Two strings of length k, differing in one character, share a prefix of length l and a suffix of length m such that k=l+m+1.
The answer by Simon Prins encodes this by storing all prefix/suffix combinations explicitly, i.e. abc
becomes *bc
, a*c
and ab*
. That's k=3, l=0,1,2 and m=2,1,0.
As valarMorghulis points out, you can organize words in a prefix tree. There's also the very similar suffix tree. It's fairly easy to augment the tree with the number of leaf nodes below each prefix or suffix; this can be updated in O(k) when inserting a new word.
The reason you want these sibling counts is so you know, given a new word, whether you want to enumerate all strings with the same prefix or whether to enumerate all strings with the same suffix. E.g. for "abc" as input, the possible prefixes are "", "a" and "ab", while the corresponding suffixes are "bc", "c" and "". As it obvious, for short suffixes it's better to enumerate siblings in the prefix tree and vice versa.
As @einpoklum points out, it's certainly possible that all strings share the same k/2 prefix. That's not a problem for this approach; the prefix tree will be linear up to depth k/2 with each node up to k/2 depth being the ancestor of 100.000 leaf nodes. As a result, the suffix tree will be used up to (k/2-1) depth, which is good because the strings have to differ in their suffixes given that they share prefixes.
[edit]
As an optimization, once you've determined the shortest unique prefix of a string, you know that if there's one different character, it must be the last character of the prefix, and you'd have found the near-duplicate when checking a prefix that was one shorter. So if "abcde" has a shortest unique prefix "abc", that means there are other strings that start with "ab?" but not with "abc". I.e. if they'd differ in only one character, that would be that third character. You don't need to check for "abc?e" anymore.
By the same logic, if you would find that "cde" is a unique shortest suffix, then you know you need to check only the length-2 "ab" prefix and not length 1 or 3 prefixes.
Note that this method works only for exactly one character differences and does not generalize to 2 character differences, it relies one one character being the separation between identical prefixes and identical suffixes.