Estructura de datos o algoritmo para encontrar rápidamente diferencias entre cadenas

19

Tengo una matriz de 100,000 cadenas, todas de longitud $k$ . Quiero comparar cada cadena con todas las demás para ver si dos cadenas difieren en 1 carácter. En este momento, a medida que agrego cada cadena a la matriz, la estoy verificando con cada cadena que ya está en la matriz, que tiene una complejidad temporal de $\frac{n(n-1)}{2} k$ .

¿Existe una estructura de datos o algoritmo que pueda comparar cadenas entre sí más rápido de lo que ya estoy haciendo?

Alguna información adicional:

El orden es importante: abcdey xbcdedifieren en 1 carácter, mientras que abcdey edcbadiferir en 4 caracteres.
Para cada par de cadenas que difieren en un carácter, eliminaré una de esas cadenas de la matriz.
En este momento, estoy buscando cadenas que difieran solo en 1 carácter, pero sería bueno que esa diferencia de 1 carácter se pudiera aumentar a, por ejemplo, 2, 3 o 4 caracteres. Sin embargo, en este caso, creo que la eficiencia es más importante que la capacidad de aumentar el límite de diferencia de caracteres.
$k$ suele estar en el rango de 20-40.

— JGut
fuente

44

Buscar un diccionario de cadenas con 1 error es un problema bastante conocido, por ejemplo, cs.nyu.edu/~adi/CGL04.pdf

— KWillets

1

20-40mers pueden usar un poco de espacio. Puede mirar un filtro Bloom ( en.wikipedia.org/wiki/Bloom_filter ) para probar si las cadenas degeneradas (el conjunto de todos los mers de una, dos o más sustituciones en una prueba mer) son "tal vez en" o "definitivamente -not-in "un conjunto de kmers. Si obtiene un "quizás-en", luego compare las dos cadenas para determinar si es un falso positivo o no. Los casos "definitivamente no están en" son verdaderos negativos que reducirán el número total de comparaciones letra por letra que tiene que hacer, al limitar las comparaciones solo a los posibles golpes "tal vez".

— Alex Reynolds

Si estaba trabajando con un rango más pequeño de k, podría usar un conjunto de bits para almacenar una tabla hash de booleanos para todas las cadenas degeneradas (por ejemplo, github.com/alexpreynolds/kmer-boolean para el juguete). Sin embargo, para k = 20-40, los requisitos de espacio para un conjunto de bits son simplemente demasiado.

— Alex Reynolds

12

Es posible lograr $O(nk \log k)$ peor tiempo de ejecución.

Comencemos simple. Si le importa una solución fácil de implementar que sea eficiente en muchas entradas, pero no en todas, aquí hay una solución simple, pragmática y fácil de implementar que muchas son suficientes en la práctica para muchas situaciones. Sin embargo, vuelve al tiempo de ejecución cuadrático en el peor de los casos.

Tome cada cadena y guárdela en una tabla hash, codificada en la primera mitad de la cadena. Luego, itera sobre los cubos de tabla hash. Para cada par de cadenas en el mismo grupo, verifique si difieren en 1 carácter (es decir, verifique si su segunda mitad difiere en 1 carácter).

Luego, toma cada cadena y guárdala en una tabla hash, esta vez tecleada en el segunda mitad de la cadena. Nuevamente revise cada par de cuerdas en el mismo cubo.

Suponiendo que las cadenas están bien distribuidas, el tiempo de ejecución probablemente será de . Además, si existe un par de cadenas que difieren en 1 carácter, se encontrará durante una de las dos pasadas (dado que difieren solo en 1 carácter, ese carácter diferente debe estar en la primera o segunda mitad de la cadena, entonces la segunda o primera mitad de la cadena debe ser la misma). Sin embargo, en el peor de los casos (p. Ej., Si todas las cadenas comienzan o terminan con los mismos caracteres ), esto se degrada a $O(nk)$ $k/2$ $O(n^2 k)$ tiempo de ejecución, por lo que su peor tiempo de ejecución no es una mejora en la fuerza bruta .

Como una optimización del rendimiento, si algún depósito tiene demasiadas cadenas, puede repetir el mismo proceso de forma recursiva para buscar un par que difiera en un carácter. La invocación recursiva será en cadenas de longitud . $k/2$

Si le preocupa el peor tiempo de ejecución:

Con la optimización de rendimiento anterior, creo que el peor tiempo de ejecución es . $O(nk \log k)$

— DW
fuente

3

Si las cadenas

comparten la misma primera mitad, lo cual puede suceder en la vida real, entonces no ha mejorado la complejidad.

Ω (n)

$\Omega(n)$

— einpoklum

@einpoklum, claro! Es por eso que escribí el enunciado en mi segunda oración que se reduce al tiempo de ejecución cuadrático en el peor de los casos, así como el enunciado en mi última oración que describe cómo lograr

peor de los casos si te importa sobre el peor de los casos. Pero supongo que tal vez no lo expresé muy claramente, así que edité mi respuesta en consecuencia. ¿Esta mejor ahora?

O (n k \log k)

$O(nk \log k)$

— DW

15

Mi solución es similar a la de j_random_hacker pero usa solo un conjunto de hash.

Crearía un conjunto hash de cadenas. Para cada cadena en la entrada, agregue al conjunto $k$ cadenas. En cada una de estas cadenas, reemplace una de las letras con un carácter especial, que no se encuentra en ninguna de las cadenas. Mientras los agrega, verifique que aún no estén en el conjunto. Si lo son, entonces tiene dos cadenas que solo difieren en (como máximo) un carácter.

Un ejemplo con las cadenas 'abc', 'adc'

Para abc agregamos '* bc', 'a * c' y 'ab *'

Para adc agregamos '* dc', 'a * c' y 'ad *'

Cuando agregamos 'a * c' la segunda vez, notamos que ya está en el conjunto, por lo que sabemos que hay dos cadenas que solo difieren en una letra.

El tiempo total de ejecución de este algoritmo es . Esto se debe a que creamos nuevas cadenas para todas las cadenas en la entrada. Para cada una de esas cadenas, necesitamos calcular el hash, que generalmente toma tiempo . $O(n*k^2)$ $k$ $n$ $O(k)$

El almacenamiento de todas las cadenas ocupa espacio . $O(n*k^2)$

Futuras mejoras

Podemos mejorar aún más el algoritmo no almacenando las cadenas modificadas directamente, sino almacenando un objeto con una referencia a la cadena original y el índice del carácter que está enmascarado. De esta manera, no necesitamos crear todas las cadenas y solo necesitamos espacio para almacenar todos los objetos. $O(n*k)$

Deberá implementar una función hash personalizada para los objetos. Podemos tomar la implementación de Java como ejemplo, consulte la documentación de Java . El código hash de Java multiplica el valor Unicode de cada carácter con (con la longitud de la cadena e el índice basado en uno). Tenga en cuenta que cada cadena alterada solo difiere en un carácter del original. Podemos calcular fácilmente la contribución de ese carácter al código hash. Podemos restar eso y agregar nuestro carácter de máscara en su lugar. Esto requiere para calcular. Esto nos permite reducir el tiempo total de ejecución a $31^{k-i}$ $k$ $i$ $O(1)$ $O(n*k)$

— Simon Prins
fuente

44

@JollyJoker Sí, el espacio es algo preocupante con este método. Puede reducir el espacio no almacenando las cadenas modificadas, sino almacenando un objeto con una referencia a la cadena y al índice enmascarado. Eso debería dejarte con espacio O (nk).

— Simon Prins

Para calcular los

hashes para cada cadena en el tiempo

, creo que necesitará una función hash casera especial (por ejemplo, calcular el hash de la cadena original en tiempo

, luego XOR con cada uno de los borrados caracteres en

cada vez (aunque esta es probablemente una función hash bastante mala en otras formas)). Por cierto, esto es bastante similar a mi solución, pero con una única tabla hash en lugar de

separadas, y reemplazando un carácter con "*" en lugar de eliminarlo.

k

$k$

O (k)

$O(k)$

O (k)

$O(k)$

O (1)

$O(1)$

k

$k$

— j_random_hacker

@SimonPrins Con métodos personalizados equalsy hashCodeque podrían funcionar. Simplemente crear la cadena de estilo a * b en esos métodos debería hacerlo a prueba de balas; Sospecho que algunas de las otras respuestas aquí tendrán problemas de colisión hash.

— JollyJoker

1

@DW Modifiqué mi publicación para reflejar el hecho de que calcular los hash toma tiempo

y agregué una solución para reducir el tiempo total de ejecución a

.

O (k)

$O(k)$

O (n * k)

$O(n*k)$

— Simon Prins

1

@SimonPrins El peor de los casos podría ser nk ^ 2 debido a la comprobación de igualdad de cadenas en hashset.contains cuando los hash colisionan. Por supuesto, el peor caso es cuando cada cadena tiene el mismo hash exacta, lo que requeriría un conjunto más o menos artesanal de cuerdas, sobre todo para obtener el mismo hash para *bc, a*c, ab*. Me pregunto si podría mostrarse imposible.

— JollyJoker

7

I would make $k$ hashtables $H_1, \dots, H_k$ , each of which has a $(k-1)$ -length string as the key and a list of numbers (string IDs) as the value. The hashtable $H_i$ will contain all strings processed so far but with the character at position $i$ deleted. For example, if $k=6$ , then $H_3[ABDEF]$ will contain a list of all strings seen so far that have the pattern $AB\cdot DEF$ , where $\cdot$ means "any character". Then to process the $j$ -th input string $s_j$ :

For each in the range 1 to :
- Form string $s_j'$ by deleting the $i$ -th character from $s_j$ .
- Look up $H_i[s_j']$ . Every string ID here identifies an original string that is either equal to $s$ , or differs at position $i$ only. Output these as matches for string $s_j$ . (If you wish to exclude exact duplicates, make the value type of the hashtables a (string ID, deleted character) pair, so that you can test for those that have had the same character deleted as we just deleted from $s_j$ .)
- Insert $j$ into $H_i$ for future queries to use.

If we store each hash key explicitly, then we must use $O(nk^2)$ space and thus have time complexity at least that. But as described by Simon Prins, it's possible to represent a series of modifications to a string (in his case described as changing single characters to *, in mine as deletions) implicitly in such a way that all $k$ hash keys for a particular string need just $O(k)$ space, leading to $O(nk)$ space overall, and opening the possibility of $O(nk)$ time too. To achieve this time complexity, we need a way to compute the hashes for all $k$ variations of a length- $k$ string in $O(k)$ time: for example, this can be done using polynomial hashes, as suggested by D.W. (and this is likely much better than simply XORing the deleted character with the hash for the original string).

Simon Prins's implicit representation trick also means that the "deletion" of each character is not actually performed, so we can use the usual array-based representation of a string without a performance penalty (rather than linked lists as I had originally suggested).

— j_random_hacker
fuente

2

Nice solution. An example of a suitable bespoke hash function would be a polynomial hash.

— D.W.

Thanks @D.W. Could you perhaps clarify a bit what you mean by "polynomial hash"? Googling the term didn't get me anything that seemed definitive. (Please feel free to edit my post directly if you want.)

— j_random_hacker

1

Just read the string as a base

q

$q$ number modulo

p

$p$ , where

p

$p$ is some prime less than your hashmap size, and

q

$q$ is a primitive root of

p

$p$ , and

q

$q$ is more than the alphabet size. It's called "polynomial hash" because it is like evaluating the polynomial whose coefficients are given by the string at

q

$q$ . I'll leave it as an exercise to figure out how to compute all the desired hashes in

O (k)

$O(k)$ time. Note that this approach is not immune to an adversary, unless you randomly choose both

p, q

$p,q$ satisfying the desired conditions.

— user21820

1

I think this solution can be further refined by observing that only one of the k hash tables needs to exist at any one time, thus reducing the memory requirement.

— Michael Kay

1

@MichaelKay: That won't work if you want to compute the

k

$k$ hashes of the possible alterations of a string in

O (k)

$O(k)$ time. You still need to store them somewhere. So if you only check one position at a time, you will take

k

$k$ times as long as if you check all positions together using

k

$k$ times as many hashtable entries.

— user21820

2

Here is a more robust hashtable approach than the polynomial-hash method. First generate $k$ random positive integers $r_{1..k}$ that are coprime to the hashtable size $M$ . Namely, $0 \le r_i < M$ . Then hash each string $x_{1..k}$ to $(\sum_{i=1}^k x_i r_i ) \bmod M$ . There is almost nothing an adversary can do to cause very uneven collisions, since you generate $r_{1..k}$ on run-time and so as $k$ increases the maximum probability of collision of any given pair of distinct strings goes quickly to $1/M$ . It is also obvious how to compute in $O(k)$ time all the possible hashes for each string with one character changed.

If you really want to guarantee uniform hashing, you can generate one random natural number $r(i,c)$ less than $M$ for each pair $(i,c)$ for $i$ from $1$ to $k$ and for each character $c$ , and then hash each string $x_{1..k}$ to $(\sum_{i=1}^k r(i,x_i) ) \bmod M$ . Then the probability of collision of any given pair of distinct strings is exactly $1/M$ . This approach is better if your character set is relatively small compared to $n$ .

— user21820
fuente

2

A lot of the algorithms posted here use quite a bit of space on hash tables. Here's an $O(1)$ auxiliary storage $O((n \lg n) \cdot k^2)$ runtime simple algorithm.

The trick is to use $C_k(a, b)$ , which is a comparator between two values $a$ and $b$ that returns true if $a < b$ (lexicographically) while ignoring the $k$ th character. Then algorithm is as follows.

First, simply sort the strings regularly and do a linear scan to remove any duplicates.

Then, for each $k$ :

Sort the strings with $C_k$ as comparator.
Strings that differ only in $k$ are now adjacent and can be detected in a linear scan.

— orlp
fuente

1

Two strings of length k, differing in one character, share a prefix of length l and a suffix of length m such that k=l+m+1.

The answer by Simon Prins encodes this by storing all prefix/suffix combinations explicitly, i.e. abc becomes *bc, a*c and ab*. That's k=3, l=0,1,2 and m=2,1,0.

As valarMorghulis points out, you can organize words in a prefix tree. There's also the very similar suffix tree. It's fairly easy to augment the tree with the number of leaf nodes below each prefix or suffix; this can be updated in O(k) when inserting a new word.

The reason you want these sibling counts is so you know, given a new word, whether you want to enumerate all strings with the same prefix or whether to enumerate all strings with the same suffix. E.g. for "abc" as input, the possible prefixes are "", "a" and "ab", while the corresponding suffixes are "bc", "c" and "". As it obvious, for short suffixes it's better to enumerate siblings in the prefix tree and vice versa.

As @einpoklum points out, it's certainly possible that all strings share the same k/2 prefix. That's not a problem for this approach; the prefix tree will be linear up to depth k/2 with each node up to k/2 depth being the ancestor of 100.000 leaf nodes. As a result, the suffix tree will be used up to (k/2-1) depth, which is good because the strings have to differ in their suffixes given that they share prefixes.

[edit] As an optimization, once you've determined the shortest unique prefix of a string, you know that if there's one different character, it must be the last character of the prefix, and you'd have found the near-duplicate when checking a prefix that was one shorter. So if "abcde" has a shortest unique prefix "abc", that means there are other strings that start with "ab?" but not with "abc". I.e. if they'd differ in only one character, that would be that third character. You don't need to check for "abc?e" anymore.

By the same logic, if you would find that "cde" is a unique shortest suffix, then you know you need to check only the length-2 "ab" prefix and not length 1 or 3 prefixes.

Note that this method works only for exactly one character differences and does not generalize to 2 character differences, it relies one one character being the separation between identical prefixes and identical suffixes.

— MSalters
fuente

Are you suggesting that for each string

s

$s$ and each

1 \leq i \leq k

$1 \le i \le k$ , we find the node

P [s_{1}, \dots, s_{i - 1}]

$P[s_1, \dots, s_{i-1}]$ corresponding to the length-

(i - 1)

$(i-1)$ prefix in the prefix trie, and the node

S [s_{i + 1}, \dots, s_{k}]

$S[s_{i+1}, \dots, s_k]$ corresponding to the length-

(k - i - 1)

$(k-i-1)$ suffix in the suffix trie (each takes amortised

O (1)

$O(1)$ time), and compare the number of descendants of each, choosing whichever has fewer descendants, and then "probing" for the rest of the string in that trie?

— j_random_hacker

1

What is the running time of your approach? It looks to me like in the worst case it might be quadratic: consider what happens if every string starts and ends with the same

k / 4

$k/4$ characters.

— D.W.

The optimization idea is clever and interesting. Did you have in mind a particular way to do the check for mtaches? If the "abcde" has the shortest unique prefix "abc", that means we should check for some other string of the form "ab?de". Did you have in mind a particular way to do that, that will be efficient? What's the resulting running time?

— D.W.

@D.W.: The idea is that to find strings in the form "ab?de", you check the prefix tree how many leaf nodes exist below "ab" and in the suffix tree how many nodes exist under "de", then choose the smallest of the two to enumerate. When all strings begin and end with the same k/4 characters; that means the first k/4 nodes in both trees have one child each. And yes, every time you need those trees, those have to be traversed which is an O(n*k) step.

— MSalters

To check for a string of the form "ab?de" in the prefix trie, it suffices to get to the node for "ab", then for each of its children

v

$v$ , check whether the path "de" exists below

v

$v$ . That is, don't bother enumerating any other nodes in these subtries. This takes

O (a h)

$O(ah)$ time, where

a

$a$ is the alphabet size and

h

$h$ is the height of the initial node in the trie.

h

$h$ is

O (k)

$O(k)$ , so if the alphabet size is

O (n)

$O(n)$ then it is indeed

O (n k)

$O(nk)$ time overall, but smaller alphabets are common. The number of children (not descendants) is important, as well as the height.

— j_random_hacker

1

Storing strings in buckets is a good way (there are already different answers outlining this).

An alternative solution could be to store strings in a sorted list. The trick is to sort by a locality-sensitive hashing algorithm. This is a hash algorithm which yields similar results when the input is similar[1].

Each time you want to investigate a string, you could calculate its hash and lookup the position of that hash in your sorted list (taking $O(log(n))$ for arrays or $O(n)$ for linked lists). If you find that the neighbours (considering all close neighbours, not only those with an index of +/- 1) of that position are similar (off by one character) you found your match. If there are no similar strings, you can insert the new string at the position you found (which takes $O(1)$ for linked lists and $O(n)$ for arrays).

One possible locality-sensitive hashing algorithm could be Nilsimsa (with open source implementation available for example in python).

[1]: Note that often hash algorithms, like SHA1, are designed for the opposite: producing greatly differing hashes for similar, but not equal inputs.

Disclaimer: To be honest, I would personally implement one of the nested/tree-organized bucket-solutions for a production application. However, the sorted list idea struck me as an interesting alternative. Note that this algorithm highly depends on the choosen hash algorithm. Nilsimsa is one algorithm I found - there are many more though (for example TLSH, Ssdeep and Sdhash). I haven't verified that Nilsimsa works with my outlined algorithm.

— tessi
fuente

1

Interesting idea, but I think we would need to have some bounds on how far apart two hash values can be when their inputs differ by just 1 character -- then scan everything within that range of hash values, instead of just neighbours. (It's impossible to have a hash function that produces adjacent hash values for all possible pairs of strings that differ by 1 character. Consider the length-2 strings in a binary alphabet: 00, 01, 10 and 11. If h(00) is adjacent to both h(10) and h(01) then it must be between them, in which case h(11) can't be adjacent to them both, and vice versa.)

— j_random_hacker

Looking at neighbors isn't sufficient. Consider the list abcd, acef, agcd. There exists a matching pair, but your procedure will not find it, as abcd is not a neighbor of agcd.

— D.W.

You both are right! With neighbours I didn't mean only "direct neighbours" but thought of "a neighbourhood" of close positions. I didn't specify how many neighbours need to be looked at since that depends on the hash algorithm. But you're right, I should probably note this down in my answer. thanks :)

— tessi

1

"LSH... similar items map to the same “buckets” with high probability" - since it's probability algorithm, result isn't guaranteed. So it depends on TS whether he needs 100% solution or 99.9% is enough.

— Bulat

1

One could achieve the solution in $O(nk+ n^2)$ time and $O(nk)$ space using enhanced suffix arrays (Suffix array along with the LCP array) that allows constant time LCP (Longest Common Prefix) query (i.e. Given two indices of a string, what is the length of the longest prefix of the suffixes starting at those indices). Here, we could take advantage of the fact that all strings are of equal length. Specifically,

Build the enhanced suffix array of all the $n$ strings concatenated together. Let $X = x_1.x_2.x_3 .... x_n$ where $x_i, \forall 1 \le i \le n$ is a string in the collection. Build the suffix array and LCP array for $X$ .

Now each $x_i$ starts at position $(i-1)k$ in the zero-based indexing. For each string $x_i$ , take LCP with each of the string $x_j$ such that $j<i$ . If LCP goes beyond the end of $x_j$ then $x_i = x_j$ . Otherwise, there is a mismatch (say $x_i[p] \ne x_j[p]$ ); in this case take another LCP starting at the corresponding positions following the mismatch. If the second LCP goes beyond the end of $x_j$ then $x_i$ and $x_j$ differ by only one character; otherwise there are more than one mismatches.

for (i=2; i<= n; ++i){
    i_pos = (i-1)k;
    for (j=1; j < i; ++j){
        j_pos = (j-1)k;
        lcp_len = LCP (i_pos, j_pos);
        if (lcp_len < k) { // mismatch
            if (lcp_len == k-1) { // mismatch at the last position
            // Output the pair (i, j)
            }
            else {
              second_lcp_len = LCP (i_pos+lcp_len+1, j_pos+lcp_len+1);
              if (lcp_len+second_lcp_len>=k-1) { // second lcp goes beyond
                // Output the pair(i, j)
              }
            }
        }
    }
}

You could use SDSL library to build the suffix array in compressed form and answer the LCP queries.

Analysis: Building the enhanced suffix array is linear in the length of $X$ i.e. $O(nk)$ . Each LCP query takes constant time. Thus, querying time is $O(n^2)$ .

Generalisation: This approach can also be generalised to more than one mismatches. In general, running time is $O(nk + qn^2)$ where $q$ is the number of allowed mismatches.

If you wish to remove a string from the collection, instead of checking every $j<i$ , you could keep a list of only 'valid' $j$ .

— Ritu Kundu
fuente

Can i say that

O (k n^{2})

$O(kn^2)$ algo es trivial: ¿solo compara cada par de cuerdas y cuenta el número de coincidencias? Y ken esta fórmula prácticamente se puede omitir, ya que con SSE puede contar bytes coincidentes en 2 ciclos de CPU por 16 símbolos (es decir, 6 ciclos para k = 40).

— Bulat

Apologies but I could not understand your query. The above approach is

O (n k + n^{2})

$O(nk + n^2)$ and not

O (k n^{2})

$O(kn^2)$ . Also, it is virtually alphabet-size independent. It could be used in conjunction with the hash-table approach -- Once two strings are found to have the same hashes, they could be tested if they contain a single mismatch in

O (1)

$O(1)$ time.

— Ritu Kundu

Mi punto es que k = 20..40 para el autor de la pregunta y comparar cadenas tan pequeñas requieren solo unos pocos ciclos de CPU, por lo que la diferencia práctica entre la fuerza bruta y su enfoque probablemente no exista.

— Bulat

1

Una mejora para todas las soluciones propuestas. Todos requieren $O(nk)$ memoria en el peor de los casos. Puede reducirlo calculando hashes de cadenas con *cada carácter, es decir *bcde, a*cde... y procesando en cada pasada solo variantes con valor hash en cierto rango entero. Fe con valores de hash pares en la primera pasada y valores de hash impares en la segunda.

También puede usar este enfoque para dividir el trabajo entre múltiples núcleos de CPU / GPU.

— Bulat
fuente

Sugerencia inteligente! En este caso, la pregunta original dice

n = 100, 000

$n=100,000$ y

k \approx 40

$k\approx 40$ , entonces

O (n k)

$O(nk)$ No parece probable que la memoria sea un problema (que podría ser algo así como 4 MB). Sin embargo, sigue siendo una buena idea que vale la pena saber si es necesario ampliar esto.

— DW

0

Esta es una versión corta de la respuesta de @SimonPrins que no involucra hashes.

Suponiendo que ninguna de sus cadenas contenga un asterisco:

Crea una lista de tallas $nk$ donde cada una de sus cadenas se produce en $k$ variaciones, cada una con una letra reemplazada por un asterisco (tiempo de ejecución $\mathcal{O}(nk^2)$ )
Ordenar esa lista (tiempo de ejecución $\mathcal{O}(nk^2\log nk)$ )
Compruebe si hay duplicados comparando las entradas posteriores de la lista ordenada (tiempo de ejecución $\mathcal{O}(nk^2)$ )

Una solución alternativa con el uso implícito de hashes en Python (no puede resistir la belleza):

def has_almost_repeats(strings,k):
    variations = [s[:i-1]+'*'+s[i+1:] for s in strings for i in range(k)]
    return len(set(variations))==k*len(strings)

— Bananach
fuente

Gracias. Por favor mencione también el

k

$k$ copias de duplicados exactos, y haré +1. (Hmm, acabo de notar que hice la misma afirmación sobre

O (n k)

$O(nk)$ tiempo en mi propia respuesta ... Mejor arreglar eso ...)

— j_random_hacker

@j_random_hacker No sé qué es exactamente lo que el OP quiere informar, así que dejé el paso 3 vago, pero creo que es trivial con algo de trabajo adicional informar ya sea (a) un binario cualquier resultado duplicado / no duplicado o (b) una lista de pares de cadenas que difieren en a lo sumo una posición, sin duplicados. Si tomamos el OP literalmente ("... para ver si hay dos cadenas ..."), entonces (a) parece ser deseable. Además, si se deseara (b), por supuesto, simplemente crear una lista de pares puede tomar

O (n^{2})

$\mathcal{O}(n^2)$ si todas las cadenas son iguales

— Bananach

0

Aquí está mi opinión sobre 2+ buscadores de desajustes. Tenga en cuenta que en esta publicación considero que cada cadena es circular, la subcadena fe de longitud 2 en el índice k-1consiste en un símbolo str[k-1]seguido de str[0]. ¡Y la subcadena de longitud 2 en el índice -1es la misma!

Si tenemos Mdesajustes entre dos cadenas de longitud k, tienen una subcadena coincidente con una longitud de al menos $mlen(k,M) = \lceil{k/M}\rceil-1$ dado que, en el peor de los casos, los símbolos que no coinciden dividen la cadena (circular) en Msegmentos de igual tamaño. Fe con k=20y M=4el "peor" partido puede tener el patrón abcd*efgh*ijkl*mnop*.

Ahora, el algoritmo para buscar todos los desajustes hasta los Msímbolos entre cadenas de ksímbolos:

para cada i de 0 a k-1
- dividir todas las cadenas en grupos por str[i..i+L-1], donde L = mlen(k,M). Fe si L=4y tiene un alfabeto de solo 4 símbolos (del ADN), esto hará 256 grupos.
- Los grupos más pequeños que ~ 100 cadenas se pueden verificar con el algoritmo de fuerza bruta
- Para grupos más grandes, debemos realizar la división secundaria:
  - Elimine de cada cadena en los Lsímbolos de grupo que ya coincidimos
  - para cada j de i-L + 1 a kL-1
    - dividir todas las cadenas en grupos por str[i..i+L1-1], donde L1 = mlen(k-L,M). Fe si k=20, M=4, alphabet of 4 symbols, así L=4y así L1=3, esto formará 64 grupos.
    - el resto se deja como ejercicio para el lector: D

¿Por qué no comenzamos jdesde 0? Debido a que ya formamos estos grupos con el mismo valor de i, entonces el trabajo con j<=i-Lserá exactamente equivalente al trabajo con los valores i y j intercambiados.

Otras optimizaciones:

En cada posición, considere también las cadenas str[i..i+L-2] & str[i+L]. Esto solo duplica la cantidad de trabajos creados, pero permite aumentar Len 1 (si mis cálculos son correctos). Entonces, fe en lugar de 256 grupos, dividirá los datos en 1024 grupos.
Si algun $L[i]$ se vuelve demasiado pequeño, siempre podemos usar el *truco: para cada i in 0..k-1, elimine el i-ésimo símbolo de cada cadena y cree trabajos en busca de M-1desajustes en esas cadenas de longitud k-1.

— Bulat
fuente

0

Trabajo todos los días para inventar y optimizar algos, por lo que si necesita hasta el último rendimiento, ese es el plan:

Verifique *en cada posición de forma independiente, es decir, en lugar de n*kvariantes de cadena de procesamiento de trabajo único : inicie ktrabajos independientes para cada ncadena de verificación . Puede distribuir estos ktrabajos entre múltiples núcleos de CPU / GPU. Esto es especialmente importante si va a verificar 2+ char diffs. Un tamaño de trabajo más pequeño también mejorará la localidad de caché, que por sí sola puede hacer que el programa sea 10 veces más rápido.
Si va a usar tablas hash, use su propia implementación empleando sondeo lineal y ~ 50% de factor de carga. Es rápido y bastante fácil de implementar. O utilice una implementación existente con direccionamiento abierto. Las tablas hash STL son lentas debido al uso de encadenamiento separado.
Puede intentar prefiltrar datos utilizando un filtro Bloom de 3 estados (distinguiendo 0/1/1 + ocurrencias) como lo propone @AlexReynolds.
Para cada i de 0 a k-1 ejecute el siguiente trabajo:
- Genere estructuras de 8 bytes que contengan hash de 4-5 bytes de cada cadena (con *la posición i-ésima) e índice de cadena, y luego ordénelos o cree una tabla hash a partir de estos registros.

Para ordenar, puede probar el siguiente combo:

primer paso es la clasificación de radios MSD en 64-256 formas empleando el truco TLB
el segundo pase es la clasificación de radios MSD en 256-1024 formas sin truco TLB (64K formas en total)
tercer paso es el tipo de inserción para corregir las inconsistencias restantes

— Bulat
fuente