Un mejor algoritmo de clasificación de similitud para cadenas de longitud variable

152

Estoy buscando un algoritmo de similitud de cadenas que produzca mejores resultados en cadenas de longitud variable que las que generalmente se sugieren (distancia levenshtein, índice sonoro, etc.).

Por ejemplo,

Dada la cadena A: "Robert",

Luego cadena B: "Amy Robertson"

sería una mejor combinación que

Cadena C: "Richard"

Además, preferiblemente, este algoritmo debe ser independiente del idioma (también funciona en otros idiomas además del inglés).

— marzagao
fuente

similar en .net: stackoverflow.com/questions/83777/…

— Ciro Santilli 郝海东冠状病六四事件法轮功

También echa un vistazo: coeficiente de dados

— avid_useR 02 de

155

Simon White de Catalysoft escribió un artículo sobre un algoritmo muy inteligente que compara pares de caracteres adyacentes que funciona realmente bien para mis propósitos:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon tiene una versión Java del algoritmo y a continuación escribí una versión PL / Ruby (tomada de la versión ruby simple hecha en el comentario de entrada del foro relacionado por Mark Wong-VanHaren) para que pueda usarla en mis consultas de PostgreSQL:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

¡Funciona de maravilla!

— marzagao
fuente

32

¿Encontraste la respuesta y escribiste todo eso en 4 minutos? ¡Impresionante!

— Matt J

28

Preparé mi respuesta después de un poco de investigación e implementación. Lo pongo aquí en beneficio de quien quiera que busque una respuesta práctica en SO utilizando un algoritmo alternativo porque la mayoría de las respuestas en preguntas relacionadas parecen girar en torno a levenshtein o soundex.

— marzagao

18

Justo lo que he estado buscando. ¿Te casarías conmigo?

— BlackTea

66

@JasonSundram tiene razón: de hecho, este es el conocido coeficiente Dice en bigrams a nivel de personaje, como escribe el autor en el "anexo" (parte inferior de la página).

— Fred Foo

44

Esto devuelve un "puntaje" de 1 (100% de coincidencia) al comparar cadenas que tienen una sola letra aislada como diferencia, como este ejemplo: string_similarity("vitamin B", "vitamin C") #=> 1¿hay una manera fácil de prevenir este tipo de comportamiento?

— MrYoshiji

77

La respuesta de marzagao es genial. Lo convertí a C #, así que pensé en publicarlo aquí:

Enlace Pastebin

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

— Michael La Voie
fuente

2

¡+100 si pudiera, me acabas de salvar unos días de trabajo duro compañero! Salud.

— vvohra87

1

¡Muy agradable! La única sugerencia que tengo es convertir esto en una extensión.

— Levitikon

+1! Genial que funcione, con ligeras modificaciones también para Java. Y parece devolver mejores respuestas que Levenshtein.

— Xyene

1

Agregué una versión que convierte esto a un método de extensión a continuación. Gracias por la versión original y la increíble traducción.

— Frank Rundatz

@Michael La Voie ¡Gracias, es muy lindo! Aunque es un pequeño problema con (2.0 * intersection) / union: obtengo Double.NaN al comparar dos cadenas vacías.

— Vojtěch Dohnal

41

Aquí hay otra versión de la respuesta de marzagao , esta escrita en Python:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    """
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    """
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print()

— John Rutledge
fuente

2

Hay un pequeño error en string_similarity cuando hay ngrams duplicados en una palabra, lo que resulta en una puntuación> 1 para cadenas idénticas. Agregar un 'descanso' después de "hit_count + = 1" lo arregla.

— jbaiter

1

@jbaiter: Buena captura. Lo cambié para reflejar tus cambios.

— John Rutledge

3

En el artículo de Simon White, dice "Tenga en cuenta que cada vez que se encuentra una coincidencia, ese par de caracteres se elimina de la segunda lista de matriz para evitar que coincidamos con el mismo par de caracteres varias veces. (De lo contrario, 'GGGGG' obtendría una coincidencia perfecta contra 'GG'.) "Alteraría esta afirmación para decir que daría una coincidencia más alta que perfecta. Sin tener esto en cuenta, también parece tener el resultado de que el algoritmo no es transitivo (similitud (x, y) = / = similitud (y, x)). Agregar pares2.remover (y) después de la línea hit_count + = 1 soluciona el problema.

— NinjaMeTimbers

17

Aquí está mi implementación PHP del algoritmo StrikeAMatch sugerido, por Simon White. Las ventajas (como dice en el enlace) son:

Un verdadero reflejo de similitud léxica : las cadenas con pequeñas diferencias deben reconocerse como similares. En particular, una superposición de subcadenas significativa debe apuntar a un alto nivel de similitud entre las cadenas.
Una solidez a los cambios en el orden de las palabras : dos cadenas que contienen las mismas palabras, pero en un orden diferente, deben reconocerse como similares. Por otro lado, si una cadena es solo un anagrama aleatorio de los caracteres contenidos en la otra, entonces (generalmente) debería reconocerse como diferente.
Independencia del idioma : el algoritmo debería funcionar no solo en inglés, sino en muchos idiomas diferentes.

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

— Igal Alkon
fuente

17

Una versión más corta de la respuesta de John Rutledge :

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

— cuántico
fuente

Incluso la intersectionvariable es un desperdicio de línea.

— Chibueze Opata

14

Esta discusión ha sido realmente útil, gracias. Convertí el algoritmo a VBA para usarlo con Excel y escribí algunas versiones de una función de hoja de trabajo, una para la comparación simple de un par de cadenas y la otra para comparar una cadena con un rango / matriz de cadenas. La versión strSimLookup devuelve la última mejor coincidencia como una cadena, índice de matriz o métrica de similitud.

Esta implementación produce los mismos resultados enumerados en el ejemplo de Amazon en el sitio web de Simon White con algunas excepciones menores en los partidos de baja puntuación; No estoy seguro de dónde entra la diferencia, podría ser la función Split de VBA, pero no he investigado ya que funciona bien para mis propósitos.

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function

— bchatham
fuente

@bchatham Esto parece extremadamente útil, pero soy nuevo en VBA y el código me desafía un poco. ¿Es posible que publique un archivo de Excel que haga uso de su contribución? Para mis propósitos, espero usarlo para que coincida con nombres similares de una sola columna en Excel con aproximadamente 1000 entradas (extracto aquí: dropbox.com/s/ofdliln9zxgi882/first-names-excerpt.xlsx ). Luego usaré las coincidencias como sinónimos en una búsqueda de personas. (ver también softwarerecs.stackexchange.com/questions/38227/… )

— bjornte

12

Lo siento, la respuesta no fue inventada por el autor. Este es un algoritmo bien conocido que fue presentado por primera vez por Digital Equipment Corporation y a menudo se lo conoce como herpes zóster.

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

10

Traduje el algoritmo de Simon White a PL / pgSQL. Esta es mi contribución.

<!-- language: lang-sql -->

create or replace function spt1.letterpairs(in p_str varchar) 
returns varchar  as 
$$
declare

    v_numpairs integer := length(p_str)-1;
    v_pairs varchar[];

begin

    for i in 1 .. v_numpairs loop
        v_pairs[i] := substr(p_str, i, 2);
    end loop;

    return v_pairs;

end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.wordletterpairs(in p_str varchar) 
returns varchar as
$$
declare
    v_allpairs varchar[];
    v_words varchar[];
    v_pairsinword varchar[];
begin
    v_words := regexp_split_to_array(p_str, '[[:space:]]');

    for i in 1 .. array_length(v_words, 1) loop
        v_pairsinword := spt1.letterpairs(v_words[i]);

        if v_pairsinword is not null then
            for j in 1 .. array_length(v_pairsinword, 1) loop
                v_allpairs := v_allpairs || v_pairsinword[j];
            end loop;
        end if;

    end loop;


    return v_allpairs;
end;
$$ language 'plpgsql';

--===================================================================

create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY)
returns anyarray as 
$$
    select array(select unnest($1) intersect select unnest($2))
$$ language 'sql';

--===================================================================

create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar)
returns float as
$$
declare
    v_pairs1 varchar[];
    v_pairs2 varchar[];
    v_intersection integer;
    v_union integer;
begin
    v_pairs1 := wordletterpairs(upper(p_str1));
    v_pairs2 := wordletterpairs(upper(p_str2));
    v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); 

    v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1);

    return (2.0 * v_intersection / v_union);
end;
$$ language 'plpgsql';

— fabiolimace
fuente

¡Funciona en mi PostgreSQL que no tiene soporte plruby! ¡Gracias!

— hostnik

¡Gracias! ¿Cómo haría esto en Oracle SQL?

— olovholm

Este puerto es incorrecto. La cadena exacta no devuelve 1.

— Brandon Wigfield

9

Las métricas de similitud de cadenas contienen una visión general de muchas métricas diferentes utilizadas en la comparación de cadenas ( Wikipedia también tiene una visión general). Gran parte de estas métricas se implementan en una biblioteca simétrica .

Otro ejemplo de métrica, no incluido en la descripción general dada es, por ejemplo, la distancia de compresión (intentando aproximar la complejidad de Kolmogorov ), que puede usarse para textos un poco más largos que el que usted presentó.

También puede considerar examinar un tema mucho más amplio del procesamiento del lenguaje natural . Estos paquetes R pueden ayudarlo a comenzar rápidamente (o al menos darle algunas ideas).

Y una última edición: busque las otras preguntas sobre este tema en SO, hay bastantes preguntas relacionadas.

— Anónimo
fuente

9

Una versión PHP más rápida del algoritmo:

/**
 *
 * @param $str
 * @return mixed
 */
private static function wordLetterPairs ($str)
{
    $allPairs = array();

    // Tokenize the string and put the tokens/words into an array

    $words = explode(' ', $str);

    // For each word
    for ($w = 0; $w < count($words); $w ++) {
        // Find the pairs of characters
        $pairsInWord = self::letterPairs($words[$w]);

        for ($p = 0; $p < count($pairsInWord); $p ++) {
            $allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
        }
    }

    return array_values($allPairs);
}

/**
 *
 * @param $str
 * @return array
 */
private static function letterPairs ($str)
{
    $numPairs = mb_strlen($str) - 1;
    $pairs = array();

    for ($i = 0; $i < $numPairs; $i ++) {
        $pairs[$i] = mb_substr($str, $i, 2);
    }

    return $pairs;
}

/**
 *
 * @param $str1
 * @param $str2
 * @return float
 */
public static function compareStrings ($str1, $str2)
{
    $pairs1 = self::wordLetterPairs(mb_strtolower($str1));
    $pairs2 = self::wordLetterPairs(mb_strtolower($str2));


    $union = count($pairs1) + count($pairs2);

    $intersection = count(array_intersect($pairs1, $pairs2));

    return (2.0 * $intersection) / $union;
}

Para los datos que tenía (aproximadamente 2300 comparaciones), tuve un tiempo de ejecución de 0.58 segundos con la solución Igal Alkon versus 0.35 segundos con la mía.

— Andy
fuente

9

Una versión en la bella Scala:

  def pairDistance(s1: String, s2: String): Double = {

    def strToPairs(s: String, acc: List[String]): List[String] = {
      if (s.size < 2) acc
      else strToPairs(s.drop(1),
        if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
    }

    val lst1 = strToPairs(s1.toUpperCase, List())
    val lst2 = strToPairs(s2.toUpperCase, List())

    (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)

  }

— Ignobilis
fuente

6

Aquí está la versión R:

get_bigrams <- function(str)
{
  lstr = tolower(str)
  bigramlst = list()
  for(i in 1:(nchar(str)-1))
  {
    bigramlst[[i]] = substr(str, i, i+1)
  }
  return(bigramlst)
}

str_similarity <- function(str1, str2)
{
   pairs1 = get_bigrams(str1)
   pairs2 = get_bigrams(str2)
   unionlen  = length(pairs1) + length(pairs2)
   hit_count = 0
   for(x in 1:length(pairs1)){
        for(y in 1:length(pairs2)){
            if (pairs1[[x]] == pairs2[[y]])
                hit_count = hit_count + 1
        }
   }
   return ((2.0 * hit_count) / unionlen)
}

— Rainer
fuente

Este algoritmo es mejor pero bastante lento para datos grandes. Quiero decir, si uno tiene que comparar 10000 palabras con otras 15000 palabras, es demasiado lento. ¿Podemos aumentar su rendimiento en términos de velocidad?

— indra_patil

6

Publicando la respuesta de marzagao en C99, inspirada en estos algoritmos

double dice_match(const char *string1, const char *string2) {

    //check fast cases
    if (((string1 != NULL) && (string1[0] == '\0')) || 
        ((string2 != NULL) && (string2[0] == '\0'))) {
        return 0;
    }
    if (string1 == string2) {
        return 1;
    }

    size_t strlen1 = strlen(string1);
    size_t strlen2 = strlen(string2);
    if (strlen1 < 2 || strlen2 < 2) {
        return 0;
    }

    size_t length1 = strlen1 - 1;
    size_t length2 = strlen2 - 1;

    double matches = 0;
    int i = 0, j = 0;

    //get bigrams and compare
    while (i < length1 && j < length2) {
        char a[3] = {string1[i], string1[i + 1], '\0'};
        char b[3] = {string2[j], string2[j + 1], '\0'};
        int cmp = strcmpi(a, b);
        if (cmp == 0) {
            matches += 2;
        }
        i++;
        j++;
    }

    return matches / (length1 + length2);
}

Algunas pruebas basadas en el artículo original :

#include <stdio.h>

void article_test1() {
    char *string1 = "FRANCE";
    char *string2 = "FRENCH";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}


void article_test2() {
    printf("====%s====\n", __func__);
    char *string = "Healed";
    char *ss[] = {"Heard", "Healthy", "Help",
                  "Herded", "Sealed", "Sold"};
    int correct[] = {44, 55, 25, 40, 80, 0};
    for (int i = 0; i < 6; ++i) {
        printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
    }
}

void multicase_test() {
    char *string1 = "FRaNcE";
    char *string2 = "fREnCh";
    printf("====%s====\n", __func__);
    printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);

}

void gg_test() {
    char *string1 = "GG";
    char *string2 = "GGGGG";
    printf("====%s====\n", __func__);
    printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}


int main() {
    article_test1();
    article_test2();
    multicase_test();
    gg_test();

    return 0;
}

— guiones
fuente

5

Sobre la base de la increíble versión C # de Michael La Voie, según la solicitud para que sea un método de extensión, esto es lo que se me ocurrió. El principal beneficio de hacerlo de esta manera es que puede ordenar una lista genérica por el porcentaje de coincidencia. Por ejemplo, considere que tiene un campo de cadena llamado "Ciudad" en su objeto. Un usuario busca "Chester" y desea devolver los resultados en orden descendente de coincidencia. Por ejemplo, desea que aparezcan coincidencias literales de Chester antes de Rochester. Para hacer esto, agregue dos nuevas propiedades a su objeto:

    public string SearchText { get; set; }
    public double PercentMatch
    {
        get
        {
            return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper());
        }
    }

Luego, en cada objeto, establezca SearchText en lo que el usuario buscó. Luego puede ordenarlo fácilmente con algo como:

    zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

Aquí está la ligera modificación para que sea un método de extensión:

    /// <summary>
    /// This class implements string comparison algorithm
    /// based on character pair similarity
    /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
    /// </summary>
    public static double PercentMatchTo(this string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private static  string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }

— Frank Rundatz
fuente

Creo que sería mejor usar un bool isCaseSensitive con un valor predeterminado de falso, incluso si es cierto, la implementación es mucho más limpia

— Jordania

5

Mi implementación de JavaScript toma una cadena o matriz de cadenas y un piso opcional (el piso predeterminado es 0.5). Si le pasa una cadena, devolverá verdadero o falso dependiendo de si el puntaje de similitud de la cadena es mayor o igual que el piso. Si le pasa un conjunto de cadenas, devolverá un conjunto de esas cadenas cuyo puntaje de similitud es mayor o igual al piso, ordenado por puntaje.

Ejemplos:

'Healed'.fuzzy('Sealed');      // returns true
'Healed'.fuzzy('Help');        // returns false
'Healed'.fuzzy('Help', 0.25);  // returns true

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]

'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

Aquí está:

(function(){
  var default_floor = 0.5;

  function pairs(str){
    var pairs = []
      , length = str.length - 1
      , pair;
    str = str.toLowerCase();
    for(var i = 0; i < length; i++){
      pair = str.substr(i, 2);
      if(!/\s/.test(pair)){
        pairs.push(pair);
      }
    }
    return pairs;
  }

  function similarity(pairs1, pairs2){
    var union = pairs1.length + pairs2.length
      , hits = 0;

    for(var i = 0; i < pairs1.length; i++){
      for(var j = 0; j < pairs2.length; j++){
        if(pairs1[i] == pairs2[j]){
          pairs2.splice(j--, 1);
          hits++;
          break;
        }
      }
    }
    return 2*hits/union || 0;
  }

  String.prototype.fuzzy = function(strings, floor){
    var str1 = this
      , pairs1 = pairs(this);

    floor = typeof floor == 'number' ? floor : default_floor;

    if(typeof(strings) == 'string'){
      return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
    }else if(strings instanceof Array){
      var scores = {};

      strings.map(function(str2){
        scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
      });

      return strings.filter(function(str){
        return scores[str] >= floor;
      }).sort(function(a, b){
        return scores[b] - scores[a];
      });
    }
  };
})();

— gruppler
fuente

1

Bug / Typo! for(var j = 0; j < pairs1.length; j++){debería serfor(var j = 0; j < pairs2.length; j++){

— Searle

3

El algoritmo del coeficiente Dice (respuesta de Simon White / marzagao) se implementa en Ruby en el método pair_distance_similar en la gema amatch

https://github.com/flori/amatch

Esta gema también contiene implementaciones de varios algoritmos de comparación aproximada y comparación de cadenas: distancia de edición de Levenshtein, distancia de edición de Sellers, distancia de Hamming, longitud de subsecuencia común más larga, longitud de subcadena común más larga, métrica de distancia de par, métrica de Jaro-Winkler .

— s01ipsist
fuente

2

Una versión de Haskell: siéntase libre de sugerir modificaciones porque no he hecho mucho Haskell.

import Data.Char
import Data.List

-- Convert a string into words, then get the pairs of words from that phrase
wordLetterPairs :: String -> [String]
wordLetterPairs s1 = concat $ map pairs $ words s1

-- Converts a String into a list of letter pairs.
pairs :: String -> [String]
pairs [] = []
pairs (x:[]) = []
pairs (x:ys) = [x, head ys]:(pairs ys)

-- Calculates the match rating for two strings
matchRating :: String -> String -> Double
matchRating s1 s2 = (numberOfMatches * 2) / totalLength
  where pairsS1 = wordLetterPairs $ map toLower s1
        pairsS2 = wordLetterPairs $ map toLower s2
        numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2
        totalLength = fromIntegral $ length pairsS1 + length pairsS2

— matthewpalmer
fuente

2

Clojure:

(require '[clojure.set :refer [intersection]])

(defn bigrams [s]
  (->> (split s #"\s+")
       (mapcat #(partition 2 1 %))
       (set)))

(defn string-similarity [a b]
  (let [a-pairs (bigrams a)
        b-pairs (bigrams b)
        total-count (+ (count a-pairs) (count b-pairs))
        match-count (count (intersection a-pairs b-pairs))
        similarity (/ (* 2 match-count) total-count)]
    similarity))

— Shaun Lebron
fuente

1

¿Qué pasa con la distancia de Levenshtein, dividida por la longitud de la primera cadena (o alternativamente dividida mi longitud mínima / máxima / promedio de ambas cadenas)? Eso me ha funcionado hasta ahora.

— Tehvan
fuente

Sin embargo, para citar otra publicación sobre este tema, lo que devuelve es a menudo "errático". Clasifica 'echo' como bastante similar a 'dog'.

— Xyene

@Nox: La parte "dividida por la longitud de la primera cadena" de esta respuesta es significativa. Además, esto funciona mejor que el muy alabado algoritmo de Dice para errores tipográficos y de transposición, e incluso conjugaciones comunes (considere comparar "nadar" y "nadar", por ejemplo).

— Recogida de Logan el

1

Hola chicos, probé esto en javascript, pero soy nuevo en esto, ¿alguien sabe formas más rápidas de hacerlo?

function get_bigrams(string) {
    // Takes a string and returns a list of bigrams
    var s = string.toLowerCase();
    var v = new Array(s.length-1);
    for (i = 0; i< v.length; i++){
        v[i] =s.slice(i,i+2);
    }
    return v;
}

function string_similarity(str1, str2){
    /*
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    */
    var pairs1 = get_bigrams(str1);
    var pairs2 = get_bigrams(str2);
    var union = pairs1.length + pairs2.length;
    var hit_count = 0;
    for (x in pairs1){
        for (y in pairs2){
            if (pairs1[x] == pairs2[y]){
                hit_count++;
            }
        }
    }
    return ((2.0 * hit_count) / union);
}


var w1 = 'Healed';
var word =['Heard','Healthy','Help','Herded','Sealed','Sold']
for (w2 in word){
    console.log('Healed --- ' + word[w2])
    console.log(string_similarity(w1,word[w2]));
}

— usuario779420
fuente

Esta implementación es incorrecta. La función bigram se rompe para la entrada de longitud 0. El método string_similarity no se rompe correctamente dentro del segundo ciclo, lo que puede llevar a contar pares varias veces, lo que lleva a un valor de retorno que excede el 100%. Y también ha olvidado declarar xy y, y no debe recorrer los bucles con un for..in..bucle (use for(..;..;..)en su lugar).

— Rob W

1

Aquí hay otra versión de Similitud basada en el índice Sørensen – Dice (respuesta de marzagao), esta escrita en C ++ 11:

/*
 * Similarity based in Sørensen–Dice index.
 *
 * Returns the Similarity between _str1 and _str2.
 */
double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) {
    // Base case: if some string is empty.
    if (_str1.empty() || _str2.empty()) {
        return 1.0;
    }

    auto str1 = upper_string(_str1);
    auto str2 = upper_string(_str2);

    // Base case: if the strings are equals.
    if (str1 == str2) {
        return 0.0;
    }

    // Base case: if some string does not have bigrams.
    if (str1.size() < 2 || str2.size() < 2) {
        return 1.0;
    }

    // Extract bigrams from str1
    auto num_pairs1 = str1.size() - 1;
    std::unordered_set<std::string> str1_bigrams;
    str1_bigrams.reserve(num_pairs1);
    for (unsigned i = 0; i < num_pairs1; ++i) {
        str1_bigrams.insert(str1.substr(i, 2));
    }

    // Extract bigrams from str2
    auto num_pairs2 = str2.size() - 1;
    std::unordered_set<std::string> str2_bigrams;
    str2_bigrams.reserve(num_pairs2);
    for (unsigned int i = 0; i < num_pairs2; ++i) {
        str2_bigrams.insert(str2.substr(i, 2));
    }

    // Find the intersection between the two sets.
    int intersection = 0;
    if (str1_bigrams.size() < str2_bigrams.size()) {
        const auto it_e = str2_bigrams.end();
        for (const auto& bigram : str1_bigrams) {
            intersection += str2_bigrams.find(bigram) != it_e;
        }
    } else {
        const auto it_e = str1_bigrams.end();
        for (const auto& bigram : str2_bigrams) {
            intersection += str1_bigrams.find(bigram) != it_e;
        }
    }

    // Returns similarity coefficient.
    return (2.0 * intersection) / (num_pairs1 + num_pairs2);
}

— chema989
fuente

1

Estaba buscando la implementación de puro rubí del algoritmo indicado por la respuesta de @marzagao. Desafortunadamente, el enlace indicado por @marzagao está roto. En la respuesta @ s01ipsist, indicó una coincidencia de gemas de rubí donde la implementación no está en rubí puro. Así que busqué un poco y encontré la gema fuzzy_match que tiene una implementación de rubí puro (aunque este uso de gemas amatch) aquí . Espero que esto ayude a alguien como yo.

— Engr. Hasanuzzaman Sumon
fuente