algorithm - para - Encontrar al vecino más cercano utilizando el algoritmo optimizado de Levenshtein

distancia de levenshtein sql (1)

Lo he hecho varias veces. La forma en que lo hago es con una caminata de árbol recursiva en profundidad del árbol de juego de posibles cambios. Hay un presupuesto k de cambios, que utilizo para podar el árbol. Con esa rutina a mano, primero la ejecuto con k = 0, luego k = 1, luego k = 2 hasta que reciba un golpe o no quiero ir más alto.

char* a = /* string 1 */; char* b = /* string 2 */; int na = strlen(a); int nb = strlen(b); bool walk(int ia, int ib, int k){ /* if the budget is exhausted, prune the search */ if (k < 0) return false; /* if at end of both strings we have a match */ if (ia == na && ib == nb) return true; /* if the first characters match, continue walking with no reduction in budget */ if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true; /* if the first characters don''t match, assume there is a 1-character replacement */ if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true; /* try assuming there is an extra character in a */ if (ia < na && walk(ia+1, ib, k-1)) return true; /* try assuming there is an extra character in b */ if (ib < nb && walk(ia, ib+1, k-1)) return true; /* if none of those worked, I give up */ return false; }

Agregado para explicar trie-search:

// definition of trie-node: struct TNode { TNode* pa[128]; // for each possible character, pointer to subnode }; // simple trie-walk of a node // key is the input word, answer is the output word, // i is the character position, and hdis is the hamming distance. void walk(TNode* p, char key[], char answer[], int i, int hdis){ // If this is the end of a word in the trie, it is marked as // having something non-null under the ''/0'' entry of the trie. if (p->pa[0] != null){ if (key[i] == ''/0'') printf("answer = %s, hdis = %d/n", answer, hdis); } // for every actual subnode of the trie for(char c = 1; c < 128; c++){ // if it is a real subnode if (p->pa[c] != null){ // keep track of the answer word represented by the trie answer[i] = c; answer[i+1] = ''/0''; // and walk that subnode // If the answer disagrees with the key, increment the hamming distance walk(p->pa[c], key, answer, i+1, (answer[i]==key[i] ? hdis : hdis+1)); } } } // Note: you have to edit this to handle short keys. // Simplest is to just append a lot of ''/0'' bytes to the key.

Ahora, para limitarlo a un presupuesto, simplemente rechace bajar si hdis es demasiado grande.

Hace poco publiqué una pregunta sobre la optimización del algoritmo para calcular la distancia de Levenshtein, y las respuestas me llevaron al artículo de Wikipedia sobre la distancia de Levenshtein .

El artículo menciona que si hay una k enlazada en la distancia máxima, un resultado posible puede ser de la consulta dada, entonces el tiempo de ejecución puede reducirse de O (mn) a O (kn) , siendo myn las longitudes de la instrumentos de cuerda. Busqué el algoritmo, pero no pude entender cómo implementarlo. Esperaba obtener algunas pistas sobre eso aquí.

La optimización es # 4 en "Posibles mejoras".

La parte que me confundió fue la que decía que solo necesitamos calcular una franja diagonal de ancho 2k + 1 , centrada en la diagonal principal (la diagonal principal se define como coordenadas (i, i)).

Si alguien pudiera ofrecer algo de ayuda / conocimiento, realmente lo apreciaría. Si es necesario, puedo publicar la descripción completa del algoritmo en el libro como una respuesta aquí.