levenshtein - Comprobando subcadena difusa/aproximada existente en una cadena más larga, en Python?

fuzzywuzzy python (5)

¿Qué le difflib.SequenceMatcher.get_matching_blocks usar difflib.SequenceMatcher.get_matching_blocks ?

>>> import difflib >>> large_string = "thelargemanhatanproject" >>> query_string = "manhattan" >>> s = difflib.SequenceMatcher(None, large_string, query_string) >>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string)) 0.8888888888888888 >>> query_string = "banana" >>> s = difflib.SequenceMatcher(None, large_string, query_string) >>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string)) 0.6666666666666666

ACTUALIZAR

import difflib def matches(large_string, query_string, threshold): words = large_string.split() for word in words: s = difflib.SequenceMatcher(None, word, query_string) match = ''''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n) if len(match) / float(len(query_string)) >= threshold: yield match large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" print list(matches(large_string, query_string, 0.8))

Sobre el código de impresión: [''manhatan'', ''manhattn'']

Usando algoritmos como leveinstein (leveinstein o difflib), es fácil encontrar coincidencias aproximadas.

>>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571

Las coincidencias borrosas se pueden detectar al decidir un umbral según sea necesario.

Requisito actual: para encontrar subcadenas difusas basadas en un umbral en una cadena más grande.

p.ej.

large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" #result = "manhatan","manhattin" and their indexes in large_string

Una solución de fuerza bruta es generar todas las subcadenas de longitud N-1 a N + 1 (u otra longitud coincidente), donde N es la longitud de query_string, y usar levenstein en ellas una por una y ver el umbral.

¿Hay mejor solución disponible en python, preferiblemente un módulo incluido en Python 2.7 o un módulo disponible externamente?

ACTUALIZACIÓN : el módulo de expresión regular de Python funciona bastante bien, aunque es un poco más lento que el módulo incorporado para los casos de subcadenas difusas, que es un resultado obvio debido a las operaciones adicionales. La salida deseada es buena y el control sobre la magnitud de la borrosidad se puede definir fácilmente.

>>> import regex >>> input = "Monalisa was painted by Leonrdo da Vinchi" >>> regex.search(r''/b(leonardo){e<3}/s+(da)/s+(vinci){e<2}/b'',input,flags=regex.IGNORECASE) <regex.Match object; span=(23, 41), match='' Leonrdo da Vinchi'', fuzzy_counts=(0, 2, 1)>

La nueva biblioteca de expresiones regulares que pronto se supone que reemplazará incluye una coincidencia difusa.

https://pypi.python.org/pypi/regex/

La sintaxis de coincidencia difusa se ve bastante expresiva, pero esto le daría una coincidencia con una o menos inserciones / adiciones / eliminaciones.

import regex regex.match(''(amazing){e<=1}'', ''amaging'')

Los enfoques anteriores son buenos, pero necesitaba encontrar una aguja pequeña en gran cantidad de heno, y terminé acercándome así:

from difflib import SequenceMatcher as SM from nltk.util import ngrams import codecs needle = "this is the string we want to find" hay = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still" needle_length = len(needle.split()) max_sim_val = 0 max_sim_string = u"" for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)): hay_ngram = u" ".join(ngram) similarity = SM(None, hay_ngram, needle).ratio() if similarity > max_sim_val: max_sim_val = similarity max_sim_string = hay_ngram print max_sim_val, max_sim_string

Rendimientos:

0.72972972973 this string is the one we wanted to find

Recientemente escribí una biblioteca de alineación para Python: https://github.com/eseraygun/python-alignment

Utilizándolo, puede realizar alineaciones globales y locales con estrategias arbitrarias de puntuación en cualquier par de secuencias. En realidad, en su caso, necesita alineaciones semi-locales ya que no le importan las subcadenas de query_string . Simulé el algoritmo semi-local utilizando alineación local y algunas heurísticas en el siguiente código, pero es fácil ampliar la biblioteca para una implementación adecuada.

Aquí está el código de ejemplo en el archivo README modificado para su caso.

from alignment.sequence import Sequence, GAP_ELEMENT from alignment.vocabulary import Vocabulary from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" # Create sequences to be aligned. a = Sequence(large_string) b = Sequence(query_string) # Create a vocabulary and encode the sequences. v = Vocabulary() aEncoded = v.encodeSequence(a) bEncoded = v.encodeSequence(b) # Create a scoring and align the sequences using local aligner. scoring = SimpleScoring(1, -1) aligner = LocalSequenceAligner(scoring, -1, minScore=5) score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True) # Iterate over optimal alignments and print them. for encoded in encodeds: alignment = v.decodeSequenceAlignment(encoded) # Simulate a semi-local alignment. if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b): continue if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT: continue if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT: continue print alignment print ''Alignment score:'', alignment.score print ''Percent identity:'', alignment.percentIdentity() print

La salida de minScore=5 es la siguiente.

m a n h a - t a n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889 m a n h a t t - i m a n h a t t a n Alignment score: 5 Percent identity: 77.7777777778 m a n h a t t i n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889

Si elimina el argumento de minScore , obtendrá solo las mejores coincidencias de puntuación.

m a n h a - t a n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889 m a n h a t t i n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889

Tenga en cuenta que todos los algoritmos en la biblioteca tienen O(n * m) complejidad de tiempo, n y m son las longitudes de las secuencias.

Uso fuzzywuzzy para concordancia difusa en función del umbral y fuzzysearch para extraer palabras difusas del partido.

process.extractBests toma una consulta, una lista de palabras y una puntuación de corte y devuelve una lista de tuplas de coincidencia y puntaje por encima de la puntuación de corte.

find_near_matches toma el resultado de process.extractBests y devuelve los índices de inicio y final de las palabras. Utilizo los índices para construir las palabras y uso la palabra construida para encontrar el índice en la cadena grande. max_l_dist of find_near_matches es ''distancia de Levenshtein'' que debe ajustarse para satisfacer las necesidades.

from fuzzysearch import find_near_matches from fuzzywuzzy import process large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" def fuzzy_extract(qs, ls, threshold): ''''''fuzzy matches ''qs'' in ''ls'' and returns list of tuples of (word,index) '''''' for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold): print(''word {}''.format(word)) for match in find_near_matches(qs, word, max_l_dist=1): match = word[match.start:match.end] print(''match {}''.format(match)) index = ls.find(match) yield (match, index)

Prueba;

print(''query: {}/nstring: {}''.format(query_string, large_string)) for match,index in fuzzy_extract(query_string, large_string, 70): print(''match: {}/nindex: {}''.format(match, index)) query_string = "citi" print(''query: {}/nstring: {}''.format(query_string, large_string)) for match,index in fuzzy_extract(query_string, large_string, 30): print(''match: {}/nindex: {}''.format(match, index)) query_string = "greet" print(''query: {}/nstring: {}''.format(query_string, large_string)) for match,index in fuzzy_extract(query_string, large_string, 30): print(''match: {}/nindex: {}''.format(match, index))

Salida;
consulta: manhattan
string: thelargemanhatanproject es un gran proyecto en themanhattincity
partido: manhatan
índice: 8
partido: manhattin
índice: 49

consulta: citi
string: thelargemanhatanproject es un gran proyecto en themanhattincity
partido: ciudad
índice: 58

consulta: saludar
string: thelargemanhatanproject es un gran proyecto en themanhattincity
partido: genial
índice: 29