separar - La subcadena común más larga de más de dos cadenas: Python

python separar string por caracter (7)

Estoy buscando una biblioteca de Python para encontrar la subcadena común más larga de un conjunto de cadenas . Hay dos formas de resolver este problema :

usando árboles de sufijo
usando programación dinámica.

El método implementado no es importante. Es importante que se pueda usar para un conjunto de cadenas (no solo para dos cadenas).

Estas funciones emparejadas encontrarán la cadena común más larga en cualquier matriz arbitraria de cadenas:

def long_substr(data): substr = '''' if len(data) > 1 and len(data[0]) > 0: for i in range(len(data[0])): for j in range(len(data[0])-i+1): if j > len(substr) and is_substr(data[0][i:i+j], data): substr = data[0][i:i+j] return substr def is_substr(find, data): if len(data) < 1 and len(find) < 1: return False for i in range(len(data)): if find not in data[i]: return False return True print long_substr([''Oh, hello, my friend.'', ''I prefer Jelly Belly beans.'', ''When hell freezes over!''])

Sin duda, el algoritmo podría mejorarse y no he tenido mucha exposición a Python, así que tal vez podría ser más eficiente sintácticamente también, pero debería hacer el trabajo.

EDITAR: alineó la segunda función is_substr como lo demostró JF Sebastian. El uso sigue siendo el mismo Nota: no hay cambio en el algoritmo.

def long_substr(data): substr = '''' if len(data) > 1 and len(data[0]) > 0: for i in range(len(data[0])): for j in range(len(data[0])-i+1): if j > len(substr) and all(data[0][i:i+j] in x for x in data): substr = data[0][i:i+j] return substr

Espero que esto ayude,

Jason.

Esto se puede hacer más corto:

def long_substr(data): substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)} s = substrs(data[0]) for val in data[1:]: s.intersection_update(substrs(val)) return max(s, key=len)

set''s (probablemente) implementado como hash-maps, lo que lo hace un poco ineficiente. Si (1) implementa un tipo de datos set como un trie y (2) solo almacena los postfixes en el trie y luego fuerza a cada nodo a ser un punto final (esto sería el equivalente de agregar todas las subcadenas), ENTONCES en teoría lo adivinaría este bebé es bastante eficiente desde el punto de vista de la memoria, especialmente porque las intersecciones de los intentos son muy fáciles.

Sin embargo, esto es corto y la optimización prematura es la raíz de una cantidad significativa de tiempo perdido.

Prefiero esto para is_substr , ya que me resulta un poco más legible e intuitivo:

def is_substr(find, data): """ inputs a substring to find, returns True only if found for each data in data list """ if len(find) < 1 or len(data) < 1: return False # expected input DNE is_found = True # and-ing to False anywhere in data will return False for i in data: print "Looking for substring %s in %s..." % (find, i) is_found = is_found and find in i return is_found

Puede usar el módulo SuffixTree que es un contenedor basado en una implementación ANSI C de árboles de sufijo generalizados. El módulo es fácil de manejar ....

Echa un vistazo a: here

Si alguien está buscando una versión generalizada que también puede tomar una lista de secuencias de objetos arbitrarios:

def get_longest_common_subseq(data): substr = [] if len(data) > 1 and len(data[0]) > 0: for i in range(len(data[0])): for j in range(len(data[0])-i+1): if j > len(substr) and is_subseq_of_any(data[0][i:i+j], data): substr = data[0][i:i+j] return substr def is_subseq_of_any(find, data): if len(data) < 1 and len(find) < 1: return False for i in range(len(data)): if not is_subseq(find, data[i]): return False return True # Will also return True if possible_subseq == seq. def is_subseq(possible_subseq, seq): if len(possible_subseq) > len(seq): return False def get_length_n_slices(n): for i in xrange(len(seq) + 1 - n): yield seq[i:i+n] for slyce in get_length_n_slices(len(possible_subseq)): if slyce == possible_subseq: return True return False print get_longest_common_subseq([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]]) print get_longest_common_subseq([''Oh, hello, my friend.'', ''I prefer Jelly Belly beans.'', ''When hell freezes over!''])

# this does not increase asymptotical complexity # but can still waste more time than it saves. TODO: profile def shortest_of(strings): return min(strings, key=len) def long_substr(strings): substr = "" if not strings: return substr reference = shortest_of(strings) #strings[0] length = len(reference) #find a suitable slice i:j for i in xrange(length): #only consider strings long at least len(substr) + 1 for j in xrange(i + len(substr) + 1, length + 1): candidate = reference[i:j] # ↓ is the slice recalculated every time? if all(candidate in text for text in strings): substr = candidate return substr

Descargo de responsabilidad Esto agrega muy poco a la respuesta de jtjacques. Sin embargo, con suerte, esto debería ser más fácil de leer y más rápido y no encajaba en un comentario, por lo tanto, por qué estoy publicando esto en una respuesta. No estoy satisfecho con el shortest_of , para ser sincero.

def common_prefix(strings): """ Find the longest string that is a prefix of all the strings. """ if not strings: return '''' prefix = strings[0] for s in strings: if len(s) < len(prefix): prefix = prefix[:len(s)] if not prefix: return '''' for i in range(len(prefix)): if prefix[i] != s[i]: prefix = prefix[:i] break return prefix

De http://bitbucket.org/ned/cog/src/tip/cogapp/whiteutils.py