python nlp n-gram language-model

title python



¿Cómo calcular skipgrams en python? (5)

Editado

La última versión 3.2.5 de skipgrams implementados los skipgrams .

Aquí hay una implementación más limpia de @jnothman del repositorio de NLTK: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs): """ Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf :param sequence: the source data to be converted into trigrams :type sequence: sequence or iter :param n: the degree of the ngrams :type n: int :param k: the skip distance :type k: int :rtype: iter(tuple) """ # Pads the sequence as desired by **kwargs. if ''pad_left'' in kwargs or ''pad_right'' in kwargs: sequence = pad_sequence(sequence, n, **kwargs) # Note when iterating through the ngrams, the pad_right here is not # the **kwargs padding, it''s for the algorithm to detect the SENTINEL # object on the right pad to stop inner loop. SENTINEL = object() for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL): head = ngram[:1] tail = ngram[1:] for skip_tail in combinations(tail, n - 1): if skip_tail[-1] is SENTINEL: continue yield head + skip_tail

[afuera]:

>>> from nltk.util import skipgrams >>> sent = "Insurgents killed in ongoing fighting".split() >>> list(skipgrams(sent, 2, 2)) [(''Insurgents'', ''killed''), (''Insurgents'', ''in''), (''Insurgents'', ''ongoing''), (''killed'', ''in''), (''killed'', ''ongoing''), (''killed'', ''fighting''), (''in'', ''ongoing''), (''in'', ''fighting''), (''ongoing'', ''fighting'')] >>> list(skipgrams(sent, 3, 2)) [(''Insurgents'', ''killed'', ''in''), (''Insurgents'', ''killed'', ''ongoing''), (''Insurgents'', ''killed'', ''fighting''), (''Insurgents'', ''in'', ''ongoing''), (''Insurgents'', ''in'', ''fighting''), (''Insurgents'', ''ongoing'', ''fighting''), (''killed'', ''in'', ''ongoing''), (''killed'', ''in'', ''fighting''), (''killed'', ''ongoing'', ''fighting''), (''in'', ''ongoing'', ''fighting'')]

Un k skipgram es un ngram que es un superconjunto de todos los ngrams y cada (ki) skipgram till (ki) == 0 (que incluye 0 saltar gramos). Entonces, ¿cómo calcular de forma eficiente estos diagramas en python?

A continuación se muestra el código que probé pero no funciona como se esperaba:

<pre> input_list = [''all'', ''this'', ''happened'', ''more'', ''or'', ''less''] def find_skipgrams(input_list, N,K): bigram_list = [] nlist=[] K=1 for k in range(K+1): for i in range(len(input_list)-1): if i+k+1<len(input_list): nlist=[] for j in range(N+1): if i+k+j+1<len(input_list): nlist.append(input_list[i+k+j+1]) bigram_list.append(nlist) return bigram_list </pre>

El código anterior no se está find_skipgrams([''all'', ''this'', ''happened'', ''more'', ''or'', ''less''],2,1) correctamente, pero imprimir find_skipgrams([''all'', ''this'', ''happened'', ''more'', ''or'', ''less''],2,1) da el siguiente resultado

[[''este'', ''pasó'', ''más''], [''pasó'', ''más'', ''o''], [''más'', ''o'', ''menos''], [''o'', ''menos ''], ['' menos ''], ['' pasó '','' más '','' o ''], ['' más '','' o '','' menos ''], ['' o '','' menos ''], ['' menos ''], [''Menos'']]

El código que se muestra aquí tampoco da el resultado correcto: https://github.com/heaven00/skipgram/blob/master/skipgram.py

print skipgram_ndarray ("Cuál es tu nombre") da: [''Qué, es'', ''es, tu'', ''tu, nombre'', ''nombre'', ''Qué, tu'', ''es, nombre'']

nombre es un unigram!


¿Qué hay de usar la implementación de otra persona https://github.com/heaven00/skipgram/blob/master/skipgram.py , donde k = skip_size y n=ngram_order :

def skipgram_ndarray(sent, k=1, n=2): """ This is not exactly a vectorized version, because we are still using a for loop """ tokens = sent.split() if len(tokens) < k + 2: raise Exception("REQ: length of sentence > skip + 2") matrix = np.zeros((len(tokens), k + 2), dtype=object) matrix[:, 0] = tokens matrix[:, 1] = tokens[1:] + [''''] result = [] for skip in range(1, k + 1): matrix[:, skip + 1] = tokens[skip + 1:] + [''''] * (skip + 1) for index in range(1, k + 2): temp = matrix[:, 0] + '','' + matrix[:, index] map(result.append, temp.tolist()) limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6) return result[:limit] def skipgram_list(sent, k=1, n=2): """ Form skipgram features using list comprehensions """ tokens = sent.split() tokens_n = [''''''tokens[index + j + {0}]''''''.format(index) for index in range(n - 1)] x = ''(tokens[index], '' + '', ''.join(tokens_n) + '')'' query_part1 = ''result = ['' + x + '' for index in range(len(tokens))'' query_part2 = '' for j in range(1, k+2) if index + j + n < len(tokens)]'' exec(query_part1 + query_part2) return result


Aunque esto formaría parte completamente de su código y lo diferiría a una biblioteca externa; puede utilizar Colibri Core ( https://proycon.github.io/colibri-core ) para la extracción de skipgram. Es una biblioteca escrita específicamente para la extracción eficiente de n-gramas y esquemas de grandes corpus de texto. El código base está en C ++ (para velocidad / eficiencia), pero hay un enlace de Python disponible.

Con razón mencionó la eficiencia, ya que la extracción del diagrama muestra rápidamente la complejidad exponencial, lo que puede no ser un gran problema si solo pasa una oración como lo hizo en su input_list , pero se vuelve problemático si lo libera en datos de grandes corpus. Para mitigar esto, puede establecer parámetros como un umbral de ocurrencia, o solicitar que cada salto de un diagrama se pueda rellenar con al menos x n-gramas distintos.

import colibricore #Prepare corpus data (will be encoded for efficiency) corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line encoder = colibricore.ClassEncoder() encoder.build(corpusfile_plaintext) corpusfile = "somecorpus.colibri.dat" #corpus output classfile = "somecorpus.colibri.cls" #class encoding output encoder.encodefile(corpusfile_plaintext,corpusfile) encoder.save(classfile) #Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length) colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True) #Instantiate an empty pattern model model = colibricore.UnindexedPatternModel() #Train the model on the encoded corpus file (this does the skipgram extraction) model.train(corpusfile, options) #Load a decoder so we can view the output decoder = colibricore.ClassDecoder(classfile) #Output all skipgrams for pattern in model: if pattern.category() == colibricore.Category.SKIPGRAM: print(pattern.tostring(decoder))

Hay un tutorial de Python más extenso sobre todo esto en el sitio web.

Descargo de responsabilidad: Soy el autor de Colibri Core


Consulte this para obtener información completa.

¡El siguiente ejemplo ya se mencionó sobre su uso y funciona como un encanto!

>>>sent = "Insurgents killed in ongoing fighting".split() >>>list(skipgrams(sent, 2, 2)) [(''Insurgents'', ''killed''), (''Insurgents'', ''in''), (''Insurgents'', ''ongoing''), (''killed'', ''in''), (''killed'', ''ongoing''), (''killed'', ''fighting''), (''in'', ''ongoing''), (''in'', ''fighting''), (''ongoing'', ''fighting'')]


Del skipgram que OP enlaza, la siguiente cadena:

Insurgentes muertos en combates en curso

Rendimientos:

2-skip-bi-grams = {insurgentes asesinados, insurgentes en, insurgentes en curso, muertos en, muertos en curso, muertos en combate, en curso, en combate, en combate continuo}

2-skip-tri-grams = {insurgentes muertos, insurgentes muertos en curso, insurgentes muertos en enfrentamientos, insurgentes en curso, insurgentes en combates, insurgentes en combates, muertos en continuo, muertos en enfrentamientos, muertos en enfrentamientos, en enfrentamientos continuos}.

Con una ligera modificación en el código ngrams de ngrams ( https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383 ):

from itertools import chain, combinations import copy from nltk.util import ngrams def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) return sequence def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None): sequence_length = len(sequence) sequence = iter(sequence) sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol) if sequence_length + pad_left + pad_right < k: raise Exception("The length of sentence + padding(s) < skip") if n < k: raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)") history = [] nk = n+k # Return point for recursion. if nk < 1: return # If n+k longer than sequence, reduce k by 1 and recur elif nk > sequence_length: for ng in skipgrams(list(sequence), n, k-1): yield ng while nk > 1: # Collects the first instance of n+k length history history.append(next(sequence)) nk -= 1 # Iterative drop first item in history and picks up the next # while yielding skipgrams for each iteration. for item in sequence: history.append(item) current_token = history.pop(0) # Iterates through the rest of the history and # pick out all combinations the n-1grams for idx in list(combinations(range(len(history)), n-1)): ng = [current_token] for _id in idx: ng.append(history[_id]) yield tuple(ng) # Recursively yield the skigrams for the rest of seqeunce where # len(sequence) < n+k for ng in list(skipgrams(history, n, k-1)): yield ng

Hagamos un poco de prueba para que coincida con el ejemplo del documento:

>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2)) [(''Insurgents'', ''killed''), (''Insurgents'', ''in''), (''Insurgents'', ''ongoing''), (''killed'', ''in''), (''killed'', ''ongoing''), (''killed'', ''fighting''), (''in'', ''ongoing''), (''in'', ''fighting''), (''ongoing'', ''fighting'')] >>> two_skip_trigrams = list(skipgrams(text, n=3, k=2)) [(''Insurgents'', ''killed'', ''in''), (''Insurgents'', ''killed'', ''ongoing''), (''Insurgents'', ''killed'', ''fighting''), (''Insurgents'', ''in'', ''ongoing''), (''Insurgents'', ''in'', ''fighting''), (''Insurgents'', ''ongoing'', ''fighting''), (''killed'', ''in'', ''ongoing''), (''killed'', ''in'', ''fighting''), (''killed'', ''ongoing'', ''fighting''), (''in'', ''ongoing'', ''fighting'')]

Pero tenga en cuenta que si n+k > len(sequence) , producirá los mismos efectos que los skipgrams(sequence, n, k-1) (esto no es un error, es una función a prueba de fallas), por ejemplo

>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3)) >>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3)) >>> four_skip_fourgrams = list(skipgrams(text, n=4, k=4)) >>> four_skip_fivegrams = list(skipgrams(text, n=5, k=4)) >>> >>> print len(three_skip_trigrams), three_skip_trigrams 10 [(''Insurgents'', ''killed'', ''in''), (''Insurgents'', ''killed'', ''ongoing''), (''Insurgents'', ''killed'', ''fighting''), (''Insurgents'', ''in'', ''ongoing''), (''Insurgents'', ''in'', ''fighting''), (''Insurgents'', ''ongoing'', ''fighting''), (''killed'', ''in'', ''ongoing''), (''killed'', ''in'', ''fighting''), (''killed'', ''ongoing'', ''fighting''), (''in'', ''ongoing'', ''fighting'')] >>> print len(three_skip_fourgrams), three_skip_fourgrams 5 [(''Insurgents'', ''killed'', ''in'', ''ongoing''), (''Insurgents'', ''killed'', ''in'', ''fighting''), (''Insurgents'', ''killed'', ''ongoing'', ''fighting''), (''Insurgents'', ''in'', ''ongoing'', ''fighting''), (''killed'', ''in'', ''ongoing'', ''fighting'')] >>> print len(four_skip_fourgrams), four_skip_fourgrams 5 [(''Insurgents'', ''killed'', ''in'', ''ongoing''), (''Insurgents'', ''killed'', ''in'', ''fighting''), (''Insurgents'', ''killed'', ''ongoing'', ''fighting''), (''Insurgents'', ''in'', ''ongoing'', ''fighting''), (''killed'', ''in'', ''ongoing'', ''fighting'')] >>> print len(four_skip_fivegrams), four_skip_fivegrams 1 [(''Insurgents'', ''killed'', ''in'', ''ongoing'', ''fighting'')]

Esto permite n == k pero no permite n > k como se muestra en las líneas:

if n < k: raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")

Para entenderlo, tratemos de entender la línea "mística":

for idx in list(combinations(range(len(history)), n-1)): pass # Do something

Dada una lista de elementos únicos, las combinaciones producen esto:

>>> from itertools import combinations >>> x = [0,1,2,3,4,5] >>> list(combinations(x,2)) [(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]

Y dado que los índices de una lista de tokens son siempre únicos, por ejemplo

>>> sent = [''this'', ''is'', ''a'', ''foo'', ''bar''] >>> current_token = sent.pop(0) # i.e. ''this'' >>> range(len(sent)) [0,1,2,3]

Es posible calcular las posibles combinaciones (sin reemplazo) del rango:

>>> n = 3 >>> list(combinations(range(len(sent)), n-1)) [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

Si mapeamos los índices de nuevo a la lista de tokens:

>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2) [(''is'', ''a''), (''is'', ''foo''), (''is'', ''bar''), (''a'', ''foo''), (''a'', ''bar''), (''foo'', ''bar'')]

Luego concatenamos con el current_token , obtenemos los diagramas para la ventana actual de token y context + skip:

>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)] [(''this'', ''is'', ''a''), (''this'', ''is'', ''foo''), (''this'', ''is'', ''bar''), (''this'', ''a'', ''foo''), (''this'', ''a'', ''bar''), (''this'', ''foo'', ''bar'')]

Entonces, después de eso, pasamos a la siguiente palabra.