n-grams en python, cuatro, cinco, seis gramos?

string nltk (12)

Grandes respuestas basadas en Python nativas dadas por otros usuarios. Pero aquí está el enfoque NLTK (por si acaso, el OP es penalizado por reinventar lo que ya existe en la biblioteca NLTK ).

Hay un módulo de ngram ( http://www.nltk.org/_modules/nltk/model/ngram.html ) que las personas rara vez usan en NLTK. No es porque sea difícil leer ngrams, pero entrenar un modelo base en> 3grams dará como resultado mucha dispersión de datos.

from nltk import ngrams sentence = ''this is a foo bar sentences and i want to ngramize it'' n = 6 sixgrams = ngrams(sentence.split(), n) for grams in sixgrams: print grams

Estoy buscando una manera de dividir un texto en n-grams. Normalmente haria algo como:

import nltk from nltk import bigrams string = "I really like python, it''s pretty awesome." string_bigrams = bigrams(string) print string_bigrams

Soy consciente de que nltk solo ofrece bigramas y trigramas, pero ¿hay alguna manera de dividir mi texto en cuatro gramos, cinco gramos o incluso cien gramos?

¡Gracias!

Me sorprende que esto no haya aparecido aún:

In [34]: sentence = "I really like python, it''s pretty awesome.".split() In [35]: N = 4 In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)] In [37]: for gram in grams: print gram [''I'', ''really'', ''like'', ''python,''] [''really'', ''like'', ''python,'', "it''s"] [''like'', ''python,'', "it''s", ''pretty''] [''python,'', "it''s", ''pretty'', ''awesome.'']

Nltk es genial, pero a veces es una sobrecarga para algunos proyectos:

import re def tokenize(text, ngrams=1): text = re.sub(r''[/b/(/)///"/'///[/]/s+/,/.:/?;]'', '' '', text) text = re.sub(r''/s+'', '' '', text) tokens = text.split() return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Ejemplo de uso:

>> text = "This is an example text" >> tokenize(text, 2) [(''This'', ''is''), (''is'', ''an''), (''an'', ''example''), (''example'', ''text'')] >> tokenize(text, 3) [(''This'', ''is'', ''an''), (''is'', ''an'', ''example''), (''an'', ''example'', ''text'')]

Nunca he tratado con nltk pero hice N-grams como parte de un pequeño proyecto de clase. Si desea encontrar la frecuencia de todos los N-grams que ocurren en la cadena, aquí hay una manera de hacerlo. D te daría el histograma de tus N-palabras.

D = dict() string = ''whatever string...'' strparts = string.split() for i in range(len(strparts)-N): # N-grams try: D[tuple(strparts[i:i+N])] += 1 except: D[tuple(strparts[i:i+N])] = 1

Para four_grams ya está en NLTK , aquí hay un fragmento de código que puede ayudarlo a esto:

from nltk.collocations import * import nltk #You should tokenize your text text = "I do not like green eggs and ham, I do not like them Sam I am!" tokens = nltk.wordpunct_tokenize(text) fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens) for fourgram, freq in fourgrams.ngram_fd.items(): print fourgram, freq

Espero que ayude.

Puede aumentar fácilmente su propia función para hacer esto usando itertools :

from itertools import izip, islice, tee s = ''spam and eggs'' N = 3 trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) list(trigrams) # [(''s'', ''p'', ''a''), (''p'', ''a'', ''m''), (''a'', ''m'', '' ''), # (''m'', '' '', ''a''), ('' '', ''a'', ''n''), (''a'', ''n'', ''d''), # (''n'', ''d'', '' ''), (''d'', '' '', ''e''), ('' '', ''e'', ''g''), # (''e'', ''g'', ''g''), (''g'', ''g'', ''s'')]

Puede obtener todos los 4-6gram usando el código sin otro paquete a continuación:

from itertools import chain def get_m_2_ngrams(input_list, min, max): for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]): yield '' ''.join(s) def get_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) if __name__ == ''__main__'': input_list = [''I'', ''am'', ''aware'', ''that'', ''nltk'', ''only'', ''offers'', ''bigrams'', ''and'', ''trigrams'', '','', ''but'', ''is'', ''there'', ''a'', ''way'', ''to'', ''split'', ''my'', ''text'', ''in'', ''four-grams'', '','', ''five-grams'', ''or'', ''even'', ''hundred-grams''] for s in get_m_2_ngrams(input_list, 4, 6): print(s)

el resultado está abajo:

I am aware that am aware that nltk aware that nltk only that nltk only offers nltk only offers bigrams only offers bigrams and offers bigrams and trigrams bigrams and trigrams , and trigrams , but trigrams , but is , but is there but is there a is there a way there a way to a way to split way to split my to split my text split my text in my text in four-grams text in four-grams , in four-grams , five-grams four-grams , five-grams or , five-grams or even five-grams or even hundred-grams I am aware that nltk am aware that nltk only aware that nltk only offers that nltk only offers bigrams nltk only offers bigrams and only offers bigrams and trigrams offers bigrams and trigrams , bigrams and trigrams , but and trigrams , but is trigrams , but is there , but is there a but is there a way is there a way to there a way to split a way to split my way to split my text to split my text in split my text in four-grams my text in four-grams , text in four-grams , five-grams in four-grams , five-grams or four-grams , five-grams or even , five-grams or even hundred-grams I am aware that nltk only am aware that nltk only offers aware that nltk only offers bigrams that nltk only offers bigrams and nltk only offers bigrams and trigrams only offers bigrams and trigrams , offers bigrams and trigrams , but bigrams and trigrams , but is and trigrams , but is there trigrams , but is there a , but is there a way but is there a way to is there a way to split there a way to split my a way to split my text way to split my text in to split my text in four-grams split my text in four-grams , my text in four-grams , five-grams text in four-grams , five-grams or in four-grams , five-grams or even four-grams , five-grams or even hundred-grams

puedes encontrar más detalles en este blog

Puede usar sklearn.feature_extraction.text.CountVectorizer :

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it''s pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print(''{1}-grams: {0}''.format(vect.get_feature_names(), ngram_size))

productos:

4-grams: [u''like python it pretty'', u''python it pretty awesome'', u''really like python it'']

Puede establecer en ngram_size a cualquier número entero positivo. Es decir, puede dividir un texto en cuatro gramos, cinco gramos o incluso cien gramos.

Si la eficiencia es un problema y tienes que construir múltiples n-gramas diferentes (hasta cien como dices), pero quieres usar pure python, yo haría:

from itertools import chain def n_grams(seq, n=1): """Returns an itirator over the n-grams given a listTokens""" shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i) shiftedTokens = (shiftToken(i) for i in range(n)) tupleNGrams = zip(*shiftedTokens) return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams) def range_ngrams(listTokens, ngramRange=(1,2)): """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens.""" return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Uso:

>>> input_list = input_list = ''test the ngrams generator''.split() >>> list(range_ngrams(input_list, ngramRange=(1,3))) [(''test'',), (''the'',), (''ngrams'',), (''generator'',), (''test'', ''the''), (''the'', ''ngrams''), (''ngrams'', ''generator''), (''test'', ''the'', ''ngrams''), (''the'', ''ngrams'', ''generator'')]

~ La misma velocidad que NLTK:

import nltk %%timeit input_list = ''test the ngrams interator vs nltk ''*10**6 nltk.ngrams(input_list,n=5) # 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = ''test the ngrams interator vs nltk ''*10**6 n_grams(input_list,n=5) # 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = ''test the ngrams interator vs nltk ''*10**6 nltk.ngrams(input_list,n=1) nltk.ngrams(input_list,n=2) nltk.ngrams(input_list,n=3) nltk.ngrams(input_list,n=4) nltk.ngrams(input_list,n=5) # 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = ''test the ngrams interator vs nltk ''*10**6 range_ngrams(input_list, ngramRange=(1,6)) # 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repostar de mi respuesta anterior .

Un enfoque más elegante para construir bigrams con zip() incorporada de python zip() . Simplemente convierta la cadena original en una lista split() , luego pase la lista una vez normalmente y una vez compensada por un elemento.

string = "I really like python, it''s pretty awesome." def find_bigrams(s): input_list = s.split(" ") return zip(input_list, input_list[1:]) def find_ngrams(s, n): input_list = s.split(" ") return zip(*[input_list[i:] for i in range(n)]) find_bigrams(string) [(''I'', ''really''), (''really'', ''like''), (''like'', ''python,''), (''python,'', "it''s"), ("it''s", ''pretty''), (''pretty'', ''awesome.'')]

Usando solo herramientas nltk

from nltk.tokenize import word_tokenize from nltk.util import ngrams def get_ngrams(text, n ): n_grams = ngrams(word_tokenize(text), n) return [ '' ''.join(grams) for grams in n_grams]

Ejemplo de salida

get_ngrams(''This is the simplest text i could think of'', 3 ) [''This is the'', ''is the simplest'', ''the simplest text'', ''simplest text i'', ''text i could'', ''i could think'', ''could think of'']

Para mantener los ngrams en formato de matriz, simplemente elimine '' ''.join

aquí hay otra manera simple de hacer n-grams

>>> from nltk.util import ngrams >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams" >>> tokenize = nltk.word_tokenize(text) >>> tokenize [''I'', ''am'', ''aware'', ''that'', ''nltk'', ''only'', ''offers'', ''bigrams'', ''and'', ''trigrams'', '','', ''but'', ''is'', ''there'', ''a'', ''way'', ''to'', ''split'', ''my'', ''text'', ''in'', ''four-grams'', '','', ''five-grams'', ''or'', ''even'', ''hundred-grams''] >>> bigrams = ngrams(tokenize,2) >>> bigrams [(''I'', ''am''), (''am'', ''aware''), (''aware'', ''that''), (''that'', ''nltk''), (''nltk'', ''only''), (''only'', ''offers''), (''offers'', ''bigrams''), (''bigrams'', ''and''), (''and'', ''trigrams''), (''trigrams'', '',''), ('','', ''but''), (''but'', ''is''), (''is'', ''there''), (''there'', ''a''), (''a'', ''way''), (''way'', ''to''), (''to'', ''split''), (''split'', ''my''), (''my'', ''text''), (''text'', ''in''), (''in'', ''four-grams''), (''four-grams'', '',''), ('','', ''five-grams''), (''five-grams'', ''or''), (''or'', ''even''), (''even'', ''hundred-grams'')] >>> trigrams=ngrams(tokenize,3) >>> trigrams [(''I'', ''am'', ''aware''), (''am'', ''aware'', ''that''), (''aware'', ''that'', ''nltk''), (''that'', ''nltk'', ''only''), (''nltk'', ''only'', ''offers''), (''only'', ''offers'', ''bigrams''), (''offers'', ''bigrams'', ''and''), (''bigrams'', ''and'', ''trigrams''), (''and'', ''trigrams'', '',''), (''trigrams'', '','', ''but''), ('','', ''but'', ''is''), (''but'', ''is'', ''there''), (''is'', ''there'', ''a''), (''there'', ''a'', ''way''), (''a'', ''way'', ''to''), (''way'', ''to'', ''split''), (''to'', ''split'', ''my''), (''split'', ''my'', ''text''), (''my'', ''text'', ''in''), (''text'', ''in'', ''four-grams''), (''in'', ''four-grams'', '',''), (''four-grams'', '','', ''five-grams''), ('','', ''five-grams'', ''or''), (''five-grams'', ''or'', ''even''), (''or'', ''even'', ''hundred-grams'')] >>> fourgrams=ngrams(tokenize,4) >>> fourgrams [(''I'', ''am'', ''aware'', ''that''), (''am'', ''aware'', ''that'', ''nltk''), (''aware'', ''that'', ''nltk'', ''only''), (''that'', ''nltk'', ''only'', ''offers''), (''nltk'', ''only'', ''offers'', ''bigrams''), (''only'', ''offers'', ''bigrams'', ''and''), (''offers'', ''bigrams'', ''and'', ''trigrams''), (''bigrams'', ''and'', ''trigrams'', '',''), (''and'', ''trigrams'', '','', ''but''), (''trigrams'', '','', ''but'', ''is''), ('','', ''but'', ''is'', ''there''), (''but'', ''is'', ''there'', ''a''), (''is'', ''there'', ''a'', ''way''), (''there'', ''a'', ''way'', ''to''), (''a'', ''way'', ''to'', ''split''), (''way'', ''to'', ''split'', ''my''), (''to'', ''split'', ''my'', ''text''), (''split'', ''my'', ''text'', ''in''), (''my'', ''text'', ''in'', ''four-grams''), (''text'', ''in'', ''four-grams'', '',''), (''in'', ''four-grams'', '','', ''five-grams''), (''four-grams'', '','', ''five-grams'', ''or''), ('','', ''five-grams'', ''or'', ''even''), (''five-grams'', ''or'', ''even'', ''hundred-grams'')]