n-grams en python, cuatro, cinco, seis gramos?
string nltk (12)
Grandes respuestas basadas en Python nativas dadas por otros usuarios. Pero aquí está el enfoque NLTK
(por si acaso, el OP es penalizado por reinventar lo que ya existe en la biblioteca NLTK
).
Hay un módulo de ngram ( http://www.nltk.org/_modules/nltk/model/ngram.html ) que las personas rara vez usan en NLTK. No es porque sea difícil leer ngrams, pero entrenar un modelo base en> 3grams dará como resultado mucha dispersión de datos.
from nltk import ngrams
sentence = ''this is a foo bar sentences and i want to ngramize it''
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
Estoy buscando una manera de dividir un texto en n-grams. Normalmente haria algo como:
import nltk
from nltk import bigrams
string = "I really like python, it''s pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams
Soy consciente de que nltk solo ofrece bigramas y trigramas, pero ¿hay alguna manera de dividir mi texto en cuatro gramos, cinco gramos o incluso cien gramos?
¡Gracias!
Me sorprende que esto no haya aparecido aún:
In [34]: sentence = "I really like python, it''s pretty awesome.".split()
In [35]: N = 4
In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
In [37]: for gram in grams: print gram
[''I'', ''really'', ''like'', ''python,'']
[''really'', ''like'', ''python,'', "it''s"]
[''like'', ''python,'', "it''s", ''pretty'']
[''python,'', "it''s", ''pretty'', ''awesome.'']
Nltk es genial, pero a veces es una sobrecarga para algunos proyectos:
import re
def tokenize(text, ngrams=1):
text = re.sub(r''[/b/(/)///"/'///[/]/s+/,/.:/?;]'', '' '', text)
text = re.sub(r''/s+'', '' '', text)
tokens = text.split()
return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]
Ejemplo de uso:
>> text = "This is an example text"
>> tokenize(text, 2)
[(''This'', ''is''), (''is'', ''an''), (''an'', ''example''), (''example'', ''text'')]
>> tokenize(text, 3)
[(''This'', ''is'', ''an''), (''is'', ''an'', ''example''), (''an'', ''example'', ''text'')]
Nunca he tratado con nltk pero hice N-grams como parte de un pequeño proyecto de clase. Si desea encontrar la frecuencia de todos los N-grams que ocurren en la cadena, aquí hay una manera de hacerlo. D
te daría el histograma de tus N-palabras.
D = dict()
string = ''whatever string...''
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
try:
D[tuple(strparts[i:i+N])] += 1
except:
D[tuple(strparts[i:i+N])] = 1
Para four_grams ya está en NLTK , aquí hay un fragmento de código que puede ayudarlo a esto:
from nltk.collocations import *
import nltk
#You should tokenize your text
text = "I do not like green eggs and ham, I do not like them Sam I am!"
tokens = nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
print fourgram, freq
Espero que ayude.
Puede aumentar fácilmente su propia función para hacer esto usando itertools
:
from itertools import izip, islice, tee
s = ''spam and eggs''
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [(''s'', ''p'', ''a''), (''p'', ''a'', ''m''), (''a'', ''m'', '' ''),
# (''m'', '' '', ''a''), ('' '', ''a'', ''n''), (''a'', ''n'', ''d''),
# (''n'', ''d'', '' ''), (''d'', '' '', ''e''), ('' '', ''e'', ''g''),
# (''e'', ''g'', ''g''), (''g'', ''g'', ''s'')]
Puede obtener todos los 4-6gram usando el código sin otro paquete a continuación:
from itertools import chain
def get_m_2_ngrams(input_list, min, max):
for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
yield '' ''.join(s)
def get_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
if __name__ == ''__main__'':
input_list = [''I'', ''am'', ''aware'', ''that'', ''nltk'', ''only'', ''offers'', ''bigrams'', ''and'', ''trigrams'', '','', ''but'', ''is'', ''there'', ''a'', ''way'', ''to'', ''split'', ''my'', ''text'', ''in'', ''four-grams'', '','', ''five-grams'', ''or'', ''even'', ''hundred-grams'']
for s in get_m_2_ngrams(input_list, 4, 6):
print(s)
el resultado está abajo:
I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams
puedes encontrar más detalles en este blog
Puede usar sklearn.feature_extraction.text.CountVectorizer :
import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it''s pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print(''{1}-grams: {0}''.format(vect.get_feature_names(), ngram_size))
productos:
4-grams: [u''like python it pretty'', u''python it pretty awesome'', u''really like python it'']
Puede establecer en ngram_size
a cualquier número entero positivo. Es decir, puede dividir un texto en cuatro gramos, cinco gramos o incluso cien gramos.
Si la eficiencia es un problema y tienes que construir múltiples n-gramas diferentes (hasta cien como dices), pero quieres usar pure python, yo haría:
from itertools import chain
def n_grams(seq, n=1):
"""Returns an itirator over the n-grams given a listTokens"""
shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
shiftedTokens = (shiftToken(i) for i in range(n))
tupleNGrams = zip(*shiftedTokens)
return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)
def range_ngrams(listTokens, ngramRange=(1,2)):
"""Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))
Uso:
>>> input_list = input_list = ''test the ngrams generator''.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[(''test'',), (''the'',), (''ngrams'',), (''generator'',), (''test'', ''the''), (''the'', ''ngrams''), (''ngrams'', ''generator''), (''test'', ''the'', ''ngrams''), (''the'', ''ngrams'', ''generator'')]
~ La misma velocidad que NLTK:
import nltk
%%timeit
input_list = ''test the ngrams interator vs nltk ''*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = ''test the ngrams interator vs nltk ''*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = ''test the ngrams interator vs nltk ''*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = ''test the ngrams interator vs nltk ''*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Repostar de mi respuesta anterior .
Un enfoque más elegante para construir bigrams con zip()
incorporada de python zip()
. Simplemente convierta la cadena original en una lista split()
, luego pase la lista una vez normalmente y una vez compensada por un elemento.
string = "I really like python, it''s pretty awesome."
def find_bigrams(s):
input_list = s.split(" ")
return zip(input_list, input_list[1:])
def find_ngrams(s, n):
input_list = s.split(" ")
return zip(*[input_list[i:] for i in range(n)])
find_bigrams(string)
[(''I'', ''really''), (''really'', ''like''), (''like'', ''python,''), (''python,'', "it''s"), ("it''s", ''pretty''), (''pretty'', ''awesome.'')]
Usando solo herramientas nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
def get_ngrams(text, n ):
n_grams = ngrams(word_tokenize(text), n)
return [ '' ''.join(grams) for grams in n_grams]
Ejemplo de salida
get_ngrams(''This is the simplest text i could think of'', 3 )
[''This is the'', ''is the simplest'', ''the simplest text'', ''simplest text i'', ''text i could'', ''i could think'', ''could think of'']
Para mantener los ngrams en formato de matriz, simplemente elimine '' ''.join
aquí hay otra manera simple de hacer n-grams
>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
[''I'', ''am'', ''aware'', ''that'', ''nltk'', ''only'', ''offers'', ''bigrams'', ''and'', ''trigrams'', '','', ''but'', ''is'', ''there'', ''a'', ''way'', ''to'', ''split'', ''my'', ''text'', ''in'', ''four-grams'', '','', ''five-grams'', ''or'', ''even'', ''hundred-grams'']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[(''I'', ''am''), (''am'', ''aware''), (''aware'', ''that''), (''that'', ''nltk''), (''nltk'', ''only''), (''only'', ''offers''), (''offers'', ''bigrams''), (''bigrams'', ''and''), (''and'', ''trigrams''), (''trigrams'', '',''), ('','', ''but''), (''but'', ''is''), (''is'', ''there''), (''there'', ''a''), (''a'', ''way''), (''way'', ''to''), (''to'', ''split''), (''split'', ''my''), (''my'', ''text''), (''text'', ''in''), (''in'', ''four-grams''), (''four-grams'', '',''), ('','', ''five-grams''), (''five-grams'', ''or''), (''or'', ''even''), (''even'', ''hundred-grams'')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[(''I'', ''am'', ''aware''), (''am'', ''aware'', ''that''), (''aware'', ''that'', ''nltk''), (''that'', ''nltk'', ''only''), (''nltk'', ''only'', ''offers''), (''only'', ''offers'', ''bigrams''), (''offers'', ''bigrams'', ''and''), (''bigrams'', ''and'', ''trigrams''), (''and'', ''trigrams'', '',''), (''trigrams'', '','', ''but''), ('','', ''but'', ''is''), (''but'', ''is'', ''there''), (''is'', ''there'', ''a''), (''there'', ''a'', ''way''), (''a'', ''way'', ''to''), (''way'', ''to'', ''split''), (''to'', ''split'', ''my''), (''split'', ''my'', ''text''), (''my'', ''text'', ''in''), (''text'', ''in'', ''four-grams''), (''in'', ''four-grams'', '',''), (''four-grams'', '','', ''five-grams''), ('','', ''five-grams'', ''or''), (''five-grams'', ''or'', ''even''), (''or'', ''even'', ''hundred-grams'')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[(''I'', ''am'', ''aware'', ''that''), (''am'', ''aware'', ''that'', ''nltk''), (''aware'', ''that'', ''nltk'', ''only''), (''that'', ''nltk'', ''only'', ''offers''), (''nltk'', ''only'', ''offers'', ''bigrams''), (''only'', ''offers'', ''bigrams'', ''and''), (''offers'', ''bigrams'', ''and'', ''trigrams''), (''bigrams'', ''and'', ''trigrams'', '',''), (''and'', ''trigrams'', '','', ''but''), (''trigrams'', '','', ''but'', ''is''), ('','', ''but'', ''is'', ''there''), (''but'', ''is'', ''there'', ''a''), (''is'', ''there'', ''a'', ''way''), (''there'', ''a'', ''way'', ''to''), (''a'', ''way'', ''to'', ''split''), (''way'', ''to'', ''split'', ''my''), (''to'', ''split'', ''my'', ''text''), (''split'', ''my'', ''text'', ''in''), (''my'', ''text'', ''in'', ''four-grams''), (''text'', ''in'', ''four-grams'', '',''), (''in'', ''four-grams'', '','', ''five-grams''), (''four-grams'', '','', ''five-grams'', ''or''), ('','', ''five-grams'', ''or'', ''even''), (''five-grams'', ''or'', ''even'', ''hundred-grams'')]