spanish - part of speech python

Python NLTK pos_tag no devuelve la etiqueta correcta de parte del discurso (1)

Teniendo esto:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

Y corriendo:

nltk.pos_tag(text)

Yo obtengo:

[(''The'', ''DT''), (''quick'', ''NN''), (''brown'', ''NN''), (''fox'', ''NN''), (''jumps'', ''NNS''), (''over'', ''IN''), (''the'', ''DT''), (''lazy'', ''NN''), (''dog'', ''NN'')]

Esto es incorrecto. Las etiquetas para quick brown lazy en la oración deben ser:

(''quick'', ''JJ''), (''brown'', ''JJ'') , (''lazy'', ''JJ'')

Probar esto a través de su herramienta en línea da el mismo resultado; quick , brown y fox deben ser adjetivos, no sustantivos.

En resumen :

NLTK no es perfecto. De hecho, ningún modelo es perfecto.

Nota:

A partir de la versión 3.1 de pos_tag función pos_tag predeterminada pos_tag no es el antiguo Pickle inglés MaxEnt .

Ahora es el etiquetador perceptrón de la implementación de @ Honnibal , ver nltk.tag.pos_tag

>>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, tagger)

Aún así es mejor pero no perfecto:

>>> from nltk import pos_tag >>> pos_tag("The quick brown fox jumps over the lazy dog".split()) [(''The'', ''DT''), (''quick'', ''JJ''), (''brown'', ''NN''), (''fox'', ''NN''), (''jumps'', ''VBZ''), (''over'', ''IN''), (''the'', ''DT''), (''lazy'', ''JJ''), (''dog'', ''NN'')]

En algún momento, si alguien quiere soluciones TL;DR , consulte https://github.com/alvations/nltk_cli

En mucho tiempo :

Intente usar otro etiquetador (consulte https://github.com/nltk/nltk/tree/develop/nltk/tag ), por ejemplo :

HunPos
Stanford POS
Senna

Usando el etiquetador predeterminado MaxEnt POS de NLTK, es decir, nltk.pos_tag :

>>> from nltk import word_tokenize, pos_tag >>> text = "The quick brown fox jumps over the lazy dog" >>> pos_tag(word_tokenize(text)) [(''The'', ''DT''), (''quick'', ''NN''), (''brown'', ''NN''), (''fox'', ''NN''), (''jumps'', ''NNS''), (''over'', ''IN''), (''the'', ''DT''), (''lazy'', ''NN''), (''dog'', ''NN'')]

Usando el etiquetador Stanford POS :

$ cd ~ $ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip $ unzip stanford-postagger-2015-04-20.zip $ mv stanford-postagger-2015-04-20 stanford-postagger $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.stanford import POSTagger >>> _path_to_model = home + ''/stanford-postagger/models/english-bidirectional-distsim.tagger'' >>> _path_to_jar = home + ''/stanford-postagger/stanford-postagger.jar'' >>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar) >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [(u''The'', u''DT''), (u''quick'', u''JJ''), (u''brown'', u''JJ''), (u''fox'', u''NN''), (u''jumps'', u''VBZ''), (u''over'', u''IN''), (u''the'', u''DT''), (u''lazy'', u''JJ''), (u''dog'', u''NN'')]

Usando HunPOS (NOTA: la codificación predeterminada es ISO-8859-1 no UTF8):

$ cd ~ $ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz $ tar zxvf hunpos-1.0-linux.tgz $ wget https://hunpos.googlecode.com/files/en_wsj.model.gz $ gzip -d en_wsj.model.gz $ mv en_wsj.model hunpos-1.0-linux/ $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.hunpos import HunposTagger >>> _path_to_bin = home + ''/hunpos-1.0-linux/hunpos-tag'' >>> _path_to_model = home + ''/hunpos-1.0-linux/en_wsj.model'' >>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin) >>> text = "The quick brown fox jumps over the lazy dog" >>> ht.tag(text.split()) [(''The'', ''DT''), (''quick'', ''JJ''), (''brown'', ''JJ''), (''fox'', ''NN''), (''jumps'', ''NNS''), (''over'', ''IN''), (''the'', ''DT''), (''lazy'', ''JJ''), (''dog'', ''NN'')]

Usando Senna (asegúrese de tener la última versión de NLTK, se hicieron algunos cambios en la API):

$ cd ~ $ wget http://ronan.collobert.com/senna/senna-v3.0.tgz $ tar zxvf senna-v3.0.tgz $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.senna import SennaTagger >>> st = SennaTagger(home+''/senna'') >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [(''The'', u''DT''), (''quick'', u''JJ''), (''brown'', u''JJ''), (''fox'', u''NN''), (''jumps'', u''VBZ''), (''over'', u''IN''), (''the'', u''DT''), (''lazy'', u''JJ''), (''dog'', u''NN'')]

O intente construir un mejor etiquetador POS :

Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
Affix / Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
Cree su propio brillo (lea el código, es un etiquetador bastante divertido, http://www.nltk.org/_modules/nltk/tag/brill.html ), consulte http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
Etiquetador LDA: http://scm.io/blog/hack/2015/02/lda-intentions/

Las quejas sobre la precisión pos_tag en incluyen :

Los problemas sobre NLTK HunPos incluyen :

Los problemas con NLTK y Stanford POS Tagger incluyen :