Lematización de wordnet y etiquetado pos en python

nltk lemmatization (6)

Pasos para convertir: Documento-> Oraciones-> Fichas-> POS-> Lemmas

import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet #example text text = ''What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'' class Splitter(object): """ split the document into sentences and tokenize each sentence """ def __init__(self): self.splitter = nltk.data.load(''tokenizers/punkt/english.pickle'') self.tokenizer = nltk.tokenize.TreebankWordTokenizer() def split(self,text): """ out : [''What'', ''can'', ''I'', ''say'', ''about'', ''this'', ''place'', ''.''] """ # split into single sentence sentences = self.splitter.tokenize(text) # tokenization in each sentences tokens = [self.tokenizer.tokenize(sent) for sent in sentences] return tokens class LemmatizationWithPOSTagger(object): def __init__(self): pass def get_wordnet_pos(self,treebank_tag): """ return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) """ if treebank_tag.startswith(''J''): return wordnet.ADJ elif treebank_tag.startswith(''V''): return wordnet.VERB elif treebank_tag.startswith(''N''): return wordnet.NOUN elif treebank_tag.startswith(''R''): return wordnet.ADV else: # As default pos in lemmatization is Noun return wordnet.NOUN def pos_tag(self,tokens): # find the pos tagginf for each tokens [(''What'', ''WP''), (''can'', ''MD''), (''I'', ''PRP'') .... pos_tokens = [nltk.pos_tag(token) for token in tokens] # lemmatization using pos tagg # convert into feature set of [(''What'', ''What'', [''WP'']), (''can'', ''can'', [''MD'']), ... ie [original WORD, Lemmatized word, POS tag] pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens] return pos_tokens lemmatizer = WordNetLemmatizer() splitter = Splitter() lemmatization_using_pos_tagger = LemmatizationWithPOSTagger() #step 1 split document into sentence followed by tokenization tokens = splitter.split(text) #step 2 lemmatization using pos tagger lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens) print(lemma_pos_token)

Quería usar wordnet lemmatizer en python y he aprendido que la etiqueta pos predeterminada es NOUN y que no genera el lema correcto para un verbo, a menos que la etiqueta pos se especifique explícitamente como VERB.

Mi pregunta es ¿cuál es la mejor toma para realizar la lematización anterior con precisión?

Hice el etiquetado pos utilizando nltk.pos_tag y estoy perdido en la integración de las etiquetas pos del banco de árboles a las etiquetas pos compatibles con wordnet. Por favor ayuda

from nltk.stem.wordnet import WordNetLemmatizer lmtzr = WordNetLemmatizer() tagged = nltk.pos_tag(tokens)

Obtengo las etiquetas de salida en NN, JJ, VB, RB. ¿Cómo puedo cambiar estas etiquetas compatibles con wordnet?

¿También tengo que entrenar nltk.pos_tag() con un corpus etiquetado o puedo usarlo directamente en mis datos para evaluar?

@Suzana_K estaba trabajando. Pero hay algunos resultados de casos en KeyError como @ Reloj esclavo mención.

Convertir etiquetas treebank a Wordnet etiqueta

from nltk.corpus import wordnet def get_wordnet_pos(treebank_tag): if treebank_tag.startswith(''J''): return wordnet.ADJ elif treebank_tag.startswith(''V''): return wordnet.VERB elif treebank_tag.startswith(''N''): return wordnet.NOUN elif treebank_tag.startswith(''R''): return wordnet.ADV else: return None # for easy if-statement

Ahora, solo ingresamos pos en la función lematizar solo si tenemos la etiqueta wordnet

from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() tagged = nltk.pos_tag(tokens) for word, tag in tagged: wntag = get_wordnet_pos(tag) if wntag is None:# not supply tag in case of None lemma = lemmatizer.lemmatize(word) else: lemma = lemmatizer.lemmatize(word, pos=wntag)

Como en el código fuente de nltk.corpus.reader.wordnet ( http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html )

#{ Part-of-speech constants ADJ, ADJ_SAT, ADV, NOUN, VERB = ''a'', ''s'', ''r'', ''n'', ''v'' #} POS_LIST = [NOUN, VERB, ADJ, ADV]

En primer lugar, puedes usar nltk.pos_tag() directamente sin entrenarlo. La función cargará un etiquetador pre-entrenado desde un archivo. Puede ver el nombre del archivo con nltk.tag._POS_TAGGER :

nltk.tag._POS_TAGGER >>> ''taggers/maxent_treebank_pos_tagger/english.pickle''

Como se entrenó con el corpus de Treebank, también usa el conjunto de etiquetas Treebank .

La siguiente función asignaría las etiquetas treebank a la parte de WordNet de los nombres de voz:

Luego puede usar el valor de retorno con el lematizante:

from nltk.stem.wordnet import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize(''going'', wordnet.VERB) >>> ''go''

Verifique el valor de retorno antes de pasarlo al Lemmatizer porque una cadena vacía daría un KeyError .

Puede crear un mapa utilizando el dict predeterminado de Python y aprovechar el hecho de que para el lematizador la etiqueta predeterminada es Noun.

from nltk.corpus import wordnet as wn from nltk.stem.wordnet import WordNetLemmatizer from nltk import word_tokenize, pos_tag from collections import defaultdict tag_map = defaultdict(lambda : wn.NOUN) tag_map[''J''] = wn.ADJ tag_map[''V''] = wn.VERB tag_map[''R''] = wn.ADV text = "Another way of achieving this task" tokens = word_tokenize(text) lmtzr = WordNetLemmatizer() for token, tag in pos_tag(tokens): lemma = lmtzr.lemmatize(token, tag_map[tag[0]]) print(token, "=>", lemma)

Puedes hacer esto en una línea:

wnpos = lambda e: (''a'' if e[0].lower() == ''j'' else e[0].lower()) if e[0].lower() in [''n'', ''r'', ''v''] else ''n''

Luego use wnpos(nltk_pos) para obtener el POS para dar a .lemmatize (). En su caso, lmtzr.lemmatize(word=tagged[0][0], pos=wnpos(tagged[0][1])) .