contando la frecuencia de n-gram en Python nltk

(3)

La función finder.ngram_fd.viewitems() funciona

tengo el siguiente código. Sé que puedo usar la función apply_freq_filter para filtrar colocaciones que son menores que un conteo de frecuencia. Sin embargo, no sé cómo obtener las frecuencias de todas las tuplas de n-gram (en mi caso bi-gramo) en un documento, antes de decidir qué frecuencia configurar para el filtrado. Como puedes ver, estoy usando la clase de colocaciones nltk.

import nltk from nltk.collocations import * line = "" open_file = open(''a_text_file'',''r'') for val in open_file: line += val tokens = line.split() bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) finder.apply_freq_filter(3) print finder.nbest(bigram_measures.pmi, 100)

NLTK viene con su propio bigrams generator , así como una conveniente función FreqDist() .

f = open(''a_text_file'') raw = f.read() tokens = nltk.word_tokenize(raw) #Create your bigrams bgs = nltk.bigrams(tokens) #compute frequency distribution for all the bigrams in the text fdist = nltk.FreqDist(bgs) for k,v in fdist.items(): print k,v

Una vez que tenga acceso a los BiGrams y las distribuciones de frecuencia, puede filtrar según sus necesidades.

Espero que ayude.

from nltk import FreqDist from nltk.util import ngrams def compute_freq(): textfile = open(''corpus.txt'',''r'') bigramfdist = FreqDist() threeramfdist = FreqDist() for line in textfile: if len(line) > 1: tokens = line.strip().split('' '') bigrams = ngrams(tokens, 2) bigramfdist.update(bigrams) compute_freq()