python - vectorizer - tf idf scikit

Implementación simple de la similitud de N-Gram, tf-idf y Cosine en Python (5)

Necesito comparar los documentos almacenados en un DB y obtener una puntuación de similitud entre 0 y 1.

El método que necesito usar tiene que ser muy simple. Implementando una versión estándar de n-grams (donde es posible definir cuántos gramos usar), junto con una implementación simple de tf-idf y similitud Cosine.

¿Hay algún programa que pueda hacer esto? ¿O debería empezar a escribir esto desde cero?

Aquí hay una respuesta con solo python + numpy , en resumen:

Coseno :

def cosine_sim(u,v): return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

Ngrams :

def ngrams(sentence, n): return zip(*[sentence.split()[i:] for i in range(n)])

TF-IDF (es un poco raro pero funciona):

def tfidf(corpus, vocab): """ INPUT: corpus = [(''this is a foo bar'', [1, 1, 0, 1, 1, 0, 0, 1]), (''foo bar bar black sheep'', [0, 2, 1, 1, 0, 0, 1, 0]), (''this is a sentence'', [1, 0, 0, 0, 1, 1, 0, 1])] vocab = [''a'', ''bar'', ''black'', ''foo'', ''is'', ''sentence'', ''sheep'', ''this''] OUTPUT: [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] """ def termfreq(matrix, doc, term): try: return matrix[doc][term] / float(sum(matrix[doc].values())) except ZeroDivisionError: return 0 def inversedocfreq(matrix, term): try: return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) except ZeroDivisionError: return 0 matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] tfidf = defaultdict(dict) for doc,_ in enumerate(matrix): for term in matrix[doc]: tf = termfreq(matrix,doc,term) idf = inversedocfreq(matrix, term) tfidf[doc][term] = tf*idf return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]

Aquí está la respuesta larga con las pruebas:

import numpy as np from math import sqrt, log from itertools import chain, product from collections import defaultdict def cosine_sim(u,v): return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) def ngrams(sentence, n): return zip(*[sentence.split()[i:] for i in range(n)]) def tfidf(corpus, vocab): """ INPUT: corpus = [(''this is a foo bar'', [1, 1, 0, 1, 1, 0, 0, 1]), (''foo bar bar black sheep'', [0, 2, 1, 1, 0, 0, 1, 0]), (''this is a sentence'', [1, 0, 0, 0, 1, 1, 0, 1])] vocab = [''a'', ''bar'', ''black'', ''foo'', ''is'', ''sentence'', ''sheep'', ''this''] OUTPUT: [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] """ def termfreq(matrix, doc, term): try: return matrix[doc][term] / float(sum(matrix[doc].values())) except ZeroDivisionError: return 0 def inversedocfreq(matrix, term): try: return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) except ZeroDivisionError: return 0 matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] tfidf = defaultdict(dict) for doc,_ in enumerate(matrix): for term in matrix[doc]: tf = termfreq(matrix,doc,term) idf = inversedocfreq(matrix, term) tfidf[doc][term] = tf*idf return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] def corpus2vectors(corpus): def vectorize(sentence, vocab): return [sentence.split().count(i) for i in vocab] vectorized_corpus = [] vocab = sorted(set(chain(*[i.lower().split() for i in corpus]))) for i in corpus: vectorized_corpus.append((i, vectorize(i, vocab))) return vectorized_corpus, vocab def create_test_corpus(): sent1 = "this is a foo bar" sent2 = "foo bar bar black sheep" sent3 = "this is a sentence" all_sents = [sent1,sent2,sent3] corpus, vocab = corpus2vectors(all_sents) return corpus, vocab def test_cosine(): corpus, vocab = create_test_corpus() for sentx, senty in product(corpus, corpus): print sentx[0] print senty[0] print "cosine =", cosine_sim(sentx[1], senty[1]) print def test_ngrams(): corpus, vocab = create_test_corpus() for sentx in corpus: print sentx[0] print ngrams(sentx[0],2) print ngrams(sentx[0],3) print def test_tfidf(): corpus, vocab = create_test_corpus() print corpus print vocab print tfidf(corpus, vocab) print "Testing cosine..." test_cosine() print print "Testing ngrams..." test_ngrams() print print "Testing tfidf..." test_tfidf() print

[fuera]:

Testing cosine... this is a foo bar this is a foo bar cosine = 1.0 this is a foo bar foo bar bar black sheep cosine = 0.507092552837 this is a foo bar this is a sentence cosine = 0.67082039325 foo bar bar black sheep this is a foo bar cosine = 0.507092552837 foo bar bar black sheep foo bar bar black sheep cosine = 1.0 foo bar bar black sheep this is a sentence cosine = 0.0 this is a sentence this is a foo bar cosine = 0.67082039325 this is a sentence foo bar bar black sheep cosine = 0.0 this is a sentence this is a sentence cosine = 1.0 Testing ngrams... this is a foo bar [(''this'', ''is''), (''is'', ''a''), (''a'', ''foo''), (''foo'', ''bar'')] [(''this'', ''is'', ''a''), (''is'', ''a'', ''foo''), (''a'', ''foo'', ''bar'')] foo bar bar black sheep [(''foo'', ''bar''), (''bar'', ''bar''), (''bar'', ''black''), (''black'', ''sheep'')] [(''foo'', ''bar'', ''bar''), (''bar'', ''bar'', ''black''), (''bar'', ''black'', ''sheep'')] this is a sentence [(''this'', ''is''), (''is'', ''a''), (''a'', ''sentence'')] [(''this'', ''is'', ''a''), (''is'', ''a'', ''sentence'')] Testing tfidf... [(''this is a foo bar'', [1, 1, 0, 1, 1, 0, 0, 1]), (''foo bar bar black sheep'', [0, 2, 1, 1, 0, 0, 1, 0]), (''this is a sentence'', [1, 0, 0, 0, 1, 1, 0, 1])] [''a'', ''bar'', ''black'', ''foo'', ''is'', ''sentence'', ''sheep'', ''this''] [[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

Consulte el paquete NLTK: http://www.nltk.org tiene todo lo que necesita

Para la similitud coseno:

def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. """ return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

Para ngrams:

def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): """ A utility that produces a sequence of ngrams from a sequence of items. For example: >>> ngrams([1,2,3,4,5], 3) [(1, 2, 3), (2, 3, 4), (3, 4, 5)] Use ingram for an iterator version of this function. Set pad_left or pad_right to true in order to get additional ngrams: >>> ngrams([1,2,3,4,5], 2, pad_right=True) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] @param sequence: the source data to be converted into ngrams @type sequence: C{sequence} or C{iterator} @param n: the degree of the ngrams @type n: C{int} @param pad_left: whether the ngrams should be left-padded @type pad_left: C{boolean} @param pad_right: whether the ngrams should be right-padded @type pad_right: C{boolean} @param pad_symbol: the symbol to use for padding (default is None) @type pad_symbol: C{any} @return: The ngrams @rtype: C{list} of C{tuple}s """ if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) sequence = list(sequence) count = max(0, len(sequence) - n + 1) return [tuple(sequence[i:i+n]) for i in range(count)]

para tf-idf primero tendrá que calcular la distribución, estoy usando Lucene para hacer eso, pero puede hacer algo similar con NLTK, use FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

si te gusta el piluceno, esto te dirá cómo comulgar tf.idf

# reader = lucene.IndexReader(FSDirectory.open(index_loc)) docs = reader.numDocs() for i in xrange(docs): tfv = reader.getTermFreqVector(i, fieldname) if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term tmap.setdefault(t, len(tmap)) rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization: denom = sum([x**2 for x in rec.values()])**0.5 for k,v in rec.items(): rec[k] = v / denom

En caso de que todavía esté interesado en este problema, he hecho algo muy similar usando Lucene Java y Jython. Aquí hay algunos fragmentos de mi código.

Lucene preprocesa documentos y consultas utilizando los llamados analizadores. Este utiliza el filtro de n-gramas incorporado de Lucene:

class NGramAnalyzer(Analyzer): ''''''Analyzer that yields n-grams for minlength <= n <= maxlength'''''' def __init__(self, minlength, maxlength): self.minlength = minlength self.maxlength = maxlength def tokenStream(self, field, reader): lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader)) return NGramTokenFilter(lower, self.minlength, self.maxlength)

Para convertir una lista de ngrams en un Document :

doc = Document() doc.add(Field(''n-grams'', '' ''.join(ngrams), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

Para almacenar un documento en un índice:

wr = IndexWriter(index_dir, NGramAnalyzer(), True, IndexWriter.MaxFieldLength.LIMITED) wr.addDocument(doc)

Crear consultas es un poco más difícil ya que QueryParser de Lucene espera un lenguaje de consulta con operadores especiales, comillas, etc., pero se puede eludir (como se explica en parte aquí ).

Para nuestro curso de recuperación de información, utilizamos un código escrito por nuestro profesor en Java. Lo sentimos, no hay puerto de Python. "Se lanza con fines educativos y de investigación solo bajo la Licencia Pública General de GNU".

Puede consultar la documentación http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

Pero más específicamente echa un vistazo a: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

Puede descargarlo http://userweb.cs.utexas.edu/users/mooney/ir-course/

Si está interesado, he hecho una serie de tutoriales ( Parte I y Parte II ) hablando sobre tf-idf y usando el módulo Scikits.learn (sklearn) Python.

Parte 3 tiene similitud coseno.