python nlp nltk normalize cosine-similarity

python - Normalizar puntaje de clasificación con pesas



nlp nltk (1)

Estoy trabajando en un problema de búsqueda de documentos donde, dado un conjunto de documentos y una consulta de búsqueda, quiero encontrar el documento más cercano a la consulta. El modelo que estoy usando se basa en TfidfVectorizer en scikit. Creé 4 diferentes vectores tf_idf para todos los documentos usando 4 tipos diferentes de tokenizers. Cada tokenizador divide la cadena en n-grams donde n está en el rango 1 ... 4.

Por ejemplo:

doc_1 = "Singularity is still a confusing phenomenon in physics" doc_2 = "Quantum theory still wins over String theory"

Entonces model_1 usará un tokenizer de 1 gramo, model_2 usará un tokenizer de 2 gramos.

Luego, para una consulta de búsqueda dada, calculo la similitud del coseno entre el término de búsqueda y todos los otros documentos que usan estos 4 modelos.

Por ejemplo, consulta de búsqueda: Singularidad en física cuántica. La consulta de búsqueda se divide en n-grams y los valores de tf_idf se calculan a partir del modelo de n-gramas correspondiente.

Por lo tanto, para cada par de consulta-documento tengo 4 valores de similitud basados ​​en el modelo n-gram utilizado. Por ejemplo:

1-gram similarity = 0.4370303325246957 2-gram similarity = 0.36617374546988996 3-gram similarity = 0.29519246156322099 4-gram similarity = 0.2902998188509896

Todos estos puntajes de similitud se normalizan en una escala de 0 a 1. Ahora quiero calcular una puntuación normalizada agregada de modo que para cualquier par de documento de consulta, la similitud de n-gramas más alta tenga un peso realmente alto. Básicamente, cuanto mayor sea la similitud de ngram, mayor será su impacto en el puntaje general.

¿Alguien puede sugerir una solución?


Hay muchas formas de jugar con los números:

>>> onegram_sim = 0.43 >>> twogram_sim = 0.36 >>> threegram_sim = 0.29 >>> fourgram_sim = 0.29 # Sum(x) / len(list) >>> all_sim = sum([onegram_sim, twogram_sim, threegram_sim, fourgram_sim]) / 4 >>> all_sim 0.3425 # Sum(x*x) / len(list) >>> all_sim = sum(map(lambda x: x**2, [onegram_sim, twogram_sim, threegram_sim, fourgram_sim])) / 4 >>> all_sim 0.120675 # Product(x) >>> from operator import mul >>> onetofour_sim = [onegram_sim, twogram_sim, threegram_sim, fourgram_sim] >>> reduce(mul, onetofour_sim, 1) 0.013018679999999998

Eventualmente, lo que te lleve a un mejor puntaje de precisión para tu tarea final es la mejor solución.

Más allá de tu pregunta:

Para calcular la similitud de documentos, hay una llamada a la tarea SemEval de larga duración. Similitud textual semántica https://groups.google.com/forum/#!forum/sts-semeval

Las estrategias comunes incluyen (no exhaustivamente):

  1. Use un corpus anotado con puntuaciones de similitud para pares de oraciones, extraiga algunas características, entrene un regresor y emite un puntaje de similitud

  2. Use algún tipo de semántica de espacio vectorial (lectura muy recomendada: http://www.jair.org/media/2934/live-2934-4846-jair.pdf ) y luego haga algunos puntajes de similitud de vectores (eche un vistazo a Cómo calcular la similitud del coseno dado 2 cadenas de frases? - Python )

    yo. Un subconjunto de la jerga de la semántica del espacio vectorial será útil (a veces conocido como incrustaciones de palabras), a veces las personas entrenan un espacio vectorial con modelos de tema / redes neuronales / aprendizaje profundo (otras palabras de moda relacionadas), consulte http: //u.cs. biu.ac.il/~yogo/cvsc2015.pdf

    ii. También podría usar vectores de bolsa de palabras más tradicionales y comprimir el espacio con TF-IDF o cualquier otra reducción de dimensionalidad "latente" y luego usar alguna función de similitud de vectores para obtener la similitud.

    iii. Cree una función de similitud vectorial elegante (por ejemplo, cosmul , consulte https://radimrehurek.com/gensim/models/word2vec.html ) y luego modifique la función y cosmul en espacios diferentes.

  3. Utilice algunos recursos léxicos que contengan una ontología de conceptos (por ejemplo, WordNet, Cyc, etc.) y luego compare la similitud atravesando los gráficos conceptuales (consulte http://www.nltk.org/howto/wordnet.html ). Un ejemplo sería https://github.com/alvations/pywsd/blob/master/pywsd/similarity.py

Teniendo en cuenta lo anterior como fondo y sin anotaciones, intentemos hackear algunos ejemplos de espacio vectorial:

Primero intentemos ngrams simples con vectores binarios simples:

import numpy as np from nltk import ngrams doc1 = "Singularity is still a confusing phenomenon in physics".split() doc2 = "Quantum theory still wins over String theory".split() _vec1 = list(ngrams(doc1, 3)) _vec2 = list(ngrams(doc2, 3)) # Create a full dictionary of all possible ngrams. vec_dict = list(set(_vec1).union(_vec2)) print ''Vector Dict:'', vec_dict # Now vectorize the documents vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict] vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict] print ''Vectorzied:'', vec1, vec2 print ''Similarity:'', np.dot(vec1, vec2)

[fuera]:

Vector Dict: [(''still'', ''a'', ''confusing''), (''confusing'', ''phenomenon'', ''in''), (''theory'', ''still'', ''wins''), (''is'', ''still'', ''a''), (''over'', ''String'', ''theory''), (''a'', ''confusing'', ''phenomenon''), (''wins'', ''over'', ''String''), (''Singularity'', ''is'', ''still''), (''still'', ''wins'', ''over''), (''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'')] Vectorzied: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] Similarity: 0

Ahora intentemos incluir desde 1 gramo a ngrams (donde n = len(sent) ) y poner todo en el diccionario vectorial con los ngramos binarios:

import numpy as np from nltk import ngrams def everygrams(sequence): """ This function returns all possible ngrams for n ranging from 1 to len(sequence). >>> list(everygrams(''a b c''.split())) [(''a'',), (''b'',), (''c'',), (''a'', ''b''), (''b'', ''c''), (''a'', ''b'', ''c'')] """ for n in range(1, len(sequence)+1): for ng in ngrams(sequence, n): yield ng doc1 = "Singularity is still a confusing phenomenon in physics".split() doc2 = "Quantum theory still wins over String theory".split() _vec1 = list(everygrams(doc1)) _vec2 = list(everygrams(doc2)) # Create a full dictionary of all possible ngrams. vec_dict = list(set(_vec1).union(_vec2)) print ''Vector Dict:'', vec_dict, ''/n'' # Now vectorize the documents vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict] vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict] print ''Vectorzied:'', vec1, vec2, ''/n'' print ''Similarity:'', np.dot(vec1, vec2), ''/n''

[fuera]:

Vector Dict: [(''still'', ''a''), (''over'', ''String''), (''theory'', ''still'', ''wins'', ''over'', ''String'', ''theory''), (''String'', ''theory''), (''physics'',), (''in'',), (''wins'', ''over'', ''String'', ''theory''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''theory'', ''still'', ''wins''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon''), (''a'',), (''wins'',), (''is'', ''still'', ''a''), (''Singularity'', ''is''), (''phenomenon'', ''in''), (''still'', ''wins'', ''over'', ''String''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins'', ''over''), (''a'', ''confusing'', ''phenomenon''), (''Singularity'', ''is'', ''still'', ''a''), (''confusing'', ''phenomenon''), (''confusing'', ''phenomenon'', ''in'', ''physics''), (''Singularity'', ''is'', ''still''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''wins'', ''over''), (''theory'', ''still'', ''wins'', ''over''), (''phenomenon'',), (''Quantum'', ''theory'', ''still'', ''wins'', ''over'', ''String''), (''is'', ''still''), (''still'', ''wins'', ''over''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon''), (''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins''), (''Quantum'', ''theory'', ''still''), (''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing''), (''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''still'', ''a'', ''confusing''), (''is'', ''still'', ''a'', ''confusing''), (''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins'', ''over'', ''String'', ''theory''), (''confusing'', ''phenomenon'', ''in''), (''theory'', ''still''), (''Quantum'', ''theory''), (''is'',), (''String'',), (''over'', ''String'', ''theory''), (''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''a'', ''confusing''), (''still'', ''wins''), (''still'',), (''over'',), (''still'', ''a'', ''confusing'', ''phenomenon''), (''wins'', ''over'', ''String''), (''Singularity'',), (''confusing'',), (''theory'',), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''still'', ''wins'', ''over'', ''String'', ''theory''), (''a'', ''confusing'', ''phenomenon'', ''in''), (''Quantum'',), (''theory'', ''still'', ''wins'', ''over'', ''String'')] Vectorzied: [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0] [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1] Similarity: 1

Ahora intentemos normalizar por no. de posibles ngrams:

import numpy as np from nltk import ngrams def everygrams(sequence): """ This function returns all possible ngrams for n ranging from 1 to len(sequence). >>> list(everygrams(''a b c''.split())) [(''a'',), (''b'',), (''c'',), (''a'', ''b''), (''b'', ''c''), (''a'', ''b'', ''c'')] """ for n in range(1, len(sequence)+1): for ng in ngrams(sequence, n): yield ng doc1 = "Singularity is still a confusing phenomenon in physics".split() doc2 = "Quantum theory still wins over String theory".split() _vec1 = list(everygrams(doc1)) _vec2 = list(everygrams(doc2)) # Create a full dictionary of all possible ngrams. vec_dict = list(set(_vec1).union(_vec2)) print ''Vector Dict:'', vec_dict, ''/n'' # Now vectorize the documents vec1 = [1/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict] vec2 = [1/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict] print ''Vectorzied:'', vec1, vec2, ''/n'' print ''Similarity:'', np.dot(vec1, vec2), ''/n''

Se ve mejor, afuera:

Vector Dict: [(''still'', ''a''), (''over'', ''String''), (''theory'', ''still'', ''wins'', ''over'', ''String'', ''theory''), (''String'', ''theory''), (''physics'',), (''in'',), (''wins'', ''over'', ''String'', ''theory''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''theory'', ''still'', ''wins''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon''), (''a'',), (''wins'',), (''is'', ''still'', ''a''), (''Singularity'', ''is''), (''phenomenon'', ''in''), (''still'', ''wins'', ''over'', ''String''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins'', ''over''), (''a'', ''confusing'', ''phenomenon''), (''Singularity'', ''is'', ''still'', ''a''), (''confusing'', ''phenomenon''), (''confusing'', ''phenomenon'', ''in'', ''physics''), (''Singularity'', ''is'', ''still''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''wins'', ''over''), (''theory'', ''still'', ''wins'', ''over''), (''phenomenon'',), (''Quantum'', ''theory'', ''still'', ''wins'', ''over'', ''String''), (''is'', ''still''), (''still'', ''wins'', ''over''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon''), (''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins''), (''Quantum'', ''theory'', ''still''), (''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing''), (''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''still'', ''a'', ''confusing''), (''is'', ''still'', ''a'', ''confusing''), (''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins'', ''over'', ''String'', ''theory''), (''confusing'', ''phenomenon'', ''in''), (''theory'', ''still''), (''Quantum'', ''theory''), (''is'',), (''String'',), (''over'', ''String'', ''theory''), (''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''a'', ''confusing''), (''still'', ''wins''), (''still'',), (''over'',), (''still'', ''a'', ''confusing'', ''phenomenon''), (''wins'', ''over'', ''String''), (''Singularity'',), (''confusing'',), (''theory'',), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''still'', ''wins'', ''over'', ''String'', ''theory''), (''a'', ''confusing'', ''phenomenon'', ''in''), (''Quantum'',), (''theory'', ''still'', ''wins'', ''over'', ''String'')] Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] Similarity: 0.000992063492063

Ahora intentemos contar los ngramos en lugar de tomar 1/len(_vec) , es decir, _vec.count(ng) / len(_vec) :

import numpy as np from nltk import ngrams def everygrams(sequence): """ This function returns all possible ngrams for n ranging from 1 to len(sequence). >>> list(everygrams(''a b c''.split())) [(''a'',), (''b'',), (''c'',), (''a'', ''b''), (''b'', ''c''), (''a'', ''b'', ''c'')] """ for n in range(1, len(sequence)+1): for ng in ngrams(sequence, n): yield ng doc1 = "Singularity is still a confusing phenomenon in physics".split() doc2 = "Quantum theory still wins over String theory".split() _vec1 = list(everygrams(doc1)) _vec2 = list(everygrams(doc2)) # Create a full dictionary of all possible ngrams. vec_dict = list(set(_vec1).union(_vec2)) print ''Vector Dict:'', vec_dict, ''/n'' # Now vectorize the documents vec1 = [_vec1.count(ng)/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict] vec2 = [_vec2.count(ng)/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict] print ''Vectorzied:'', vec1, vec2, ''/n'' print ''Similarity:'', np.dot(vec1, vec2), ''/n''

Como era de esperar, dado que los recuentos son todos 1, es el mismo puntaje de similitud:

Vector Dict: [(''still'', ''a''), (''over'', ''String''), (''theory'', ''still'', ''wins'', ''over'', ''String'', ''theory''), (''String'', ''theory''), (''physics'',), (''in'',), (''wins'', ''over'', ''String'', ''theory''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''theory'', ''still'', ''wins''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon''), (''a'',), (''wins'',), (''is'', ''still'', ''a''), (''Singularity'', ''is''), (''phenomenon'', ''in''), (''still'', ''wins'', ''over'', ''String''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins'', ''over''), (''a'', ''confusing'', ''phenomenon''), (''Singularity'', ''is'', ''still'', ''a''), (''confusing'', ''phenomenon''), (''confusing'', ''phenomenon'', ''in'', ''physics''), (''Singularity'', ''is'', ''still''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''wins'', ''over''), (''theory'', ''still'', ''wins'', ''over''), (''phenomenon'',), (''Quantum'', ''theory'', ''still'', ''wins'', ''over'', ''String''), (''is'', ''still''), (''still'', ''wins'', ''over''), (''is'', ''still'', ''a'', ''confusing'', ''phenomenon''), (''phenomenon'', ''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins''), (''Quantum'', ''theory'', ''still''), (''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''Singularity'', ''is'', ''still'', ''a'', ''confusing''), (''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''still'', ''a'', ''confusing''), (''is'', ''still'', ''a'', ''confusing''), (''in'', ''physics''), (''Quantum'', ''theory'', ''still'', ''wins'', ''over'', ''String'', ''theory''), (''confusing'', ''phenomenon'', ''in''), (''theory'', ''still''), (''Quantum'', ''theory''), (''is'',), (''String'',), (''over'', ''String'', ''theory''), (''still'', ''a'', ''confusing'', ''phenomenon'', ''in'', ''physics''), (''a'', ''confusing''), (''still'', ''wins''), (''still'',), (''over'',), (''still'', ''a'', ''confusing'', ''phenomenon''), (''wins'', ''over'', ''String''), (''Singularity'',), (''confusing'',), (''theory'',), (''Singularity'', ''is'', ''still'', ''a'', ''confusing'', ''phenomenon'', ''in''), (''still'', ''wins'', ''over'', ''String'', ''theory''), (''a'', ''confusing'', ''phenomenon'', ''in''), (''Quantum'',), (''theory'', ''still'', ''wins'', ''over'', ''String'')] Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.07142857142857142, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] Similarity: 0.000992063492063

Además de los ngrams, también puedes probar los skipgrams: ¿cómo calcular los skipgrams en python?