python - preprocessing - onehotencoder pandas

Usar MultilabelBinarizer en datos de prueba con etiquetas que no están en el conjunto de entrenamiento (2)

Dado este simple ejemplo de clasificación multilabel (tomado de esta pregunta, use scikit-learn para clasificar en múltiples categorías )

import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn import preprocessing from sklearn.metrics import accuracy_score X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"], ["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]] X_test = np.array([''nice day in nyc'', ''welcome to london'', ''london is rainy'', ''it is raining in britian'', ''it is raining in britian and the big apple'', ''it is raining in britian and nyc'', ''hello welcome to new york. enjoy it here and london too'']) y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]] lb = preprocessing.MultiLabelBinarizer() Y = lb.fit_transform(y_train_text) Y_test = lb.fit_transform(y_test_text) classifier = Pipeline([ (''vectorizer'', CountVectorizer()), (''tfidf'', TfidfTransformer()), (''clf'', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) print "Accuracy Score: ",accuracy_score(Y_test, predicted)

El código funciona bien e imprime el puntaje de precisión, sin embargo, si cambio y_test_text para

y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]

yo obtengo

Traceback (most recent call last): File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module> print "Accuracy Score: ",accuracy_score(Y_test, predicted) File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score differing_labels = count_nonzero(y_true - y_pred, axis=1) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__ raise ValueError("inconsistent shapes") ValueError: inconsistent shapes

Observe la introducción de la etiqueta ''inglaterra'' que no está en el conjunto de entrenamiento. ¿Cómo uso la clasificación multilabel para que si se introduce una etiqueta de "prueba", todavía pueda ejecutar algunas de las métricas? ¿O eso es posible?

EDITAR: Gracias por las respuestas chicos, creo que mi pregunta es más acerca de cómo funciona el binarizador scikit o debería funcionar. Dado mi código de muestra corto, también esperaría si cambiara y_test_text para

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]

Que funcionaría, quiero decir que hemos ajustado para esa etiqueta, pero en este caso me sale

ValueError: Can''t handle mix of binary and multilabel-indicator

En resumen, es un problema mal planteado. La clasificación supone que todas las etiquetas se conocen de antemano , al igual que el binarizador. Colóquelo en todas las etiquetas y luego entrene en cualquier subconjunto que desee.

Puedes, si "presentas" la nueva etiqueta en el entrenamiento y también, así:

import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn import preprocessing from sklearn.metrics import accuracy_score X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"], ["new york"],["new york"],["london"],["london"], ["london"],["london"],["london"],["london"], ["new york","England"],["new york","london"]] X_test = np.array([''nice day in nyc'', ''welcome to london'', ''london is rainy'', ''it is raining in britian'', ''it is raining in britian and the big apple'', ''it is raining in britian and nyc'', ''hello welcome to new york. enjoy it here and london too'']) y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]] lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England")) Y = lb.fit_transform(y_train_text) Y_test = lb.fit_transform(y_test_text) print Y_test classifier = Pipeline([ (''vectorizer'', CountVectorizer()), (''tfidf'', TfidfTransformer()), (''clf'', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) print predicted print "Accuracy Score: ",accuracy_score(Y_test, predicted)

Salida:

Accuracy Score: 0.571428571429

La sección clave es:

y_train_text = [["new york"],["new york"],["new york"], ["new york"],["new york"],["new york"], ["london"],["london"],["london"],["london"], ["london"],["london"],["new york","England"], ["new york","london"]]

Donde también insertamos "Inglaterra". Tiene sentido, porque de otra manera, ¿cómo puede predecir el clasificador alguna etiqueta si no la ha visto antes? Así que creamos un problema de clasificación de tres etiquetas de esta manera.

EDITADO:

lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))

Tienes que pasar las clases como arg a MultiLabelBinarizer() y funcionará con cualquier y_test_text.