example - Agrupar documentos de texto utilizando los kmeans de scikit-learn en Python

kmeans() (2)

Considera que este artículo es muy útil para agrupar documentos utilizando K-Means . http://brandonrose.org/clustering .

Para comprender el algoritmo, puede consultar este artículo también https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

Necesito implementar kMeans de scikit-learn para agrupar documentos de texto. El código de ejemplo funciona bien como está pero toma algunos datos de 20newsgroups como entrada. Quiero usar el mismo código para agrupar una lista de documentos como se muestra a continuación:

documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"]

¿Qué cambios debo hacer en kMeans código de ejemplo para usar esta lista como entrada? (Simplemente tomar ''dataset = documents'' no funciona)

Este es un ejemplo más simple:

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"]

vectorizar el texto, es decir, convertir las cadenas a características numéricas

vectorizer = TfidfVectorizer(stop_words=''english'') X = vectorizer.fit_transform(documents)

documentos agrupados

true_k = 2 model = KMeans(n_clusters=true_k, init=''k-means++'', max_iter=100, n_init=1) model.fit(X)

imprimir los términos principales por grupos

print("Top terms per cluster:") order_centroids = model.cluster_centers_.argsort()[:, ::-1] terms = vectorizer.get_feature_names() for i in range(true_k): print "Cluster %d:" % i, for ind in order_centroids[i, :10]: print '' %s'' % terms[ind], print

Si desea tener una idea más visual de cómo se ve esto, vea esta respuesta .