machine - python scikit-learn agrupamiento con datos faltantes

machine learning python español pdf (2)

Quiero agrupar datos con columnas faltantes. Al hacerlo manualmente, calcularía la distancia en caso de que falte una columna simplemente sin esta columna.

Con scikit-learn, los datos faltantes no son posibles. Tampoco hay posibilidad de especificar una función de distancia del usuario.

¿Hay alguna posibilidad de agruparse con datos faltantes?

Ejemplo de datos:

n_samples = 1500 noise = 0.05 X, _ = make_swiss_roll(n_samples, noise) rnd = np.random.rand(X.shape[0],X.shape[1]) X[rnd<0.1] = np.nan

Aquí hay un algoritmo diferente que uso. En lugar de reemplazar los valores que faltan, los valores se ignoran y para capturar las diferencias entre los ítems faltantes y no extraviados se implícita.

Comparado con el algoritmo de Alis, parece más fácil para las observaciones con observatorios faltantes saltar de clase en clase. Como no llené los valores faltantes.

Afortunadamente no tuve tiempo de comparar usando el hermoso código de Ali, pero siéntete libre de hacerlo (podría hacerlo cuando tenga el tiempo) y contribuir a la discusión sobre el mejor método.

import numpy as np class kmeans_missing(object): def __init__(self,potential_centroids,n_clusters): #initialize with potential centroids self.n_clusters=n_clusters self.potential_centroids=potential_centroids def fit(self,data,max_iter=10,number_of_runs=1): n_clusters=self.n_clusters potential_centroids=self.potential_centroids dist_mat=np.zeros((data.shape[0],n_clusters)) all_centroids=np.zeros((n_clusters,data.shape[1],number_of_runs)) costs=np.zeros((number_of_runs,)) for k in range(number_of_runs): idx=np.random.choice(range(potential_centroids.shape[0]), size=(n_clusters), replace=False) centroids=potential_centroids[idx] clusters=np.zeros(data.shape[0]) old_clusters=np.zeros(data.shape[0]) for i in range(max_iter): #Calc dist to centroids for j in range(n_clusters): dist_mat[:,j]=np.nansum((data-centroids[j])**2,axis=1) #Assign to clusters clusters=np.argmin(dist_mat,axis=1) #Update clusters for j in range(n_clusters): centroids[j]=np.nanmean(data[clusters==j],axis=0) if all(np.equal(clusters,old_clusters)): break # Break when to change in clusters if i==max_iter-1: print(''no convergence before maximal iterations are reached'') else: clusters,old_clusters=old_clusters,clusters all_centroids[:,:,k]=centroids costs[k]=np.mean(np.min(dist_mat,axis=1)) self.costs=costs self.cost=np.min(costs) self.best_model=np.argmin(costs) self.centroids=all_centroids[:,:,self.best_model] self.all_centroids=all_centroids def predict(self,data): dist_mat=np.zeros((data.shape[0],self.n_clusters)) for j in range(self.n_clusters): dist_mat[:,j]=np.nansum((data-self.centroids[j])**2,axis=1) prediction=np.argmin(dist_mat,axis=1) cost=np.min(dist_mat,axis=1) return prediction,cost

Aquí hay un ejemplo de cómo podría ser útil.

from sklearn.datasets import make_blobs from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D from kmeans_missing import * def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500, n_features=2, seed=None): # complete data gen = np.random.RandomState(seed) X, true_labels = make_blobs(n_samples, n_features, n_clusters, random_state=gen) # with missing values missing = gen.rand(*X.shape) < fraction_missing Xm = np.where(missing, np.nan, X) return X, true_labels, Xm X, true_labels, X_hat = make_fake_data(fraction_missing=0.3, n_clusters=3, seed=0) X_missing_dummies=np.isnan(X_hat) n_clusters=3 X_hat = np.concatenate((X_hat,X_missing_dummies),axis=1) kmeans_m=kmeans_missing(X_hat,n_clusters) kmeans_m.fit(X_hat,max_iter=100,number_of_runs=10) print(kmeans_m.costs) prediction,cost=kmeans_m.predict(X_hat) for i in range(n_clusters): print([np.mean((prediction==i)*(true_labels==j)) for j in range(3)],np.mean((prediction==i)))

--EDITAR--

En este ejemplo, las ocurrencias de valores faltantes son completamente aleatorias y cuando ese es el caso. No agregar mejor las preformas dummy de valor perdido, ya que las variables ficticias de valor faltante en ese caso son ruido. No incluirlos también sería lo correcto para comparar con el algoritmo de Ali.

Creo que puedes usar un algoritmo iterativo de tipo EM:

Inicializar los valores faltantes en su columna significa
Repita hasta la convergencia:
Realice clustering K-means en los datos rellenos
Establezca los valores faltantes en las coordenadas del centroide de los clústeres a los que fueron asignados

Implementación

import numpy as np from sklearn.cluster import KMeans def kmeans_missing(X, n_clusters, max_iter=10): """Perform K-Means clustering on data with missing values. Args: X: An [n_samples, n_features] array of data to cluster. n_clusters: Number of clusters to form. max_iter: Maximum number of EM iterations to perform. Returns: labels: An [n_samples] vector of integer labels. centroids: An [n_clusters, n_features] array of cluster centroids. X_hat: Copy of X with the missing values filled in. """ # Initialize missing values to their column means missing = ~np.isfinite(X) mu = np.nanmean(X, 0, keepdims=1) X_hat = np.where(missing, mu, X) for i in xrange(max_iter): if i > 0: # initialize KMeans with the previous set of centroids. this is much # faster and makes it easier to check convergence (since labels # won''t be permuted on every iteration), but might be more prone to # getting stuck in local minima. cls = KMeans(n_clusters, init=prev_centroids) else: # do multiple random initializations in parallel cls = KMeans(n_clusters, n_jobs=-1) # perform clustering on the filled-in data labels = cls.fit_predict(X_hat) centroids = cls.cluster_centers_ # fill in the missing values based on their cluster centroids X_hat[missing] = centroids[labels][missing] # when the labels have stopped changing then we have converged if i > 0 and np.all(labels == prev_labels): break prev_labels = labels prev_centroids = cls.cluster_centers_ return labels, centroids, X_hat

Ejemplo con datos falsos

from sklearn.datasets import make_blobs from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500, n_features=3, seed=None): # complete data gen = np.random.RandomState(seed) X, true_labels = make_blobs(n_samples, n_features, n_clusters, random_state=gen) # with missing values missing = gen.rand(*X.shape) < fraction_missing Xm = np.where(missing, np.nan, X) return X, true_labels, Xm X, true_labels, Xm = make_fake_data(fraction_missing=0.3, n_clusters=5, seed=0) labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5) # plot the inferred points, color-coded according to the true cluster labels fig, ax = plt.subplots(1, 2, subplot_kw={''projection'':''3d'', ''aspect'':''equal''}) ax[0].scatter3D(X[:, 0], X[:, 1], X[:, 2], c=true_labels, cmap=''gist_rainbow'') ax[1].scatter3D(X_hat[:, 0], X_hat[:, 1], X_hat[:, 2], c=true_labels, cmap=''gist_rainbow'') ax[0].set_title(''Original data'') ax[1].set_title(''Imputed (30% missing values)'') fig.tight_layout()

Punto de referencia

Para evaluar el rendimiento del algoritmo, podemos usar la información mutua ajustada entre las etiquetas de clúster verdadero e inferido. Una puntuación de 1 es un rendimiento perfecto y 0 representa una oportunidad:

from sklearn.metrics import adjusted_mutual_info_score fraction = np.arange(0.0, 1.0, 0.05) n_repeat = 10 scores = np.empty((2, fraction.shape[0], n_repeat)) for i, frac in enumerate(fraction): for j in range(n_repeat): X, true_labels, Xm = make_fake_data(fraction_missing=frac, n_clusters=5) labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5) any_missing = np.any(~np.isfinite(Xm), 1) scores[0, i, j] = adjusted_mutual_info_score(labels, true_labels) scores[1, i, j] = adjusted_mutual_info_score(labels[any_missing], true_labels[any_missing]) fig, ax = plt.subplots(1, 1) scores_all, scores_missing = scores ax.errorbar(fraction * 100, scores_all.mean(-1), yerr=scores_all.std(-1), label=''All labels'') ax.errorbar(fraction * 100, scores_missing.mean(-1), yerr=scores_missing.std(-1), label=''Labels with missing values'') ax.set_xlabel(''% missing values'') ax.set_ylabel(''Adjusted mutual information'') ax.legend(loc=''best'', frameon=False) ax.set_ylim(0, 1) ax.set_xlim(-5, 100)

Actualizar:

De hecho, después de una rápida búsqueda en Google, parece que lo que se me ocurrió arriba es más o menos lo mismo que el algoritmo k- POD para la agrupación de K-means de datos faltantes (Chi, Chi y Baraniuk, 2016) .