python - onehotencoder - sklearn

Imputar valores categóricos faltantes en scikit-learn (6)

Tengo datos de pandas con algunas columnas de tipo de texto. Hay algunos valores de NaN junto con estas columnas de texto. Lo que trato de hacer es imputar esos NaN por sklearn.preprocessing.Imputer (reemplazando NaN por el valor más frecuente). El problema está en la implementación. Supongamos que hay un marco de datos de Pandas df con 30 columnas, 10 de las cuales son de naturaleza categórica. Una vez que corro:

from sklearn.preprocessing import Imputer imp = Imputer(missing_values=''NaN'', strategy=''most_frequent'', axis=0) imp.fit(df)

Python genera un error: ''could not convert string to float: ''run1'''' , donde ''run1'' es un valor ordinario (no perdido) de la primera columna con datos categóricos.

Cualquier ayuda sería muy bienvenida

Copiando y modificando la respuesta de sveitser, hice una imputer para un objeto pandas.Series

class CustomImputer(BaseEstimator, TransformerMixin): def __init__(self, strategy=''mean'',filler=''NA''): self.strategy = strategy self.fill = filler def fit(self, X, y=None): if self.strategy in [''mean'',''median'']: if not all(X.dtypes == np.number): raise ValueError(''dtypes mismatch np.number dtype is / required for ''+ self.strategy) if self.strategy == ''mean'': self.fill = X.mean() elif self.strategy == ''median'': self.fill = X.median() elif self.strategy == ''mode'': self.fill = X.mode().iloc[0] elif self.strategy == ''fill'': if type(self.fill) is list and type(X) is pd.DataFrame: self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)]) return self def transform(self, X, y=None): return X.fillna(self.fill)

Para usarlo harías:

>> df MasVnrArea FireplaceQu Id 1 196.0 NaN 974 196.0 NaN 21 380.0 Gd 5 350.0 TA 651 NaN Gd >> CustomImputer(strategy=''mode'').fit_transform(df) MasVnrArea FireplaceQu Id 1 196.0 Gd 974 196.0 Gd 21 380.0 Gd 5 350.0 TA 651 196.0 Gd >> CustomImputer(strategy=''fill'', filler=[0, ''NA'']).fit_transform(df) MasVnrArea FireplaceQu Id 1 196.0 NA 974 196.0 NA 21 380.0 Gd 5 350.0 TA 651 0.0 Gd

Este código completa una serie con la categoría más frecuente:

import pandas as pd import numpy as np # create fake data m = pd.Series(list(''abca'')) m.iloc[1] = np.nan #artificially introduce nan print(''m = '') print(m) #make dummy variables, count and sort descending: most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] def replace_most_common(x): if pd.isnull(x): return most_common else: return x new_m = m.map(replace_most_common) #apply function to original data print(''new_m = '') print(new_m)

Productos:

m = 0 a 1 NaN 2 c 3 a dtype: object new_m = 0 a 1 a 2 c 3 a dtype: object

Inspirado por las respuestas aquí y por la falta de un Impost para todos los casos de uso, terminé escribiendo esto. Es compatible con cuatro estrategias para la imputación mean, mode, median, fill funciona tanto en pd.DataFrame como en Pd.Series .

mean y median funciona solo para datos numéricos, mode y trabajos de fill para datos numéricos y categóricos.

import numpy import pandas from sklearn.base import TransformerMixin class SeriesImputer(TransformerMixin): def __init__(self): """Impute missing values. If the Series is of dtype Object, then impute with the most frequent object. If the Series is not of dtype Object, then impute with the mean. """ def fit(self, X, y=None): if X.dtype == numpy.dtype(''O''): self.fill = X.value_counts().index[0] else : self.fill = X.mean() return self def transform(self, X, y=None): return X.fillna(self.fill)

uso

# Make a series s1 = pandas.Series([''k'', ''i'', ''t'', ''t'', ''e'', numpy.NaN]) a = SeriesImputer() # Initialize the imputer a.fit(s1) # Fit the imputer s2 = a.transform(s1) # Get a new series

Para usar valores medios para columnas numéricas y el valor más frecuente para columnas no numéricas, puede hacer algo como esto. Puede distinguir aún más entre enteros y flotantes. Supongo que podría tener sentido usar la mediana para las columnas enteras.

import pandas as pd import numpy as np from sklearn.base import TransformerMixin class DataFrameImputer(TransformerMixin): def __init__(self): """Impute missing values. Columns of dtype object are imputed with the most frequent value in column. Columns of other types are imputed with mean of column. """ def fit(self, X, y=None): self.fill = pd.Series([X[c].value_counts().index[0] if X[c].dtype == np.dtype(''O'') else X[c].mean() for c in X], index=X.columns) return self def transform(self, X, y=None): return X.fillna(self.fill) data = [ [''a'', 1, 2], [''b'', 1, 1], [''b'', 2, 2], [np.nan, np.nan, np.nan] ] X = pd.DataFrame(data) xt = DataFrameImputer().fit_transform(X) print(''before...'') print(X) print(''after...'') print(xt)

que imprime,

before... 0 1 2 0 a 1 2 1 b 1 1 2 b 2 2 3 NaN NaN NaN after... 0 1 2 0 a 1.000000 2.000000 1 b 1.000000 1.000000 2 b 2.000000 2.000000 3 b 1.333333 1.666667

Puede usar sklearn_pandas.CategoricalImputer para las columnas categóricas. Detalles:

Primero, (del libro Hands-On Machine Learning con Scikit-Learn y TensorFlow) puede tener subcapas para características numéricas y de cadena / categóricas, donde el primer transformador de cada subpipeline es un selector que toma una lista de nombres de columna (y la línea full_pipeline.fit_transform() toma un Dataframe de pandas):

class DataFrameSelector(BaseEstimator, TransformerMixin): def __init__(self, attribute_names): self.attribute_names = attribute_names def fit(self, X, y=None): return self def transform(self, X): return X[self.attribute_names].values

A continuación, puede combinar estas subcategorías con sklearn.pipeline.FeatureUnion , por ejemplo:

full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline) ])

Ahora, en num_pipeline puede simplemente usar sklearn.preprocessing.Imputer() , pero en cat_pipline , puede usar CategoricalImputer() desde el paquete sklearn_pandas .

nota: el paquete sklearn-pandas se puede instalar con la pip install sklearn-pandas , pero se importa como import sklearn_pandas

Similar. Modificar Imputer para strategy=''most_frequent'' :

class GeneralImputer(Imputer): def __init__(self, **kwargs): Imputer.__init__(self, **kwargs) def fit(self, X, y=None): if self.strategy == ''most_frequent'': self.fills = pd.DataFrame(X).mode(axis=0).squeeze() self.statistics_ = self.fills.values return self else: return Imputer.fit(self, X, y=y) def transform(self, X): if hasattr(self, ''fills''): return pd.DataFrame(X).fillna(self.fills).values.astype(str) else: return Imputer.transform(self, X)

donde pandas.DataFrame.mode() encuentra el valor más frecuente para cada columna y luego pandas.DataFrame.fillna() rellena los valores faltantes con estos. Otros valores strategy todavía se manejan de la misma manera por Imputer .