with machine learning example cross code python python-2.7 validation scikit-learn cross-validation

python - machine - Preprocesamiento de Sklearn-PolinomialFeatures-Cómo mantener los nombres de columna/encabezados de la matriz de salida/dataframe



k fold cross validation python example (3)

Ejemplo de trabajo, todo en una línea (supongo que la "legibilidad" no es el objetivo aquí):

target_feature_names = [''x''.join([''{}^{}''.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]] output_df = pd.DataFrame(output_nparray, columns = target_feature_names)

TLDR: cómo obtener encabezados para la matriz numpy de salida de la función sklearn.preprocessing.PolynomialFeatures ()?

Digamos que tengo el siguiente código ...

import pandas as pd import numpy as np from sklearn import preprocessing as pp a = np.ones(3) b = np.ones(3) * 2 c = np.ones(3) * 3 input_df = pd.DataFrame([a,b,c]) input_df = input_df.T input_df.columns=[''a'', ''b'', ''c''] input_df a b c 0 1 2 3 1 1 2 3 2 1 2 3 poly = pp.PolynomialFeatures(2) output_nparray = poly.fit_transform(input_df) print output_nparray [[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.] [ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.] [ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]]

¿Cómo puedo obtener esa matrix / output_nparray 3x10 para llevar las etiquetas a, b, c de la forma en que se relacionan con los datos anteriores?


Esto funciona:

def PolynomialFeatures_labeled(input_df,power): ''''''Basically this is a cover for the sklearn preprocessing function. The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially a whole bunch of unlabeled columns. Inputs: input_df = Your labeled pandas dataframe (list of x''s not raised to any power) power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly) Ouput: Output: This function relies on the powers_ matrix which is one of the preprocessing function''s outputs to create logical labels and outputs a labeled pandas dataframe '''''' poly = pp.PolynomialFeatures(power) output_nparray = poly.fit_transform(input_df) powers_nparray = poly.powers_ input_feature_names = list(input_df.columns) target_feature_names = ["Constant Term"] for feature_distillation in powers_nparray[1:]: intermediary_label = "" final_label = "" for i in range(len(input_feature_names)): if feature_distillation[i] == 0: continue else: variable = input_feature_names[i] power = feature_distillation[i] intermediary_label = "%s^%d" % (variable,power) if final_label == "": #If the final label isn''t yet specified final_label = intermediary_label else: final_label = final_label + " x " + intermediary_label target_feature_names.append(final_label) output_df = pd.DataFrame(output_nparray, columns = target_feature_names) return output_df output_df = PolynomialFeatures_labeled(input_df,2) output_df Constant Term a^1 b^1 c^1 a^2 a^1 x b^1 a^1 x c^1 b^2 b^1 x c^1 c^2 0 1 1 2 3 1 2 3 4 6 9 1 1 1 2 3 1 2 3 4 6 9 2 1 1 2 3 1 2 3 4 6 9


scikit-learn 0.18 agregó un ingenioso método get_feature_names() !

>> input_df.columns Index([''a'', ''b'', ''c''], dtype=''object'') >> poly.fit_transform(input_df) array([[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.], [ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.], [ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.]]) >> poly.get_feature_names(input_df.columns) [''1'', ''a'', ''b'', ''c'', ''a^2'', ''a b'', ''a c'', ''b^2'', ''b c'', ''c^2'']

Tenga en cuenta que debe proporcionarle los nombres de las columnas, ya que sklearn no lo lee por separado del DataFrame.