python pandas dataframe dictionary

python - Agregue una nueva columna al marco de datos según el diccionario



pandas dataframe (7)

Dado que la score es un diccionario (por lo que las claves son únicas), podemos usar la alineación MultiIndex

df = df.set_index([''gender'', ''age'', ''cholesterol'', ''smoke'']) df[''score''] = pd.Series(score) # Assign values based on the tuple df = df.fillna(0, downcast=''infer'').reset_index() # Back to columns

gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0

Tengo un marco de datos y un diccionario. Necesito agregar una nueva columna al marco de datos y calcular sus valores en función del diccionario.

Aprendizaje automático, agregando nuevas características basadas en alguna tabla:

score = {(1, 45, 1, 1) : 4, (0, 1, 2, 1) : 5} df = pd.DataFrame(data = { ''gender'' : [1, 1, 0, 1, 1, 0, 0, 0, 1, 0], ''age'' : [13, 45, 1, 45, 15, 16, 16, 16, 15, 15], ''cholesterol'' : [1, 2, 2, 1, 1, 1, 1, 1, 1, 1], ''smoke'' : [0, 0, 1, 1, 7, 8, 3, 4, 4, 2]}, dtype = np.int64) print(df, ''/n'') df[''score''] = 0 df.score = score[(df.gender, df.age, df.cholesterol, df.smoke)] print(df)

Espero el siguiente resultado:

gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0


Lista de comprensión y mapa:

df[''score''] = (pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke)) .map(score) .fillna(0) .astype(int) )

Salida:

gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0 9 0 15 1 2 0.0


Podrías usar map , ya que score es un diccionario:

df[''score''] = df[[''gender'', ''age'', ''cholesterol'', ''smoke'']].apply(tuple, axis=1).map(score).fillna(0) print(df)

Salida

gender age cholesterol smoke score 0 1 13 1 0 0.0 1 1 45 2 0 0.0 2 0 1 2 1 5.0 3 1 45 1 1 4.0 4 1 15 1 7 0.0 5 0 16 1 8 0.0 6 0 16 1 3 0.0 7 0 16 1 4 0.0 8 1 15 1 4 0.0 9 0 15 1 2 0.0

Como alternativa, podría utilizar una lista de comprensión:

df[''score''] = [score.get(t, 0) for t in zip(df.gender, df.age, df.cholesterol, df.smoke)] print(df)


Puede ser otra forma de usar .loc[] :

m=df.set_index(df.columns.tolist()) m.loc[list(score.keys())].assign( score=score.values()).reindex(m.index,fill_value=0).reset_index()

gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0


Solución simple de una línea, use get y tuple row-wise,

df[''score''] = df.apply(lambda x: score.get(tuple(x), 0), axis=1)

La solución anterior supone que no hay columnas distintas de las deseadas en orden. Si no, solo use columnas

cols = [''gender'',''age'',''cholesterol'',''smoke''] df[''score''] = df[cols].apply(lambda x: score.get(tuple(x), 0), axis=1)


Usando assign con una comprensión de lista, obteniendo una tupla de valores (cada fila) del diccionario de score , por defecto a cero si no se encuentra.

>>> df.assign(score=[score.get(tuple(row), 0) for row in df.values]) gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0

Tiempos

Dada la variedad de enfoques, pensé que sería interesante comparar algunos de los tiempos.

# Initial dataframe 100k rows (10 rows of identical data replicated 10k times). df = pd.DataFrame(data = { ''gender'' : [1, 1, 0, 1, 1, 0, 0, 0, 1, 0] * 10000, ''age'' : [13, 45, 1, 45, 15, 16, 16, 16, 15, 15] * 10000, ''cholesterol'' : [1, 2, 2, 1, 1, 1, 1, 1, 1, 1] * 10000, ''smoke'' : [0, 0, 1, 1, 7, 8, 3, 4, 4, 2] * 10000}, dtype = np.int64) %timeit -n 10 df.assign(score=[score.get(tuple(v), 0) for v in df.values]) # 223 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=[score.get(t, 0) for t in zip(*map(df.get, df))]) # 76.8 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=[score.get(v, 0) for v in df.itertuples(index=False)]) # 113 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit -n 10 df.assign(score=df.apply(lambda x: score.get(tuple(x), 0), axis=1)) # 1.84 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 (df .set_index([''gender'', ''age'', ''cholesterol'', ''smoke'']) .assign(score=pd.Series(score)) .fillna(0, downcast=''infer'') .reset_index() ) # 138 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 s=pd.Series(score) s.index.names=[''gender'',''age'',''cholesterol'',''smoke''] df.merge(s.to_frame(''score'').reset_index(),how=''left'').fillna(0).astype(int) # 24 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke)) .map(score) .fillna(0) .astype(int)) # 191 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=df[[''gender'', ''age'', ''cholesterol'', ''smoke'']] .apply(tuple, axis=1) .map(score) .fillna(0)) # 1.95 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


reindex

df[''socre'']=pd.Series(score).reindex(pd.MultiIndex.from_frame(df),fill_value=0).values df Out[173]: gender age cholesterol smoke socre 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0

O merge

s=pd.Series(score) s.index.names=[''gender'',''age'',''cholesterol'',''smoke''] df=df.merge(s.to_frame(''score'').reset_index(),how=''left'').fillna(0) Out[166]: gender age cholesterol smoke score 0 1 13 1 0 0.0 1 1 45 2 0 0.0 2 0 1 2 1 5.0 3 1 45 1 1 4.0 4 1 15 1 7 0.0 5 0 16 1 8 0.0 6 0 16 1 3 0.0 7 0 16 1 4 0.0 8 1 15 1 4 0.0 9 0 15 1 2 0.0