python - Agregue una nueva columna al marco de datos según el diccionario
pandas dataframe (7)
Dado que la
score
es un diccionario (por lo que las claves son únicas), podemos usar la alineación
MultiIndex
df = df.set_index([''gender'', ''age'', ''cholesterol'', ''smoke''])
df[''score''] = pd.Series(score) # Assign values based on the tuple
df = df.fillna(0, downcast=''infer'').reset_index() # Back to columns
gender age cholesterol smoke score
0 1 13 1 0 0
1 1 45 2 0 0
2 0 1 2 1 5
3 1 45 1 1 4
4 1 15 1 7 0
5 0 16 1 8 0
6 0 16 1 3 0
7 0 16 1 4 0
8 1 15 1 4 0
9 0 15 1 2 0
Tengo un marco de datos y un diccionario. Necesito agregar una nueva columna al marco de datos y calcular sus valores en función del diccionario.
Aprendizaje automático, agregando nuevas características basadas en alguna tabla:
score = {(1, 45, 1, 1) : 4, (0, 1, 2, 1) : 5}
df = pd.DataFrame(data = {
''gender'' : [1, 1, 0, 1, 1, 0, 0, 0, 1, 0],
''age'' : [13, 45, 1, 45, 15, 16, 16, 16, 15, 15],
''cholesterol'' : [1, 2, 2, 1, 1, 1, 1, 1, 1, 1],
''smoke'' : [0, 0, 1, 1, 7, 8, 3, 4, 4, 2]},
dtype = np.int64)
print(df, ''/n'')
df[''score''] = 0
df.score = score[(df.gender, df.age, df.cholesterol, df.smoke)]
print(df)
Espero el siguiente resultado:
gender age cholesterol smoke score
0 1 13 1 0 0
1 1 45 2 0 0
2 0 1 2 1 5
3 1 45 1 1 4
4 1 15 1 7 0
5 0 16 1 8 0
6 0 16 1 3 0
7 0 16 1 4 0
8 1 15 1 4 0
9 0 15 1 2 0
Lista de comprensión y mapa:
df[''score''] = (pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke))
.map(score)
.fillna(0)
.astype(int)
)
Salida:
gender age cholesterol smoke score
0 1 13 1 0 0
1 1 45 2 0 0
2 0 1 2 1 5
3 1 45 1 1 4
4 1 15 1 7 0
5 0 16 1 8 0
6 0 16 1 3 0
7 0 16 1 4 0
8 1 15 1 4 0
9 0 15 1 2 0
9 0 15 1 2 0.0
Podrías usar map , ya que score es un diccionario:
df[''score''] = df[[''gender'', ''age'', ''cholesterol'', ''smoke'']].apply(tuple, axis=1).map(score).fillna(0)
print(df)
Salida
gender age cholesterol smoke score
0 1 13 1 0 0.0
1 1 45 2 0 0.0
2 0 1 2 1 5.0
3 1 45 1 1 4.0
4 1 15 1 7 0.0
5 0 16 1 8 0.0
6 0 16 1 3 0.0
7 0 16 1 4 0.0
8 1 15 1 4 0.0
9 0 15 1 2 0.0
Como alternativa, podría utilizar una lista de comprensión:
df[''score''] = [score.get(t, 0) for t in zip(df.gender, df.age, df.cholesterol, df.smoke)]
print(df)
Puede ser otra forma de usar
.loc[]
:
m=df.set_index(df.columns.tolist())
m.loc[list(score.keys())].assign(
score=score.values()).reindex(m.index,fill_value=0).reset_index()
gender age cholesterol smoke score
0 1 13 1 0 0
1 1 45 2 0 0
2 0 1 2 1 5
3 1 45 1 1 4
4 1 15 1 7 0
5 0 16 1 8 0
6 0 16 1 3 0
7 0 16 1 4 0
8 1 15 1 4 0
9 0 15 1 2 0
Solución simple de una línea, use
get
y
tuple
row-wise,
df[''score''] = df.apply(lambda x: score.get(tuple(x), 0), axis=1)
La solución anterior supone que no hay columnas distintas de las deseadas en orden. Si no, solo use columnas
cols = [''gender'',''age'',''cholesterol'',''smoke'']
df[''score''] = df[cols].apply(lambda x: score.get(tuple(x), 0), axis=1)
Usando
assign
con una comprensión de lista, obteniendo una tupla de valores (cada fila) del diccionario de
score
, por defecto a cero si no se encuentra.
>>> df.assign(score=[score.get(tuple(row), 0) for row in df.values])
gender age cholesterol smoke score
0 1 13 1 0 0
1 1 45 2 0 0
2 0 1 2 1 5
3 1 45 1 1 4
4 1 15 1 7 0
5 0 16 1 8 0
6 0 16 1 3 0
7 0 16 1 4 0
8 1 15 1 4 0
9 0 15 1 2 0
Tiempos
Dada la variedad de enfoques, pensé que sería interesante comparar algunos de los tiempos.
# Initial dataframe 100k rows (10 rows of identical data replicated 10k times).
df = pd.DataFrame(data = {
''gender'' : [1, 1, 0, 1, 1, 0, 0, 0, 1, 0] * 10000,
''age'' : [13, 45, 1, 45, 15, 16, 16, 16, 15, 15] * 10000,
''cholesterol'' : [1, 2, 2, 1, 1, 1, 1, 1, 1, 1] * 10000,
''smoke'' : [0, 0, 1, 1, 7, 8, 3, 4, 4, 2] * 10000},
dtype = np.int64)
%timeit -n 10 df.assign(score=[score.get(tuple(v), 0) for v in df.values])
# 223 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.assign(score=[score.get(t, 0) for t in zip(*map(df.get, df))])
# 76.8 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.assign(score=[score.get(v, 0) for v in df.itertuples(index=False)])
# 113 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 df.assign(score=df.apply(lambda x: score.get(tuple(x), 0), axis=1))
# 1.84 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
(df
.set_index([''gender'', ''age'', ''cholesterol'', ''smoke''])
.assign(score=pd.Series(score))
.fillna(0, downcast=''infer'')
.reset_index()
)
# 138 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
s=pd.Series(score)
s.index.names=[''gender'',''age'',''cholesterol'',''smoke'']
df.merge(s.to_frame(''score'').reset_index(),how=''left'').fillna(0).astype(int)
# 24 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.assign(score=pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke))
.map(score)
.fillna(0)
.astype(int))
# 191 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.assign(score=df[[''gender'', ''age'', ''cholesterol'', ''smoke'']]
.apply(tuple, axis=1)
.map(score)
.fillna(0))
# 1.95 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
reindex
df[''socre'']=pd.Series(score).reindex(pd.MultiIndex.from_frame(df),fill_value=0).values
df
Out[173]:
gender age cholesterol smoke socre
0 1 13 1 0 0
1 1 45 2 0 0
2 0 1 2 1 5
3 1 45 1 1 4
4 1 15 1 7 0
5 0 16 1 8 0
6 0 16 1 3 0
7 0 16 1 4 0
8 1 15 1 4 0
9 0 15 1 2 0
O
merge
s=pd.Series(score)
s.index.names=[''gender'',''age'',''cholesterol'',''smoke'']
df=df.merge(s.to_frame(''score'').reset_index(),how=''left'').fillna(0)
Out[166]:
gender age cholesterol smoke score
0 1 13 1 0 0.0
1 1 45 2 0 0.0
2 0 1 2 1 5.0
3 1 45 1 1 4.0
4 1 15 1 7 0.0
5 0 16 1 8 0.0
6 0 16 1 3 0.0
7 0 16 1 4 0.0
8 1 15 1 4 0.0
9 0 15 1 2 0.0