python - score - ¿Cómo usar el filtro de pandas con IQR?

remove outliers python pandas (3)

Esto le dará el subconjunto de df que se encuentra en el IQR de la columna column :

def subset_by_iqr(df, column, whisker_width=1.5): """Remove outliers from a dataframe by column, including optional whiskers, removing rows for which the column value are less than Q1-1.5IQR or greater than Q3+1.5IQR. Args: df (`:obj:pd.DataFrame`): A pandas dataframe to subset column (str): Name of the column to calculate the subset from. whisker_width (float): Optional, loosen the IQR filter by a factor of `whisker_width` * IQR. Returns: (`:obj:pd.DataFrame`): Filtered dataframe """ # Calculate Q1, Q2 and IQR q1 = df[column].quantile(0.25) q3 = df[column].quantile(0.75) iqr = q3 - q1 # Apply filter with respect to IQR, including optional whiskers filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr) return df.loc[filter] # Example for whiskers = 1.5, as requested by the OP df_filtered = subset_by_iqr(df, ''column_name'', whisker_width=1.5)

¿Existe una forma integrada de filtrar en una columna por IQR (es decir, valores entre Q1-1.5IQR y Q3 + 1.5IQR)? además, se apreciará cualquier otro posible filtrado generalizado en pandas sugerido.

Otro enfoque utilizando Series.between ():

iqr = df[''col''][df[''col''].between(df[''col''].quantile(.25), df[''col''].quantile(.75), inclusive=True)]

Sacado:

q1 = df[''col''].quantile(.25) q3 = df[''col''].quantile(.75) mask = d[''col''].between(q1, q2, inclusive=True) iqr = d.loc[mask, ''col'']

Que yo sepa, la notación más compacta parece ser traída por el método de query .

# Some test data np.random.seed(33454) df = ( # A standard distribution pd.DataFrame({''nb'': np.random.randint(0, 100, 20)}) # Adding some outliers .append(pd.DataFrame({''nb'': np.random.randint(100, 200, 2)})) # Reseting the index .reset_index(drop=True) ) # Computing IQR Q1 = df[''nb''].quantile(0.25) Q3 = df[''nb''].quantile(0.75) IQR = Q3 - Q1 # Filtering Values between Q1-1.5IQR and Q3+1.5IQR filtered = df.query(''(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)'')

Luego podemos trazar el resultado para comprobar la diferencia. Observamos que el valor atípico en el diagrama de caja izquierdo (la cruz en 183) ya no aparece en la serie filtrada.

# Ploting the result to check the difference df.join(filtered, rsuffix=''_filtered'').boxplot()

Desde esta respuesta, he escrito una post sobre este tema donde puede encontrar más información.