superponer - subplot python

Gráfico de contaje marino con eje y normalizado por grupo (4)

Me preguntaba si es posible crear un gráfico de recuento de Seaborn, pero en lugar de los recuentos reales en el eje y, muestre la frecuencia relativa (porcentaje) dentro de su grupo (según se especifica con el parámetro de hue ).

De alguna manera lo arreglé con el siguiente enfoque, pero no puedo imaginar que este sea el enfoque más fácil:

# Plot percentage of occupation per income class grouped = df.groupby([''income''], sort=False) occupation_counts = grouped[''occupation''].value_counts(normalize=True, sort=False) occupation_data = [ {''occupation'': occupation, ''income'': income, ''percentage'': percentage*100} for (income, occupation), percentage in dict(occupation_counts).items() ] df_occupation = pd.DataFrame(occupation_data) p = sns.barplot(x="occupation", y="percentage", hue="income", data=df_occupation) _ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels

Resultado:

Estoy utilizando el conocido conjunto de datos para adultos del repositorio de aprendizaje automático de UCI . El marco de datos de los pandas se crea así:

# Read the adult dataset df = pd.read_csv( "data/adult.data", engine=''c'', lineterminator=''/n'', names=[''age'', ''workclass'', ''fnlwgt'', ''education'', ''education_num'', ''marital_status'', ''occupation'', ''relationship'', ''race'', ''sex'', ''capital_gain'', ''capital_loss'', ''hours_per_week'', ''native_country'', ''income''], header=None, skipinitialspace=True, na_values="?" )

Esta pregunta está relacionada, pero no hace uso del parámetro hue . Y en mi caso no puedo simplemente cambiar las etiquetas en el eje y, porque la altura de la barra debe depender del grupo.

Me sorprendió que Seaborn no proporcionara nada como esto fuera de la caja.

Aún así, fue bastante fácil modificar el código fuente para obtener lo que querías. El siguiente código, con la función "gráfico de porcentaje (x, tono, datos)" funciona igual que sns.countplot, pero regula cada barra por grupo (es decir, divide el valor de cada barra verde por la suma de todas las barras verdes)

En efecto, convierte esto (difícil de interpretar porque la N diferente de Apple contra Android): sns.countplot en esto (Normado de modo que las barras reflejen la proporción del total para Apple, frente a Android): Percentageplot

¡¡Espero que esto ayude!!

from seaborn.categorical import _CategoricalPlotter, remove_na import matplotlib as mpl class _CategoricalStatPlotter(_CategoricalPlotter): @property def nested_width(self): """A float with the width of plot elements when hue nesting is used.""" return self.width / len(self.hue_names) def estimate_statistic(self, estimator, ci, n_boot): if self.hue_names is None: statistic = [] confint = [] else: statistic = [[] for _ in self.plot_data] confint = [[] for _ in self.plot_data] for i, group_data in enumerate(self.plot_data): # Option 1: we have a single layer of grouping # -------------------------------------------- if self.plot_hues is None: if self.plot_units is None: stat_data = remove_na(group_data) unit_data = None else: unit_data = self.plot_units[i] have = pd.notnull(np.c_[group_data, unit_data]).all(axis=1) stat_data = group_data[have] unit_data = unit_data[have] # Estimate a statistic from the vector of data if not stat_data.size: statistic.append(np.nan) else: statistic.append(estimator(stat_data, len(np.concatenate(self.plot_data)))) # Get a confidence interval for this estimate if ci is not None: if stat_data.size < 2: confint.append([np.nan, np.nan]) continue boots = bootstrap(stat_data, func=estimator, n_boot=n_boot, units=unit_data) confint.append(utils.ci(boots, ci)) # Option 2: we are grouping by a hue layer # ---------------------------------------- else: for j, hue_level in enumerate(self.hue_names): if not self.plot_hues[i].size: statistic[i].append(np.nan) if ci is not None: confint[i].append((np.nan, np.nan)) continue hue_mask = self.plot_hues[i] == hue_level group_total_n = (np.concatenate(self.plot_hues) == hue_level).sum() if self.plot_units is None: stat_data = remove_na(group_data[hue_mask]) unit_data = None else: group_units = self.plot_units[i] have = pd.notnull( np.c_[group_data, group_units] ).all(axis=1) stat_data = group_data[hue_mask & have] unit_data = group_units[hue_mask & have] # Estimate a statistic from the vector of data if not stat_data.size: statistic[i].append(np.nan) else: statistic[i].append(estimator(stat_data, group_total_n)) # Get a confidence interval for this estimate if ci is not None: if stat_data.size < 2: confint[i].append([np.nan, np.nan]) continue boots = bootstrap(stat_data, func=estimator, n_boot=n_boot, units=unit_data) confint[i].append(utils.ci(boots, ci)) # Save the resulting values for plotting self.statistic = np.array(statistic) self.confint = np.array(confint) # Rename the value label to reflect the estimation if self.value_label is not None: self.value_label = "{}({})".format(estimator.__name__, self.value_label) def draw_confints(self, ax, at_group, confint, colors, errwidth=None, capsize=None, **kws): if errwidth is not None: kws.setdefault("lw", errwidth) else: kws.setdefault("lw", mpl.rcParams["lines.linewidth"] * 1.8) for at, (ci_low, ci_high), color in zip(at_group, confint, colors): if self.orient == "v": ax.plot([at, at], [ci_low, ci_high], color=color, **kws) if capsize is not None: ax.plot([at - capsize / 2, at + capsize / 2], [ci_low, ci_low], color=color, **kws) ax.plot([at - capsize / 2, at + capsize / 2], [ci_high, ci_high], color=color, **kws) else: ax.plot([ci_low, ci_high], [at, at], color=color, **kws) if capsize is not None: ax.plot([ci_low, ci_low], [at - capsize / 2, at + capsize / 2], color=color, **kws) ax.plot([ci_high, ci_high], [at - capsize / 2, at + capsize / 2], color=color, **kws) class _BarPlotter(_CategoricalStatPlotter): """Show point estimates and confidence intervals with bars.""" def __init__(self, x, y, hue, data, order, hue_order, estimator, ci, n_boot, units, orient, color, palette, saturation, errcolor, errwidth=None, capsize=None): """Initialize the plotter.""" self.establish_variables(x, y, hue, data, orient, order, hue_order, units) self.establish_colors(color, palette, saturation) self.estimate_statistic(estimator, ci, n_boot) self.errcolor = errcolor self.errwidth = errwidth self.capsize = capsize def draw_bars(self, ax, kws): """Draw the bars onto `ax`.""" # Get the right matplotlib function depending on the orientation barfunc = ax.bar if self.orient == "v" else ax.barh barpos = np.arange(len(self.statistic)) if self.plot_hues is None: # Draw the bars barfunc(barpos, self.statistic, self.width, color=self.colors, align="center", **kws) # Draw the confidence intervals errcolors = [self.errcolor] * len(barpos) self.draw_confints(ax, barpos, self.confint, errcolors, self.errwidth, self.capsize) else: for j, hue_level in enumerate(self.hue_names): # Draw the bars offpos = barpos + self.hue_offsets[j] barfunc(offpos, self.statistic[:, j], self.nested_width, color=self.colors[j], align="center", label=hue_level, **kws) # Draw the confidence intervals if self.confint.size: confint = self.confint[:, j] errcolors = [self.errcolor] * len(offpos) self.draw_confints(ax, offpos, confint, errcolors, self.errwidth, self.capsize) def plot(self, ax, bar_kws): """Make the plot.""" self.draw_bars(ax, bar_kws) self.annotate_axes(ax) if self.orient == "h": ax.invert_yaxis() def percentageplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None, palette=None, saturation=.75, ax=None, **kwargs): # Estimator calculates required statistic (proportion) estimator = lambda x, y: (float(len(x))/y)*100 ci = None n_boot = 0 units = None errcolor = None if x is None and y is not None: orient = "h" x = y elif y is None and x is not None: orient = "v" y = x elif x is not None and y is not None: raise TypeError("Cannot pass values for both `x` and `y`") else: raise TypeError("Must pass values for either `x` or `y`") plotter = _BarPlotter(x, y, hue, data, order, hue_order, estimator, ci, n_boot, units, orient, color, palette, saturation, errcolor) plotter.value_label = "Percentage" if ax is None: ax = plt.gca() plotter.plot(ax, kwargs) return ax

Podría estar confundido. La diferencia entre su salida y la salida de

occupation_counts = (df.groupby([''income''])[''occupation''] .value_counts(normalize=True) .rename(''percentage'') .mul(100) .reset_index() .sort_values(''occupation'')) p = sns.barplot(x="occupation", y="percentage", hue="income", data=occupation_counts) _ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels

Es, me parece, sólo el orden de las columnas.

Y parece que eso te importa, ya que pasas sort=False . Pero luego, en su código, el orden se determina únicamente por casualidad (y el orden en el que se itera el diccionario, incluso cambia de ejecución a ejecución con Python 3.5).

Puede proporcionar estimadores para la altura de la barra (a lo largo del eje y) en un gráfico de conteo marítimo utilizando la palabra clave estimador.

ax = sns.barplot(x="x", y="x", data=df, estimator=lambda x: len(x) / len(df) * 100)

El fragmento de código anterior es de https://github.com/mwaskom/seaborn/issues/1027

Tienen una discusión completa sobre cómo proporcionar porcentajes en un gráfico de cuenta. Esta respuesta se basa en el mismo hilo vinculado anteriormente.

En el contexto de su problema específico, probablemente pueda hacer algo como esto:

ax = sb.barplot(x=''occupation'', y=''some_numeric_column'', data=raw_data, estimator=lambda x: len(x) / len(raw_data) * 100, hue=''income'') ax.set(ylabel="Percent")

El código anterior funcionó para mí (en un conjunto de datos diferente con atributos diferentes). Tenga en cuenta que necesita poner en alguna columna numérica para y si no, da un error: "ValueError: Ni la variable x ni y parece ser numérica".

Puede usar la biblioteca Dexplot para realizar el conteo, así como para normalizar sobre cualquier variable para obtener frecuencias relativas.

Pase el aggplot una cadena / variable categórica al parámetro agg y producirá automáticamente un gráfico de barras de los conteos de todos los valores únicos. Usa el hue para subdividir los conteos por otra variable. Observe que Dexplot ajusta automáticamente las etiquetas x-tick.

dxp.aggplot(agg=''occupation'', data=df, hue=''income'')

Use el parámetro normalize para normalizar los conteos sobre cualquier variable (o combinación de variables con una tupla). También puede usar "all" para normalizar sobre el total de cuentas.

dxp.aggplot(''occupation'', data=df, hue=''income'', normalize=''income'')