spark print manipulating examples dataframes crear list group-by set pyspark collect

list - print - schema spark



pyspark collect_set o collect_list con groupby (1)

¿Cómo puedo usar collect_set o collect_list en un marco de datos después de groupby ? por ejemplo: df.groupby(''key'').collect_set(''values'') . Me sale un error: AttributeError: ''GroupedData'' object has no attribute ''collect_set''


Necesitas usar agg. Ejemplo:

from pyspark import SparkContext from pyspark.sql import HiveContext from pyspark.sql import functions as F sc = SparkContext("local") sqlContext = HiveContext(sc) df = sqlContext.createDataFrame([ ("a", None, None), ("a", "code1", None), ("a", "code2", "name2"), ], ["id", "code", "name"]) df.show() +---+-----+-----+ | id| code| name| +---+-----+-----+ | a| null| null| | a|code1| null| | a|code2|name2| +---+-----+-----+

Tenga en cuenta que en lo anterior debe crear un HiveContext. Consulte https://.com/a/35529093/690430 para ver las diferentes versiones de Spark.

(df .groupby("id") .agg(F.collect_set("code"), F.collect_list("name")) .show()) +---+-----------------+------------------+ | id|collect_set(code)|collect_list(name)| +---+-----------------+------------------+ | a| [code1, code2]| [name2]| +---+-----------------+------------------+