list - print - schema spark
pyspark collect_set o collect_list con groupby (1)
¿Cómo puedo usar
collect_set
o
collect_list
en un marco de datos después de
groupby
?
por ejemplo:
df.groupby(''key'').collect_set(''values'')
.
Me sale un error:
AttributeError: ''GroupedData'' object has no attribute ''collect_set''
Necesitas usar agg. Ejemplo:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
Tenga en cuenta que en lo anterior debe crear un HiveContext. Consulte https://.com/a/35529093/690430 para ver las diferentes versiones de Spark.
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+