scala - How to pivot DataFrame? -
i starting use spark dataframes , need able pivot data create multiple columns out of 1 column multiple rows. there built in functionality in scalding , believe in pandas in python, can't find new spark dataframe.
i assume can write custom function of sort i'm not sure how start, since novice spark. knows how built in functionality or suggestions how write in scala, appreciated.
as mentioned @user2000823 spark provides pivot
function since version 1.6. general syntax looks follows:
df .groupby(grouping_columns) .pivot(pivot_column, [values]) .agg(aggregate_expressions)
usage examples using nycflights13
, spark-csv
:
python:
from pyspark.sql.functions import avg flights = (sqlcontext .read .format("com.databricks.spark.csv") .options(inferschema="true", header="true") .load("flights.csv") .na.drop()) flights.registertemptable("flights") sqlcontext.cachetable("flights") gexprs = ("origin", "dest", "carrier") aggexpr = avg("arr_delay") flights.count() ## 336776 %timeit -n10 flights.groupby(*gexprs ).pivot("hour").agg(aggexpr).count() ## 10 loops, best of 3: 1.03 s per loop
scala:
val flights = sqlcontext .read .format("com.databricks.spark.csv") .options(map("inferschema" -> "true", "header" -> "true")) .load("flights.csv") flights .groupby($"origin", $"dest", $"carrier") .pivot("hour") .agg(avg($"arr_delay"))
r:
library(magrittr) flights <- read.df("flights.csv", source="csv", header=true, inferschema=true) flights %>% groupby("origin", "dest", "carrier") %>% pivot("hour") %>% agg(avg(column("arr_delay")))
performance considerations:
generally speaking pivoting expensive operation.
if can try provide
values
list:vs = list(range(25)) %timeit -n10 flights.groupby(*gexprs ).pivot("hour", vs).agg(aggexpr).count() ## 10 loops, best of 3: 392 ms per loop
in cases proved beneficial
repartition
, / or pre-aggregate datafor reshaping only, can use
first
: pivot string column on pyspark dataframe
Comments
Post a Comment