scala - How to pivot DataFrame? -


i starting use spark dataframes , need able pivot data create multiple columns out of 1 column multiple rows. there built in functionality in scalding , believe in pandas in python, can't find new spark dataframe.

i assume can write custom function of sort i'm not sure how start, since novice spark. knows how built in functionality or suggestions how write in scala, appreciated.

as mentioned @user2000823 spark provides pivot function since version 1.6. general syntax looks follows:

df   .groupby(grouping_columns)   .pivot(pivot_column, [values])    .agg(aggregate_expressions) 

usage examples using nycflights13 , spark-csv:

python:

from pyspark.sql.functions import avg  flights = (sqlcontext     .read     .format("com.databricks.spark.csv")     .options(inferschema="true", header="true")     .load("flights.csv")     .na.drop())  flights.registertemptable("flights") sqlcontext.cachetable("flights")  gexprs = ("origin", "dest", "carrier") aggexpr = avg("arr_delay")  flights.count() ## 336776  %timeit -n10 flights.groupby(*gexprs ).pivot("hour").agg(aggexpr).count() ## 10 loops, best of 3: 1.03 s per loop 

scala:

val flights = sqlcontext   .read   .format("com.databricks.spark.csv")   .options(map("inferschema" -> "true", "header" -> "true"))   .load("flights.csv")  flights   .groupby($"origin", $"dest", $"carrier")   .pivot("hour")   .agg(avg($"arr_delay")) 

r:

library(magrittr)  flights <- read.df("flights.csv", source="csv", header=true, inferschema=true)  flights %>%    groupby("origin", "dest", "carrier") %>%    pivot("hour") %>%    agg(avg(column("arr_delay"))) 

performance considerations:

generally speaking pivoting expensive operation.


Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -