complicated join in spark: rdd elements have many key-value pairs -


i'm new spark , trying find way integrate information 1 rdd another, structures don't lend standard join function

i have on rdd of format:

[{a:a1, b:b1, c:[1,2,3,4], d:d1},  {a:a2, b:b2, c:[5,6,7,8], d:d2}] 

and of format:

[{1:x1},{2,x2},{3,x3},{4,x4},{5,x5},{6,x6},{7,x7},{8,x8}] 

i want match values in second rdd keys in first rdd (which in list value in c key). know how manipulate them once they're there, i'm not concerned final output, i'd maybe see this:

[{a:a1, b:b1, c:[1,2,3,4],c0: [x1,x2,x3,x4], d:d1},  {a:a2, b:b2, c:[5,6,7,8],c0: [x5,x6,x7,x8], d:d2}] 

or this:

[{a:a1, b:b1, c:[(1,x1),(2,x2),(3,x3),(4,x4)], d:d1},  {a:a2, b:b2, c:[(5,x5),(6,x6),(7,x7),(8,x8)], d:d2}] 

or else can match keys in second rdd values in first. considered making second rdd dictionary, know how work with, think data large that.

thank much, appreciate it.

join after flatmap, or cartesian makes many shuffles.

one of possible solutions use cartesian after groupby hashpartitioner.

(sorry, scala code)

val rdd0: rdd[(string, string, seq[int], string)] val rdd1: rdd[(int, string)]  val partitioner = new hashpartitioner(rdd0.partitions.size)  // here point! val grouped = rdd1.groupby(partitioner.getpartition(_))  val result = rdd0.cartesian(grouped).map { case (left, (_, right)) =>     val map = right.tomap     (left._1, left._2, left._4) -> left._3.flatmap(v => map.get(v).map(v -> _)) }.groupbykey().map { case (key, value) =>     (key._1, key._2, value.flatten.toseq, key._3) } 

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -