complicated join in spark: rdd elements have many key-value pairs -
i'm new spark , trying find way integrate information 1 rdd another, structures don't lend standard join function
i have on rdd of format:
[{a:a1, b:b1, c:[1,2,3,4], d:d1}, {a:a2, b:b2, c:[5,6,7,8], d:d2}]
and of format:
[{1:x1},{2,x2},{3,x3},{4,x4},{5,x5},{6,x6},{7,x7},{8,x8}]
i want match values in second rdd keys in first rdd (which in list value in c key). know how manipulate them once they're there, i'm not concerned final output, i'd maybe see this:
[{a:a1, b:b1, c:[1,2,3,4],c0: [x1,x2,x3,x4], d:d1}, {a:a2, b:b2, c:[5,6,7,8],c0: [x5,x6,x7,x8], d:d2}]
or this:
[{a:a1, b:b1, c:[(1,x1),(2,x2),(3,x3),(4,x4)], d:d1}, {a:a2, b:b2, c:[(5,x5),(6,x6),(7,x7),(8,x8)], d:d2}]
or else can match keys in second rdd values in first. considered making second rdd dictionary, know how work with, think data large that.
thank much, appreciate it.
join
after flatmap
, or cartesian
makes many shuffles.
one of possible solutions use cartesian
after groupby
hashpartitioner
.
(sorry, scala
code)
val rdd0: rdd[(string, string, seq[int], string)] val rdd1: rdd[(int, string)] val partitioner = new hashpartitioner(rdd0.partitions.size) // here point! val grouped = rdd1.groupby(partitioner.getpartition(_)) val result = rdd0.cartesian(grouped).map { case (left, (_, right)) => val map = right.tomap (left._1, left._2, left._4) -> left._3.flatmap(v => map.get(v).map(v -> _)) }.groupbykey().map { case (key, value) => (key._1, key._2, value.flatten.toseq, key._3) }
Comments
Post a Comment