Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame -
let's have rather large dataset in following form:
data = sc.parallelize([('foo',41,'us',3), ('foo',39,'uk',1), ('bar',57,'ca',2), ('bar',72,'ca',2), ('baz',22,'us',6), ('baz',36,'us',6)])
what remove duplicate rows based on values of first,third , fourth columns only.
removing entirely duplicate rows straightforward:
data = data.distinct()
and either row 5 or row 6 removed
but how remove duplicate rows based on columns 1, 3 , 4 only? i.e. remove either 1 one of these:
('baz',22,'us',6) ('baz',36,'us',6)
in python, done specifying columns .drop_duplicates()
. how can achieve same in spark/pyspark?
from question, unclear as-to columns want use determine duplicates. general idea behind solution create key based on values of columns identify duplicates. then, can use reducebykey or reduce operations eliminate duplicates.
here code started:
def get_key(x): return "{0}{1}{2}".format(x[0],x[2],x[3]) m = data.map(lambda x: (get_key(x),x))
now, have key-value rdd
keyed columns 1,3 , 4. next step either reducebykey
or groupbykey
, filter
. eliminate duplicates.
r = m.reducebykey(lambda x,y: (x))
Comments
Post a Comment