Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame -

January 15, 2014

let's have rather large dataset in following form:

data = sc.parallelize([('foo',41,'us',3),                        ('foo',39,'uk',1),                        ('bar',57,'ca',2),                        ('bar',72,'ca',2),                        ('baz',22,'us',6),                        ('baz',36,'us',6)])

what remove duplicate rows based on values of first,third , fourth columns only.

removing entirely duplicate rows straightforward:

data = data.distinct()

and either row 5 or row 6 removed

but how remove duplicate rows based on columns 1, 3 , 4 only? i.e. remove either 1 one of these:

('baz',22,'us',6) ('baz',36,'us',6)

in python, done specifying columns .drop_duplicates(). how can achieve same in spark/pyspark?

from question, unclear as-to columns want use determine duplicates. general idea behind solution create key based on values of columns identify duplicates. then, can use reducebykey or reduce operations eliminate duplicates.

here code started:

def get_key(x):     return "{0}{1}{2}".format(x[0],x[2],x[3])  m = data.map(lambda x: (get_key(x),x))

now, have key-value rdd keyed columns 1,3 , 4. next step either reducebykey or groupbykey , filter. eliminate duplicates.

r = m.reducebykey(lambda x,y: (x))

Search This Blog

Lix

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame -

Comments

Post a Comment

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

javascript - IE11 incompatibility with jQuery's 'readonly'? -

php - How can I echo out this array? -