Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame -


let's have rather large dataset in following form:

data = sc.parallelize([('foo',41,'us',3),                        ('foo',39,'uk',1),                        ('bar',57,'ca',2),                        ('bar',72,'ca',2),                        ('baz',22,'us',6),                        ('baz',36,'us',6)]) 

what remove duplicate rows based on values of first,third , fourth columns only.

removing entirely duplicate rows straightforward:

data = data.distinct() 

and either row 5 or row 6 removed

but how remove duplicate rows based on columns 1, 3 , 4 only? i.e. remove either 1 one of these:

('baz',22,'us',6) ('baz',36,'us',6) 

in python, done specifying columns .drop_duplicates(). how can achieve same in spark/pyspark?

from question, unclear as-to columns want use determine duplicates. general idea behind solution create key based on values of columns identify duplicates. then, can use reducebykey or reduce operations eliminate duplicates.

here code started:

def get_key(x):     return "{0}{1}{2}".format(x[0],x[2],x[3])  m = data.map(lambda x: (get_key(x),x)) 

now, have key-value rdd keyed columns 1,3 , 4. next step either reducebykey or groupbykey , filter. eliminate duplicates.

r = m.reducebykey(lambda x,y: (x)) 

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -