apache spark - Why does dropna() not work? -


system: spark 1.3.0 (anaconda python dist.) on cloudera quickstart vm 5.4

here's spark dataframe:

from pyspark.sql import sqlcontext pyspark.sql.types import * sqlcontext = sqlcontext(sc)  data = sc.parallelize([('foo',41,'us',3),                        ('foo',39,'uk',1),                        ('bar',57,'ca',2),                        ('bar',72,'ca',3),                        ('baz',22,'us',6),                        (none,75,none,7)])  schema = structtype([structfield('name', stringtype(), true),                      structfield('age', integertype(), true),                      structfield('country', stringtype(), true),                      structfield('score', integertype(), true)])  df = sqlcontext.createdataframe(data,schema) 

data.show()

name age country score foo  41       3     foo  39  uk      1     bar  57  ca      2     bar  72  ca      3     baz  22       6     null 75  null    7  

however neither of these work!

df.dropna() df.na.drop() 

i message:

>>> df.show() name age country score foo  41       3     foo  39  uk      1     bar  57  ca      2     bar  72  ca      3     baz  22       6     null 75  null    7     >>> df.dropna().show() traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 580, in __getattr__     jc = self._jdf.apply(name)   file "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__   file "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.py4jjavaerror: error occurred while calling o50.apply. : org.apache.spark.sql.analysisexception: cannot resolve column name "dropna" among (name, age, country, score);     @ org.apache.spark.sql.dataframe$$anonfun$resolve$1.apply(dataframe.scala:162)     @ org.apache.spark.sql.dataframe$$anonfun$resolve$1.apply(dataframe.scala:162)     @ scala.option.getorelse(option.scala:120)     @ org.apache.spark.sql.dataframe.resolve(dataframe.scala:161)     @ org.apache.spark.sql.dataframe.col(dataframe.scala:436)     @ org.apache.spark.sql.dataframe.apply(dataframe.scala:426)     @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)     @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:57)     @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43)     @ java.lang.reflect.method.invoke(method.java:606)     @ py4j.reflection.methodinvoker.invoke(methodinvoker.java:231)     @ py4j.reflection.reflectionengine.invoke(reflectionengine.java:379)     @ py4j.gateway.invoke(gateway.java:259)     @ py4j.commands.abstractcommand.invokemethod(abstractcommand.java:133)     @ py4j.commands.callcommand.execute(callcommand.java:79)     @ py4j.gatewayconnection.run(gatewayconnection.java:207)     @ java.lang.thread.run(thread.java:745) 

has else experienced problem? what's workaround? pyspark seems thing looking column called "na". appreciated!

tl;dr methods na , dropna available since spark 1.3.1.

few mistakes made:

  1. data = sc.parallelize([....('',75,'', 7 )]), intended use '' represent none, however, it's string instead of null

  2. na , dropna both methods on dataframe class, therefore, should call df.

runnable code:

data = sc.parallelize([('foo',41,'us',3),                        ('foo',39,'uk',1),                        ('bar',57,'ca',2),                        ('bar',72,'ca',3),                        ('baz',22,'us',6),                        (none, 75, none, 7)])  schema = structtype([structfield('name', stringtype(), true),                  structfield('age', integertype(), true),                  structfield('country', stringtype(), true),                  structfield('score', integertype(), true)])  df = sqlcontext.createdataframe(data,schema)  df.dropna().show() df.na.drop().show() 

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -