apache spark - Why does dropna() not work? -
system: spark 1.3.0 (anaconda python dist.) on cloudera quickstart vm 5.4
here's spark dataframe:
from pyspark.sql import sqlcontext pyspark.sql.types import * sqlcontext = sqlcontext(sc) data = sc.parallelize([('foo',41,'us',3), ('foo',39,'uk',1), ('bar',57,'ca',2), ('bar',72,'ca',3), ('baz',22,'us',6), (none,75,none,7)]) schema = structtype([structfield('name', stringtype(), true), structfield('age', integertype(), true), structfield('country', stringtype(), true), structfield('score', integertype(), true)]) df = sqlcontext.createdataframe(data,schema)
data.show()
name age country score foo 41 3 foo 39 uk 1 bar 57 ca 2 bar 72 ca 3 baz 22 6 null 75 null 7
however neither of these work!
df.dropna() df.na.drop()
i message:
>>> df.show() name age country score foo 41 3 foo 39 uk 1 bar 57 ca 2 bar 72 ca 3 baz 22 6 null 75 null 7 >>> df.dropna().show() traceback (most recent call last): file "<stdin>", line 1, in <module> file "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 580, in __getattr__ jc = self._jdf.apply(name) file "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ file "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.py4jjavaerror: error occurred while calling o50.apply. : org.apache.spark.sql.analysisexception: cannot resolve column name "dropna" among (name, age, country, score); @ org.apache.spark.sql.dataframe$$anonfun$resolve$1.apply(dataframe.scala:162) @ org.apache.spark.sql.dataframe$$anonfun$resolve$1.apply(dataframe.scala:162) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.sql.dataframe.resolve(dataframe.scala:161) @ org.apache.spark.sql.dataframe.col(dataframe.scala:436) @ org.apache.spark.sql.dataframe.apply(dataframe.scala:426) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:57) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43) @ java.lang.reflect.method.invoke(method.java:606) @ py4j.reflection.methodinvoker.invoke(methodinvoker.java:231) @ py4j.reflection.reflectionengine.invoke(reflectionengine.java:379) @ py4j.gateway.invoke(gateway.java:259) @ py4j.commands.abstractcommand.invokemethod(abstractcommand.java:133) @ py4j.commands.callcommand.execute(callcommand.java:79) @ py4j.gatewayconnection.run(gatewayconnection.java:207) @ java.lang.thread.run(thread.java:745)
has else experienced problem? what's workaround? pyspark seems thing looking column called "na". appreciated!
tl;dr methods na
, dropna
available since spark 1.3.1.
few mistakes made:
data = sc.parallelize([....('',75,'', 7 )])
, intended use''
representnone
, however, it's string instead of nullna
,dropna
both methods on dataframe class, therefore, should calldf
.
runnable code:
data = sc.parallelize([('foo',41,'us',3), ('foo',39,'uk',1), ('bar',57,'ca',2), ('bar',72,'ca',3), ('baz',22,'us',6), (none, 75, none, 7)]) schema = structtype([structfield('name', stringtype(), true), structfield('age', integertype(), true), structfield('country', stringtype(), true), structfield('score', integertype(), true)]) df = sqlcontext.createdataframe(data,schema) df.dropna().show() df.na.drop().show()
Comments
Post a Comment