scala - GC overhead limit exceeded with large RDD[MatrixEntry] in Apache Spark -


i have csv file stored data of user-item of dimension 6,365x214 , , finding user-user similarity using columnsimilarities() of org.apache.spark.mllib.linalg.distributed.coordinatematrix.

my code looks this:

import org.apache.spark.mllib.linalg.{vector, vectors} import org.apache.spark.mllib.linalg.distributed.{rowmatrix,  matrixentry, coordinatematrix} import org.apache.spark.rdd.rdd  def rddtocoordinatematrix(input_rdd: rdd[string]) : coordinatematrix = {      // convert rdd[string] rdd[tuple3]     val coo_matrix_input: rdd[tuple3[long,long,double]] = input_rdd.map(         line => line.split(',').tolist     ).map{             e => (e(0).tolong, e(1).tolong, e(2).todouble)     }      // convert rdd[tuple3] rdd[matrixentry]     val coo_matrix_matrixentry: rdd[matrixentry] = coo_matrix_input.map(e => matrixentry(e._1, e._2, e._3))      // convert rdd[matrixentry] coordinatematrix     val coo_matrix: coordinatematrix  = new coordinatematrix(coo_matrix_matrixentry)      return coo_matrix }  // read csv file rdd[string] val input_rdd: rdd[string] = sc.textfile("user_item.csv")  // read rdd[string] coordinatematrix val coo_matrix = rddtocoordinatematrix(input_rdd)  // transpose coordinatematrix val coo_matrix_trans = coo_matrix.transpose()  // convert coordinatematrix rowmatrix val mat: rowmatrix = coo_matrix_trans.torowmatrix()  // compute similar columns perfectly, brute force // return coordinatematrix val simsperfect: coordinatematrix = mat.columnsimilarities()  // coordinatematrix rdd[matrixentry] val simsperfect_entries = simsperfect.entries  simsperfect_entries.count()  // write results file val results_rdd = simsperfect_entries.map(line => line.i+","+line.j+","+line.value)  results_rdd.saveastextfile("similarity-output")  // close repl terminal system.exit(0) 

and, when run script on spark-shell got following error, after running line of code simsperfect_entries.count() :

java.lang.outofmemoryerror: gc overhead limit exceeded 

updated:

i tried many solutions given others ,but got no success.

1 increasing amount of memory use per executor process spark.executor.memory=1g

2 decreasing number of cores use driver process spark.driver.cores=1

suggest me way resolve issue.

all spark transformations lazy until materialize it. when define rdd-to-rdd data manipulations, spark chains operations together, not performing actual computation. when call simsperfect_entries.count(), chain of operations executed , got number.

error gc overhead limit exceeded means jvm garbage collector activity high execution of code stopped. gc activity can high due these reasons:

  • you produce many small objects , discarding them. looks you're not.
  • your data not fit jvm heap. if try load 2gb text file ram, have 1gb of jvm heap. looks it's case.

to fix issue try increase amount of jvm heap on:

  • your worker nodes if have distributed spark setup.
  • your spark-shell app.

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -