Does caching in spark streaming increase performance -


so i'm preforming multiple operations on same rdd in kafka stream. caching rdd going improve performance?

when running multiple operations on same dstream, cache substantially improve performance. can observed on spark ui:

without use of cache, each iteration on dstream take same time, total time process data in each batch interval linear number of iterations on data: spark streaming no cache

when cache used, first time transformation pipeline on rdd executed, rdd cached , every subsequent iteration on rdd take fraction of time execute.

(in screenshot, execution time of same job further reduced 3s 0.4s reducing number of partitions) spark streaming cache

instead of using dstream.cache recommend use dstream.foreachrdd or dstream.transform gain direct access underlying rdd , apply persist operation. use matching persist , unpersist around iterative code clean memory possible:

dstream.foreachrdd{rdd =>   rdd.cache()   col.foreach{id => rdd.filter(elem => elem.id == id).map(...).saveas...}   rdd.unpersist(true) }   

otherwise, 1 needs wait time configured on spark.cleaner.ttl clear memory.

note default value spark.cleaner.ttl infinite, not recommended production 24x7 spark streaming job.


Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -