Does caching in spark streaming increase performance -
so i'm preforming multiple operations on same rdd in kafka stream. caching rdd going improve performance?
when running multiple operations on same dstream, cache
substantially improve performance. can observed on spark ui:
without use of cache
, each iteration on dstream take same time, total time process data in each batch interval linear number of iterations on data:
when cache
used, first time transformation pipeline on rdd executed, rdd cached , every subsequent iteration on rdd take fraction of time execute.
(in screenshot, execution time of same job further reduced 3s 0.4s reducing number of partitions)
instead of using dstream.cache
recommend use dstream.foreachrdd
or dstream.transform
gain direct access underlying rdd , apply persist
operation. use matching persist
, unpersist
around iterative code clean memory possible:
dstream.foreachrdd{rdd => rdd.cache() col.foreach{id => rdd.filter(elem => elem.id == id).map(...).saveas...} rdd.unpersist(true) }
otherwise, 1 needs wait time configured on spark.cleaner.ttl
clear memory.
note default value spark.cleaner.ttl
infinite, not recommended production 24x7 spark streaming job.
Comments
Post a Comment