statistics - Is is possible to real time update a value with spark streaming? -
lets assume have stream of double values , want compute average every ten seconds. how can have sliding window doesn't need recompute average instead update by, lets say, removing part of oldest ten seconds , adding new 10 seconds values?
tl;dr : use reducebywindow
both of function arguments (jump last paragraph code snippet)
there's 2 interpretations of question, specific 1 (how running mean 1 hour, updated every 2 seconds), , general 1 (how computation updates state in sparse way). here's answer general one.
first, notice there way represent data such average-with-updates easy compute, based on windowed dstream: represents data incremental construction of stream, maximal sharing. less efficient, computationally, recompute mean on each batch – noted.
if want update of complex stateful computation invertible, don't want touch stream's construction, there updatestatebykey
– there spark doesn't in reflecting incremental aspect of computation in stream, have manage yourself.
here, have simple , invertible, , don't have notion of keys. can use reducebywindow
inverse reduction argument, using usual functions let compute incremental mean.
val myinitialdstream: dstream[float] val mydstreamwithcount: dstream[(float, long)] = myinitialdstream.map((x) => (x, 1l)) def addonebatchtomean(previousmean: (float, long), newbatch: (float, long)): (float, long) = (previousmean._1 + newbatch._1, previousmean._2 + newbatch._2) def removeonebatchtomean(previousmean: (float, long), oldbatch: (float, long)): (float, long) = (previousmean._1 - oldbatch._1, previousmean._2 - oldbatch._2) val runningmeans = mydstreamwithcount.reducebywindow(addonebatchtomean, removeonebatchtomean, durations.seconds(3600), duractions.seconds(2))
you stream of one-element rdd
s, each of contains pair (m, n) m running sum on 1h-window , n number of elements in 1h-window. return (or map
to) m/n mean.
Comments
Post a Comment