(版本定制)第16课:SparkStreaming源码解读之数据清理内幕彻底解密
本期内容:
创新互联建站是一家专注于网站设计制作、网站设计与策划设计,呼中网站建设哪家好?创新互联建站做网站,专注于网站建设十多年,网设计领域的专业建站公司;建站业务涵盖:呼中等地区。呼中做网站价格咨询:18980820575
1、Spark Streaming元数据清理详解
2、Spark Streaming元数据清理源码解析
一、如何研究Spark Streaming元数据清理
操作DStream的时候会产生元数据,所以要解决RDD的数据清理工作就一定要从DStream入手。因为DStream是RDD的模板,DStream之间有依赖关系。
DStream的操作产生了RDD,接收数据也靠DStream,数据的输入,数据的计算,输出整个生命周期都是由DStream构建的。由此,DStream负责RDD的整个生命周期。因此研究的入口的是DStream。基于Kafka数据来源,通过Direct的方式访问Kafka,DStream随着时间的进行,会不断的在自己的内存数据结构中维护一个HashMap,HashMap维护的就是时间窗口,以及时间窗口下的RDD.按照Batch Duration来存储RDD以及删除RDD.
Spark Streaming本身是一直在运行的,在自己计算的时候会不断的产生RDD,例如每秒Batch Duration都会产生RDD,除此之外可能还有累加器,广播变量。由于不断的产生这些对象,因此Spark Streaming有自己的一套对象,元数据以及数据的清理机制。
Spark Streaming对RDD的管理就相当于JVM的GC
二、源码解析
Spark Streaming是通过我们设定的Batch Durations来不断的产生RDD,Spark Streaming清理元数据跟时钟有关,因为数据是周期性的产生,所以肯定是周期性的释放,这些都跟JobGenerator有关,所以我们先从这开始研究。
1、RecurringTimer: 消息循环器将消息不断的发送给EventLoop
= RecurringTimer(...millisecondslongTime => .post((Time(longTime))))
2、eventLoop:onReceive接收到消息
(): = synchronized { (!= ) = EventLoop[JobGeneratorEvent]() { (event: JobGeneratorEvent): = processEvent(event) (e: ): = { jobScheduler.reportError(e) } } .start() (.) { restart() } { startFirstTime() } }
3、在processEvent中接收清理元数据消息
/** Processes all events */ private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) //清理元数据 case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) //清理checkpoint } }
具体的方法实现内容就不再这里说,我们进一步分析下这些清理动作是在什么时候被调用的,在Spark Streaming应用程序中,最终Job是交给JobHandler来执行的,所以我们分析下JobHandler
private class JobHandler(job: Job) extends Runnable with Logging { import JobScheduler._ def run() { try { val formattedTime = UIUtils.formatBatchTime( job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false) val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}" val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]" ssc.sc.setJobDescription( s"""Streaming job from $batchLinkText""") ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) // We need to assign `eventLoop` to a temp variable. Otherwise, because // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then // it's possible that when `post` is called, `eventLoop` happens to null. var _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobStarted(job, clock.getTimeMillis())) // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { job.run() } _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobCompleted(job, clock.getTimeMillis())) } } else { // JobScheduler has been stopped. } } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } } } }
当Job完成的时候,会发JobCompleted消息给onReceive,通过processEvent来执行具体的方法
private def processEvent(event: JobSchedulerEvent) { try { event match { case JobStarted(job, startTime) => handleJobStart(job, startTime) case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime) case ErrorReported(m, e) => handleError(m, e) } } catch { case e: Throwable => reportError("Error in job scheduler", e) } }
private def handleJobCompletion(job: Job, completedTime: Long) { val jobSet = jobSets.get(job.time) jobSet.handleJobCompletion(job) job.setEndTime(completedTime) listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo)) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time) if (jobSet.hasCompleted) { jobSets.remove(jobSet.time) jobGenerator.onBatchCompletion(jobSet.time) logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format( jobSet.totalDelay / 1000.0, jobSet.time.toString, jobSet.processingDelay / 1000.0 )) listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)) } job.result match { case Failure(e) => reportError("Error running job " + job, e) case _ => } }
通过jobGenerator.onBatchCompletion来清理元数据
/** * Callback called when a batch has been completely processed. */ def onBatchCompletion(time: Time) { eventLoop.post(ClearMetadata(time)) }
到这里Spark Streaming清理元数据的步骤基本上完成了
文章名称:(版本定制)第16课:SparkStreaming源码解读之数据清理内幕彻底解密
文章起源:http://pwwzsj.com/article/ighjdp.html