首先,上篇给大家介绍了spark中master,跟work的启动并注册源码之后,说明集群已经启动成功了,本篇来向大家介绍下spark中application提交到集群中的master资源调度源码,是怎么资源调度的,然后work上面的Driver跟work进程是怎么启动并开始工作的,好了,废话不多说直接开始
###什么时候会调用schedule?(资源调度)
其实每当一个新的application加入或者资源发生变化的时候都会调用schudule方法对资源进行重新分配,那么它是如何分配资源的呢?我们下面进行源码级别的分析。
schedule
我们先贴出schedule的源码:
private def schedule(): Unit = {
//首先判断 master状态不是alive的话,直接返回
//也就是说 statndby master是不会进行资源调度的
if (state != RecoveryState.ALIVE) {
return
}
// Drivers take strict precedence over executors
//先注册Driver然后再注册executors
//使用shuffle方法 获取存活的work节点
//Random.shuffle的原理,大家要清楚,就是对传入的集合的元素
//1、进行随机的打乱 目的:以免每次都将Driver分配到同一个work上面
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
//得到Alive 状态的work长度
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
//开始遍历 work
//首先,调度driver
/**
* 为什么要调度driver,spark-on-yarn cluster模式,Driver会被调度
* standalone跟client模式,都会在本地直接启动driver
* 而不会来注册driver,就更不可能让master调度driver了
*/
//2、循环遍历启动Drivers 等待的drivers队列 一个数组
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
// 2.1、判断是否有剩余的没有分配的Workers,并且尚未启动
//已经获取到work状态 将cisited += 1
var numWorkersVisited = 0
//只要还有或者的work没有遍历到 就继续遍历
//而且还是当前的driver还没有被启动 也就是launched = false
while (numWorkersVisited < numWorkersAlive && !launched) {
// 2.2、获取一个Worker,第一个的索引为0,后面的索引根据curPos = (curPos + 1) % numWorkersAlive进行计算
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
//如果当前的worker的空闲内存量 大于等于driver的内存
//并且worker的空闲cpu数量 大于等于driver需要的cpu数量
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >=
driver.desc.cores) {
//启动driver
launchDriver(worker, driver)
//启动完成之后,并且将driver从waitingdrivbers队列中移除
waitingDrivers -= driver
launched = true
}
//将指针指向下一个work
curPos = (curPos + 1) % numWorkersAlive
}
}
//当所有的Driver都分配完了,以后,开始启动Executors
startExecutorsOnWorkers()
}
介绍一下上面的那个随机打乱方法
进入方法: Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
def shuffle[T, CC[X] <: TraversableOnce[X]](xs: CC[T])(implicit bf: CanBuildFrom[CC[T], T, CC[T]]): CC[T] = {
val buf = new ArrayBuffer[T] ++= xs
def swap(i1: Int, i2: Int) {
val tmp = buf(i1)
buf(i1) = buf(i2)
buf(i2) = tmp
}
//调用了swap方法 将数组内的数据调换位置了,
for (n <- buf.length to 2 by -1) {
val k = nextInt(n)
swap(n - 1, k)
}
(bf(xs) ++= buf).result()
}
启动Driver
我已经在上面的源码中对分配的流程进行了详细的注释,现在我们看一下launchDriver方法:
//在某一个worker上,启动driver
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
//1、打印日志
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
//将driver加入worker内存的缓存结构
//将worker内使用的内存和cpu数量,都加上driver需要的内存和spu数量
//2、向worker中添加driver的信息,包括增加已经使用的内存和cpu数量
worker.addDriver(driver)
//3、同时把worker也加入到driver内部的缓存结构中去
driver.worker = Some(worker)
//然后调用worker的endpoint 老版本是actor
//4、发送LaunchDriver消息 让work来启动drivder
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
//5、将driver状态设置成RUNNING
driver.state = DriverState.RUNNING
}
接下来我们看一下对应的Worker在接收到LaunchDriver消息后是怎么处理的:
进入到worker的源码中找到LaunchDriver方法中
//work接收到LaunchDriver消息后 怎么处理的
case LaunchDriver(driverId, driverDesc) =>
//1、打日志
logInfo(s"Asked to launch driver $driverId")
//2、实例化DriverRunner线程
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
workerUri,
securityMgr)
//3、实例化完成后 向drivers中添加该drivers的记录
drivers(driverId) = driver
//4、启动driver
driver.start()
//启动完成后记录资源的变化情况
//然后跟踪start方法
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
继续跟踪driver.start():
//启动一个线程来运行和管理driver
private[worker] def start() = {
new Thread("DriverRunner for " + driverId) {
override def run() {
var shutdownHook: AnyRef = null
try {
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
logInfo(s"Worker shutting down, killing driver $driverId")
kill()
}
// prepare driver jars and run driver
//准备driver需要的jar包 工作目录创建并返回路径的名称等信息
//点进去这个方法 具体的启动Driver操作 在这里面
val exitCode = prepareAndRunDriver()
进入prepareAndRunDriver()看看
private[worker] def prepareAndRunDriver(): Int = {
val driverDir = createWorkingDirectory()
val localJarFilename = downloadUserJar(driverDir)
def substituteVariables(argument: String): String = argument match {
case "{{WORKER_URL}}" => workerUrl
case "{{USER_JAR}}" => localJarFilename
case other => other
}
// TODO: If we add ability to submit multiple jars they should also be added here
val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
//buildProcessBuilder 都是通过这个类启动的Driver
runDriver(builder, driverDir, driverDesc.supervise)
}
可以看到runDriver这个类 启动Driver(具体的启动Driver的操作,这里不再详细分析)
如果启动成功最后要向worker发送一条DriverStateChanged的消息,而Worker在接收到该消息后会调用handleDriverStateChanged方法进行一系列处理,具体的处理细节就不再说明,主要的就是向Master发送一条driverStateChanged的消息,Master在接收到该消息后移除Driver的信息:
case DriverStateChanged(driverId, state, exception) =>
state match {
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
至此向Driver分配资源并启动Driver的过程结束,下面我们看一下启动Executors即执行startExecutorsOnWorkers()的流程。
//开始在work上启动executor
private def startExecutorsOnWorkers(): Unit = {
//采用的是先进先出的原则
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
for (app <- waitingApps if app.coresLeft > 0) {
val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
// Filter out workers that don't have enough resources to launch an executor
//1 过滤出ALIVE状态并且满足资源要求的workers,同时按照空闲cpu cores的个数倒叙排列
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor.getOrElse(1))
.sortBy(_.coresFree).reverse
//默认是spreadOutApps true 2中算法?
//开始在works上调度
//2 决定再每隔worker上面分配多少个cpu cores
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// Now that we've decided how many cores to allocate on each worker, let's allocate them
//3 开始分配
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}
}
}
上面步骤中:如何决定在每个worker上分配多少个cores的
这里面有2中算法,默认使用第一种:
算法1:executors分配到尽可能多的workers上,比如有10个work节点,每个有10个core,你spark-submit提交的资源是20个core,那么每个work都会使用2个core,来跑你的spark程序,好处就是尽可能将所有的worker都使用
算法2:跟第一种相反,比如有10个work节点,每个有10个core,你spark-submit提交的资源是20个core,那么他会使用前2个work节点,每个节点使用10个core,其他8个空闲的,
默认使用第一种方式。
接下来就是具体为executors分配计算资源并启动executors的过程:
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// If the number of cores per executor is specified, we divide the cores assigned
// to this worker evenly among the executors with no remainder.
// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
//向application中添加executor的信息
val exec = app.addExecutor(worker, coresToAssign)
//启动executors
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
启动executors:
//这里是干嘛的
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
//将executor加入worker内部的缓存
worker.addExecutor(exec)
//向worker发消息启动executor
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
//向executor对应的application的driver,发送ExecutorAddeds消息
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
worker在接收到启动executor的消息后执行具体的启动操作,并向Master汇报。
然后也要向driver发送executors的资源信息,driver收到信息后执行application,至此分配并启动executors的大致流程也就执行完毕。
最后用一张图总结一下启动Driver和Worker的简易流程: