E-HPC支援多隊列管理和自動伸縮

阿裡雲 E-HPC

（彈性高性能計算）在最近的釋出中支援多隊列排程以及管理，另外釋出針對多隊列排程自動伸縮的政策支援。

本文主要介紹以下内容

多隊列排程的應用背景和應用場景
E-HPC支援多隊列排程的功能實作
介紹各種HPC排程器類型對隊列和節點組是如何配置管理的
介紹如何通過OpenApi的方式調用E-HPC多隊列排程相關功能

前言

傳統的HPC本地叢集遷雲過程中，部分會采用混合雲的模式，例如如下模式，

E-HPC支援多隊列管理和自動伸縮

雲上計算資源規格可能是和本地的計算節點規格不一樣，這就導緻單個叢集裡需要支援多種規格的計算資源，HPC叢集一般會用不同隊列(job queue)或者節點組(node group)來管理不同規格的節點，然後分發作業到不同的隊列以達到區分雲上作業和本地作業；

有客戶有需求在一個E-HPC叢集裡面運作不同類型作業，每種類型的作業對資源的需求是不同的，例如前處理作業需要普通8核32GiB記憶體的ECS虛拟機，後端計算性任務需要使用裸金屬伺服器。

E-HPC支援多隊列

E-HPC通過釋出以下功能支援多隊列部署：

擴容的時候支援指定新的執行個體規格
建立叢集和擴容的時候支援加入指定隊列，如果隊列不存在會自動建立隊列
送出作業的時候支援送出到指定的隊列

E-HPC支援多隊列管理和自動伸縮

自動伸縮服務支援多隊列彈性政策的配置，針對每個隊列可以配置如下資訊：

自動擴容的執行個體規格
擴容付費類型，是按量付費，或者搶占式執行個體
如果是搶占式執行個體，出價政策，是系統自動出價還是設定最高價格

其餘的伸縮配置共享叢集全局配置，也可以設定部分隊列啟用自動伸縮，部分隊列不自動伸縮

HPC叢集對多隊列的支援

E-HPC支援建立部署多種HPC排程叢集，不同HPC排程器類型對隊列的支援情形是不同的，這裡做一些簡要的介紹

PBSPro

PBSPro有兩種隊列類型,

execution: 可執行隊列，作業必須在execution隊列裡才能被分發運作
routing: 用來分發作業到其他隊列，目标隊列可以是routing或者execution隊列

PBSPro預設會建立execution隊列workq，該隊列預設被建立和啟用, 擴容節點時如果沒有指定queue，隊列workq裡的作業可以分發到該節點計算。

以下是pbspro隊列相關的指令

qmgr -c "create queue high queue_type = execution"
qmgr -c "set queue high started = true"
qmgr -c "set queue high enabled = true"
# 設定節點的隊列資訊為high，将隻能運作隊列high裡的作業
qmgr -c "set node node001 queue = high"

目前E-HPC對PBSPro叢集隊列的管理，都是針對execution隊列

Slurm

Slurm裡對應隊列的概念是partitions，partitions可以認為是節點組，将節點分成多個set；partitions也可以被認為是作業隊列，可以對該partition下運作的作業設定限制，例如作業運作時間限制，使用者權限限制等等。

Slurm預設的partition是comp，所有計算節點都屬于該partition

以下是Slurm關于partition的相關配置

# 建立新的partition，并且指定該partition節點， 但是該配置不是持久化的，重新開機slurmctld服務就會覆寫該配置
scontrol create PartitionName=heavy nodes=compute0

# 通過修改配置檔案的方式
# 打開檔案/opt/slurm/17.02.4/etc/slurm.conf， 可以看到文末的partition配置
PartitionName=comp Nodes=ALL Default=YES MaxTime=INFINITE State=UP
可以添加新的partition，例如
PartitionName=light Nodes=compute0,compute1 Default=YES MaxTime=INFINITE State=UP
# 重新開機slurmctld
system restart slurmctld

LSF/CUBE

LSF或者CUBE的預設隊列是normal，所有的節點預設加入該隊列，可以配置節點或者節點組加入某個隊列，隊列配置資訊如下

# 打開隊列配置檔案lsb.queues （CUBE的配置路徑是/opt/cubeman/etc， LSF類似）
# 增加如下隊列配置
Begin Queue
QUEUE_NAME   = high
PRIORITY     = 30
NICE         = 20
#QJOB_LIMIT   = 60         # job limit of the queue
#UJOB_LIMIT   = 5               # job limit per user
#PJOB_LIMIT   = 2               # job limit per processor
#RUN_WINDOW   = 5:19:00-1:8:30 20:00-8:30
#r1m         = 0.7/2.0        # loadSched/loadStop
#r15m          = 1.0/2.5
#pg          = 4.0/8
#ut           = 0.2
#io          = 50/240
#CPULIMIT     = 180/apple      # 3 hours of host apple
#FILELIMIT    = 20000
#MEMLIMIT     = 5000           # jobs bigger than this (5M) will be niced
#DATALIMIT    = 20000          # jobs data segment limit
#STACKLIMIT   = 2048
#CORELIMIT    = 20000
#PROCLIMIT    = 5              # job processor limit
#USERS        = all            # users who can submit jobs to this queue
HOSTS        = high            # hostgroup high
#PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
DESCRIPTION  = For normal low priority jobs, running only if hosts are \
lightly loaded.
End Queue

# 打開hostgroup配置檔案lsb.hosts,最後增加節點組配置（CUBE的配置路徑是/opt/cubeman/etc， LSF類似）
Begin HostGroup
GROUP_NAME    GROUP_MEMBER    # Key words
high        (compute0 compute1)    # Define a host group
End HostGroup

# 重新開機服務
service cubeman restart

SGE(Sun Grid Engine)

SGE預設隊列是all.q，預設節點組是@allhosts，所有節點都預設在該節點組

以下是SGE關于隊列的相關配置

# 添加節點組
qconf -ahgrp @high

group_name @high
hostlist compute0 compute1

# 添加隊列
qconf -aq high
指定節點組
hostlist              @high

API調用示例

由于部分客戶和合作夥伴是通過OpenAPI和E-HPC對接，這裡介紹一下API如何調用, 以python為示例代碼，其他語言的示例代碼可以通過

OpenAPI Explorer

來檢視其他語言的示例代碼

CreateCluster 建立叢集

#!/usr/bin/env python
#coding=utf-8

from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest
client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

request = CommonRequest()
request.set_accept_format('json')
request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
request.set_method('GET')
request.set_version('2018-04-12')
request.set_action_name('CreateCluster')

# 設定隊列，建立的計算節點會被指定為該隊列，該隊列會被自動建立
request.add_query_param('JobQueue', 'high')

# 設定CreateCluster其他參數
......

response = client.do_action_with_exception(request)

AddNodes

#!/usr/bin/env python
#coding=utf-8

from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest
client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

request = CommonRequest()
request.set_accept_format('json')
request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
request.set_method('GET')
request.set_version('2018-04-12')
request.set_action_name('AddNodes')
# 設定隊列，新擴容的計算節點會被指定為該隊列，該隊列如果不存在會被自動建立
request.add_query_param('JobQueue', 'high')

# 設定AddNodes其他參數
......

response = client.do_action_with_exception(request)

ListQueues

新增API用于查詢叢集隊列清單

#!/usr/bin/env python
#coding=utf-8

from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest
client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

request = CommonRequest()
request.set_accept_format('json')
request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
request.set_method('GET')
request.set_version('2018-04-12')
request.set_action_name('ListQueues')

request.add_query_param('RegionId', 'cn-hangzhou')
request.add_query_param('ClusterId', '<clusterId>')

response = client.do_action_with_exception(request)

SubmitJob

#!/usr/bin/env python
#coding=utf-8

from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest
client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

request = CommonRequest()
request.set_accept_format('json')
request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
request.set_method('GET')
request.set_version('2018-04-12')
request.set_action_name('SubmitJob')

# 指定作業送出到該隊列中
request.add_query_param('JobQueue', 'high')

# 設定SubmitJob其他參數
......

response = client.do_action_with_exception(request)

SetAutoScaleConfig

#!/usr/bin/env python
#coding=utf-8

from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest
client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

request = CommonRequest()
request.set_accept_format('json')
request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
request.set_method('GET')
request.set_version('2018-04-12')
request.set_action_name('SetAutoScaleConfig')

# 對于隊列high，設定擴容執行個體規格為GPU執行個體ecs.gn6v-c8g1.8xlargee，按量付費
request.add_query_param('Queues.1.QueueName', 'high')
request.add_query_param('Queues.1.InstanceType', 'ecs.gn6v-c8g1.8xlarge')
request.add_query_param('Queues.1.SpotStrategy', 'NoSpot')
request.add_query_param('Queues.1.SpotPriceLimit', '0')

# 對于隊列low，設定擴容執行個體規格為ecs.g5.large，擴容搶占式執行個體，最高出價為0.1
request.add_query_param('Queues.2.QueueName', 'low')
request.add_query_param('Queues.2.InstanceType', 'ecs.g5.large')
request.add_query_param('Queues.2.SpotStrategy', 'SpotWithPriceLimit')
request.add_query_param('Queues.2.SpotPriceLimit', '0.1')

# 設定SetAutoScaleConfig其他參數
......

response = client.do_action_with_exception(request)

LSF/CUBE叢集的額外設定

LSF/CUBE由于需要license，在建立好叢集之後，需要使用者手動配置license認證，然後手動配置隊列和節點組資訊，配置方法在上述章節已經提及，然後後續擴容節點或者自動伸縮就可以做到自動化多隊列管理

E-HPC支援多隊列管理和自動伸縮

前言

E-HPC支援多隊列

HPC叢集對多隊列的支援

PBSPro

Slurm

LSF/CUBE

SGE(Sun Grid Engine)

API調用示例

CreateCluster 建立叢集

AddNodes

ListQueues

SubmitJob

SetAutoScaleConfig

LSF/CUBE叢集的額外設定

繼續閱讀

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

vue-cli簡介（中文翻譯）

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題

在python中建立excel并寫入