天天看点

使用hadoop restful api实现对集群信息的统计

(适用于hadoop 2.7及以上版本)

resourcemanager rest api’s:

<a href="https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/resourcemanagerrest.html">https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/resourcemanagerrest.html</a>

webhdfs rest api:

<a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/webhdfs.html">https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/webhdfs.html</a>

mapreduce history server rest api’s:

<a href="https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/historyserverrest.html">https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/historyserverrest.html</a>

spark monitoring and instrumentation

<a href="http://spark.apache.org/docs/latest/monitoring.html">http://spark.apache.org/docs/latest/monitoring.html</a>

url

<a href="http://emr-header-1:50070/webhdfs/v1/?user.name=hadoop&amp;op=getcontentsummary">http://emr-header-1:50070/webhdfs/v1/?user.name=hadoop&amp;op=getcontentsummary</a>

返回结果:

关于返回结果的说明:

注意length与spaceconsumed的关系,跟hdfs副本数有关。

如果要统计各个组工作目录的使用情况,使用如下请求:

<a href="http://emr-header-1:50070/webhdfs/v1/user/feed_aliyun?user.name=hadoop&amp;op=getcontentsummary">http://emr-header-1:50070/webhdfs/v1/user/feed_aliyun?user.name=hadoop&amp;op=getcontentsummary</a>

<a href="http://emr-header-1:8088/ws/v1/cluster">http://emr-header-1:8088/ws/v1/cluster</a>

返回结果

<a href="http://emr-header-1:8088/ws/v1/cluster/scheduler">http://emr-header-1:8088/ws/v1/cluster/scheduler</a>

具体参数说明参考:

<a href="https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/resourcemanagerrest.html#cluster_application_queue_api">https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/resourcemanagerrest.html#cluster_application_queue_api</a>

<a href="http://emr-header-1:8088/ws/v1/cluster/apps">http://emr-header-1:8088/ws/v1/cluster/apps</a>

如果要统计固定时间段的,可以加上"?finishedtimebegin={时间戳}&amp;finishedtimeend={时间戳}"参数,例如

<a href="http://emr-header-1:8088/ws/v1/cluster/apps?finishedtimebegin=1496742124000&amp;finishedtimeend=1496742134000">http://emr-header-1:8088/ws/v1/cluster/apps?finishedtimebegin=1496742124000&amp;finishedtimeend=1496742134000</a>

job扫描的数据量,需要通过history server的restful api查询,mapreduce的和spark的又有一些差异。

<a href="http://emr-header-1:19888/ws/v1/history/mapreduce/jobs/job_1495123166259_0962/counters">http://emr-header-1:19888/ws/v1/history/mapreduce/jobs/job_1495123166259_0962/counters</a>

其中org.apache.hadoop.mapreduce.lib.input.fileinputformatcounter里面的bytes_read为job扫描的数据量

<a href="http://emr-header-1:18080/api/v1/applications/application_1495123166259_1050/executors">http://emr-header-1:18080/api/v1/applications/application_1495123166259_1050/executors</a>

每个executor的totalinputbytes总和为整个job的数据扫描量。