天天看点

HBase Region数据不一致

【背景】

应用在进行bulkload将hfile导入hbase中报错:

2019-04-03 11:27:18,509 [LoadIncrementalHFiles-2][org.apache.hadoop.hbase.client.RpcRetryingCaller:132] [INFO ] - Callexception, tries=20, retries=35, started=269051 ms ago, cancelled=false,msg=row '048000-1229171819-48889202-48889202-0-79818770#090' on table'tbl_glhis_hb_swt_addn_inf' atregion=tbl_glhis_hb_swt_addn_inf,048000-1229171819-48889202-48889202-0-79818770#090,1554144453180.10bd9b9ffb8bcaac4f939a79a29f4ba1.,hostname=y3050705,60020,1517293052636, seqNum=1

HBase Region数据不一致

查看hbase master web ui页面发现tbl_glhis_hb_swt_addn_inf在y3050705节点上的region不可用,手工停止该节点上的regionserver服务。应用再次发起,报错依然。

【修复过程】

任意在一台regionserver节点上执行

export HADOOP_USER_NAME=hbase

hbase hbck -details

tbl_glhis_hb_swt_addn_inf > 1.txt 2>&1

查看1.txt日志报错如下:

---- Table 'tbl_achis_hb_trans_flow': overlap groups

There are 0 overlap groups with 0 overlapping regions

19/04/03 13:40:15 INFO util.HBaseFsck: Computingmapping of all store files

...........................................................................................java.lang.OutOfMemoryError:Java heap space

Dumping heap to java_pid16636.hprof ...

Heap dump file created [330952255 bytes in 2.844 secs]

#

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

#  Executing/bin/sh -c "kill -9 16636"...

Killed

[END] 2019/4/3 13:42:50

为内存溢出导致,查看/etc/hbase/conf/hbase-env.sh文件

HBase Region数据不一致

HBSE_OPTS中设置的GC最大堆内存是256M

于是在该regionserver上临时修改JVM的内存大小

HBase Region数据不一致

export

HBASE_OPTS="-Xmx4294967296 -XX:+HeapDumpOnOutOfMemoryError

-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Djava.net.preferIPv4Stack=true

$HBASE_OPTS"

HBase Region数据不一致

再次执行hbase hbck -details tbl_glhis_hb_swt_addn_inf >

1.txt 2>&1

结果检测到4个不一致

4inconsistencies detected.

Status:INCONSISTENT

HBase Region数据不一致

报错日志为:

ERROR: Region { meta => tbl_glhis_hb_swt_addn_inf,048000-1229171819-48889202-48889202-0-79818770#090,1554144453180.10bd9b9ffb8bcaac4f939a79a29f4ba1.,hdfs => null, deployed => , replicaId => 0 }found in META, but not in HDFS or deployed on any region server.

ERROR: Region { meta =>tbl_glhis_hb_swt_addn_inf,047648-0225101088-48429202-48429202-0-79809774#090,1554144453180.bddba6808daaf163334bdda58fe8ff47.,hdfs => null, deployed => , replicaId => 0 }found in META, but not in HDFS or deployed on any region server.

ERROR: Region { meta => null, hdfs => null,deployed =>y3031903,60020,1534847671712;tbl_glhis_hb_swt_addn_inf,047648-0225101088-48429202-48429202-0-79809774#090,1554143472127.defad07ae1207932536feb505f0d047b.,replicaId => 0 }, key=defad07ae1207932536feb505f0d047b,not on HDFS or in hbase:meta but deployed ony3031903,60020,1534847671712

atorg.apache.hadoop.hbase.master.MasterRpcServices.offlineRegion(MasterRpcServices.java:1232)atorg.apache.hadoop.hbase.master.MasterRpcServices.offlineRegion(MasterRpcServices.java:1232)

ERROR: No regioninfo in Meta or HDFS. { meta =>null, hdfs => null, deployed =>y3031903,60020,1534847671712;tbl_glhis_hb_swt_addn_inf,047648-0225101088-48429202-48429202-0-79809774#090,1554143472127.defad07ae1207932536feb505f0d047b.,replicaId => 0 }

ERROR: There is a hole in the region chain

between 047648-0225101088-48429202-48429202-0-79809774#090and 048337-0529174121-48028810-00098800-0-79829470#090.  You need to create a new .regioninfo andregion dir in hdfs to plug the hole.

ERROR: Found inconsistency in tabletbl_glhis_hb_swt_addn_inf

HBase Region数据不一致

于是进行修复:

hbase hbck

-fixHdfsHoles -fixMeta -fixAssignments tbl_glhis_hb_swt_addn_inf > 3.txt

2>&1

查看日志不一致变为两个:

ERROR: Region { meta => null, hdfs =>hdfs://nameservice/hbase/data/default/tbl_glhis_hb_swt_addn_inf/1cb1bdb5a6d35637d1365d1bfddfa839,deployed => , replicaId => 0 }on HDFS, but not

listed in hbase:meta or deployed on any region server

ERROR:There is a hole in the regionchain between 047648-0225101088-48429202-48429202-0-79809774#090 and048337-0529174121-48028810-00098800-0-79829470#090.  You need to create a new .regioninfo andregion dir in hdfs to plug the hole.

ERROR: Found inconsistency in table tbl_glhis_hb_swt_addn_inf

于是再次运行hbase hbck

-repairHoles tbl_glhis_hb_swt_addn_inf > repair.txt 2>&1

HBase Region数据不一致

检查表状态已正常,无不一致情况。应用重新发起导入成功。

【小结】

HBCK检查什么?

1、HBase Region一致性

a.集群所有region都被assign,且被deploy到唯一一台regionserver上

b.该region的状态在内存、hbase:meta表及zk上是否一致

2、HBase表完整性

对集群中任意一张表,每个rowkey都仅能存在于一个region区间

region不一致情况主要分为以下几种类型:

1、There is a hole in the region chain between X and Y.

这种情况是在hdfs层面上的,这个region的.regioninfo(meta)文件不存在,使用"-fixHdfsHole"进行修复;

-fixHdfsHole:修复region holes(空洞,某个区间没有region)问题

2、Found lingering reference file X.

这种情况基本上都是由于split reion时造成的,这些文件都是连接文件,使用"-fixReferenceFiles"进行修复;

3、Region X on HDFS,but not listed in hbase:meta or deployed on any region server.

这种情况下region的实际数据是存在的,但是在hbase:meta中不存在,使用"-fixMeta"进行信息同步修复;

-fixMeta:主要修复.regioninfo文件和hbase:meta元数据表的不一致。修复的原则是以HDFS文件为准:如果region在HDFS上存在,但在hbase.meta表中不存在,就会在hbase:meta表中添加一条记录。反之如果在HDFS上不存在,而在hbase:meta表中存在,就会将hbase:meta表中对应的记录删除

4、Region X not deployed on any region server.

这种情况下,region的hfile等数据都在,只是没有在任何region上online,使用"fixAssignments"进行修复。

-fixAssignments:修复没有assign、assign不正确或者同时assign到多台RegionServer的问题region。

5、使用"- repairHoles"进行修复。相当于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans

由于不一致的现象多种多样,原因也不尽相同,通过来说,regionserver crash、region在regionserver中迁移灯是基本原因。

步骤1. hbase hbck 检查输出所以ERROR信息,每个ERROR都会说明错误信息。

步骤2. hbase hbck -fixTableOrphones 先修复tableinfo缺失问题,根据内存cache或者hdfs table 目录结构,重新生成tableinfo文件。

步骤3. hbase hbck -fixHdfsOrphones 修复regioninfo缺失问题,根据region目录下的hfile重新生成regioninfo文件

步骤4. hbase hbck -fixHdfsOverlaps 修复region重叠问题,merge重叠的region为一个region目录,并从新生成一个regioninfo

步骤5. hbase hbck -fixHdfsHoles 修复region缺失,利用缺失的rowkey范围边界,生成新的region目录以及regioninfo填补这个空洞。

步骤6. hbase hbck -fixMeta 修复meta表信息,利用regioninfo信息,重新生成对应meta row填写到meta表中,并为其填写默认的分配regionserver

步骤7. hbase hbck -fixAssignment 把这些offline的region触发上线,当region开始重新open 上线的时候,会被重新分配到真实的RegionServer上 , 并更新meta表上对应的行信息。

继续阅读