天天看點

hadoop幾個問題

序言

    海量的資料無論是存儲還是計算,總是要保證其架構的高可用,資料倉庫的建構是一個合的過程,而微服務又是一個分的過程,天下大勢,分分合合。

    不同的場景适合于不同的技術,不要在一個裡面覺得這個技術就是銀彈,可能也隻是昙花一現。

hadoop相關問題   

序:namenode高可用問題

namenode的高可用是由QJM和zkfc加zk叢集來實作的,當當機再啟動的時候,會切換很快,但是如果直接當機或者是hang機,當ssh無法登入上去的時候,就會導緻切換不成功,一直為standby狀态。。。需要登入上去,進行探測namenode程序是否存在,如果存在則殺死,這就是sshfence的實作,但是要在當機的時候,也可以切換,必須修改為如下配置:

<property>        <name>dfs.ha.fencing.methods</name>        <value>        sshfence        shell(/bin/true)        </value></property>      
hadoop幾個問題

     namenode不可用導緻叢集不可用:

#隻讀操作也不可以              hdfs dfs -ls hdfs://ns/user              21/03/02 04:21:21 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over KEL1/192.168.1.99:9000 after 1 fail over attempts. Trying to fail over after sleeping for 659ms.              org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby               at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)               at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1774)               at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313)               at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3850)               at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1011)               at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:843)               at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)               at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)               at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)               at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)               at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)               at java.security.AccessController.doPrivileged(Native Method)               at javax.security.auth.Subject.doAs(Subject.java:422)               at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)               at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)               ##namenode log              2021-03-02 05:30:27,301 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Get corrupt file blocks returned error: Operation category READ is not supported in state standby              2021-03-02 05:30:28,171 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Get corrupt file blocks returned error: Operation category READ is not supported in state standby                  ##zkfc log              2021-03-02 06:14:13,573 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at KEL/192.168.1.10:9000: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.10:54563 remote=KEL/192.168.1.10:9000] Call From KEL/192.168.1.10 to KEL:9000 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.10:54563 remote=KEL/192.168.1.10:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout                  ##兩次切換的日志              2021-03-02 08:11:18,790 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at KEL/192.168.1.10:9000 standby (unable to connect)              org.apache.hadoop.net.ConnectTimeoutException: Call From KEL1/192.168.1.99 to KEL:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=KEL/192.168.1.10:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)               at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)               at java.lang.reflect.Constructor.newInstance(Constructor.java:423)               at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)               at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)               at org.apache.hadoop.ipc.Client.call(Client.java:1479)               at org.apache.hadoop.ipc.Client.call(Client.java:1412)               at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)               at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)               at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)               at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)               at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)               at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)               at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)               at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)               at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)               at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)               at org.apache.hadoop.ha.ActiveStandbyElector.proce***esult(ActiveStandbyElector.java:418)               at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)               at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)              Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=KEL/192.168.1.10:9000]               at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)               at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)               at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)               at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)               at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)               at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)               at org.apache.hadoop.ipc.Client.call(Client.java:1451)               ... 14 more              2021-03-02 08:11:18,791 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======              2021-03-02 08:11:18,791 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/2: org.apache.hadoop.ha.SshFenceByTcpPort(null)              2021-03-02 08:11:18,793 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to KEL...              2021-03-02 08:11:18,793 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to KEL port 22              2021-03-02 08:11:48,818 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to KEL as user root              com.jcraft.jsch.JSchException: timeout: socket is not established               at com.jcraft.jsch.Util.createSocket(Util.java:386)               at com.jcraft.jsch.Session.connect(Session.java:182)               at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)               at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)               at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)               at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)               at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)               at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)               at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)               at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)               at org.apache.hadoop.ha.ActiveStandbyElector.proce***esult(ActiveStandbyElector.java:418)               at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)               at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)              2021-03-02 08:11:48,820 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.              2021-03-02 08:11:48,820 INFO org.apache.hadoop.ha.NodeFencer: Trying method 2/2: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)              2021-03-02 08:11:48,852 INFO org.apache.hadoop.ha.ShellCommandFencer: Launched fencing command '/bin/true' with pid 5931              2021-03-02 08:11:48,856 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======              2021-03-02 08:11:48,856 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/ns/ActiveBreadCrumb to indicate that the local node is the most recent active...              2021-03-02 08:11:48,863 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at KEL1/192.168.1.99:9000 active...              2021-03-02 08:11:49,387 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at KEL1/192.168.1.99:9000 to active state           

    1 資料副本問題

    在檢視namenode的日志時,出現資料塊管理報錯,沒有足夠的副本來存放資料,不停的WARN:

2021-03-02 04:35:38,652 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy2021-03-02 04:35:38,653 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy      

    看報錯的資訊,大緻的意思就是本來設定了一個block為三個副本,但是還差一個才能到達三個副本。

    這種情況出現的原因有很多,畢竟hadoop是運作在廉價的實體機之上,那麼無論是伺服器節點還是硬碟都随時可能出現故障,出現壞盤或者壞機器,進而導緻資料副本不足。

    hdfs為了保證資料的可靠性,會将一個檔案切分成一個個的block,一個block預設大小為124M,然後将block分布在不同的datanode節點之上,當有節點出現問題的時候,會自動的進行複制遷移到正常的節點上。

    偶爾也需要進行手動的遷移,也就是本來是三副本現在不足,那麼就先設定為2副本,然後再設定為3副本,進而手動觸發資料遷移複制。

[root@KEL logs]# hdfs fsck /Connecting to namenode via http://KEL1:50070/fsck?ugi=root&path=%2FFSCK started by root (auth:SIMPLE) from /192.168.1.10 for path / at Wed Mar 03 13:03:26 CST 2021............................................................................................................................................................................................................................................./tmp/hadoop-yarn/staging/root/.staging/job_1614266885310_0001/job.jar:  Under replicated BP-184102405-192.168.1.10-1612873956948:blk_1073742395_1571. Target Replicas is 10 but found 2 replica(s)../tmp/hadoop-yarn/staging/root/.staging/job_1614266885310_0001/job.split:  Under replicated BP-184102405-192.168.1.10-1612873956948:blk_1073742396_1572. Target Replicas is 10 but found 3 replica(s)....../tmp/hadoop-yarn/staging/root/.staging/job_1614529282649_0001/job.jar:  Under replicated BP-184102405-192.168.1.10-1612873956948:blk_1073747250_6426. Target Replicas is 10 but found 3 replica(s)../tmp/hadoop-yarn/staging/root/.staging/job_1614529282649_0001/job.split:  Under replicated BP-184102405-192.168.1.10-1612873956948:blk_1073747251_6427. Target Replicas is 10 but found 3 replica(s)..../tmp/hadoop-yarn/staging/root/.staging/job_1614648088274_0001/job.jar:  Under replicated BP-184102405-192.168.1.10-1612873956948:blk_1073747449_6625. Target Replicas is 10 but found 3 replica(s)../tmp/hadoop-yarn/staging/root/.staging/job_1614648088274_0001/job.split:  Under replicated BP-184102405-192.168.1.10-1612873956948:blk_1073747450_6626. Target Replicas is 10 but found 3 replica(s)...........................................................................................................................................................................................................................................................................................................................................................................................Status: HEALTHY Total size:  577584452 B Total dirs:  300 Total files:  626 Total symlinks:    0 Total blocks (validated):  521 (avg. block size 1108607 B) Minimally replicated blocks:  521 (100.0 %) Over-replicated blocks:  0 (0.0 %) Under-replicated blocks:  6 (1.1516315 %) Mis-replicated blocks:    0 (0.0 %) Default replication factor:  3 Average block replication:  2.9980807 Corrupt blocks:    0 Missing replicas:    43 (2.6791277 %) Number of data-nodes:    3 Number of racks:    1FSCK ended at Wed Mar 03 13:03:26 CST 2021 in 70 millisecondsThe filesystem under path '/' is HEALTHY      

    可以使用hdfs fsck來進行檔案的檢查,會檢查所有的副本數是否滿足,如上所示,有幾個檔案的副本數不足,産生的原因是運作任務的過程中出現了報錯,而沒有将相關的資料進行清理。

#手動設定所有檔案的副本數為3[root@KEL module]hdfs  dfs -setrep -w 3 -R / 發現用了很久時間都沒有進行設定。Waiting for /tmp/hadoop-yarn/staging/history/done/2021/03/02/000000/job_1614665577456_0002_conf.xml ... doneWaiting for /tmp/hadoop-yarn/staging/root/.staging/job_1614266885310_0001/job.jar      

    在tmp目錄很久也未設定成功,進而直接将此檔案删除:

[root@KEL module]# hdfs dfs -rm -r /tmp/hadoop-yarn/staging/21/03/03 21:51:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.Deleted /tmp/hadoop-yarn/staging#删除臨時檔案之後,再次進行檢查[root@KEL module]# hdfs fsck /Connecting to namenode via http://KEL1:50070/fsck?ugi=root&path=%2FFSCK started by root (auth:SIMPLE) from /192.168.1.10 for path / at Wed Mar 03 13:41:53 CST 2021..........................................................................................................................................................................................................................................................................................................................................................................................Status: HEALTHY Total size:  533172477 B Total dirs:  268 Total files:  378 Total symlinks:    0 Total blocks (validated):  274 (avg. block size 1945884 B) Minimally replicated blocks:  274 (100.0 %) Over-replicated blocks:  0 (0.0 %) Under-replicated blocks:  0 (0.0 %) Mis-replicated blocks:    0 (0.0 %) Default replication factor:  3 Average block replication:  3.0 Corrupt blocks:    0 Missing replicas:    0 (0.0 %) Number of data-nodes:    3 Number of racks:    1FSCK ended at Wed Mar 03 13:41:53 CST 2021 in 76 millisecondsThe filesystem under path '/' is HEALTHY      

    在進行檢視job history檔案的時候,出現錯誤(historyserver日志):

2021-03-03 21:56:50,962 ERROR org.apache.hadoop.mapreduce.v2.hs.JobHistory: Error while scanning intermediate done dirjava.io.FileNotFoundException: File /tmp/hadoop-yarn/staging/history/done_intermediate does not exist.        at org.apache.hadoop.fs.Hdfs$DirListingIterator.<init>(Hdfs.java:211)        at org.apache.hadoop.fs.Hdfs$DirListingIterator.<init>(Hdfs.java:195)        at org.apache.hadoop.fs.Hdfs$2.<init>(Hdfs.java:177)        at org.apache.hadoop.fs.Hdfs.listStatusIterator(Hdfs.java:177)        at org.apache.hadoop.fs.FileContext$22.next(FileContext.java:1494)        at org.apache.hadoop.fs.FileContext$22.next(FileContext.java:1489)        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)        at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1489)        at org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils.listFilteredStatus(JobHistoryUtils.java:505)        at org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils.localGlobber(JobHistoryUtils.java:452)        at org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils.localGlobber(JobHistoryUtils.java:444)        at org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils.localGlobber(JobHistoryUtils.java:439)        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:795)        at org.apache.hadoop.mapreduce.v2.hs.JobHistory$MoveIntermediateToDoneRunnable.run(JobHistory.java:189)        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)        at java.lang.Thread.run(Thread.java:748)      

    頁面出現500報錯。。。開啟了mapreduce的日志聚合之後,預設會将日志存放在這個tmp路徑之中。(當不能觸發自動遷移的時候,就應該主動去手動觸發遷移。)

    hdfs-site.xml配置檔案中,參數dfs.replication設定副本數量,預設為3.

    2 hdfs多目錄挂載

    這種情況一般發生在剛上線的時候,或者是伺服器的盤壞了需要重新加一塊盤的時候。

#對磁盤進行分區              [root@KEL ~]# fdisk /dev/sdb              Welcome to fdisk (util-linux 2.23.2).                  Changes will remain in memory only, until you decide to write them.              Be careful before using the write command.                  Device does not contain a recognized partition table              Building a new DOS disklabel with disk identifier 0x4670e170.              Command (m for help): n              Partition type:                p   primary (0 primary, 0 extended, 4 free)                e   extended              Select (default p): p              Partition number (1-4, default 1):               First sector (2048-20971519, default 2048):               Using default value 2048              Last sector, +sectors or +size{K,M,G} (2048-20971519, default 20971519):               Using default value 20971519              Partition 1 of type Linux and of size 10 GiB is set              Command (m for help): w              The partition table has been altered!              #進行檢查分區結果              [root@KEL ~]# fdisk -l              Disk /dev/sdb: 10.7 GB, 10737418240 bytes, 20971520 sectors              Units = sectors of 1 * 512 = 512 bytes              Sector size (logical/physical): 512 bytes / 512 bytes              I/O size (minimum/optimal): 512 bytes / 512 bytes              Disk label type: dos              Disk identifier: 0x4670e170                    Device Boot      Start         End      Blocks   Id  System              /dev/sdb1            2048    20971519    10484736   83  Linux           

    再進行格式化磁盤:

[root@KEL data2]# mkfs.ext3 /dev/sdb1mke2fs 1.42.9 (28-Dec-2013)Filesystem label=OS type: LinuxBlock size=4096 (log=2)Fragment size=4096 (log=2)Stride=0 blocks, Stripe width=0 blocks655360 inodes, 2621184 blocks131059 blocks (5.00%) reserved for the super userFirst data block=0Maximum filesystem blocks=268435456080 block groups32768 blocks per group, 32768 fragments per group8192 inodes per groupSuperblock backups stored on blocks:   32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632Allocating group tables: done                            Writing inode tables: done                            Creating journal (32768 blocks): doneWriting superblocks and filesystem accounting information: done#建立挂載目錄[root@KEL hadoop-2.7.2]# mkdir data2#添加開啟啟動檔案[root@KEL data2]# cat /etc/fstab ## /etc/fstab# Created by anaconda on Sat Feb  6 18:18:57 2021## Accessible filesystems, by reference, are maintained under '/dev/disk'# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info#/dev/mapper/rhel-root   /                       xfs     defaults        0 0UUID=a6f53869-d27b-4a99-94b5-4ec61040d008 /boot                   xfs     defaults        0 0/dev/mapper/rhel-swap   swap                    swap    defaults        0 0/dev/sdb1  /opt/module/hadoop-2.7.2/data2 ext3  defaults  0  0#測試挂載[root@KEL data2]# mount -a#修改hdfs配置檔案[root@KEL hadoop]# vim hdfs-site.xml<property>        <name>dfs.datanode.data.dir</name>        <value>file://${hadoop.tmp.dir}/dfs/data,file:///opt/module/hadoop-2.7.2/data2</value></property>#啟動本機上的相關程序[root@KEL hadoop]# start-dfs.sh #檢查檔案結果(主要檢視版本号是否一緻)[root@KEL current]# pwd/opt/module/hadoop-2.7.2/data2/current[root@KEL current]# cat VERSION #Wed Mar 03 17:07:06 CST 2021storageID=DS-72c239ca-daff-4655-b95a-4e5a50bd28a8clusterID=CID-073cc3af-78f6-4716-993b-92a0b398291acTime=0datanodeUuid=1c4ffb79-0fae-42a1-af32-65fecc962cdastorageType=DATA_NODElayoutVersion=-56[root@KEL current]# cd -/opt/module/hadoop-2.7.2/data/tmp/dfs/data/current[root@KEL current]# cat VERSION #Wed Mar 03 17:07:06 CST 2021storageID=DS-ff23d5a7-fe98-404b-a5b9-5d449e3b31d2clusterID=CID-073cc3af-78f6-4716-993b-92a0b398291acTime=0datanodeUuid=1c4ffb79-0fae-42a1-af32-65fecc962cdastorageType=DATA_NODElayoutVersion=-56[root@KEL current]# pwd/opt/module/hadoop-2.7.2/data/tmp/dfs/data/current#檢視空間是否增大[root@KEL hadoop]# hdfs dfsadmin -report-------------------------------------------------Live datanodes (3):Name: 192.168.1.10:50010 (KEL)Hostname: KELDecommission Status : NormalConfigured Capacity: 29180092416 (27.18 GB)DFS Used: 788815872 (752.27 MB)Non DFS Used: 3924152320 (3.65 GB)DFS Remaining: 24467124224 (22.79 GB)DFS Used%: 2.70%DFS Remaining%: 83.85%Configured Cache Capacity: 0 (0 B)Cache Used: 0 (0 B)Cache Remaining: 0 (0 B)Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 1Last contact: Wed Mar 03 09:12:20 CST 2021      

WEB頁面檢視增大的空間:

hadoop幾個問題

    3 資料不均衡

        在叢集中新加入一台機器之後,那麼就會造成資料不均衡,新上線的機器的負載低,存儲的資料量也少,進而要對資料進行一個balance操作。

#啟動一個balancer操作,threshold表示每個datanode上資料量差别為1%[root@KEL sbin]# ./start-balancer.sh -threshold 1starting balancer, logging to /opt/module/hadoop-2.7.2/logs/hadoop-root-balancer-KEL.out[root@KEL sbin]# vim ../logs/hadoop-root-balancer-KEL.log [root@KEL sbin]# jps10121 Balancer2021-03-03 17:47:38,546 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.199:500102021-03-03 17:47:38,546 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.99:500102021-03-03 17:47:38,546 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.10:500102021-03-03 17:47:38,547 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [192.168.1.199:50010:DISK]2021-03-03 17:47:38,547 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []2021-03-03 17:47:38,547 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 7.59 MB to make the cluster balanced.2021-03-03 17:47:38,547 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 27.38 MB bytes from 192.168.1.199:50010:DISK to 192.168.1.99:50010:DISK2021-03-03 17:47:38,547 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 151.40 MB bytes from 192.168.1.199:50010:DISK to 192.168.1.10:50010:DISK2021-03-03 17:47:38,547 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 178.78 MB in this iteration2021-03-03 17:47:47,568 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.10:500102021-03-03 17:47:47,569 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.99:500102021-03-03 17:47:47,569 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.199:50010      

    在頁面上檢視均衡結果:

hadoop幾個問題

    在大緻平衡完成之後,需要手動進行關閉,否則大量的資料遷移會導緻運作中的任務性能的降低。

[root@KEL sbin]# ./stop-balancer.sh      

    4 強制關機導緻zookeeper無法啟動

    強制關機的時候,導緻啟動腳本中還能檢測pid,進而誤認為程序還存在,進而不啟動。

#status說未啟動[root@KEL2 zkdata]# /opt/module/zookeeper-3.4.12/bin/zkServer.sh status /opt/module/zookeeper-3.4.12/zkdata/zoo.cfg ZooKeeper JMX enabled by defaultUsing config: /opt/module/zookeeper-3.4.12/zkdata/zoo.cfgError contacting service. It is probably not running.#start的時候,檢測到pid檔案,認為已經在運作[root@KEL2 zkdata]# /opt/module/zookeeper-3.4.12/bin/zkServer.sh start /opt/module/zookeeper-3.4.12/zkdata/zoo.cfg ZooKeeper JMX enabled by defaultUsing config: /opt/module/zookeeper-3.4.12/zkdata/zoo.cfgStarting zookeeper ... already running as process 2260.      

    兩種方式解決:

#找到pid檔案,然後将檔案删除,再次啟動即可。[root@KEL2 zkdata]# grep data zoo.cfg dataDir=/opt/module/zookeeper-3.4.12/zkdata[root@KEL2 zkdata]# ls -ltotal 16-rw-r--r-- 1 root root    2 Feb  9 18:31 myiddrwxr-xr-x 2 root root 4096 Feb 17 19:38 version-2-rw-r--r-- 1 root root  170 Feb  9 18:38 zoo.cfg-rw-r--r-- 1 root root    4 Feb 16 18:36 zookeeper_server.pid[root@KEL2 zkdata]# cat zookeeper_server.pid 2260[root@KEL2 zkdata]# rm -rf zookeeper_server.pid [root@KEL2 zkdata]# /opt/module/zookeeper-3.4.12/bin/zkServer.sh start /opt/module/zookeeper-3.4.12/zkdata/zoo.cfg ZooKeeper JMX enabled by defaultUsing config: /opt/module/zookeeper-3.4.12/zkdata/zoo.cfgStarting zookeeper ... STARTED      

    或者是先停止,然後再啟動也行。

    除了zk啟動會碰到這種問題,對于haddoop其他程序也可能碰到此類問題,都可以先停止,然後再啟動。