Hbase Region Server hijacked – Failed transition from OFFLINE to OPENING

 

Recently during my tests , my hbase cluster crashed and My master did not want to start up and my region servers said they were possibly hijacked i.e. removed.

At first I tried starting up the servers but they would not start up.

Hbase Master:

2015-11-23 23:56:59,100 FATAL [test-ambari-h-147:16000.activeMasterManager] master.HMaster: Failed to become active master


java.lang.RuntimeException: java.lang.NullPointerException
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
    at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
    at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
    at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
    at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
    at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScanOfMeta(MetaTableAccessor.java:143)
    at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.isMetaTableUpdated(MetaMigrationConvertingToPB.java:163)
    at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.updateMetaIfNecessary(MetaMigrationConvertingToPB.java:130)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:758)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395)
    at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553)
    at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185)
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152)
    at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
I looked at zookeeper and the node '/hbase-unsecure' i.e. the parent node for hbase was missing.

so i went in and created it manually.

Then i started the master, it went up and started to wait for the region servers.


....
2015-11-23 23:16:18,134 INFO  [test-ambari-h-147:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
....

Hbase Region Server:

The region servers started up and tried to move the regions from OFFLINE TO OPENING.

Hbase master UI show the regions in TRANSITION.


Screen Shot 2015-11-24 at 3.53.34 pm

 

 

After a while, the Region server gives up and throws error:

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
zookeeper.ZKAssign: regionserver:16020-0x151379ff470012d, quorum=zookeeper.internal:2181, baseZNode=/hbase-unsecure Attempt to transition the unassigned node for 45f2d52bc925c1645e33242ba3b4bf30 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was ip—10-2-12-12.internal,16020,1448338125490 not the expected test-ambari-jn-12,16020,1448338125490

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to OPENING for region=45f2d52bc925c1645e33242ba3b4bf30

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=45f2d52bc925c1645e33242ba3b4bf3

Now the Interesting part was it tried to transition from
test-ambari-jn-12
to
ip—10-2-12-12.internal

But for us
both test-ambari-jn-12 & ip—10-2-12-12.internal were the same.
test-ambari-jn-12 was a cname 
&
ip—10-2-12-12.internalwas the hostname



They pointed to the same box , it seems the resolution failed.

i.e. it recognized the same box as two different region servers.
Screen Shot 2015-11-24 at 4.03.55 pm

Resolution:

1) Stop the Hbase Master and Region Servers
2) Create the ‘hbase parent node’ in zookeeper. For hortonworks ambari distribution
its ‘/hbase-unsecure’
3) start the hbase master
4) Add an entry to /etc/hosts on each region server.
5) restart region servers.

6) they should connect move regions around & i was able to get my data back.

Hbase Version 1.1 – hortonworks 2.3

NOTE: This one scenario where my regions were hijacked.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s