Hbase Region Server hijacked – Failed transition from OFFLINE to OPENING


Recently during my tests , my hbase cluster crashed and My master did not want to start up and my region servers said they were possibly hijacked i.e. removed.

At first I tried starting up the servers but they would not start up.

Hbase Master:

2015-11-23 23:56:59,100 FATAL [test-ambari-h-147:16000.activeMasterManager] master.HMaster: Failed to become active master

java.lang.RuntimeException: java.lang.NullPointerException
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
    at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
    at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
    at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
    at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
    at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScanOfMeta(MetaTableAccessor.java:143)
    at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.isMetaTableUpdated(MetaMigrationConvertingToPB.java:163)
    at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.updateMetaIfNecessary(MetaMigrationConvertingToPB.java:130)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:758)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395)
    at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553)
    at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185)
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152)
    at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
    at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
I looked at zookeeper and the node '/hbase-unsecure' i.e. the parent node for hbase was missing.

so i went in and created it manually.

Then i started the master, it went up and started to wait for the region servers.

2015-11-23 23:16:18,134 INFO  [test-ambari-h-147:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.

Hbase Region Server:

The region servers started up and tried to move the regions from OFFLINE TO OPENING.

Hbase master UI show the regions in TRANSITION.

Screen Shot 2015-11-24 at 3.53.34 pm



After a while, the Region server gives up and throws error:

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
zookeeper.ZKAssign: regionserver:16020-0x151379ff470012d, quorum=zookeeper.internal:2181, baseZNode=/hbase-unsecure Attempt to transition the unassigned node for 45f2d52bc925c1645e33242ba3b4bf30 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was ip—10-2-12-12.internal,16020,1448338125490 not the expected test-ambari-jn-12,16020,1448338125490

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to OPENING for region=45f2d52bc925c1645e33242ba3b4bf30

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=45f2d52bc925c1645e33242ba3b4bf3

Now the Interesting part was it tried to transition from

But for us
both test-ambari-jn-12 & ip—10-2-12-12.internal were the same.
test-ambari-jn-12 was a cname 
ip—10-2-12-12.internalwas the hostname

They pointed to the same box , it seems the resolution failed.

i.e. it recognized the same box as two different region servers.
Screen Shot 2015-11-24 at 4.03.55 pm


1) Stop the Hbase Master and Region Servers
2) Create the ‘hbase parent node’ in zookeeper. For hortonworks ambari distribution
its ‘/hbase-unsecure’
3) start the hbase master
4) Add an entry to /etc/hosts on each region server.
5) restart region servers.

6) they should connect move regions around & i was able to get my data back.

Hbase Version 1.1 – hortonworks 2.3

NOTE: This one scenario where my regions were hijacked.

MySQL Aurora CPU spikes

Recently we encountered an issue with aurora, where our cpu started spiking every X minutes. (X < 30)

Screen Shot 2015-08-27 at 3.02.27 pm

we had like 20-30 connections, out of which majority were inserts and a few selects.

every time the spike occurred our inserts would start waiting and waiting….

here is innotop output


we checked our code. nothing major had changed.

The mysql error log had this mysterious message which i could not trace in the mysql/percona/maria source.

“Innodb: Retracted crabbing in btr_cur_search_to_nth_level 33001 times since restart”

if you get to know the meaning of this – please do let me know – would love to understand this.

crab_Screen Shot 2015-09-17 at 9.29.17 pm

Finally we contacted Aurora support team, after an investigation by them, it turned out to be a issue in aurora.

I am not too sure whether the above ‘error log message’  had any impact or not on this case.

PS: if you are wondering , that this post ended abruptly, even i am a little surprised by the ending too.

Screen Shot 2015-08-24 at 7.47.15 pm

Should I worry about the Query Cache in Aurora ?

There are a lot of blog posts on the internet which warn you about using the Query Cache in MySQL.

I was surprised to see that the query cache was enabled in Aurora.

Screen Shot 2015-08-24 at 7.25.02 pm

This was the size on a ‘db.r3.large’ instance.

On a ‘db.r3.2xlarge’  instance, it was set to 2460900352 i.e. 2.4GB

I am not sure, if amazon has done something to improve the query cache.

So, do run tests with Aurora and see if the cache suits you.

Screen Shot 2015-08-24 at 7.47.15 pm

Storm Spout Co-ordinator – llegal State Exception – Deactivate / Activate Topology

Recently while working with Transactional Topologies storm 0.9.4 we came across this error in the spout

Expecting previous txid state to be the previous transaction

We were a little confused as to why this happened, as the code change was minimal.

Root Cause Analysis:

Normally we make method ‘isReady()’ of Coordinator return true. We made changes for it to return false, i.e we wanted to

implement deactivation/activation of the topology instead of killing it cold i.e. a graceful stop.

In addition to that, we had for our topology

conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 10);

i.e. the number of parallel in flight transactions was set to 10.

What happens when ‘isReady()’ returns false ?

As per Coordinator Java doc,  the next transaction id’s  are skipped,

i.e. will not be used. refer TransactionalSpoutCoordinator.java

if(_activeTx.size() < _maxTransactionActive) {
    BigInteger curr = _currTransaction;
    for(int i=0; i<_maxTransactionActive; i++) {
        if((_coordinatorState.hasCache(curr) || _coordinator.isReady())
                && !_activeTx.containsKey(curr)) {
          Object state = _coordinatorState.getState(curr, _initializer);
        curr = nextTransactionId(curr);   //      <------- txn++ ----------------

What happens when ‘isReady()’ returns true .i.e. we activate topology again ?

The same logic from above runs i.e.

Object state = _coordinatorState.getState(curr, _initializer);

because we have below

conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 10); 
// hence it enters the for loop above

Now if we go into the getState() method

if(_strictOrder) {
    if(prev!=null && !prev.equals(txid.subtract(BigInteger.ONE))) {
        throw new IllegalStateException("Expecting previous txid state to be the previous transaction");

Lets say we deactivate the topology, making isReady() return FALSE.

Assuming the previous committed txn is 3 and current is say 6 after some skipping.

From the above, the 6 – 1 is not equal to 3. i.e. STRICT order needs to be maintained.

Storm expects the next txn to be 4 as 4 – 1 == 3.  but do to skipping, this is violated.

Hence the exception as we have set

conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 10);

if this is set to 1, this will not happen as it will never enter the for loop.

I guess this is a bug.

Work Around:

Deactivate your topology by making isReturn() return False, wait for data processed to reach Zero.

and then kill your topology.

or set

conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 01);

Hope this helps.

a wild Supposition: can MySQL be Kafka ?

This is an idea which i presented at percona live 2015.

Is MySQL an avatar of Apache Kafka ?

Can it be Kafka ?

Yes, it can.

This talk takes a shot at modeling MySQL as Kafka.

PS: it’s unconventional, hence a WILD supposition :)

slides @



MySQL Cluster – Java Connector / Bindings

While working with MySQL Cluster, i was looking for a monitoring framework for the cluster.

i came across a library @ https://launchpad.net/ndb-bindings – which had java and other connectors to NDB, the library was a wrapper of the existing C++ NDB Api.

This library allowed me to connect to the management node , get the state of the cluster and get real time notifications about heartbeat misses/node disconnections.

The library error-ed out on some conditions, with a small fix, it can work with MySQL Cluster 7.3.


I have listed down steps for compilation and running a sample program at github