Graphing Amazon RDS MySQL Metrics with Prometheus & Grafana

Recently the mysql community got an awesome monitoring solution for mysql

with Prometheus & Grafana. The graphs are simply beautiful and really lively.

I started off with this nice post on the mysql performance blog  by Roman Vynar and got the solution up and running very easily.

You can actually monitor Amazon RDS instance with the same steps mentioned in the above post but with a few changes:


The monitoring framework consists of 4 components:

  1. Prometheus server on port 9090
  2. Grafana server on port 3000
  3. MySQL  exporter  process which connects to the MySQL Server
  4. Node Exporter process which gets system metrics of the box hosting MySQL



  1. Create an RDS instance.
  2. create an amazon ec2-instance. (this will host all the 4 components)


Component 3 & 4: – Node Exporter & MySQL exporter process :

Amazon _DOES_NOT_ allow us to install anything on the RDS box.

So, I am sorry we will not be able get the System metrics of RDS – please rely on cloudwatch / Rds console for load averages, cpu usage , io etc etc.

So Follow the steps as mentioned in the nice post BUT make the following changes,

  • Install the Node exporter & MySQL exporter processes on the ec-instance ,
    • So the ‘/opt/prometheus/prometheus.yml’ file will look like
    • i.e. you are now monitoring system metrics of the ec2-instance not RDS box !
cat << EOF > /opt/prometheus/prometheus.yml
 scrape_interval: 5s
 evaluation_interval: 5s
 - job_name: linux
 - targets: ['localhost:9100']
 alias: db1
 - job_name: mysql
 - targets: ['localhost:9104']
 alias: db1

But we need to tell the MySQL exporter to pull from RDS endpoint, so the

my.cnf file for MySQL exporter should be as follows:

[root@centos7 prometheus_exporters]# cat << EOF > .my.cnf


Component 1 & 2: – Grafana & Prometheus:

Just Follow the steps as mentioned in the nice post .


And walllaaaah…. you should be able graph Amazon RDS metrics 🙂

Screen Shot 2016-05-05 at 4.08.17 pm



Punch Clock for debugging Apache Storm

I have been working a lot with Transactional topologies of Apache Storm these days.

In the course of my work, i have come up with questions like

  • Why is the topology stuck ?
  • Which part of the topology is the batch stuck at ?

So I came up with an idea based on a punch clock.

Screen Shot 2016-01-13 at 9.47.36 pm

To Find out …

  1. When did the Person enter / exit the office ?
  2. Who is still in office ?

Context Change:

Screen Shot 2016-01-13 at 9.50.33 pm.png

Screen Shot 2016-01-13 at 9.50.52 pm.png

Apply Punch clock to Storm:

Screen Shot 2016-01-13 at 9.50.58 pm

Possible Solution:

Screen Shot 2016-01-13 at 10.08.35 pm


Screen Shot 2016-01-13 at 10.03.08 pm


In the emitBatch method of Partitioned Transactional Spout:

punchCardId = "SPOUT__"+ InetAddress.getLocalHost().getHostAddress()+Thread.currentThread().getId()+"__"+System.currentTimeMillis();
PunchClock.getInstance().punchIn(punchCardId); // Punch In 
collector.emit(tuples); // Emit tuple(s) 
PunchClock.getInstance().punchOut(punchCardId); // Punch Out 


Prepare method of Transactional Bolt:

       punchCardId ="Bolt__"+Thread.currentThread().getId()+"__"+System.currentTimeMillis();  //Create Punch Card for txn 

Execute method of Transactional Bolt:

       PunchClock.getInstance().punchIn(punchCardId);      // Punch In

In the finishBatch method of Transactional Bolt:

       PunchClock.getInstance().punchOut(punchCardId);  // Punch Out


Screen Shot 2016-01-13 at 10.06.34 pm


Small Demo:

Screen Shot 2016-01-13 at 5.57.09 pm

Screen Shot 2016-01-13 at 5.57.39 pm

PS: if there are no punch cards available anywhere & topology is stuck, then the problem is probably not your bolts/spout.


Presentation: here

Other Solutions:

(1) logging while entering and exiting

(2) using -> suggested by my friend Angad @Inmobi

Hope you like the idea & hope its useful to you.

Thank you for reading & any feedback is welcome.




The Tricolor in a graph :)

One can always find inspiration randomly – that’s what i believe.

While debugging my Apache Storm topology – i was looking at metrics

using ambari & i noticed something call out to me.

it was the tricolor – the flag of my nation INDIA ==> INSPIRATION.

Screen Shot 2015-12-10 at 4.31.02 pm

#JAIHIND  #GraphArt #inspiration


Hbase Region Server hijacked – Failed transition from OFFLINE to OPENING


Recently during my tests , my hbase cluster crashed and My master did not want to start up and my region servers said they were possibly hijacked i.e. removed.

At first I tried starting up the servers but they would not start up.

Hbase Master:

2015-11-23 23:56:59,100 FATAL [test-ambari-h-147:16000.activeMasterManager] master.HMaster: Failed to become active master

java.lang.RuntimeException: java.lang.NullPointerException
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
    at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(
    at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(
    at org.apache.hadoop.hbase.client.ClientScanner.<init>(
    at org.apache.hadoop.hbase.client.HTable.getScanner(
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(
    at org.apache.hadoop.hbase.MetaTableAccessor.fullScanOfMeta(
    at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.isMetaTableUpdated(
    at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.updateMetaIfNecessary(
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(
    at org.apache.hadoop.hbase.master.HMaster.access$500(
    at org.apache.hadoop.hbase.master.HMaster$
Caused by: java.lang.NullPointerException
    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(
    at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(
    at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(
    at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(
    at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
I looked at zookeeper and the node '/hbase-unsecure' i.e. the parent node for hbase was missing.

so i went in and created it manually.

Then i started the master, it went up and started to wait for the region servers.

2015-11-23 23:16:18,134 INFO  [test-ambari-h-147:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.

Hbase Region Server:

The region servers started up and tried to move the regions from OFFLINE TO OPENING.

Hbase master UI show the regions in TRANSITION.

Screen Shot 2015-11-24 at 3.53.34 pm



After a while, the Region server gives up and throws error:

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
zookeeper.ZKAssign: regionserver:16020-0x151379ff470012d, quorum=zookeeper.internal:2181, baseZNode=/hbase-unsecure Attempt to transition the unassigned node for 45f2d52bc925c1645e33242ba3b4bf30 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was ip—10-2-12-12.internal,16020,1448338125490 not the expected test-ambari-jn-12,16020,1448338125490

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to OPENING for region=45f2d52bc925c1645e33242ba3b4bf30

WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] 
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=45f2d52bc925c1645e33242ba3b4bf3

Now the Interesting part was it tried to transition from

But for us
both test-ambari-jn-12 & ip—10-2-12-12.internal were the same.
test-ambari-jn-12 was a cname 
ip—10-2-12-12.internalwas the hostname

They pointed to the same box , it seems the resolution failed.

i.e. it recognized the same box as two different region servers.
Screen Shot 2015-11-24 at 4.03.55 pm


1) Stop the Hbase Master and Region Servers
2) Create the ‘hbase parent node’ in zookeeper. For hortonworks ambari distribution
its ‘/hbase-unsecure’
3) start the hbase master
4) Add an entry to /etc/hosts on each region server.
5) restart region servers.

6) they should connect move regions around & i was able to get my data back.

Hbase Version 1.1 – hortonworks 2.3

NOTE: This one scenario where my regions were hijacked.

MySQL Aurora CPU spikes

Recently we encountered an issue with aurora, where our cpu started spiking every X minutes. (X < 30)

Screen Shot 2015-08-27 at 3.02.27 pm

we had like 20-30 connections, out of which majority were inserts and a few selects.

every time the spike occurred our inserts would start waiting and waiting….

here is innotop output


we checked our code. nothing major had changed.

The mysql error log had this mysterious message which i could not trace in the mysql/percona/maria source.

“Innodb: Retracted crabbing in btr_cur_search_to_nth_level 33001 times since restart”

if you get to know the meaning of this – please do let me know – would love to understand this.

crab_Screen Shot 2015-09-17 at 9.29.17 pm

Finally we contacted Aurora support team, after an investigation by them, it turned out to be a issue in aurora.

I am not too sure whether the above ‘error log message’  had any impact or not on this case.

PS: if you are wondering , that this post ended abruptly, even i am a little surprised by the ending too.

Screen Shot 2015-08-24 at 7.47.15 pm