So recently while being oncall, i noticed that the load average was very high on my aws ec2 instance.
The first thing i did, was run TOP command: which showed me the following:
Ntop – taking 105 % cpu and 500MB !!!!
Now thats Fishy !!
I googled and found out these two links which suggested that NTOP is broken:
So i proceeded to stop ntop, but stop failed, so i just killed it !
kill -9 <ntop_pid>
wallah the load averages returned to normal🙂
Recently the mysql community got an awesome monitoring solution for mysql
You can actually monitor Amazon RDS instance with the same steps mentioned in the above post but with a few changes:
The monitoring framework consists of 4 components:
Amazon _DOES_NOT_ allow us to install anything on the RDS box.
So, I am sorry we will not be able get the System metrics of RDS – please rely on cloudwatch / Rds console for load averages, cpu usage , io etc etc.
So Follow the steps as mentioned in the nice post BUT make the following changes,
cat << EOF > /opt/prometheus/prometheus.yml global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: - job_name: linux target_groups: - targets: ['localhost:9100'] labels: alias: db1 - job_name: mysql target_groups: - targets: ['localhost:9104'] labels: alias: db1 EOF
But we need to tell the MySQL exporter to pull from RDS endpoint, so the
my.cnf file for MySQL exporter should be as follows:
[root@centos7 prometheus_exporters]# cat << EOF > .my.cnf [client] user=prom password=abc123 host=amazon-rds-instance.amazonaws.com EOF
Just Follow the steps as mentioned in the nice post .
And walllaaaah…. you should be able graph Amazon RDS metrics🙂
a Possible solution to handle deletion of shards in a stream with apache storm.
I have been working a lot with Transactional topologies of Apache Storm these days.
In the course of my work, i have come up with questions like
So I came up with an idea based on a punch clock.
To Find out …
Apply Punch clock to Storm:
In the emitBatch method of Partitioned Transactional Spout:
punchCardId = "SPOUT__"+ InetAddress.getLocalHost().getHostAddress()+Thread.currentThread().getId()+"__"+System.currentTimeMillis();
PunchClock.getInstance().punchIn(punchCardId); // Punch In collector.emit(tuples); // Emit tuple(s) PunchClock.getInstance().punchOut(punchCardId); // Punch Out
Prepare method of Transactional Bolt:
punchCardId ="Bolt__"+Thread.currentThread().getId()+"__"+System.currentTimeMillis(); //Create Punch Card for txn
Execute method of Transactional Bolt:
PunchClock.getInstance().punchIn(punchCardId); // Punch In
In the finishBatch method of Transactional Bolt:
PunchClock.getInstance().punchOut(punchCardId); // Punch Out
PS: if there are no punch cards available anywhere & topology is stuck, then the problem is probably not your bolts/spout.
(1) logging while entering and exiting
(2) using http://riemann.io/ -> suggested by my friend Angad @Inmobi
Hope you like the idea & hope its useful to you.
Thank you for reading & any feedback is welcome.
One can always find inspiration randomly – that’s what i believe.
While debugging my Apache Storm topology – i was looking at metrics
using ambari & i noticed something call out to me.
it was the tricolor – the flag of my nation INDIA ==> INSPIRATION.
#JAIHIND #GraphArt #inspiration
Recently during my tests , my hbase cluster crashed and My master did not want to start up and my region servers said they were possibly hijacked i.e. removed.
At first I tried starting up the servers but they would not start up.
2015-11-23 23:56:59,100 FATAL [test-ambari-h-147:16000.activeMasterManager] master.HMaster: Failed to become active master
java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602) at org.apache.hadoop.hbase.MetaTableAccessor.fullScanOfMeta(MetaTableAccessor.java:143) at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.isMetaTableUpdated(MetaMigrationConvertingToPB.java:163) at org.apache.hadoop.hbase.MetaMigrationConvertingToPB.updateMetaIfNecessary(MetaMigrationConvertingToPB.java:130) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:758) at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182) at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
I looked at zookeeper and the node '/hbase-unsecure' i.e. the parent node for hbase was missing. so i went in and created it manually. Then i started the master, it went up and started to wait for the region servers. .... 2015-11-23 23:16:18,134 INFO [test-ambari-h-147:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. .... Hbase Region Server: The region servers started up and tried to move the regions from OFFLINE TO OPENING. Hbase master UI show the regions in TRANSITION.
After a while, the Region server gives up and throws error:
WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] zookeeper.ZKAssign: regionserver:16020-0x151379ff470012d, quorum=zookeeper.internal:2181, baseZNode=/hbase-unsecure Attempt to transition the unassigned node for 45f2d52bc925c1645e33242ba3b4bf30 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was ip—10-2-12-12.internal,16020,1448338125490 not the expected test-ambari-jn-12,16020,1448338125490 WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to OPENING for region=45f2d52bc925c1645e33242ba3b4bf30 WARN [RS_OPEN_REGION-test-ambari-jn-12:16020-1] handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=45f2d52bc925c1645e33242ba3b4bf3 Now the Interesting part was it tried to transition from test-ambari-jn-12 to ip—10-2-12-12.internal But for us both test-ambari-jn-12 & ip—10-2-12-12.internal were the same.
test-ambari-jn-12 was a cname & ip—10-2-12-12.internalwas the hostname
They pointed to the same box , it seems the resolution failed.
i.e. it recognized the same box as two different region servers.
1) Stop the Hbase Master and Region Servers
2) Create the ‘hbase parent node’ in zookeeper. For hortonworks ambari distribution
3) start the hbase master
4) Add an entry to /etc/hosts on each region server.
5) restart region servers.
6) they should connect move regions around & i was able to get my data back.
Hbase Version 1.1 – hortonworks 2.3
NOTE: This one scenario where my regions were hijacked.