Druid Indexing Fails – ISE – Cannot find instance of indexer to talk to!

Recently my druid overlord and middle manager stopped indexing data.

I tried the basic multiple restarts but it did not help.

The logs of the middle manager revealed the following:

2018-10-02 06:01:37 [main] WARN  RemoteTaskActionClient:102 - Exception submitting action for task[index_nom_14days_2018-10-02T06:01:29.633Z]

java.io.IOException: Failed to locate service uri

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.task.IndexTask.isReady(IndexTask.java:150) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.worker.executor.ExecutorLifecycle.start(ExecutorLifecycle.java:169) [druid-indexing-service-0.10.0.jar:0.10.0]

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]

        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]

        at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:364) [java-util-0.10.0.jar:0.10.0]

        at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:263) [java-util-0.10.0.jar:0.10.0]

        at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) [druid-api-0.10.0.jar:0.10.0]

        at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:102) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.CliPeon.run(CliPeon.java:277) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.Main.main(Main.java:108) [druid-services-0.10.0.jar:0.10.0]

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

        at io.druid.indexing.common.actions.RemoteTaskActionClient.getServiceInstance(RemoteTaskActionClient.java:168) ~[druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:89) ~[druid-indexing-service-0.10.0.jar:0.10.0]

The main cause was

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

Apparently the middle manager and its peons could not seem to find the overlord service for indexing to start and hence all jobs were failing.

Root Cause:

I logged into zookeeper and found the following to entries for Overlord/Coordinator to be empty even though they are up and running.

Since they were empty, the middle manager could not find an indexer to talk to ..!! as the middle manager looks up zookeeper.

[zk: localhost:2181(CONNECTED) 1] ls /druid/discovery/druid:coordinator
[]
[zk: localhost:2181(CONNECTED) 3] ls /druid/discovery/druid:overlord
[]

Fix:

assuming that Coordinator runs on Host_A & Overlord runs on Host_B and … do the following:

(1) Generate 2 random ids. 1 for Coordinator & 1 for Overlord

Druid code uses the java code :

System.out.println(java.util.UUID.randomUUID().toString());

you can use any random string.

lets assume you generate

45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4‘ for coordinator

&

f1babb39-26c1-42cb-ac4a-0eb21ae5d77d‘ for overlord.

(2) log into zookeeper cli and run the following commands:

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:coordinator/45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4 {"name":"druid:coordinator","id":"45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4","address":"HOST_A_IP","port":8081,"sslPort":null,"payload":null,"registrationTimeUTC":1538459656828,"serviceType":"DYNAMIC","uriSpec":null}

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:overlord/f1babb39-26c1-42cb-ac4a-0eb21ae5d77d {"name":"druid:overlord","id":"f1babb39-26c1-42cb-ac4a-0eb21ae5d77d","address":"HOST_B_IP","port":8090,"sslPort":null,"payload":null,"registrationTimeUTC":1538459763281,"serviceType":"DYNAMIC","uriSpec
":null}

You will notice a registrationTimeUtc field – put in some time u see fit.

This should create the necessary zookeeper nodes for middle manager to lookup

now restart coordinator then overlord then middleManager

Submit a indexing task and it should work.

Additional tips:

In $Druid_Home/conf/druid/middleManager/runtime.properties you may have

druid.indexer.runner.javaOpts=-cp conf/druid/_common:conf/druid/middleManager/:conf/druid/middleManager:lib/* -server -Ddruid.selectors.indexing.serviceName=druid/overlord -Xmx7g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dhadoop.mapreduce.job.user.classpath.first=true

and $Druid_Home/conf/druid/middleManager/jvm.config you may have

-Ddruid.selectors.indexing.serviceName=druid/overlord

Hope this helps you.

java.lang.RuntimeException: native snappy library not available – druid ingestion

Was ingesting some data into druid using local batch mode & the ingestion failed with

2018-07-10T04:57:51,501 INFO [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – map task executor complete.
2018-07-10T04:57:51,506 WARN [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – job_local265070473_0001
java.lang.Exception: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) [hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.

To avoid the snappy compression… i disabled it with the following tuning config in the ingestion spec

"tuningConfig": {

                        "type": "hadoop",

                        "partitionsSpec": {

                                "type": "hashed",

                                "targetPartitionSize": 5000000

                        },

                        "jobProperties": {

                                 "mapreduce.framework.name" : "local",

                                 "mapreduce.map.output.compress":"false"

                                "mapreduce.job.classloader": "true",

                                "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop."
                                  .........
                                  .........
                        }

                }

 

Pushing druid metrics to Graphite Grafana with full metrics whitelist

This blog post is about configuring druid to send metrics to graphite/grafana.

kindly refer to steps @

https://github.com/jaihind213/druid_stuff/tree/master/druid_metrics_graphite

The Grafana dashboard templates are in above github link.

———————————————————————————————————————

after you have followed the steps above, restart druid and it should send stuff to graphite and grafana.

You should see log statements like

INFO [main] io.druid.emitter.graphite.GraphiteEmitter - Starting Graphite Emitter.

INFO [GraphiteEmitter-0] io.druid.emitter.graphite.GraphiteEmitter - trying to connect to graphite server

INFO [main] io.druid.initialization.Initialization - Loading extension [graphite-emitter] for class

Screen Shot 2018-06-11 at 7.52.12 PMScreen Shot 2018-06-11 at 7.51.58 PM

Screen Shot 2018-06-11 at 7.58.00 PM

Tip: you can set the frequency at which metrics are reported and also set the types of monitors you wish to have BASED ON THE RELEASE version you USE.

Important Tip: Start off with JVM monitor first, then slowly add. The sys monitor needs sigar jar. some monitors mite nite work immediately.

Screen Shot 2018-06-11 at 7.59.12 PM.png


 

Druid Indexing & Xerces Hell – Loader Constraint Violation

Recently my team mate Vihag came across an java.lang.Linkage error while doing druid indexing with CDH 5.10.2

It was good fun we must say finding out the reason.  hehe taking md5 of class files and comparing and finding out which jar the class is loaded from.

—————————————-

2018-05-11 11:03:04,848 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.LinkageError: loader constraint violation: when resolving overridden method “org.apache.xerces.jaxp.DocumentBuilderImpl.newDocument()Lorg/w3c/dom/Document;” the class loader (instance of org/apache/hadoop/util/ApplicationClassLoader) of the current class, org/apache/xerces/jaxp/DocumentBuilderImpl, and its superclass loader (instance of <bootloader>), have different Class objects for the type org/w3c/dom/Document used in the signature
at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown Source)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1233)
at org.apache.hadoop.yarn.factory.providers.RecordFactoryProvider.getRecordFactory(RecordFactoryProvider.java:49)
at org.apache.hadoop.mapreduce.TypeConverter.<clinit>(TypeConverter.java:62)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:481)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:469)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1579)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:391)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1537)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1534)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1467)
2018-05-11 11:03:04,851 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 

—————————————-

Vihag has documented the resolution nicely here @ github

—————————————-

The crux of the matter was that The class ‘DocumentBuilderImpl‘ was present in Druid in the xercesImpl-2.9.1.jar and Yarn was loading it from xercesImpl-2.10.0.jar.

So we removed 2.9.1 jar from druid and replaced it with 2.10.0 jar

Also we noticed that 2.10 jar plays well with xml-apis-1.4.01.jar & not xml-apis-1.3.04.jar

Hope this helps you.

 

 

 

 

Druid Segment Diskspace Calculator

Recently i have been working with Druid & was trying to come up with the disk space sizing on the historical nodes, as we have to deploy onto remote customer locations, for which we need to come up with machine requests way in advance.

dc

This took me into the world of bitmaps : concise and roaring.

Druid uses concise bitmaps by default and has the option for roaring too.

References:

So after reading a bit i decided to come up with a calculator for Druid Segment Sizing needed on the druid HISTORICAL nodes assuming Concise is used.

You can find the calculator  @ github.com/jaihind213/druid-calculator

PS: Initially i wanted to write a blog post explaining the logic behind my calculator, but was too lazy, so wrote code and included the rationale in the code comments. 🙂

Thank you and hope you find it useful. Feedback is welcome.