golang: Parse a large json file in streaming manner

hi all,

i recently tried to parse a large json file in golang and used

'ioutil.ReadAll(file)' followed by 'json.Unmarshal(data, &msgs)'

This lead to a huge spike in memory usage as the entire file is read into memory in 1 go.

I wanted to parse the file in a streaming fashion, i.e. read the file bit by bit and decode the json data just like the jackson streaming api.

I did some digging and found

func NewDecoder(r io.Reader) *Decoder {...} in json package

The above uses a reader to buffer & pick up data.

here is some sample code to do it:


The differences are in terms of memory footprint:

We print memory usage as we parse a ~900MB file generated by generate_data.sh.

Streaming Fashion:

Screenshot 2019-05-07 at 11.20.15


Screenshot 2019-05-07 at 11.31.36

Simple Parsing by reading Everything in Mem:

Screenshot 2019-05-07 at 11.32.28

Screenshot 2019-05-07 at 11.32.54


The streaming parser has a very low Heap_inUse when compared to the massive usage of the readAllInMem approach.


Druid Indexing Fails – ISE – Cannot find instance of indexer to talk to!

Recently my druid overlord and middle manager stopped indexing data.

I tried the basic multiple restarts but it did not help.

The logs of the middle manager revealed the following:

2018-10-02 06:01:37 [main] WARN  RemoteTaskActionClient:102 - Exception submitting action for task[index_nom_14days_2018-10-02T06:01:29.633Z]

java.io.IOException: Failed to locate service uri

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.task.IndexTask.isReady(IndexTask.java:150) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.worker.executor.ExecutorLifecycle.start(ExecutorLifecycle.java:169) [druid-indexing-service-0.10.0.jar:0.10.0]

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]

        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]

        at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:364) [java-util-0.10.0.jar:0.10.0]

        at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:263) [java-util-0.10.0.jar:0.10.0]

        at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) [druid-api-0.10.0.jar:0.10.0]

        at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:102) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.CliPeon.run(CliPeon.java:277) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.Main.main(Main.java:108) [druid-services-0.10.0.jar:0.10.0]

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

        at io.druid.indexing.common.actions.RemoteTaskActionClient.getServiceInstance(RemoteTaskActionClient.java:168) ~[druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:89) ~[druid-indexing-service-0.10.0.jar:0.10.0]

The main cause was

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

Apparently the middle manager and its peons could not seem to find the overlord service for indexing to start and hence all jobs were failing.

Root Cause:

I logged into zookeeper and found the following to entries for Overlord/Coordinator to be empty even though they are up and running.

Since they were empty, the middle manager could not find an indexer to talk to ..!! as the middle manager looks up zookeeper.

[zk: localhost:2181(CONNECTED) 1] ls /druid/discovery/druid:coordinator
[zk: localhost:2181(CONNECTED) 3] ls /druid/discovery/druid:overlord


assuming that Coordinator runs on Host_A & Overlord runs on Host_B and … do the following:

(1) Generate 2 random ids. 1 for Coordinator & 1 for Overlord

Druid code uses the java code :


you can use any random string.

lets assume you generate

45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4‘ for coordinator


f1babb39-26c1-42cb-ac4a-0eb21ae5d77d‘ for overlord.

(2) log into zookeeper cli and run the following commands:

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:coordinator/45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4 {"name":"druid:coordinator","id":"45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4","address":"HOST_A_IP","port":8081,"sslPort":null,"payload":null,"registrationTimeUTC":1538459656828,"serviceType":"DYNAMIC","uriSpec":null}

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:overlord/f1babb39-26c1-42cb-ac4a-0eb21ae5d77d {"name":"druid:overlord","id":"f1babb39-26c1-42cb-ac4a-0eb21ae5d77d","address":"HOST_B_IP","port":8090,"sslPort":null,"payload":null,"registrationTimeUTC":1538459763281,"serviceType":"DYNAMIC","uriSpec

You will notice a registrationTimeUtc field – put in some time u see fit.

This should create the necessary zookeeper nodes for middle manager to lookup

now restart coordinator then overlord then middleManager

Submit a indexing task and it should work.

Additional tips:

In $Druid_Home/conf/druid/middleManager/runtime.properties you may have

druid.indexer.runner.javaOpts=-cp conf/druid/_common:conf/druid/middleManager/:conf/druid/middleManager:lib/* -server -Ddruid.selectors.indexing.serviceName=druid/overlord -Xmx7g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dhadoop.mapreduce.job.user.classpath.first=true

and $Druid_Home/conf/druid/middleManager/jvm.config you may have


Hope this helps you.

java.lang.RuntimeException: native snappy library not available – druid ingestion

Was ingesting some data into druid using local batch mode & the ingestion failed with

2018-07-10T04:57:51,501 INFO [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – map task executor complete.
2018-07-10T04:57:51,506 WARN [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – job_local265070473_0001
java.lang.Exception: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) [hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.

To avoid the snappy compression… i disabled it with the following tuning config in the ingestion spec

"tuningConfig": {

                        "type": "hadoop",

                        "partitionsSpec": {

                                "type": "hashed",

                                "targetPartitionSize": 5000000


                        "jobProperties": {

                                 "mapreduce.framework.name" : "local",


                                "mapreduce.job.classloader": "true",

                                "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop."



Pushing druid metrics to Graphite Grafana with full metrics whitelist

This blog post is about configuring druid to send metrics to graphite/grafana.

kindly refer to steps @


The Grafana dashboard templates are in above github link.


after you have followed the steps above, restart druid and it should send stuff to graphite and grafana.

You should see log statements like

INFO [main] io.druid.emitter.graphite.GraphiteEmitter - Starting Graphite Emitter.

INFO [GraphiteEmitter-0] io.druid.emitter.graphite.GraphiteEmitter - trying to connect to graphite server

INFO [main] io.druid.initialization.Initialization - Loading extension [graphite-emitter] for class

Screen Shot 2018-06-11 at 7.52.12 PMScreen Shot 2018-06-11 at 7.51.58 PM

Screen Shot 2018-06-11 at 7.58.00 PM

Tip: you can set the frequency at which metrics are reported and also set the types of monitors you wish to have BASED ON THE RELEASE version you USE.

Important Tip: Start off with JVM monitor first, then slowly add. The sys monitor needs sigar jar. some monitors mite nite work immediately.

Screen Shot 2018-06-11 at 7.59.12 PM.png


UnknownArchetype: The desired archetype does not exist

recently encountered this error while generating a project in my intellij.

[INFO] ————————————————————————
[INFO] : org.apache.maven.archetype.exception.UnknownArchetype: The desired archetype does not exist (org.apache.maven.archetypes:maven-archetype-webapp:RELEASE)
The desired archetype does not exist (org.apache.maven.archetypes:maven-archetype-webapp:RELEASE)

Solution: The archtypes are downloaded from the maven central repo, so make sure the repo is specified in your .m2/settings.xml

In the settings.xml,  under profiles->profile->repositories add the maven central repo url

      <profiles> <profile>  <repositories>


                <releases> <enabled>true</enabled> </releases>






     </repositories> </profile> </profiles>

Tip: Check if the url ‘https://repo1.maven.org/maven2‘ works first , else give a repo which works.