Scylla Chronicles: Working with streaming while joining a node

Scylla Chronicles: Working with streaming while joining a node

We at eyeota are currently running a single cluster across multiple aws regions. its a global cluster but we have key spaces specific to each region i.e. Apps local to a aws region write/read to/from their corresponding key space (Example keyspace: us-east-1_txns, Apps in us-east-1 would interact with this key space only). Few key spaces (i.e. the smaller sized ones like Singapore) have cross dc replication.

we use scylla 3.0.8 on Linux 5.1.15-1.el7.elrepo.x86_64.

The class of instances used was i3.en /i3, but size differed from aws region to aws region – i3.x.large / i3.en.xlarge to i3.en.6xlarge (but within each aws region , the instance type is homogeneous). The amount of traffic/ data varies from region to region, hence the use of different instance types. We used i3.x.large in Singapore because i3.en was not available and used the same class in Sydney as well because we had cross dc replication between AU & SG.

Here is a brief summary of how we setup up the clusters in different data centers:

  1. scylla provides an AMI for us to use in aws but, its not available for all regions. hence we created an AMI using that AMI -> one ami for all.
  2. In each region, we have nodes in different availability zones.
  3. We fired up the instances in different aws regions using our AMI & started scylla.
  4. We migrated data from our old data store to scylla. We wrote a custom app which reads from old data store & inserts/updates records to scylla.
  5. After migration, Disk usage for the ssd was about 40-50% on each node.
  6. We use leveled compaction strategy, with new mc format on & chunk size=4KB.
  7. The size of record / number of records varies from region to region & we used appropriate i3.en class for each region . US-east have bigger class machines when compared to Australia because traffic is higher hence more data.
  8. Since we needed to expand in each region , we had to add nodes in each region.
  9. We decided to expand/add nodes in aws Frankfurt first before moving to the big US-east (i3.en.6xlarge).
  10. Note: The new node had auto_bootstrap = true & we did not do a node rebuild.

Now bear in mind, while adding a node, Reads/Writes were enabled on the nodes in Frankfurt !!! i.e. live traffic is flowing to the nodes.

From here things went slow and slow…..

When we added a node to Frankfurt (i3.en.2xlarge), we expected the node to go into UN mode in 1/2 days, but it did not. At this point we were worried, & going by estimates, we expected it to complete in 7 days, which was not viable.

Now we felt, if this was the case for Frankfurt, then it will take even more time for our US-east nodes to join 😦

We immediately started to debug:

  • iftop (network incoming bytes) – which was low – streaming from 2-3 boxes from same AZ.
  • sar -d 1 (iowait around 0.21) – not high either on both source nodes & joining node.
  • load average at a maximum. ‘x’ on a ‘x’ core box.
  • scylla logs – the usual. no errors, only info about compaction & streaming.

Now we began to wonder, was the setup run properly ?

  • We noticed that scylla_ami_setup failed to run.  (we went thru the scripts & found out it was not a big deal if the ami setup failed)
  • We noticed when we ran scylla_setup it failed too. see bug (this too was not a big deal)
  • Our raid was set up properly on all hosts.
    • RAID setup was also done on i3.en.xlarge => RAID on a box with 1 disk.
      • Redundant Array of Independent Disks with 1 disk, hmm there is no redundancy here but anyways. it standard set up everywhere.
  • Our cpu set was set to all the cores based on instance type as decided by scylla_setup script & network interrupts in mq mode based on instance type.
  • Our io_properties.yaml was good too.
  • the scylla dashboards , we took from scylla monitoring. we mostly looked at ‘scylla_io_queue_streaming_write_total_bytes‘ & ‘scylla_io_queue_streaming_read_total_bytes
  • The ‘scylla_io_queue_streaming_write_total_bytes’ was always around ‘X’ MB / second where X = number of cores on box, which was interesting like 1MB per core per second. This was consistent across all the instance types we used.
  • we noticed FS trim was not running/setup. We did try running fstrim while joining the node once a day but we could not see much difference.

Now when we look at this blog post about time taken to rebuild a scylla node, we obviously were no where near the numbers.

So whats different from the scylla blog post benchmark ?

  1. record size is a factor ? possibly
  2. compaction strategy a factor ? possibly when we think of write amplification on disks
  3. number of records are a factor ? – see this bug
  4. active reads/writes happening simultaneously on source nodes=> compactions => additional disk activity & CPU usage ==> yes it is a factor.
  5.  We decided to run our own benchmark . we took 2 nodes (i3.en.2x large).
    1. generated data on 1 node. (each record ~900kB — 600Gb compressed)
    2. started the 2nd node(auto_bootstrap=true) & it went into UJ mode then eventually to UN.
    3. Joining time = ~1HR which was awesome.
    4. Joining time from i3.2x.en to i3.6x.en = 40minutes.
    5. The results were awesome but this was not the case in prod.
  6. But, what was missing was in above test was the presence of active reads/writes to simulate production traffic while joining was in progress.
  7. What we could take away from the test (i3.en.2xLarge to i3.en.6xLarge)was that Number of cores obviously is a factor => CPU.
  8. looking into github we found the notion of cpu shares for various activities like streaming/compaction etc. The shares are hard coded and there is no way to change /tune them unless we recompile the code.  => The idea being , lets give more cpu for streaming which, from the code has the lowest priority. But there are no levers / configuration params to help out streaming.
  9. After banging our heads doing different things, looking into code it seemed hopeless as disks were definitely not overloaded (in our test above, we saw iowait go to 3.1 (using `​sar -d 1`) but production was at 0.21. Also, Network no way was saturated.

Other thoughts in our heads:

1) lets change compaction strategy to size tier as compaction load on the disk would be less when compared to level. We can try , but its painful process to change , not feasible to do in long run and our disks don’t seem loaded.

2) fs trim – lets clean up the ssd disk ? we tried but not much help and in any case we can do at max once per day not per hour.

3) we stopped active reads to help out, but no joy. (we could not stop active writes).

4) For a while we looked at seastar and tried to see if there are any tuning knobs to help out.

5) At times we kind of engaged sometimes in blindly trying out some configuration parameters / black box tuning.

….

Due to project deadlines approaching we had to expand our cluster (in all regions) fast, by any means – hack or crook 🙂

We finally stumbled upon this configuration parameter ‘compaction_static_shares‘ in config.cc

We decided to play with this – going by philosophy ‘HACK or crookhttps://en.wikipedia.org/wiki/By_hook_or_by_crook

Since compaction shares for streaming is hard coded to 200 in the code.

auto maintenance_scheduling_group = make_sched_group(streaming, 200);

we decided to set compaction shares < 200 => that make streaming more important.

So

  1. We decided to try this out on our Sydney region (i3.x.large)
  2. we set compaction_static_shares=100 on the streaming source nodes & on the joining node.
  3. We restarted the source streaming nodes first.
  4. Then we restarted the joining node, so that it can make fresh connections with the source nodes. (Order of restart is important !)
  5.  we were pleasantly surprised. streaming was moving with a faster pace.

Screenshot 2019-09-16 at 16.04.46

As you can a node in green took about 56 hours vs the node in yellow took about 18 hours.

=> That is a 3X boost !

Note:  The partition sizes(bytes, using cfhistograms) for

  • Frankurt key space were 50%-924  75%-4768  95%-24501  99%42510-  Max-785939.
  • Sydney key space were 50%- 2759  75%-14237  95%-61214  99%-152321  Max-1629722.

Armed with this weapon, we started adding other nodes across other aws regions. Once they joined, we removed the setting and restarted the nodes.

But by, lowering compaction cpu share => lesser compactions which lead to an increase in read latency , which was ok for us.

Our objective was to join nodes as soon as possible because of project deadlines, so we chose this approach.

Going by our past experience with traffic patterns, we added a node once in a while, and expect to do so in future as well. Hence we might choose not to employ the usage of this configuration parameter, as there would be no rush to join the cluster. If the node takes 6 days to join in Frankfurt region (i3.en.2x.large), so be it – we are willing to wait.

But having said that, we still are puzzled with the slow joining. (reported this in scylla github)

At the end, the team ( Krishna & Rahul & myself ) enjoyed debugging & learning new stuff.

Thanks for reading the post. If you feel we were were incorrect with regard to certain things in the post – feel free to point out. We would appreciate any feedback /questions.

ps: the post is quite delayed. had a lot of vacations 🙂

 

 

 

 

 

 

 

 

 

 

 

 

Scylla : received an invalid gossip generation for peer [How to resolve]

Scylla : received an invalid gossip generation for peer [How to resolve]

At my current org eyeota.com, we run a lot of Scylla & Cassandra clusters.

Our Scylla cluster for analytics/datascience has > 40 nodes holding terra-bytes of data & runs on version = 2.1.3-0.20180501.8e33e80ad.

The cluster recently had been losing nodes , because those nodes restarted & were not being allowed to join the cluster during gossip phase of bootup. The reason being: nodes which where status=UN (up & normal) were spewing the following error & not allowing those affected nodes to join the cluster during the gossip phase.

Jul 04 01:54:17 host-10.3.7.77 scylla[30263]: [shard 0] gossip – received an invalid gossip generation for peer 10.3.7.7; local generation = 1526993447, received generation = 1562158865

Now lets go into the details & context of the above error message:

  • Every node is configured with a list of seeds, to which it tries to gossip & gather cluster info during bootup.

 

  • On bootup it creates a ‘generation’ number(generation number is a epoch) which it shares with the seed host during gossip. (refer code:  here)   Screenshot 2019-07-05 at 11.21.44

 

  • The node on its FIRST ever bootup sends its generation number to the seed & seed gossips with others to pass on the info. The seed stores this generation number as a reference. This is referred to as local_generation term referred in the above error message i.e. the UN node 10.3.7.77 was saying that peer 10.3.7.7 was sending a generation number 1562158865 (i.e. referred to as remote_generation) but it had stored as reference 1526993447. You will notice that 1526993447 refers to epoch may 22, 2018 & 1562158865 refers to july 3, 2019 epoch, i.e. the node 10.3.7.7 first started on may 22, 2018 & had sent its generation number as 1526993447.

 

  • Since the difference between the 2 epochs is greater than 1 year the UN node will refuse to allow the other node to join (see code here)

Screenshot 2019-07-05 at 11.46.06        Screenshot 2019-07-05 at 11.45.46

 

  • Now during bootup , the logic for increment_and_get is:  (refer code: here)       Screenshot 2019-07-05 at 11.24.44.png
  • From the above logic, the server first looks up the generation number from the system.local table. if the value is empty , it generates a fresh number i.e. current time as the logic for generating a generation number depends solely on current time (see here or pic below) . if its not empty, it compares with current time & uses the larger value i.e. the more recent time & writes it back to system.local table (hence this stackoverflow suggestion of updating the table wont work.).

Screenshot 2019-07-05 at 11.28.39

  • So the generation-number generated & sent by node to seed on bootup is always in general closer to current time but the generation number stored by the seed UN node as a local reference does not change.
  • In our example error message, the UN node 10.3.7.7 reports that peer 10.3.7.7 is sending 1562158865, but since 1526993447 + 1 year < 1562158865, it wont be allowed to join.

 

So how do we solve this problem:

(1) we tried multiple rolling restarts as suggested on internet but no joy! – it might or might not work. 

(2) Stop all clients. Stop all nodes & start them ALL up with in a window of few minutes. – this will work as generation numbers will be reset across the cluster (have not verified this reasoning but people on internet have reported this approach to work) . The only problem is DOWNTIME Screenshot 2019-07-05 at 12.03.35 with an entire cluster being rebooted.

(3) To entire avoid cluster reboot: We adopted this approach in production based on the code logic explained above.

  • The root problem was that the local generation of the problematic node stored in the UN seed node was not changing. (but the problematic node on every reboot sends a fresh gen number which is closer to current time)
  • IDEA Screenshot 2019-07-05 at 12.23.55.png : Lets update the local generation of the problematic node stored in the UN node so that the the remote-generation number being sent by the problematic node would fall within 1 year.
    • So how we update this value in the UN seed node ? Screenshot 2019-07-05 at 12.29.09.png
      • we need to make the problematic node send a gen number (epoch) with a value which falls in 1 year window of the local gen number stored in UN seed node.
      • but since the code always takes current time as gen number and current time is july 2019, what can we do ?
      • We change TIME Screenshot 2019-07-05 at 12.31.36.pngback on the problematic node to a value which falls within 1 year of 1526993447. choose a epoch value towards the end of 1 year window, i.e change system time to a value say march 31, 2019 i.e. epoch 1554030000 rather than october 2, 2018 & restart the node. the node will reboot & send a gen number 1554030000(as it looks up system.local table) or current time which is march 31, 2019 anyway to the seed.
      • The UN seed node gets this value & validates that the remote-generation number sent by problematic node is within 1 year of may22,2018 so, it proceeds to update its reference (local generation). Screenshot 2019-07-05 at 12.17.58.png
      • Wallah Screenshot 2019-07-05 at 12.38.28, we have successfully the updated the reference (local gen) of the problematic node stored in the UN seed node.
      • Now we stop problematic node, reset the time back on the problematic node to current time & reboot, the problematic node will send latest epoch of say july 4, 2019 i.e epoch 1562215230 
      • Now since 1562215230 (gen sent be problematic node using latest time) minus 1554030000 (local reference stored in UN seed node) < 1 year, the problematic node will be allowed to join the cluster .
      • hurray Screenshot 2019-07-05 at 12.43.04.png
      • we advise you to choose an epoch/date towards end of the 1 year window but within 1 year, the later the better as a new 1 year window starts from the date you choose & this problem is mitigated for that long LOL – YEP this problem occurs on long running clusters. What this means is that you need to do rolling restarts once in a while every year to extend the 1 year window. reminds me of visa extensions hehe.

Steps:

  1. if the problematic node is 10.3.7.7 and error is reported on say 10.3.7.77 (UN node), make sure the seed for 10.3.7.7 is 10.3.7.77 so that we guarantee its talking to this node and we dont have to search to find out who its talking too in the cluster. if the seed for 7.7 node is different from the node reporting the error, then look at the error message printed by the seed node to decide which epoch to reset too. in our case, since i saw the error on 7.77, i changed the seed of 7.7 to 7.77 node.
  2. start the problematic node.
  3. the seed node should start printing the error. capture the error message for our node and make note of local gen number for us to choose a date to reset too. in our case the msg was as follows:
  4. Jul 04 01:54:17 host-10.3.7.77 scylla[30263]: [shard 0] gossip – received an invalid gossip generation for peer 10.3.7.7; local generation = 1526993447, received generation = 1562158865

  5. cqlsh to problematic node 10.3.7.7 & update generation number to an epoch within 1 year of 1526993447 BUT choose an epoch towards the end of the 1 year window like 1554030000 (march 31, 2019) rather than say july/october 2018 so that you have a longer new 1 year window.
  6. #on problematic node, run

    1. update system.local set gossip_generation = 1554030000 where key=‘local’;
  7. #on problematic node, run

    1. nodetool flush
  8. stop problematic node
  9. edit the config file & change CQL (native_transport_port) from 9042 to 9043 so that clients cant connect & insert data – insertion of data during this phase will set records with a timestamp of march 2019 which is not correct i.e. prevent data corruption. This is a precaution Screenshot 2019-07-05 at 12.58.49.png.
  10. change system time i.e. “date -s ’31 MAR 2019 11:03:25′”
  11. verify system time has changed by running date command
  12. start the problematic node & tail logs of the UN seed node, the error should go away.
  13. wait for some time (few minutes is enough) for gossip to occur & verify if problematic node is now UN. Screenshot 2019-07-05 at 14.01.47
  14. You can tail logs of the UN seed node & check if you still get the error.
    1. if you do see the error again – repeat the steps again from the start. u mite have missed something.
  15. once the node is declared UN:
    1. shutdown the node
    2. change the CQL (native_transport_port) back from 9043 to 9042 in config file.
    3. RESET THE system time on the box
    4. verify system time is back to normal
  16. start up the node once you have changed back time and port.Screenshot 2019-07-05 at 12.58.49Screenshot 2019-07-05 at 12.29.09
  17. You will notice that the node is still UN 🙂

Confessions:

  1. yes we did this exercise in production. The node was anyways considered dead, hence risk was minimal as screwing up a dead node even more wont make a difference and if the procedure failed, we would sacrifice 1 node only and hence be left with the only option to cluster reboot.
  2. we scanned the scylla code base of master branch for usages of system time in cluster communication and found only 2 places which gave us confidence that changing system time would work. also by changing CQL port to 9043, we eliminated any contamination of existing data by clients.Screenshot 2019-07-05 at 14.13.51.png

Morals of the story:

  1. this happened in 2.1 version of scylla, and as of today july 4, 2019 the master branch of scylla still has the same code logic, hence this can occur in version 3 and above too.
  2. every few months better do rolling restarts of the nodes, so that the nodes send a new gen number for gossip and the 1 year window is extended.
  3. if you have a long running cluster > 1 year , if a node is restarted, it will be affected by this error, the more node  restarts that happen, the more the epidemic spreads.
  4. this can work for cassandra if the code logic is the same, which i think it is.

References:

https://github.com/scylladb/scylla/blob/134b59a425da71f6dfa86332322cc63d47a88cd7/gms/gossiper.cc

https://github.com/scylladb/scylla/blob/94d2194c771dfc2fb260b00f7f525b8089092b41/service/storage_service.cc

https://github.com/scylladb/scylla/blob/077c639e428a643cd4f0ffe8e90874c80b1dc669/db/system_keyspace.cc

https://stackoverflow.com/questions/40676589/cassandra-received-an-invalid-gossip-generation-for-peer

Thanks

to my team mate Rahul Anand for helping out in debugging the issue & providing key breakthroughs during the firefighting

if you find it useful – please comment on either stackoverflow or here. thanks.

Update:

The folks at scylla have now fixed this.

refer: https://github.com/scylladb/scylla/pull/5195

golang: Parse a large json file in streaming manner

hi all,

i recently tried to parse a large json file in golang and used

'ioutil.ReadAll(file)' followed by 'json.Unmarshal(data, &msgs)'

This lead to a huge spike in memory usage as the entire file is read into memory in 1 go.

I wanted to parse the file in a streaming fashion, i.e. read the file bit by bit and decode the json data just like the jackson streaming api.

I did some digging and found

func NewDecoder(r io.Reader) *Decoder {...} in json package

The above uses a reader to buffer & pick up data.

here is some sample code to do it:

https://github.com/jaihind213/golang/blob/master/main.go

The differences are in terms of memory footprint:

We print memory usage as we parse a ~900MB file generated by generate_data.sh.

Streaming Fashion:

Screenshot 2019-05-07 at 11.20.15

….

Screenshot 2019-05-07 at 11.31.36

Simple Parsing by reading Everything in Mem:

Screenshot 2019-05-07 at 11.32.28

Screenshot 2019-05-07 at 11.32.54

Conclusion:

The streaming parser has a very low Heap_inUse when compared to the massive usage of the readAllInMem approach.

Druid Indexing Fails – ISE – Cannot find instance of indexer to talk to!

Recently my druid overlord and middle manager stopped indexing data.

I tried the basic multiple restarts but it did not help.

The logs of the middle manager revealed the following:

2018-10-02 06:01:37 [main] WARN  RemoteTaskActionClient:102 - Exception submitting action for task[index_nom_14days_2018-10-02T06:01:29.633Z]

java.io.IOException: Failed to locate service uri

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.task.IndexTask.isReady(IndexTask.java:150) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.worker.executor.ExecutorLifecycle.start(ExecutorLifecycle.java:169) [druid-indexing-service-0.10.0.jar:0.10.0]

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]

        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]

        at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:364) [java-util-0.10.0.jar:0.10.0]

        at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:263) [java-util-0.10.0.jar:0.10.0]

        at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) [druid-api-0.10.0.jar:0.10.0]

        at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:102) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.CliPeon.run(CliPeon.java:277) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.Main.main(Main.java:108) [druid-services-0.10.0.jar:0.10.0]

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

        at io.druid.indexing.common.actions.RemoteTaskActionClient.getServiceInstance(RemoteTaskActionClient.java:168) ~[druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:89) ~[druid-indexing-service-0.10.0.jar:0.10.0]

The main cause was

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

Apparently the middle manager and its peons could not seem to find the overlord service for indexing to start and hence all jobs were failing.

Root Cause:

I logged into zookeeper and found the following to entries for Overlord/Coordinator to be empty even though they are up and running.

Since they were empty, the middle manager could not find an indexer to talk to ..!! as the middle manager looks up zookeeper.

[zk: localhost:2181(CONNECTED) 1] ls /druid/discovery/druid:coordinator
[]
[zk: localhost:2181(CONNECTED) 3] ls /druid/discovery/druid:overlord
[]

Fix:

assuming that Coordinator runs on Host_A & Overlord runs on Host_B and … do the following:

(1) Generate 2 random ids. 1 for Coordinator & 1 for Overlord

Druid code uses the java code :

System.out.println(java.util.UUID.randomUUID().toString());

you can use any random string.

lets assume you generate

45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4‘ for coordinator

&

f1babb39-26c1-42cb-ac4a-0eb21ae5d77d‘ for overlord.

(2) log into zookeeper cli and run the following commands:

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:coordinator/45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4 {"name":"druid:coordinator","id":"45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4","address":"HOST_A_IP","port":8081,"sslPort":null,"payload":null,"registrationTimeUTC":1538459656828,"serviceType":"DYNAMIC","uriSpec":null}

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:overlord/f1babb39-26c1-42cb-ac4a-0eb21ae5d77d {"name":"druid:overlord","id":"f1babb39-26c1-42cb-ac4a-0eb21ae5d77d","address":"HOST_B_IP","port":8090,"sslPort":null,"payload":null,"registrationTimeUTC":1538459763281,"serviceType":"DYNAMIC","uriSpec
":null}

You will notice a registrationTimeUtc field – put in some time u see fit.

This should create the necessary zookeeper nodes for middle manager to lookup

now restart coordinator then overlord then middleManager

Submit a indexing task and it should work.

Additional tips:

In $Druid_Home/conf/druid/middleManager/runtime.properties you may have

druid.indexer.runner.javaOpts=-cp conf/druid/_common:conf/druid/middleManager/:conf/druid/middleManager:lib/* -server -Ddruid.selectors.indexing.serviceName=druid/overlord -Xmx7g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dhadoop.mapreduce.job.user.classpath.first=true

and $Druid_Home/conf/druid/middleManager/jvm.config you may have

-Ddruid.selectors.indexing.serviceName=druid/overlord

Hope this helps you.

java.lang.RuntimeException: native snappy library not available – druid ingestion

Was ingesting some data into druid using local batch mode & the ingestion failed with

2018-07-10T04:57:51,501 INFO [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – map task executor complete.
2018-07-10T04:57:51,506 WARN [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – job_local265070473_0001
java.lang.Exception: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) [hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.

To avoid the snappy compression… i disabled it with the following tuning config in the ingestion spec

"tuningConfig": {

                        "type": "hadoop",

                        "partitionsSpec": {

                                "type": "hashed",

                                "targetPartitionSize": 5000000

                        },

                        "jobProperties": {

                                 "mapreduce.framework.name" : "local",

                                 "mapreduce.map.output.compress":"false"

                                "mapreduce.job.classloader": "true",

                                "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop."
                                  .........
                                  .........
                        }

                }

 

Pushing druid metrics to Graphite Grafana with full metrics whitelist

This blog post is about configuring druid to send metrics to graphite/grafana.

kindly refer to steps @

https://github.com/jaihind213/druid_stuff/tree/master/druid_metrics_graphite

The Grafana dashboard templates are in above github link.

———————————————————————————————————————

after you have followed the steps above, restart druid and it should send stuff to graphite and grafana.

You should see log statements like

INFO [main] io.druid.emitter.graphite.GraphiteEmitter - Starting Graphite Emitter.

INFO [GraphiteEmitter-0] io.druid.emitter.graphite.GraphiteEmitter - trying to connect to graphite server

INFO [main] io.druid.initialization.Initialization - Loading extension [graphite-emitter] for class

Screen Shot 2018-06-11 at 7.52.12 PMScreen Shot 2018-06-11 at 7.51.58 PM

Screen Shot 2018-06-11 at 7.58.00 PM

Tip: you can set the frequency at which metrics are reported and also set the types of monitors you wish to have BASED ON THE RELEASE version you USE.

Important Tip: Start off with JVM monitor first, then slowly add. The sys monitor needs sigar jar. some monitors mite nite work immediately.

Screen Shot 2018-06-11 at 7.59.12 PM.png


 

Druid Indexing & Xerces Hell – Loader Constraint Violation

Recently my team mate Vihag came across an java.lang.Linkage error while doing druid indexing with CDH 5.10.2

It was good fun we must say finding out the reason.  hehe taking md5 of class files and comparing and finding out which jar the class is loaded from.

—————————————-

2018-05-11 11:03:04,848 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.LinkageError: loader constraint violation: when resolving overridden method “org.apache.xerces.jaxp.DocumentBuilderImpl.newDocument()Lorg/w3c/dom/Document;” the class loader (instance of org/apache/hadoop/util/ApplicationClassLoader) of the current class, org/apache/xerces/jaxp/DocumentBuilderImpl, and its superclass loader (instance of <bootloader>), have different Class objects for the type org/w3c/dom/Document used in the signature
at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown Source)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1233)
at org.apache.hadoop.yarn.factory.providers.RecordFactoryProvider.getRecordFactory(RecordFactoryProvider.java:49)
at org.apache.hadoop.mapreduce.TypeConverter.<clinit>(TypeConverter.java:62)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:481)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:469)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1579)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:469)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:391)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1537)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1534)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1467)
2018-05-11 11:03:04,851 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 

—————————————-

Vihag has documented the resolution nicely here @ github

—————————————-

The crux of the matter was that The class ‘DocumentBuilderImpl‘ was present in Druid in the xercesImpl-2.9.1.jar and Yarn was loading it from xercesImpl-2.10.0.jar.

So we removed 2.9.1 jar from druid and replaced it with 2.10.0 jar

Also we noticed that 2.10 jar plays well with xml-apis-1.4.01.jar & not xml-apis-1.3.04.jar

Hope this helps you.

 

 

 

 

UnknownArchetype: The desired archetype does not exist

recently encountered this error while generating a project in my intellij.

[ERROR] BUILD FAILURE
[INFO] ————————————————————————
[INFO] : org.apache.maven.archetype.exception.UnknownArchetype: The desired archetype does not exist (org.apache.maven.archetypes:maven-archetype-webapp:RELEASE)
The desired archetype does not exist (org.apache.maven.archetypes:maven-archetype-webapp:RELEASE)

Solution: The archtypes are downloaded from the maven central repo, so make sure the repo is specified in your .m2/settings.xml

In the settings.xml,  under profiles->profile->repositories add the maven central repo url

      <profiles> <profile>  <repositories>

              <repository>

                <releases> <enabled>true</enabled> </releases>

                <snapshots><enabled>true</enabled></snapshots>

                <id>central</id>

                <url>https://repo1.maven.org/maven2</url>

            </repository>

        ……

     </repositories> </profile> </profiles>

Tip: Check if the url ‘https://repo1.maven.org/maven2‘ works first , else give a repo which works.