Scylla : received an invalid gossip generation for peer [How to resolve]

Scylla : received an invalid gossip generation for peer [How to resolve]

At my current org eyeota.com, we run a lot of Scylla & Cassandra clusters.

Our Scylla cluster for analytics/datascience has > 40 nodes holding terra-bytes of data & runs on version = 2.1.3-0.20180501.8e33e80ad.

The cluster recently had been losing nodes , because those nodes restarted & were not being allowed to join the cluster during gossip phase of bootup. The reason being: nodes which where status=UN (up & normal) were spewing the following error & not allowing those affected nodes to join the cluster during the gossip phase.

Jul 04 01:54:17 host-10.3.7.77 scylla[30263]: [shard 0] gossip – received an invalid gossip generation for peer 10.3.7.7; local generation = 1526993447, received generation = 1562158865

Now lets go into the details & context of the above error message:

  • Every node is configured with a list of seeds, to which it tries to gossip & gather cluster info during bootup.

 

  • On bootup it creates a ‘generation’ number(generation number is a epoch) which it shares with the seed host during gossip. (refer code:  here)   Screenshot 2019-07-05 at 11.21.44

 

  • The node on its FIRST ever bootup sends its generation number to the seed & seed gossips with others to pass on the info. The seed stores this generation number as a reference. This is referred to as local_generation term referred in the above error message i.e. the UN node 10.3.7.77 was saying that peer 10.3.7.7 was sending a generation number 1562158865 (i.e. referred to as remote_generation) but it had stored as reference 1526993447. You will notice that 1526993447 refers to epoch may 22, 2018 & 1562158865 refers to july 3, 2019 epoch, i.e. the node 10.3.7.7 first started on may 22, 2018 & had sent its generation number as 1526993447.

 

  • Since the difference between the 2 epochs is greater than 1 year the UN node will refuse to allow the other node to join (see code here)

Screenshot 2019-07-05 at 11.46.06        Screenshot 2019-07-05 at 11.45.46

 

  • Now during bootup , the logic for increment_and_get is:  (refer code: here)       Screenshot 2019-07-05 at 11.24.44.png
  • From the above logic, the server first looks up the generation number from the system.local table. if the value is empty , it generates a fresh number i.e. current time as the logic for generating a generation number depends solely on current time (see here or pic below) . if its not empty, it compares with current time & uses the larger value i.e. the more recent time & writes it back to system.local table (hence this stackoverflow suggestion of updating the table wont work.).

Screenshot 2019-07-05 at 11.28.39

  • So the generation-number generated & sent by node to seed on bootup is always in general closer to current time but the generation number stored by the seed UN node as a local reference does not change.
  • In our example error message, the UN node 10.3.7.7 reports that peer 10.3.7.7 is sending 1562158865, but since 1526993447 + 1 year < 1562158865, it wont be allowed to join.

 

So how do we solve this problem:

(1) we tried multiple rolling restarts as suggested on internet but no joy! – it might or might not work. 

(2) Stop all clients. Stop all nodes & start them ALL up with in a window of few minutes. – this will work as generation numbers will be reset across the cluster (have not verified this reasoning but people on internet have reported this approach to work) . The only problem is DOWNTIME Screenshot 2019-07-05 at 12.03.35 with an entire cluster being rebooted.

(3) To entire avoid cluster reboot: We adopted this approach in production based on the code logic explained above.

  • The root problem was that the local generation of the problematic node stored in the UN seed node was not changing. (but the problematic node on every reboot sends a fresh gen number which is closer to current time)
  • IDEA Screenshot 2019-07-05 at 12.23.55.png : Lets update the local generation of the problematic node stored in the UN node so that the the remote-generation number being sent by the problematic node would fall within 1 year.
    • So how we update this value in the UN seed node ? Screenshot 2019-07-05 at 12.29.09.png
      • we need to make the problematic node send a gen number (epoch) with a value which falls in 1 year window of the local gen number stored in UN seed node.
      • but since the code always takes current time as gen number and current time is july 2019, what can we do ?
      • We change TIME Screenshot 2019-07-05 at 12.31.36.pngback on the problematic node to a value which falls within 1 year of 1526993447. choose a epoch value towards the end of 1 year window, i.e change system time to a value say march 31, 2019 i.e. epoch 1554030000 rather than october 2, 2018 & restart the node. the node will reboot & send a gen number 1554030000(as it looks up system.local table) or current time which is march 31, 2019 anyway to the seed.
      • The UN seed node gets this value & validates that the remote-generation number sent by problematic node is within 1 year of may22,2018 so, it proceeds to update its reference (local generation). Screenshot 2019-07-05 at 12.17.58.png
      • Wallah Screenshot 2019-07-05 at 12.38.28, we have successfully the updated the reference (local gen) of the problematic node stored in the UN seed node.
      • Now we stop problematic node, reset the time back on the problematic node to current time & reboot, the problematic node will send latest epoch of say july 4, 2019 i.e epoch 1562215230 
      • Now since 1562215230 (gen sent be problematic node using latest time) minus 1554030000 (local reference stored in UN seed node) < 1 year, the problematic node will be allowed to join the cluster .
      • hurray Screenshot 2019-07-05 at 12.43.04.png
      • we advise you to choose an epoch/date towards end of the 1 year window but within 1 year, the later the better as a new 1 year window starts from the date you choose & this problem is mitigated for that long LOL – YEP this problem occurs on long running clusters. What this means is that you need to do rolling restarts once in a while every year to extend the 1 year window. reminds me of visa extensions hehe.

Steps:

  1. if the problematic node is 10.3.7.7 and error is reported on say 10.3.7.77 (UN node), make sure the seed for 10.3.7.7 is 10.3.7.77 so that we guarantee its talking to this node and we dont have to search to find out who its talking too in the cluster. if the seed for 7.7 node is different from the node reporting the error, then look at the error message printed by the seed node to decide which epoch to reset too. in our case, since i saw the error on 7.77, i changed the seed of 7.7 to 7.77 node.
  2. start the problematic node.
  3. the seed node should start printing the error. capture the error message for our node and make note of local gen number for us to choose a date to reset too. in our case the msg was as follows:
  4. Jul 04 01:54:17 host-10.3.7.77 scylla[30263]: [shard 0] gossip – received an invalid gossip generation for peer 10.3.7.7; local generation = 1526993447, received generation = 1562158865

  5. cqlsh to problematic node 10.3.7.7 & update generation number to an epoch within 1 year of 1526993447 BUT choose an epoch towards the end of the 1 year window like 1554030000 (march 31, 2019) rather than say july/october 2018 so that you have a longer new 1 year window.
  6. #on problematic node, run

    1. update system.local set gossip_generation = 1554030000 where key=‘local’;
  7. #on problematic node, run

    1. nodetool flush
  8. stop problematic node
  9. edit the config file & change CQL (native_transport_port) from 9042 to 9043 so that clients cant connect & insert data – insertion of data during this phase will set records with a timestamp of march 2019 which is not correct i.e. prevent data corruption. This is a precaution Screenshot 2019-07-05 at 12.58.49.png.
  10. change system time i.e. “date -s ’31 MAR 2019 11:03:25′”
  11. verify system time has changed by running date command
  12. start the problematic node & tail logs of the UN seed node, the error should go away.
  13. wait for some time (few minutes is enough) for gossip to occur & verify if problematic node is now UN. Screenshot 2019-07-05 at 14.01.47
  14. You can tail logs of the UN seed node & check if you still get the error.
    1. if you do see the error again – repeat the steps again from the start. u mite have missed something.
  15. once the node is declared UN:
    1. shutdown the node
    2. change the CQL (native_transport_port) back from 9043 to 9042 in config file.
    3. RESET THE system time on the box
    4. verify system time is back to normal
  16. start up the node once you have changed back time and port.Screenshot 2019-07-05 at 12.58.49Screenshot 2019-07-05 at 12.29.09
  17. You will notice that the node is still UN 🙂

Confessions:

  1. yes we did this exercise in production. The node was anyways considered dead, hence risk was minimal as screwing up a dead node even more wont make a difference and if the procedure failed, we would sacrifice 1 node only and hence be left with the only option to cluster reboot.
  2. we scanned the scylla code base of master branch for usages of system time in cluster communication and found only 2 places which gave us confidence that changing system time would work. also by changing CQL port to 9043, we eliminated any contamination of existing data by clients.Screenshot 2019-07-05 at 14.13.51.png

Morals of the story:

  1. this happened in 2.1 version of scylla, and as of today july 4, 2019 the master branch of scylla still has the same code logic, hence this can occur in version 3 and above too.
  2. every few months better do rolling restarts of the nodes, so that the nodes send a new gen number for gossip and the 1 year window is extended.
  3. if you have a long running cluster > 1 year , if a node is restarted, it will be affected by this error, the more node  restarts that happen, the more the epidemic spreads.
  4. this can work for cassandra if the code logic is the same, which i think it is.

References:

https://github.com/scylladb/scylla/blob/134b59a425da71f6dfa86332322cc63d47a88cd7/gms/gossiper.cc

https://github.com/scylladb/scylla/blob/94d2194c771dfc2fb260b00f7f525b8089092b41/service/storage_service.cc

https://github.com/scylladb/scylla/blob/077c639e428a643cd4f0ffe8e90874c80b1dc669/db/system_keyspace.cc

https://stackoverflow.com/questions/40676589/cassandra-received-an-invalid-gossip-generation-for-peer

Thanks

to my team mate Rahul Anand for helping out in debugging the issue & providing key breakthroughs during the firefighting

if you find it useful – please comment on either stackoverflow or here. thanks.

Update:

The folks at scylla have now fixed this.

refer: https://github.com/scylladb/scylla/pull/5195

golang: Parse a large json file in streaming manner

hi all,

i recently tried to parse a large json file in golang and used

'ioutil.ReadAll(file)' followed by 'json.Unmarshal(data, &msgs)'

This lead to a huge spike in memory usage as the entire file is read into memory in 1 go.

I wanted to parse the file in a streaming fashion, i.e. read the file bit by bit and decode the json data just like the jackson streaming api.

I did some digging and found

func NewDecoder(r io.Reader) *Decoder {...} in json package

The above uses a reader to buffer & pick up data.

here is some sample code to do it:

https://github.com/jaihind213/golang/blob/master/main.go

The differences are in terms of memory footprint:

We print memory usage as we parse a ~900MB file generated by generate_data.sh.

Streaming Fashion:

Screenshot 2019-05-07 at 11.20.15

….

Screenshot 2019-05-07 at 11.31.36

Simple Parsing by reading Everything in Mem:

Screenshot 2019-05-07 at 11.32.28

Screenshot 2019-05-07 at 11.32.54

Conclusion:

The streaming parser has a very low Heap_inUse when compared to the massive usage of the readAllInMem approach.

Druid Indexing Fails – ISE – Cannot find instance of indexer to talk to!

Recently my druid overlord and middle manager stopped indexing data.

I tried the basic multiple restarts but it did not help.

The logs of the middle manager revealed the following:

2018-10-02 06:01:37 [main] WARN  RemoteTaskActionClient:102 - Exception submitting action for task[index_nom_14days_2018-10-02T06:01:29.633Z]

java.io.IOException: Failed to locate service uri

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.task.IndexTask.isReady(IndexTask.java:150) [druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.worker.executor.ExecutorLifecycle.start(ExecutorLifecycle.java:169) [druid-indexing-service-0.10.0.jar:0.10.0]

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]

        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]

        at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:364) [java-util-0.10.0.jar:0.10.0]

        at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:263) [java-util-0.10.0.jar:0.10.0]

        at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) [druid-api-0.10.0.jar:0.10.0]

        at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:102) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.CliPeon.run(CliPeon.java:277) [druid-services-0.10.0.jar:0.10.0]

        at io.druid.cli.Main.main(Main.java:108) [druid-services-0.10.0.jar:0.10.0]

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

        at io.druid.indexing.common.actions.RemoteTaskActionClient.getServiceInstance(RemoteTaskActionClient.java:168) ~[druid-indexing-service-0.10.0.jar:0.10.0]

        at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:89) ~[druid-indexing-service-0.10.0.jar:0.10.0]

The main cause was

Caused by: io.druid.java.util.common.ISE: Cannot find instance of indexer to talk to!

Apparently the middle manager and its peons could not seem to find the overlord service for indexing to start and hence all jobs were failing.

Root Cause:

I logged into zookeeper and found the following to entries for Overlord/Coordinator to be empty even though they are up and running.

Since they were empty, the middle manager could not find an indexer to talk to ..!! as the middle manager looks up zookeeper.

[zk: localhost:2181(CONNECTED) 1] ls /druid/discovery/druid:coordinator
[]
[zk: localhost:2181(CONNECTED) 3] ls /druid/discovery/druid:overlord
[]

Fix:

assuming that Coordinator runs on Host_A & Overlord runs on Host_B and … do the following:

(1) Generate 2 random ids. 1 for Coordinator & 1 for Overlord

Druid code uses the java code :

System.out.println(java.util.UUID.randomUUID().toString());

you can use any random string.

lets assume you generate

45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4‘ for coordinator

&

f1babb39-26c1-42cb-ac4a-0eb21ae5d77d‘ for overlord.

(2) log into zookeeper cli and run the following commands:

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:coordinator/45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4 {"name":"druid:coordinator","id":"45c3697f-414f-48f2-b1bc-b9ab5a0ebbd4","address":"HOST_A_IP","port":8081,"sslPort":null,"payload":null,"registrationTimeUTC":1538459656828,"serviceType":"DYNAMIC","uriSpec":null}

[zk: localhost:2181(CONNECTED) 3] create /druid/discovery/druid:overlord/f1babb39-26c1-42cb-ac4a-0eb21ae5d77d {"name":"druid:overlord","id":"f1babb39-26c1-42cb-ac4a-0eb21ae5d77d","address":"HOST_B_IP","port":8090,"sslPort":null,"payload":null,"registrationTimeUTC":1538459763281,"serviceType":"DYNAMIC","uriSpec
":null}

You will notice a registrationTimeUtc field – put in some time u see fit.

This should create the necessary zookeeper nodes for middle manager to lookup

now restart coordinator then overlord then middleManager

Submit a indexing task and it should work.

Additional tips:

In $Druid_Home/conf/druid/middleManager/runtime.properties you may have

druid.indexer.runner.javaOpts=-cp conf/druid/_common:conf/druid/middleManager/:conf/druid/middleManager:lib/* -server -Ddruid.selectors.indexing.serviceName=druid/overlord -Xmx7g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dhadoop.mapreduce.job.user.classpath.first=true

and $Druid_Home/conf/druid/middleManager/jvm.config you may have

-Ddruid.selectors.indexing.serviceName=druid/overlord

Hope this helps you.

java.lang.RuntimeException: native snappy library not available – druid ingestion

Was ingesting some data into druid using local batch mode & the ingestion failed with

2018-07-10T04:57:51,501 INFO [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – map task executor complete.
2018-07-10T04:57:51,506 WARN [Thread-31] org.apache.hadoop.mapred.LocalJobRunner – job_local265070473_0001
java.lang.Exception: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) [hadoop-mapreduce-client-common-2.6.0-cdh5.9.0.jar:?]
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.

To avoid the snappy compression… i disabled it with the following tuning config in the ingestion spec

"tuningConfig": {

                        "type": "hadoop",

                        "partitionsSpec": {

                                "type": "hashed",

                                "targetPartitionSize": 5000000

                        },

                        "jobProperties": {

                                 "mapreduce.framework.name" : "local",

                                 "mapreduce.map.output.compress":"false"

                                "mapreduce.job.classloader": "true",

                                "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop."
                                  .........
                                  .........
                        }

                }

 

Pushing druid metrics to Graphite Grafana with full metrics whitelist

This blog post is about configuring druid to send metrics to graphite/grafana.

kindly refer to steps @

https://github.com/jaihind213/druid_stuff/tree/master/druid_metrics_graphite

The Grafana dashboard templates are in above github link.

———————————————————————————————————————

after you have followed the steps above, restart druid and it should send stuff to graphite and grafana.

You should see log statements like

INFO [main] io.druid.emitter.graphite.GraphiteEmitter - Starting Graphite Emitter.

INFO [GraphiteEmitter-0] io.druid.emitter.graphite.GraphiteEmitter - trying to connect to graphite server

INFO [main] io.druid.initialization.Initialization - Loading extension [graphite-emitter] for class

Screen Shot 2018-06-11 at 7.52.12 PMScreen Shot 2018-06-11 at 7.51.58 PM

Screen Shot 2018-06-11 at 7.58.00 PM

Tip: you can set the frequency at which metrics are reported and also set the types of monitors you wish to have BASED ON THE RELEASE version you USE.

Important Tip: Start off with JVM monitor first, then slowly add. The sys monitor needs sigar jar. some monitors mite nite work immediately.

Screen Shot 2018-06-11 at 7.59.12 PM.png


 

UnknownArchetype: The desired archetype does not exist

recently encountered this error while generating a project in my intellij.

[ERROR] BUILD FAILURE
[INFO] ————————————————————————
[INFO] : org.apache.maven.archetype.exception.UnknownArchetype: The desired archetype does not exist (org.apache.maven.archetypes:maven-archetype-webapp:RELEASE)
The desired archetype does not exist (org.apache.maven.archetypes:maven-archetype-webapp:RELEASE)

Solution: The archtypes are downloaded from the maven central repo, so make sure the repo is specified in your .m2/settings.xml

In the settings.xml,  under profiles->profile->repositories add the maven central repo url

      <profiles> <profile>  <repositories>

              <repository>

                <releases> <enabled>true</enabled> </releases>

                <snapshots><enabled>true</enabled></snapshots>

                <id>central</id>

                <url>https://repo1.maven.org/maven2</url>

            </repository>

        ……

     </repositories> </profile> </profiles>

Tip: Check if the url ‘https://repo1.maven.org/maven2‘ works first , else give a repo which works.