onyx 2016-12-03 | Slack Archive

michaeldrogalis00:12:19

If someone makes some tooling to be able to tell when a Spec has grown with forward-compatibility, it would be possible to make a deployer for Onyx that does rolling-restarts when there is new code, and does a full-stop restart when there is an incompatibility.

jholmberg02:12:13

Evening gents. I'm working with @camechis on ingesting some data. A job that's been working well for us has been acting a little funny on this latest run. I noticed that the docker container for the job was killed by mesos' OOM killer. The error log contained this:

Exception in thread "main" clojure.lang.ExceptionInfo: empty String {:original-exception :java.lang.NumberFormatException}
    at onyx.compression.nippy$fn__10918$fn__10919.invoke(nippy.clj:33)
    at taoensso.nippy$read_custom_BANG_.invokeStatic(nippy.clj:1052)
    at taoensso.nippy$read_custom_BANG_.invoke(nippy.clj:1049)
    at taoensso.nippy$thaw_from_in_BANG_.invokeStatic(nippy.clj:1218)
    at taoensso.nippy$thaw_from_in_BANG_.invoke(nippy.clj:1098)
    at taoensso.nippy$thaw$thaw_data__10761.invoke(nippy.clj:1330)
    at taoensso.nippy$thaw.invokeStatic(nippy.clj:1356)
    at taoensso.nippy$thaw.invoke(nippy.clj:1279)
    at onyx.compression.nippy$zookeeper_decompress.invokeStatic(nippy.clj:56)
    at onyx.compression.nippy$zookeeper_decompress.invoke(nippy.clj:55)
    at onyx.log.zookeeper$fn__16795$fn__16797$fn__16798.invoke(zookeeper.clj:564)
    at onyx.log.zookeeper$clean_up_broken_connections.invokeStatic(zookeeper.clj:77)
    at onyx.log.zookeeper$clean_up_broken_connections.invoke(zookeeper.clj:75)
    at onyx.log.zookeeper$fn__16795$fn__16797.invoke(zookeeper.clj:561)
    at onyx.monitoring.measurements$measure_latency.invokeStatic(measurements.clj:11)
    at onyx.monitoring.measurements$measure_latency.invoke(measurements.clj:5)
    at onyx.log.zookeeper$fn__16795.invokeStatic(zookeeper.clj:560)
    at onyx.log.zookeeper$fn__16795.doInvoke(zookeeper.clj:558)
    at clojure.lang.RestFn.invoke(RestFn.java:445)
    at clojure.lang.MultiFn.invoke(MultiFn.java:238)
    at onyx.test_helper$feedback_exception_BANG_.invokeStatic(test_helper.clj:24)
    at onyx.test_helper$feedback_exception_BANG_.invoke(test_helper.clj:13)
    at onyx.test_helper$feedback_exception_BANG_.invokeStatic(test_helper.clj:19)
    at onyx.test_helper$feedback_exception_BANG_.invoke(test_helper.clj:13)
    at centrifuge.core$_main.invokeStatic(core.clj:91)
    at centrifuge.core$_main.doInvoke(core.clj:67)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at centrifuge.core.main(Unknown Source)

Trying to find something that might give me a clue what's going on. Any thoughts?

gardnervickers02:12:57

How are you setting the JVM heap limit?

Travis02:12:35

@gardnervickers we are basically using the formula from the onyx template

Travis02:12:04

In the start peers script

jholmberg02:12:24

Peers seem to run ok for 30-45 min. The job that started by the peers script, it seems to die in about 2-3 min.

gardnervickers02:12:19

How much memory are you allocating the containers?

Travis02:12:39

@jholmberg what's the latest settings on the container?

jholmberg02:12:10

on the peers 5G,

jholmberg02:12:17

on the job 1G

gardnervickers02:12:45

Oh the job launcher is being killed too?

jholmberg02:12:01

job launcher is the one getting killed every 2-3 min

Travis02:12:21

Oh it's the Job not the peer?

jholmberg02:12:03

Peers do end up dying eventually (every 40 min or so). But its the job that's getting killed more by far

Travis02:12:27

By oom?

jholmberg02:12:31

yeah

jholmberg02:12:42

I bumped up the mem to 2G but didn't see any change

gardnervickers02:12:42

That's just watching the Zookeeper log, I wonder why it's being killed. I'm not familiar with the Mesos OOM killer.

gardnervickers02:12:57

Can you set the heap size manually?

Travis02:12:50

Oom killer is basically killing the docker container if it exceeds it's allocated mem

jholmberg02:12:29

The settings get passed in from marathon. I bet I could set the heap for jvm explicitly from marathon in the peer script

gardnervickers02:12:25

Yea set the heap sizes for the JVM's to half the container allocation to start with.

jholmberg02:12:06

Ok, thanks @gardnervickers!

gardnervickers02:12:43

Of course! If you see this again I would be interested in taking a look at your kernel logs wherever the OOM killer process writes too. That should indicate how much memory the JVM is actually consuming in the container.

Travis02:12:50

Cool, I will probably be working on this on Monday as well

gardnervickers02:12:57

For anyone running any JVM apps in containers, this is a great read. http://matthewkwilliams.com/index.php/2016/03/17/docker-cgroups-memory-constraints-and-java-cautionary-tale/

Travis02:12:35

Cool, think I have seen and read this but will read again

michaeldrogalis06:12:01

@jasonbell Watching your Skills Matter talk. Nice 🙂

jasonbell08:12:50

@michaeldrogalis Pleasure, it certainly got some interest. Looking forward to doing some more. 🙂

lucasbradstreet08:12:56

https://skillsmatter.com/skillscasts/9153-introducing-streaming-processing-with-kafka-and-the-onyx-platform?utm_content=buffer800ae&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer for those who are interested

jasonbell08:12:53

Hopefully it got the concepts across, that’s what I was aiming for. If I had more time then I could have got into the peer management a little more but 25 minutes went very fast.

jasonbell08:12:58

Proposal is in for Strata London with a more refined version

lucasbradstreet08:12:59

Great. Yeah, with 25 min talks you need to pick your battles

jasonbell09:12:05

indeed

lucasbradstreet10:12:10

With respect to the semver discussion above, I think semver is handy (we could do it better), but my feeling has always been that it alone is not enough. Giving good testing mechanisms, along with validation (of which spec is part) are both very important

mariusz_jachimowicz12:12:35

@lucasbradstreet @michaeldrogalis Could you merge my PR's for dashboard ? https://github.com/onyx-platform/onyx-dashboard/pull/75 https://github.com/onyx-platform/onyx-dashboard/pull/74

lucasbradstreet12:12:55

Thank you. Will review soon 🙂

lucasbradstreet12:12:17

Changing base on both to master

akiel14:12:03

There are some confusing dependencies in Onyx right now. For example Netty 3.7.0.Final from Zookeeper and Netty 3.9.4.Final from Bookkeeper. I usually run lein with-profile production deps :tree or even use :pedantic? :abort in my projects.

michaeldrogalis20:12:52

@mariusz_jachimowicz Thanks, sorry the delay in merging. Things have been a little hectic getting ready to move @lucasbradstreet to my neighborhood. 🙂

michaeldrogalis20:12:45

@akiel Is Leiningen picking Netty 3.7.0.Final? Is it giving you trouble, or just giving us a heads up about the conflict?

akiel20:12:51

I think it picks Netty 3.7.0.Final. I have no problems with it. I’m just concerned about the conflict itself. I like to have everything as reproduceable as possible.

michaeldrogalis20:12:05

@akiel I don’t think there’s much we can do about that short of excluding 3.9.4.Final in Onyx core itself.

michaeldrogalis20:12:53

We’re going to drop BookKeeper in favor of another pluggable storage interface for 0.10. Iterative state snapshots will come back in the future, but probably not with BookKeeper, if that’s any consolation.

michaeldrogalis20:12:24

0.10 will snapshot entire values onto S3/HDFS/ZooKeeper, or whatever else we/you implement behind the interface. All functionality will be preserved.

michaeldrogalis20:12:22

I know we’ve seen saying it forever, but we’ll have a preview release out in ~1 week. It’s been hard to judge since it’s ended up being a rewrite of all the critical parts of core.

akiel20:12:13

I appreciate pluggable storage. That would solve another issue that currently onyx has many dependencies that a production peer doesn’t need.

akiel20:12:17

Regarding Netty: you should decide which version Onyx likes to use und just add that dependency directly into Onyx itself.

michaeldrogalis20:12:08

We’re close enough to the tech preview that I probably won’t patch it, it’ll get ripped out shortly.

michaeldrogalis20:12:55

@jasonbell Yeah, 25 minutes goes by in the blink of an eye

akiel21:12:03

@michaeldrogalis No problem. Onyx works for me very well so far.

2016-12-03

Channels