onyx 2016-02-11 | Slack Archive

@lucasbradstreet: looks like swap space is helping

it’s not using a lot of swap, just 500mb, but that appears to give it enough room to GC etc

Cool, looks like -Xmx was too high too, otherwise it’d have gotten into the multiple GB swap range

We’ve tracked down the startup issue to a condition where the whole cluster was killed, and started up again, and the coordination replica was in a stuck state

lucasbradstreet08:02:29

I’m working on reproducing it with jepsen tests, but we have a fix incoming

robert-stuttaford08:02:26

oh, wonderful!

robert-stuttaford08:02:46

i’m glad we were able to identify a bug, and i’m proud it’s of the sort that you need jepsen to reproduce

lucasbradstreet08:02:53

I think starting up 40 peers on one host made the issue easier to hit

lucasbradstreet08:02:57

because they joined slower

lucasbradstreet08:02:22

Well, it’s kinda easy to reproduce it elsewhere, but I want to make sure we’ve fixed all the cases like it

lucasbradstreet08:02:54

I’m curious about your service shutdown issue

lucasbradstreet08:02:02

This one:

lucasbradstreet08:02:49

Oops

lucasbradstreet08:02:58

Ah, it just took a bit to upload

robert-stuttaford08:02:14

grepping local code for ‘aeron-ec2-user’ ...

lucasbradstreet08:02:39

Oh, that’s just automatic aeron stuff

robert-stuttaford08:02:51

# Clean temporary files before start rm -rf /dev/shm/aeron-ec2-user

robert-stuttaford08:02:56

we do this before upstart starts the jar

lucasbradstreet08:02:16

hmm, is there any chance the service hasn’t finished shutting down before you start a new jar?

robert-stuttaford08:02:27

oh you mean the component shut down

robert-stuttaford08:02:32

yes, quite possible

lucasbradstreet08:02:38

ok, I think that’s what is going on there

robert-stuttaford08:02:40

it hangs, i kill -9 it, start a new jar

lucasbradstreet08:02:49

Oh, that should be fine

robert-stuttaford08:02:11

CI deployments will sleep 3 before starting a new one

robert-stuttaford08:02:24

i guess we need to make sure that upstart waits for the previous shutdown properly

lucasbradstreet08:02:26

What I thought might be happening is you upstart shutdown, it’s shutting down, but the startup script rm -rfs before the upstart service has finished

robert-stuttaford08:02:48

yes, verrry possible

lucasbradstreet08:02:58

I don’t think it’s very serious, but obviously not really desirable

robert-stuttaford08:02:15

so, how do you script restarts?

robert-stuttaford08:02:36

i feel like this should be a solved problem, and we’re just not using the proper solution

lucasbradstreet08:02:02

Oh, you should just remove the rm -rf from your startup script

lucasbradstreet08:02:15

Well...

lucasbradstreet08:02:43

I was going to say the embedded driver now does it by default

lucasbradstreet08:02:48

So you don’t need the script

lucasbradstreet08:02:08

But typically I’d want to make sure the service that runs onyx is shutdown before starting a new one

lucasbradstreet08:02:27

by service shutdown, perhaps with a timeout that kills it after some time period

robert-stuttaford08:02:19

ok. upstart should be waiting for the process to exit before it starts another one, surely?

robert-stuttaford08:02:30

even if it takes a couple minutes

lucasbradstreet08:02:35

I’d have thought so

lucasbradstreet08:02:42

It depends on how you’ve implemented the shutdown I guess

lucasbradstreet08:02:52

I’m not really familiar with what you’ve used to make it a service

robert-stuttaford08:02:00

i can show you our upstart script, if you know that system

robert-stuttaford08:02:15

what do you use to herd jvm processes on your servers?

lucasbradstreet08:02:31

We typically put things in docker containers

lucasbradstreet08:02:53

Tho you do need a monitor inside the container

robert-stuttaford09:02:17

cool. will work on it. thanks Lucas

Kira Sotnikov14:02:09

lucasbradstreet: do you have an article about onyx inside docker?

Kira Sotnikov14:02:16

interesting to take a look

lucasbradstreet14:02:36

I'm afraid not. We're still fairly immature with respect to devops docs

lucasbradstreet15:02:02

The best I have for you is what's in the meetup example which is part of onyx-template

lucasbradstreet15:02:40

@robert-stuttaford: I’ve tracked down your issue

Kira Sotnikov15:02:27

lucasbradstreet: okay, just interesting are there any issues with java/onyx and docker

lucasbradstreet15:02:50

main one is to do with shm-size, because aeron needs more /dev/shm space

lucasbradstreet15:02:03

We currently hack around it in the template by running in —privileged mode

lucasbradstreet15:02:07

and running a shell script

lucasbradstreet15:02:21

But they just released a version of docker 1.10 which allows —shm-size to be supplied

michaeldrogalis15:02:22

@robert-stuttaford: Would recommend kill -2 in general Not saying what you found isnt a bug, but kill -9 is worth avoiding 😛

lucasbradstreet15:02:25

@michaeldrogalis: they initially kill (not -9), but it gets blocked on peer shutdown

robert-stuttaford15:02:26

\o/

robert-stuttaford15:02:43

yup. upstart does the graceful thing and then i lose patience and murderrrr it

lucasbradstreet15:02:17

@robert-stuttaford: this is the issue

lucasbradstreet15:02:21

(let [c (chan 1) _ (>!! c :a) _ (close! c) r (future (>!! c :b))] (Thread/sleep 500) (deref r)) => false

lucasbradstreet15:02:28

(above is good)

lucasbradstreet15:02:37

(let [c (chan 1) _ (>!! c :a) r (future (>!! c :b))] (Thread/sleep 500) (close! c) ;; blocks forever (deref r))

lucasbradstreet15:02:40

not good

lucasbradstreet15:02:57

IMO, they should behave the same

robert-stuttaford15:02:03

wow. how did you arrive there so quickly?

robert-stuttaford15:02:09

i assume this is in the read-log plugin?

lucasbradstreet15:02:15

Yep, read-log as suspected

robert-stuttaford15:02:19

or is it in the short-circuit

robert-stuttaford15:02:20

lucasbradstreet15:02:30

Basically the read buffer is filled up

lucasbradstreet15:02:39

And we close the channel, but the final block never exits

lucasbradstreet15:02:43

s/block/put/

robert-stuttaford15:02:48

well done for finding it

lucasbradstreet15:02:51

or rather, blocking put

robert-stuttaford15:02:04

looking at how simple your examples are, i guess the remedy is easy to do?

lucasbradstreet15:02:19

well, there’s not really any good remedy assuming you’re not using a non-blocking channel

lucasbradstreet15:02:33

my current best is to use the new offer! fn

lucasbradstreet15:02:56

and basically do:

lucasbradstreet15:02:01

(defn >!!-safe [ch v] (loop [] (if-not (offer! read-ch input) (do (Thread/sleep 1) (if (closed? read-ch) false (recur))) true)))

lucasbradstreet15:02:23

unfortunately closed? is in the core.async impl ns

robert-stuttaford15:02:00

is this a conceptual issue with core.async?

lucasbradstreet15:02:04

I believe so

lucasbradstreet15:02:25

I think the blocking >!! on the channel should return false once the channel is closed

lucasbradstreet15:02:41

Otherwise you need to drain the channel for it to ever return

lucasbradstreet15:02:52

Clojure devs may disagree

lucasbradstreet15:02:16

I’m ok with it as long as they make closed? public eventually

robert-stuttaford15:02:30

ok - so what do we do now? live with using impl?

robert-stuttaford15:02:41

or switch to raw threads

lucasbradstreet15:02:42

Yeah, tbh, I’m already using closed? in core

michaeldrogalis15:02:56

I think its hard to argue thats not a bug in core.async. Whats your thought about it?

lucasbradstreet15:02:03

I completely agree

lucasbradstreet15:02:29

The fact that it acts differently if you do it after it’s closed, indicates it’s a bug

michaeldrogalis15:02:32

We literally cannot unblock a thread without doing something silly. That's terrible for a thread-heavy app

lucasbradstreet15:02:58

Agree

michaeldrogalis15:02:07

Even if it's accepted as a bug, it will probably take a while to get patched and released.

lucasbradstreet15:02:21

Like the only other solution besides using offer! (and non-public closed?) is to drain the channel

lucasbradstreet15:02:42

Yeah, I’m sticking with offer! and closed? and going to report it to the clojure mailing list to get some discussion

michaeldrogalis15:02:15

Okay. Nice job running that one down. ^^

robert-stuttaford15:02:29

yeah, well done!

robert-stuttaford15:02:42

how helpful was the peer log i shared for this, Lucas?

robert-stuttaford15:02:55

i’m curious where the clue was

lucasbradstreet15:02:03

It was good because it helped me correlate which peers were up against what was in the onyx.log

lucasbradstreet15:02:08

onyx.log was the main thing

lucasbradstreet15:02:18

But the peer log helped prove it

robert-stuttaford15:02:48

i’m glad it helped

lucasbradstreet15:02:13

The peer log is pretty cool. It allows us to step through the cluster state over time

lucasbradstreet15:02:29

Something I need to add is an ability to supply the timezone of the server that generated the log

lucasbradstreet15:02:45

That way I can compare the time in your onyx.log vs the time in the cluster log

lucasbradstreet15:02:12

Some excellent advice that timbre doesn’t seem to take is to print the timezone with the date and the time

robert-stuttaford15:02:33

that seems like a rather obvious oversight

robert-stuttaford15:02:43

i’m sure peter would be happy to add it

lucasbradstreet15:02:45

Totally mandatory when you have lots of servers

michaeldrogalis15:02:52

Hugely agreed

lucasbradstreet15:02:19

To be honest with you, I had a feeling it had to do with waiting for the producer-ch to close, because I had to disable this functionality in onyx-bookkeeper to get successful runs in jepsen 😛

robert-stuttaford15:02:24

peter’s another south african clojurist although living in your third of the world, Lucas

lucasbradstreet15:02:25

It was on my todo list I swear

robert-stuttaford15:02:29

haha!

lucasbradstreet15:02:02

Ah yes, he’s in Thailand or something like that?

robert-stuttaford15:02:20

or vietnam

lucasbradstreet15:02:34

Ah yeah, it was vietnam

lucasbradstreet15:02:40

technomancy is in Thailand

lucasbradstreet15:02:53

I can release a new onyx-datomic for you. I’d prefer to do it against 0.8.9

robert-stuttaford15:02:12

please

lucasbradstreet15:02:15

You will need to ditch :onyx/restart-pred-fn if you upgrade to 0.8.9 but it’s an easy fix

robert-stuttaford15:02:34

that means using the new lifecycle error handling right?

robert-stuttaford15:02:44

be happy to, we need to do this anyway

michaeldrogalis15:02:40

@robert-stuttaford: Yep, just toss in :lifecycle/handle-exception (constantly :restart) if you always want to restart on a task.

michaeldrogalis15:02:02

Can also choose :kill or :defer. You get a lot of info to decide what to do in the supplied function.

robert-stuttaford15:02:06

and i’d extend that lifecycle to all my tasks

michaeldrogalis15:02:17

Correct, yep

lucasbradstreet15:02:19

Yeah, just use :all for your lifecycle

robert-stuttaford15:02:22

sweet

lucasbradstreet15:02:35

That’s a lifecycle map

michaeldrogalis15:02:42

I really like this feature. Twas @lucasbradstreet's idea

lucasbradstreet15:02:43

so (def restart-lifecycles {:lifecycle/handle-exception (constantly :restart)})

lucasbradstreet15:02:18

{:lifecycle/task :all :lifecycle/calls :path-to-restart/restart-lifecycles}

lucasbradstreet15:02:25

Yeah, it’s much better than restart-pred-fn

robert-stuttaford15:02:34

nice, i’ll drop that in when i grab the new release

lucasbradstreet15:02:11

(you don’t really see the benefit here)

lucasbradstreet15:02:31

It allows us to do stuff like handle certain exceptions in the plugins that we know are safe

robert-stuttaford15:02:32

so this sorts the issue with clean stops. then i will have eliminated all the known impediments to multi-node bar one: that only one node seems to process events. i still have to do the separate java entry point to start the jobs, as well

lucasbradstreet15:02:41

Yup

robert-stuttaford15:02:13

so i’ll do that, get this error handling change done, and see where we are at. i’ll make a list of everything i’ve done to get it working, and if i’m still stuck, then i think we’ll be ready for our first Hour if it’s all working properly, then i’ll roll straight on to load testing the snot out of it, and work on a list of things i’d like your help with.

robert-stuttaford15:02:30

either way, that’ll take us to first thing next week quite nicely

michaeldrogalis15:02:04

:thumbsup:

robert-stuttaford15:02:10

having error handling in lifecycles makes sense, just like having a web middleware for exception logging

lucasbradstreet15:02:26

Agreed

lucasbradstreet15:02:40

It took a while to come to the right solution

robert-stuttaford15:02:51

i love that transducers basically does the same thing

robert-stuttaford15:02:00

why do i love it? because i can actually understand it!

michaeldrogalis15:02:14

Hah

michaeldrogalis15:02:31

:defer is pretty handy too. Basically lets a downstream lifecycle figure out what to do if your lifecycle cant know

robert-stuttaford15:02:26

and kill? does that take the job or the peer out?

michaeldrogalis15:02:38

Kills the whole job.

robert-stuttaford15:02:47

right

robert-stuttaford15:02:35

ok folks - dadding time. thanks again, Lucas. i look forward to sharing progress tomorrow

lucasbradstreet15:02:44

:thumbsup:

michaeldrogalis15:02:05

@robert-stuttaford: Cool, talk soon!

robert-stuttaford20:02:02

@michaeldrogalis: would the new lifecycle error catch work on onyx-datomic plugin tasks too? we just had an uncaught transactor-unavailable take the system down 😞

lucasbradstreet20:02:27

Yes, it does

lucasbradstreet20:02:47

Ah, that must have happened at the wrong point

lucasbradstreet20:02:58

the new lifecycles has much better coverage (should cover anything)

michaeldrogalis20:02:24

Yep ^

lucasbradstreet20:02:00

(Jepsen found some cases we missed in the first pass of the new lifecycles, and we’re pretty sure we’ve got them all now :D)

michaeldrogalis20:02:54

Another cool thing is that the deciding function has access to the event map and the exception that was thrown, so you can recover + take some action, like send a pager notification

2016-02-11

Channels