This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-02-11
Channels
- # announcements (1)
- # beginners (84)
- # boot (325)
- # cbus (1)
- # cider (13)
- # cljs-dev (1)
- # cljsjs (1)
- # cljsrn (15)
- # clojars (8)
- # clojure (221)
- # clojure-czech (2)
- # clojure-ireland (8)
- # clojure-madison (28)
- # clojure-poland (176)
- # clojure-russia (111)
- # clojurebridge (7)
- # clojurescript (75)
- # community-development (70)
- # conf-proposals (19)
- # core-async (29)
- # css (12)
- # cursive (66)
- # datavis (15)
- # datomic (61)
- # devcards (15)
- # dirac (2)
- # editors (13)
- # emacs (9)
- # funcool (7)
- # hoplon (13)
- # jobs-discuss (5)
- # ldnclj (39)
- # ldnproclodo (1)
- # lein-figwheel (3)
- # leiningen (21)
- # liberator (26)
- # off-topic (12)
- # om (153)
- # onyx (168)
- # parinfer (165)
- # proton (21)
- # quil (5)
- # re-frame (58)
- # reagent (4)
- # ring-swagger (12)
- # spacemacs (3)
- # yada (120)
@lucasbradstreet: looks like swap space is helping
it’s not using a lot of swap, just 500mb, but that appears to give it enough room to GC etc
Cool, looks like -Xmx was too high too, otherwise it’d have gotten into the multiple GB swap range
We’ve tracked down the startup issue to a condition where the whole cluster was killed, and started up again, and the coordination replica was in a stuck state
I’m working on reproducing it with jepsen tests, but we have a fix incoming
oh, wonderful!
i’m glad we were able to identify a bug, and i’m proud it’s of the sort that you need jepsen to reproduce
I think starting up 40 peers on one host made the issue easier to hit
because they joined slower
Well, it’s kinda easy to reproduce it elsewhere, but I want to make sure we’ve fixed all the cases like it
I’m curious about your service shutdown issue
This one:
Ah, it just took a bit to upload
grepping local code for ‘aeron-ec2-user’ ...
Oh, that’s just automatic aeron stuff
# Clean temporary files before start rm -rf /dev/shm/aeron-ec2-user
we do this before upstart starts the jar
hmm, is there any chance the service hasn’t finished shutting down before you start a new jar?
oh you mean the component shut down
yes, quite possible
ok, I think that’s what is going on there
it hangs, i kill -9 it, start a new jar
Oh, that should be fine
CI deployments will sleep 3
before starting a new one
i guess we need to make sure that upstart waits for the previous shutdown properly
What I thought might be happening is you upstart shutdown, it’s shutting down, but the startup script rm -rfs before the upstart service has finished
yes, verrry possible
I don’t think it’s very serious, but obviously not really desirable
so, how do you script restarts?
i feel like this should be a solved problem, and we’re just not using the proper solution
Oh, you should just remove the rm -rf from your startup script
Well...
I was going to say the embedded driver now does it by default
So you don’t need the script
But typically I’d want to make sure the service that runs onyx is shutdown before starting a new one
by service shutdown, perhaps with a timeout that kills it after some time period
ok. upstart should be waiting for the process to exit before it starts another one, surely?
even if it takes a couple minutes
I’d have thought so
It depends on how you’ve implemented the shutdown I guess
I’m not really familiar with what you’ve used to make it a service
i can show you our upstart script, if you know that system
what do you use to herd jvm processes on your servers?
We typically put things in docker containers
Tho you do need a monitor inside the container
cool. will work on it. thanks Lucas
lucasbradstreet: do you have an article about onyx inside docker?
interesting to take a look
I'm afraid not. We're still fairly immature with respect to devops docs
The best I have for you is what's in the meetup example which is part of onyx-template
@robert-stuttaford: I’ve tracked down your issue
lucasbradstreet: okay, just interesting are there any issues with java/onyx and docker
main one is to do with shm-size, because aeron needs more /dev/shm space
We currently hack around it in the template by running in —privileged mode
and running a shell script
But they just released a version of docker 1.10 which allows —shm-size to be supplied
@robert-stuttaford: Would recommend kill -2
in general Not saying what you found isnt a bug, but kill -9
is worth avoiding 😛
@michaeldrogalis: they initially kill (not -9), but it gets blocked on peer shutdown
yup. upstart does the graceful thing and then i lose patience and murderrrr it
@robert-stuttaford: this is the issue
(let [c (chan 1) _ (>!! c :a) _ (close! c) r (future (>!! c :b))] (Thread/sleep 500) (deref r)) => false
(above is good)
(let [c (chan 1) _ (>!! c :a) r (future (>!! c :b))] (Thread/sleep 500) (close! c) ;; blocks forever (deref r))
not good
IMO, they should behave the same
wow. how did you arrive there so quickly?
i assume this is in the read-log plugin?
Yep, read-log as suspected
or is it in the short-circuit
Basically the read buffer is filled up
And we close the channel, but the final block never exits
s/block/put/
well done for finding it
or rather, blocking put
looking at how simple your examples are, i guess the remedy is easy to do?
well, there’s not really any good remedy assuming you’re not using a non-blocking channel
my current best is to use the new offer! fn
and basically do:
(defn >!!-safe [ch v] (loop [] (if-not (offer! read-ch input) (do (Thread/sleep 1) (if (closed? read-ch) false (recur))) true)))
unfortunately closed? is in the core.async impl ns
is this a conceptual issue with core.async?
I believe so
I think the blocking >!! on the channel should return false once the channel is closed
Otherwise you need to drain the channel for it to ever return
Clojure devs may disagree
I’m ok with it as long as they make closed? public eventually
ok - so what do we do now? live with using impl?
or switch to raw threads
Yeah, tbh, I’m already using closed? in core
I think its hard to argue thats not a bug in core.async. Whats your thought about it?
I completely agree
The fact that it acts differently if you do it after it’s closed, indicates it’s a bug
We literally cannot unblock a thread without doing something silly. That's terrible for a thread-heavy app
Even if it's accepted as a bug, it will probably take a while to get patched and released.
Like the only other solution besides using offer! (and non-public closed?) is to drain the channel
Yeah, I’m sticking with offer! and closed? and going to report it to the clojure mailing list to get some discussion
Okay. Nice job running that one down. ^^
yeah, well done!
how helpful was the peer log i shared for this, Lucas?
i’m curious where the clue was
It was good because it helped me correlate which peers were up against what was in the onyx.log
onyx.log was the main thing
But the peer log helped prove it
i’m glad it helped
The peer log is pretty cool. It allows us to step through the cluster state over time
Something I need to add is an ability to supply the timezone of the server that generated the log
That way I can compare the time in your onyx.log vs the time in the cluster log
Some excellent advice that timbre doesn’t seem to take is to print the timezone with the date and the time
that seems like a rather obvious oversight
i’m sure peter would be happy to add it
Totally mandatory when you have lots of servers
Hugely agreed
To be honest with you, I had a feeling it had to do with waiting for the producer-ch to close, because I had to disable this functionality in onyx-bookkeeper to get successful runs in jepsen 😛
peter’s another south african clojurist although living in your third of the world, Lucas
It was on my todo list I swear
Ah yes, he’s in Thailand or something like that?
or vietnam
Ah yeah, it was vietnam
technomancy is in Thailand
I can release a new onyx-datomic for you. I’d prefer to do it against 0.8.9
You will need to ditch :onyx/restart-pred-fn if you upgrade to 0.8.9 but it’s an easy fix
that means using the new lifecycle error handling right?
be happy to, we need to do this anyway
@robert-stuttaford: Yep, just toss in :lifecycle/handle-exception (constantly :restart)
if you always want to restart on a task.
Can also choose :kill
or :defer
. You get a lot of info to decide what to do in the supplied function.
and i’d extend that lifecycle to all my tasks
Correct, yep
Yeah, just use :all
for your lifecycle
That’s a lifecycle map
I really like this feature. Twas @lucasbradstreet's idea
so (def restart-lifecycles {:lifecycle/handle-exception (constantly :restart)})
{:lifecycle/task :all :lifecycle/calls :path-to-restart/restart-lifecycles}
Yeah, it’s much better than restart-pred-fn
nice, i’ll drop that in when i grab the new release
(you don’t really see the benefit here)
It allows us to do stuff like handle certain exceptions in the plugins that we know are safe
so this sorts the issue with clean stops. then i will have eliminated all the known impediments to multi-node bar one: that only one node seems to process events. i still have to do the separate java entry point to start the jobs, as well
so i’ll do that, get this error handling change done, and see where we are at. i’ll make a list of everything i’ve done to get it working, and if i’m still stuck, then i think we’ll be ready for our first Hour if it’s all working properly, then i’ll roll straight on to load testing the snot out of it, and work on a list of things i’d like your help with.
either way, that’ll take us to first thing next week quite nicely
:thumbsup:
having error handling in lifecycles makes sense, just like having a web middleware for exception logging
Agreed
It took a while to come to the right solution
i love that transducers basically does the same thing
why do i love it? because i can actually understand it!
:defer
is pretty handy too. Basically lets a downstream lifecycle figure out what to do if your lifecycle cant know
and kill? does that take the job or the peer out?
Kills the whole job.
ok folks - dadding time. thanks again, Lucas. i look forward to sharing progress tomorrow
:thumbsup:
@robert-stuttaford: Cool, talk soon!
@michaeldrogalis: would the new lifecycle error catch work on onyx-datomic plugin tasks too? we just had an uncaught transactor-unavailable take the system down 😞
Yes, it does
Ah, that must have happened at the wrong point
the new lifecycles has much better coverage (should cover anything)
(Jepsen found some cases we missed in the first pass of the new lifecycles, and we’re pretty sure we’ve got them all now :D)
Another cool thing is that the deciding function has access to the event map and the exception that was thrown, so you can recover + take some action, like send a pager notification