This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-08-19
Channels
- # admin-announcements (2)
- # beginners (25)
- # boot (93)
- # cider (2)
- # clara (2)
- # cljs-dev (63)
- # cljsjs (3)
- # cljsrn (38)
- # clojure (142)
- # clojure-austin (1)
- # clojure-brasil (2)
- # clojure-czech (1)
- # clojure-dev (7)
- # clojure-greece (1)
- # clojure-russia (170)
- # clojure-spec (11)
- # clojure-uk (65)
- # clojurescript (46)
- # clojurex (1)
- # code-reviews (3)
- # cursive (11)
- # datomic (35)
- # euroclojure (6)
- # events (2)
- # flambo (2)
- # hoplon (115)
- # instaparse (11)
- # jobs (21)
- # jobs-rus (3)
- # lambdaisland (2)
- # off-topic (17)
- # om (35)
- # onyx (161)
- # planck (1)
- # protorepl (7)
- # random (1)
- # re-frame (31)
- # reagent (19)
- # ring-swagger (21)
- # rum (5)
- # spacemacs (3)
- # specter (25)
- # test-check (20)
- # testing (7)
- # untangled (2)
- # yada (50)
i've got an interesting new problem with onyx on mesos - mesos seems to be occasionally killing onyx processes and restarting them, but the restarted processes are failing to initialise correctly because the aeron listen port is already bound https://www.refheap.com/a49365f7551a143ea7763322e
my guess is that this is happening because mesos is helpfully trying to start the new onyx process before terminating the old one
Is it killing the process or the whole container?
i think the whole container
i think i must have a problem where the badly initialised onyx peer hangs or something... or i would have expected mesos to have another go
That's a tough one. Not sure what to suggest. That should be common problem. I'd imagine other servers would but these kinds of issues
Yeah, maybe you need it to exit after it fails to bind
is there any kind of onyx status port i could hook a mesos http health-check into ?
There will be once we get this merged https://github.com/onyx-platform/onyx/pull/637
oh, cool 🙂
will that allow me to query the state of each peer process with mesos health-checks, or is it per-cluster health-checks ?
It'll allow you to query each onyx node with where it's up to in the log, view it's knowledge of the cluster replica, etc
So you can look at your nodes, see if any are lagging, etc
It might not translate perfectly to a mesos health check, but it's probably a pretty good measure of it
hmm... i note that i've also got mesos set to do a rolling restart - that may be counter-productive in this case and actually causing the problem
hey lucasbradstreet 🙂 got a moment to look at something for me?
go for it
busy load-testing our whole stack
that blip around 13:30 is when i started a new 5 minute web-based load test. you can see a lot of stuff going into datomic, and the pending segment count goes straight to 10k and stays there. the cluster did start to process segs - i can see a bunch logged - but then it just ... stopped. no exceptions in our or onyx logs. zookeeper metrics look fine. where should i dig?
the tx log blip (green hill in graph row two column one) peaks at 13k
i haven't restarted anything yet. i want to know what i can do to understand what's happening or not happening before i take it down
Hmm. Where to start. No real signs of failure other than it going to 10K and stopping
yep. i'm sshing into all 3 instances now to make super sure of zero errors
wait - got a bunch of these 16-Aug-19 11:29:29 uat-HSClusterStack-M9AGN9V5VAUL-i-be6bcd2b WARN [onyx.messaging.aeron.publication-manager] - Writing nil publication manager, likely due to timeout on creation.
Ah. Those almost always tend to be memory pressure/GC issues
ok, so perhaps Aeron is insufficiently provisioned?
Generally on the Onyx peer JVM
What’s probably happening is that you’re reading a lot of the log in, and the whole chain of tasks are processing more than they usually would
and then you hit memory pressure
confirmed those errors on all 3 nodes
ok. so max-pending-segs of 10k is probably too high? that seems oddly low for a fleet of 3 x c4.xlarges
10K might be OK for the actual input segments, but it will multiply throughout the system
since each segment will produce more segments
(this is just a guess though)
10K as a max-pending is totally fine, but it depends on what you’re doing
that's quite possible
I regularly use 100,000/1M on some benchmark tests, no problem
our client app batches events and they get transacted together
and the vast majority of this workload is such transactions
so it could be ballooning to 10 or 20x that
@lucasbradstreet: two of the three instances recorded a jfr. i have one open. what am i looking for?
click the memory tab
Go to GC pauses tab
then have a look at how long your pauses are, and at what time
so i'm trying to correlate events on this graph with those in the metrics, right?
argh the jfrs i copied weren't the full dump
Ah, I was going to say, those all look reasonable
i've restarted the cluster with pending at 2k now
argh. by: java.lang.IllegalStateException: Missing file for cnc: /dev/shm/aeron-ec2-user/cnc
coulda sworn we'd handled this
Yeah I thought you guys had that get deleted on startup.
we stopped doing that, as i thought it became unecessary?
it's complaining that it's not there, not that it is
ah right. Hmm, that’s weird though
i got it running again. just manually stopped everything and started it again
maybe some weird timing issue with aeron media driver & peer starting up
sorry to interrupt. Feel free to get back to this question at your nearest convince. I’m trying to understand if it is advisable or possible to have a task’s window perform an aggregation which then sent immediately to a downstream task. e.g
workflow [[a c] [b c]]
windows
[ {window/id sum-segs window/task a window/aggregation sum}
{window/id sum-segs window/task b window/aggregation sum}
where i’m looking for c’s input segments to be something like...
{start-time 1 end-time 2 a-sum 10}
{start-time 1 end-time 2 b-sum 20}
`
Most onyx examples i see the aggregations are at the end of the workflow. Their usually emitting to a storage location.
However in the use case that im thinking of now, its not currently necessary to store those values (a-sum and b-sum) and i would assume their would be a performance increase in sending them right on to the next task that needs them.
In brainstorming this, assuming its possible, the two concerns i had were
* how to push an the output of an aggregation down stream to another task. Right now i assume i can use the event map in the trigger sync function
* What would the function of task-a be? In most examples i see with aggregations the task is the identify function, which i believe is done just a placeholder sense the values dont necessarily flow anywhere. I’m worried if i use the identify function it will be confusing as thats not actual whats sent to the next task.Pushing outputs downstream to another task will be supported in the next major version
@lucasbradstreet sweet, any idea when we can expect to see that land?
@lucasbradstreet does it make sense to move aeron off to a) a single big machine for all onyx instances to talk to, with its own dedicated cores and ram?
@drewverlee: hopefully we'll have a preview out in the next month
@robert-stuttaford: it's kinda like a user land TCP stack, so it needs to be running on every machine
are there any recommended JVM opts for the aeron media driver process... min/max heap, gc etc ?
(i can't see anything in the aeron docs)
I generally just run with -server. @gardnervickers would have a good idea about what Xmx you should use since he did some experiments which you can find in the onyx-template Aeron scripts
The media driver memory usage should be steady because almost all the memory is off heap in /dev/shm
@mccraigmccraig: the only thing I would worry about there is making sure to limit your heap to the mem available to the container, as the JVM will incorrectly assume it has all the memory available that the host has. So if you have 8gb of mem and give 1gb to your container, both JVM's will assume they have 8gb available and take a default heap size of 2gb which will occasionally result in OOM or any number of weird errors.
The template script has some logic to automatically calculate the "true" available mem and return a fraction of that set as -Xmx
hmm... it may be time for me to upgrade / re-create my onyx project from template
It shouldn't be too hard to copy to your launch scripts
My feeling is that you could get away with a 500MB Xmx but I haven't tested it
Oh yea I think I've ran containers with less than that for both the media driver and peer
Good to know
What's the simplest way to 'debounce' incoming segments? Like only process at most 1 segment, per time period t ?
fixed windows, @greywolve http://www.onyxplatform.org/docs/user-guide/latest/windowing.html
@mccraigmccraig: I was catching up on an earlier conversation. I think this might help but if your using DCOS what we do is pass in the ephemeral port $PORT0 and so the peers won’t collide and should be able to spin back up
aha @camechis - that is a great idea 🙂
ah - but the aeron port needs to be the same for all peer processes doesn't it ?
so each peer process can have a different aeron port ? ok, i'll give that a spin...
export BIND_PORT="$PORT0"
ADDR=$(ifconfig eth0 | grep "inet addr:" | cut -d : -f 2 | cut -d " " -f 1)
export BIND_ADDR="$HOST”
compaction-transition
method is getting onyx.core/window-state
key but it should be onyx.core/windows-state
, right ?
And I don't see how this method is called
+1 for multiple aggregation flows. e.g. perhaps flow conditions that lead to separate aggs then a desire to join them back basen on some group by
@aaelony This should be easier to implement with the changes we're putting in the next release. Can you please create an issue with a suggestion for how you would imagine it working?
We know it's in demand as it is needed to reduce overhead
Also just simplify things
hey @lucasbradstreet: I created an issue with some thoughts on this, https://github.com/onyx-platform/onyx/issues/639
hi, just wanted to say congrats on the big news from you guys!!! really exciting and looking forward to seeing Onyx grow!!!
Thanks! Tons to do, but we have a lot more room to grow now.
Will be hiring in the fall. We need a few months more to let the dust settle.
My implementation of window filter based on LMDB seems to work correctly, I am writing tests now
Thanks @mariusz_jachimowicz. Will give it a look over soon!
question... should all functions check for and handle the :done
case ? or is that overkill?
e.g. in a workflow of [A B] [B C] [C D] where B and C reference functions, both functions underlying B and C need to handle :done
, correct?
@aaelony The sentinel value is extracted at the input task, no functions ever see this value. So you never need to handle it in a function.
Functions will only ever be invoked with segment values.
okay thanks, that makes sense. I was seeing an error because the :done wasn't a map and I was trying to extract the value for a key earlier, but now I can't reproduce it... (I guess that is good)
what's the normal way of repl-interaction during dev with the newer onyx-template ? i'd gotten used to the dev-path (reset)
and dev-job in my old onyx-template... but that all seems to be gone ?
It's on the cards to add those back in, but we prefer to go with with-test-env to run integration type tests on our workflows
@lucasbradstreet: Still having a hard time with performance and we suspect it might have something to do with the size of our segments. Not sure how easy it would be to convey but what would be considered a large segment especially with the windowing involved ?
that seems kinda likely, I’m not sure what the optimal size is, but I would absolutely try to reduce the size of what gets changelogged. Much better to be storing a few hundred bytes of what you need vs a few hundred kilobytes. The problem is that you’re probably trying to collect all your segments and write it out in one go
yeah, we have tried to boil it down to just what we need but I don’t think we can shrink it anymore
FlightRecorder would be able to tell you for sure where the hotspot is.
right now we are processing roughly 1.5 million of our segs per hour which feel pretty slow
Yep, quite slow indeed.
I have some ideas for how to tune it a bit
but flight recorder would be a good start
ok cool, any suggestions would be welcome. One thing we want to do is get it off our piece of a cluster and on to AWS. Also want to give it a dedicated bookie servers so we can put the ledger and journal on separate disks which sounds like that would help some
you could try tuning this http://www.onyxplatform.org/docs/cheat-sheet/latest/#peer-config/:onyx.bookkeeper/write-batch-backoff
if you increase it, it’ll batch together more segments when it writes them to bookkeeper
did you say you’re using a batch size of 1?
right. You’re losing a lot of the amortisation of costs that you’ll get from higher batch sizes
I’d try increasing it some more
and also try that write-batch-backoff. Actually, I forgot how it worked. I think it won’t help to increase it
since it’s a backoff. It used to be a timeout
we did finally get retries down to 0. had to set MAX-pending to around 2500 but we might be able to push that little more
ok yep, these things can hurt you with throughput too. How many peers on the windowing task?
k. Might help to scale those out a bit more, since it’s probably the bottleneck
one thing is with our metrics it always seems to be only receiving data from one of the hosts for that tasks
Anyway, try having a play with some of those things independently, and also measure where all your CPU is going with flight recorder
for the windowing task?
or for all?
some I can definitely see more hosts working on a task but that one I only seem to get one line
worth investigation
Sleep time for me
I am no expert in that so it could be something wrong but we based it off of the benchmark project
LMDB window filter + tests ready, but I have some warnings about reflections
please review the code so I will squash the commits then
@michaeldrogalis: what is the flight recorder?
@camechis An Oracle tool