Fork me on GitHub
#onyx
<
2016-08-19
>
mccraigmccraig10:08:41

i've got an interesting new problem with onyx on mesos - mesos seems to be occasionally killing onyx processes and restarting them, but the restarted processes are failing to initialise correctly because the aeron listen port is already bound https://www.refheap.com/a49365f7551a143ea7763322e

mccraigmccraig10:08:11

my guess is that this is happening because mesos is helpfully trying to start the new onyx process before terminating the old one

lucasbradstreet10:08:06

Is it killing the process or the whole container?

mccraigmccraig10:08:38

i think the whole container

mccraigmccraig10:08:26

i think i must have a problem where the badly initialised onyx peer hangs or something... or i would have expected mesos to have another go

lucasbradstreet10:08:30

That's a tough one. Not sure what to suggest. That should be common problem. I'd imagine other servers would but these kinds of issues

lucasbradstreet10:08:56

Yeah, maybe you need it to exit after it fails to bind

mccraigmccraig10:08:29

is there any kind of onyx status port i could hook a mesos http health-check into ?

mccraigmccraig10:08:06

oh, cool 🙂

mccraigmccraig10:08:56

will that allow me to query the state of each peer process with mesos health-checks, or is it per-cluster health-checks ?

lucasbradstreet10:08:06

It'll allow you to query each onyx node with where it's up to in the log, view it's knowledge of the cluster replica, etc

lucasbradstreet10:08:18

So you can look at your nodes, see if any are lagging, etc

lucasbradstreet10:08:08

It might not translate perfectly to a mesos health check, but it's probably a pretty good measure of it

mccraigmccraig10:08:24

hmm... i note that i've also got mesos set to do a rolling restart - that may be counter-productive in this case and actually causing the problem

robert-stuttaford11:08:41

hey lucasbradstreet 🙂 got a moment to look at something for me?

robert-stuttaford11:08:21

busy load-testing our whole stack

robert-stuttaford11:08:51

that blip around 13:30 is when i started a new 5 minute web-based load test. you can see a lot of stuff going into datomic, and the pending segment count goes straight to 10k and stays there. the cluster did start to process segs - i can see a bunch logged - but then it just ... stopped. no exceptions in our or onyx logs. zookeeper metrics look fine. where should i dig?

robert-stuttaford11:08:50

the tx log blip (green hill in graph row two column one) peaks at 13k

robert-stuttaford11:08:36

i haven't restarted anything yet. i want to know what i can do to understand what's happening or not happening before i take it down

lucasbradstreet11:08:47

Hmm. Where to start. No real signs of failure other than it going to 10K and stopping

robert-stuttaford11:08:09

yep. i'm sshing into all 3 instances now to make super sure of zero errors

robert-stuttaford11:08:28

wait - got a bunch of these 16-Aug-19 11:29:29 uat-HSClusterStack-M9AGN9V5VAUL-i-be6bcd2b WARN [onyx.messaging.aeron.publication-manager] - Writing nil publication manager, likely due to timeout on creation.

lucasbradstreet11:08:00

Ah. Those almost always tend to be memory pressure/GC issues

robert-stuttaford11:08:23

ok, so perhaps Aeron is insufficiently provisioned?

lucasbradstreet11:08:21

Generally on the Onyx peer JVM

lucasbradstreet11:08:44

What’s probably happening is that you’re reading a lot of the log in, and the whole chain of tasks are processing more than they usually would

lucasbradstreet11:08:47

and then you hit memory pressure

robert-stuttaford11:08:59

confirmed those errors on all 3 nodes

robert-stuttaford11:08:49

ok. so max-pending-segs of 10k is probably too high? that seems oddly low for a fleet of 3 x c4.xlarges

lucasbradstreet12:08:50

10K might be OK for the actual input segments, but it will multiply throughout the system

lucasbradstreet12:08:03

since each segment will produce more segments

lucasbradstreet12:08:08

(this is just a guess though)

lucasbradstreet12:08:21

10K as a max-pending is totally fine, but it depends on what you’re doing

robert-stuttaford12:08:34

that's quite possible

lucasbradstreet12:08:35

I regularly use 100,000/1M on some benchmark tests, no problem

robert-stuttaford12:08:48

our client app batches events and they get transacted together

robert-stuttaford12:08:01

and the vast majority of this workload is such transactions

robert-stuttaford12:08:15

so it could be ballooning to 10 or 20x that

robert-stuttaford12:08:34

@lucasbradstreet: two of the three instances recorded a jfr. i have one open. what am i looking for?

lucasbradstreet12:08:40

click the memory tab

lucasbradstreet12:08:53

Go to GC pauses tab

lucasbradstreet12:08:26

then have a look at how long your pauses are, and at what time

robert-stuttaford12:08:11

so i'm trying to correlate events on this graph with those in the metrics, right?

robert-stuttaford12:08:43

argh the jfrs i copied weren't the full dump

lucasbradstreet12:08:11

Ah, I was going to say, those all look reasonable

robert-stuttaford12:08:32

i've restarted the cluster with pending at 2k now

robert-stuttaford12:08:27

argh. by: java.lang.IllegalStateException: Missing file for cnc: /dev/shm/aeron-ec2-user/cnc

robert-stuttaford12:08:34

coulda sworn we'd handled this

lucasbradstreet12:08:53

Yeah I thought you guys had that get deleted on startup.

robert-stuttaford12:08:12

we stopped doing that, as i thought it became unecessary?

robert-stuttaford12:08:47

it's complaining that it's not there, not that it is

lucasbradstreet12:08:45

ah right. Hmm, that’s weird though

robert-stuttaford12:08:49

i got it running again. just manually stopped everything and started it again

lucasbradstreet12:08:10

maybe some weird timing issue with aeron media driver & peer starting up

Drew Verlee12:08:41

sorry to interrupt. Feel free to get back to this question at your nearest convince. I’m trying to understand if it is advisable or possible to have a task’s window perform an aggregation which then sent immediately to a downstream task. e.g

workflow [[a c] [b c]]

windows 
   [ {window/id sum-segs window/task a window/aggregation sum}
     {window/id sum-segs window/task b window/aggregation sum}
where i’m looking for c’s input segments to be something like...
{start-time 1 end-time 2 a-sum 10}
{start-time 1 end-time 2 b-sum 20}
` Most onyx examples i see the aggregations are at the end of the workflow. Their usually emitting to a storage location. However in the use case that im thinking of now, its not currently necessary to store those values (a-sum and b-sum) and i would assume their would be a performance increase in sending them right on to the next task that needs them. In brainstorming this, assuming its possible, the two concerns i had were * how to push an the output of an aggregation down stream to another task. Right now i assume i can use the event map in the trigger sync function * What would the function of task-a be? In most examples i see with aggregations the task is the identify function, which i believe is done just a placeholder sense the values dont necessarily flow anywhere. I’m worried if i use the identify function it will be confusing as thats not actual whats sent to the next task.

lucasbradstreet12:08:22

Pushing outputs downstream to another task will be supported in the next major version

Drew Verlee12:08:13

@lucasbradstreet sweet, any idea when we can expect to see that land?

robert-stuttaford13:08:56

@lucasbradstreet does it make sense to move aeron off to a) a single big machine for all onyx instances to talk to, with its own dedicated cores and ram?

lucasbradstreet13:08:06

@drewverlee: hopefully we'll have a preview out in the next month

lucasbradstreet13:08:30

@robert-stuttaford: it's kinda like a user land TCP stack, so it needs to be running on every machine

mccraigmccraig13:08:08

are there any recommended JVM opts for the aeron media driver process... min/max heap, gc etc ?

mccraigmccraig13:08:21

(i can't see anything in the aeron docs)

lucasbradstreet13:08:46

I generally just run with -server. @gardnervickers would have a good idea about what Xmx you should use since he did some experiments which you can find in the onyx-template Aeron scripts

lucasbradstreet13:08:14

The media driver memory usage should be steady because almost all the memory is off heap in /dev/shm

gardnervickers13:08:37

@mccraigmccraig: the only thing I would worry about there is making sure to limit your heap to the mem available to the container, as the JVM will incorrectly assume it has all the memory available that the host has. So if you have 8gb of mem and give 1gb to your container, both JVM's will assume they have 8gb available and take a default heap size of 2gb which will occasionally result in OOM or any number of weird errors.

gardnervickers13:08:16

The template script has some logic to automatically calculate the "true" available mem and return a fraction of that set as -Xmx

mccraigmccraig13:08:03

hmm... it may be time for me to upgrade / re-create my onyx project from template

gardnervickers13:08:23

It shouldn't be too hard to copy to your launch scripts

lucasbradstreet13:08:09

My feeling is that you could get away with a 500MB Xmx but I haven't tested it

gardnervickers13:08:51

Oh yea I think I've ran containers with less than that for both the media driver and peer

greywolve14:08:43

What's the simplest way to 'debounce' incoming segments? Like only process at most 1 segment, per time period t ?

Travis14:08:41

@mccraigmccraig: I was catching up on an earlier conversation. I think this might help but if your using DCOS what we do is pass in the ephemeral port $PORT0 and so the peers won’t collide and should be able to spin back up

mccraigmccraig14:08:42

aha @camechis - that is a great idea 🙂

Travis14:08:11

we had to adjust the config so it the aeron port is a little dynamic

Travis14:08:22

through an ENV var

mccraigmccraig14:08:06

ah - but the aeron port needs to be the same for all peer processes doesn't it ?

Travis14:08:24

as long as it gets advertised correctly in zookeeper you should be fine

Travis14:08:28

seems to be working

mccraigmccraig14:08:29

so each peer process can have a different aeron port ? ok, i'll give that a spin...

Travis14:08:19

we did a little something like this in our run_peer.sh script before running the peer

Travis14:08:23

export BIND_PORT="$PORT0"
ADDR=$(ifconfig eth0 | grep "inet addr:" | cut -d : -f 2 | cut -d " " -f 1)
export BIND_ADDR="$HOST”

Travis14:08:16

we ran in HOST mode. I had trouble getting the combo right in bridge

mariusz_jachimowicz14:08:51

compaction-transition method is getting onyx.core/window-state key but it should be onyx.core/windows-state, right ?

mariusz_jachimowicz15:08:59

And I don't see how this method is called

aaelony15:08:50

+1 for multiple aggregation flows. e.g. perhaps flow conditions that lead to separate aggs then a desire to join them back basen on some group by

lucasbradstreet15:08:52

@aaelony This should be easier to implement with the changes we're putting in the next release. Can you please create an issue with a suggestion for how you would imagine it working?

aaelony15:08:14

ok, will do

lucasbradstreet15:08:31

We know it's in demand as it is needed to reduce overhead

lucasbradstreet15:08:40

Also just simplify things

aaelony15:08:18

you can imagine aggs leading to a dataset for use in ml

aaelony15:08:38

that would be really cool

aaelony16:08:07

hey @lucasbradstreet: I created an issue with some thoughts on this, https://github.com/onyx-platform/onyx/issues/639

thomas17:08:59

hi, just wanted to say congrats on the big news from you guys!!! really exciting and looking forward to seeing Onyx grow!!!

michaeldrogalis17:08:33

Thanks! Tons to do, but we have a lot more room to grow now.

michaeldrogalis17:08:01

Will be hiring in the fall. We need a few months more to let the dust settle.

mariusz_jachimowicz17:08:24

My implementation of window filter based on LMDB seems to work correctly, I am writing tests now

michaeldrogalis17:08:11

Thanks @mariusz_jachimowicz. Will give it a look over soon!

aaelony18:08:51

question... should all functions check for and handle the :done case ? or is that overkill?

aaelony18:08:25

e.g. in a workflow of [A B] [B C] [C D] where B and C reference functions, both functions underlying B and C need to handle :done, correct?

michaeldrogalis18:08:47

@aaelony The sentinel value is extracted at the input task, no functions ever see this value. So you never need to handle it in a function.

michaeldrogalis18:08:05

Functions will only ever be invoked with segment values.

aaelony18:08:04

okay thanks, that makes sense. I was seeing an error because the :done wasn't a map and I was trying to extract the value for a key earlier, but now I can't reproduce it... (I guess that is good)

aaelony18:08:57

still a novice at this, but learning quick

mccraigmccraig18:08:58

what's the normal way of repl-interaction during dev with the newer onyx-template ? i'd gotten used to the dev-path (reset) and dev-job in my old onyx-template... but that all seems to be gone ?

lucasbradstreet19:08:36

It's on the cards to add those back in, but we prefer to go with with-test-env to run integration type tests on our workflows

Travis19:08:51

@lucasbradstreet: Still having a hard time with performance and we suspect it might have something to do with the size of our segments. Not sure how easy it would be to convey but what would be considered a large segment especially with the windowing involved ?

lucasbradstreet19:08:31

that seems kinda likely, I’m not sure what the optimal size is, but I would absolutely try to reduce the size of what gets changelogged. Much better to be storing a few hundred bytes of what you need vs a few hundred kilobytes. The problem is that you’re probably trying to collect all your segments and write it out in one go

Travis19:08:11

yeah, we have tried to boil it down to just what we need but I don’t think we can shrink it anymore

Travis19:08:00

maybe we can come up with some kind of trick but not sure

michaeldrogalis19:08:58

FlightRecorder would be able to tell you for sure where the hotspot is.

Travis19:08:52

cool, might attempt to give that a try soon

Travis19:08:26

right now we are processing roughly 1.5 million of our segs per hour which feel pretty slow

michaeldrogalis20:08:04

Yep, quite slow indeed.

lucasbradstreet20:08:07

I have some ideas for how to tune it a bit

lucasbradstreet20:08:32

but flight recorder would be a good start

Travis20:08:48

ok cool, any suggestions would be welcome. One thing we want to do is get it off our piece of a cluster and on to AWS. Also want to give it a dedicated bookie servers so we can put the ledger and journal on separate disks which sounds like that would help some

lucasbradstreet20:08:17

if you increase it, it’ll batch together more segments when it writes them to bookkeeper

lucasbradstreet20:08:56

did you say you’re using a batch size of 1?

Travis20:08:14

yeah, we changed it to 2 and then to 3 but things started to go more south

lucasbradstreet20:08:51

right. You’re losing a lot of the amortisation of costs that you’ll get from higher batch sizes

lucasbradstreet20:08:00

I’d try increasing it some more

Travis20:08:19

ok worth a shot

lucasbradstreet20:08:00

and also try that write-batch-backoff. Actually, I forgot how it worked. I think it won’t help to increase it

lucasbradstreet20:08:12

since it’s a backoff. It used to be a timeout

Travis20:08:17

we did finally get retries down to 0. had to set MAX-pending to around 2500 but we might be able to push that little more

lucasbradstreet20:08:07

ok yep, these things can hurt you with throughput too. How many peers on the windowing task?

Travis20:08:42

i think we have around 3

lucasbradstreet20:08:29

k. Might help to scale those out a bit more, since it’s probably the bottleneck

Travis20:08:28

one thing is with our metrics it always seems to be only receiving data from one of the hosts for that tasks

lucasbradstreet20:08:33

Anyway, try having a play with some of those things independently, and also measure where all your CPU is going with flight recorder

Travis20:08:35

like its only running on one

lucasbradstreet20:08:42

for the windowing task?

Travis20:08:18

some I can definitely see more hosts working on a task but that one I only seem to get one line

Travis20:08:22

for throughput

Travis20:08:36

dashboard says its on 3

Travis20:08:00

and i have the min/max set to that but I the grafana report makes me wonder

lucasbradstreet20:08:15

worth investigation

lucasbradstreet20:08:08

Sleep time for me

Travis20:08:14

I am no expert in that so it could be something wrong but we based it off of the benchmark project

Travis20:08:20

sleep well!

mariusz_jachimowicz20:08:24

LMDB window filter + tests ready, but I have some warnings about reflections

mariusz_jachimowicz20:08:03

please review the code so I will squash the commits then

Travis21:08:43

@michaeldrogalis: what is the flight recorder?

Travis21:08:12

Ah gotcha, so a JVM thing