@lucasbradstreet: would the new task constraints feature allow us to safely continue using core.async inputs on a single instance?
to inject work into the cluster
really don’t want to have to run e.g. kafka just to support occasional from-the-side data correction tasks
It absolutely would
Just be careful that you don't get into a scenario where you can't recover because that node is down (I'm guessing you'd bounce it anyway)
I would consider just using an extra Datomic database that you can occasionally trash though. It seems operationally simpler.
It'll also give you a record of the manual adjustments
If you play with the backoff policy in a log reader for that database, it wouldn't be very expensive to maintain resources wise
interesting. you’re saying watch a trash db but have the side input do the work on the actual db?
Yeah, basically use the side db as an input source to inject the side correction segments
i like that idea!
i’m going to explore this
Since you can always trash the db you don't have to worry about overloading your main db and you get fault tolerance on your side tasks
and a system of record
you’re good at this!
Ha. Thanks
Compliments will get you everywhere with me 😛
using git commit hash based :onyx/id
was the final thing i needed to fix
the submitting node is processing events. the other node has started peers etc but has not submitted jobs. it’s not processing events. should it be live before the jobs are submitted?
it started after the jobs were submitted
It shouldn’t matter which node submits
Once the job is submitted, both nodes should start processing events
as long as they use the same onyx/id and the same catalog/workflow, right?
jobs do the catalog/workflow bit
same onyx/id yup
would number two do anything if number 1 had enough peers to start the work?
the first one’s onyx.log shows ‘enough peers are active’, the second one doesn't
depends on your scheduling
But almost definitely yes
:onyx.job-scheduler/balanced :onyx.task-scheduler/balanced
Assuming you’re not putting max-peers on every task
Both nodes should start the job
sorry, should start working on the job
ok. so that’s not happening right now. i didn’t start them up cleanly (i was messing with the env vars driving the onyx/id thing), so i’m going to do a full system cold boot and see what happens
Yeah, I suspect they’re on diff onyx/ids
what do you use to watch logs?
2 instances, 3 logs each (app, onyx, zk), makes for a lot of terminals
i’m so close i can taste it!!
I’m a little too used to the pain there
something like might be good
but then you have another service
cool. we use that for prod. we’ll get it set up for test as well
even if just to watch the background stuff (onyx/zk)
Ah yeah, it’s a no brainer then
@lucasbradstreet: given a peer object instance, what can i print to identify it? do they have ids?
(doseq [peer peers]
(log/info “> Peer” (??? peer))
(onyx/shutdown-peer peer)
(catch Throwable e
(log/info "Error during peer shutdown:" e))))
what can i use in place of ???
i have read the source but nothing jumped out at me
It’s a little opaque because it’s nested fairly deep in the peer system
And peers can restart and obtain new peer-ids
yeah. so i’m discovering
What are you trying to achieve?
just want to log something unique for each peer so i can see 40 distinct shutdowns
currently have a hung shutdown
You don’t see anything like this in your log? 16-Feb-10 16:44:17 lbpro INFO [onyx.peer.virtual-peer] - Stopping Virtual Peer 36e1e995-b33c-408e-995c-b05bf1df13fe
last onyx log on both is 16-Feb-10 03:36:24
and ZK has lots of closed-socket and/or processed session termination
I should probably put the peer-id in that log message
Is this occuring on calling shutdown-peer?
that’s what i’m working on figuring out now
logging the peer shutdowns and the peer group shutdown explicitly
more soon
Usually if it falls out of outbox loop it’s either a shutdown or it hit an exception and should be rebooting
as you can see, we do catch and log Throwables, which i’m not seeing
i’m thinking the peers shut down but the peer-group doesn't
Ah. Are you catching around the peer-group shutdown?
looks like it’s failing on some peers
got a meeting now. will share info soon
@robert-stuttaford: off the top of my head, the most likely is a peer getting blocked doing something like opening a connection to datomic, or something like that. Then when you shutdown it's still in a task startup state
ok, assuming that is so, how do we cleanly abort peers in such a state?
I don't currently have a good answer for you, though it's something we'd want to look at if that is the problem
cool. i think i’m going to disable yeller for test servers, i can never be sure if any exceptions are silently being eaten
Yeah that is a concern
man. prod just stopped again. had to restart zookeeper and use a new onyx/id before it would start working again.
Ok that's not good
i’m definitely still suffering some big knowledge gaps
No exceptions in prod?
It would be good to see some metrics charts leading up to the event
You obviously need some extra metrics but I might be able to read the matrix a little
Using a new onyx id shouldn't be necessary
zookeeper, and input metrics
So that is definitely a worry
everything came back up with the same onyx id, but no work happened
switching to a new id, work happened
~2500 datomic txes in the backfill
looks like ZK conns just started dropping off
Wow is that really a 40s batch latency on the read-log?
looks like it 😐
I suspect GC
datomic txor/ddb graphs are nominal
even after a reboot?
so theres that it died, and theres that it wouldnt start again until i used a new id
Yeah I'm only speaking to the dying atm
Is this after it died?
The chart with the 40K batch latency I mean
I'm asking whether this is after startup or whether it's before death
the read latency is just before death
the very last data point is 35 seconds
Ok. Definitely seems like GC
the dreaded GC
You're using G1GC though
well, not explicitly. it might be included in AggreissiveOpts
Oh yeah, the conc mark GC. I don't actually know much about how that works
should we switch to G1GC?
I think you still get short GCs with conc mark
It'd be tops if you had flight recorder output :/
Oh you are recording it
hold on a moment
You may have overwritten it when you started it up again :/
i’ve got a lot of these
yeah the file is 0 bytes
Are you using Aeron yet?
externally, no
And too bad. Next time grab the flight recorder file before starting up
But internally yes? Is this multi node?
single node
this is prod
With short circuiting right?
with short-circuiting
Yeah, ok. Def ram
ok. is that because aeron uses unsafe?
I don't think so because with short circuiting it should never be allocating any buffers
the jar is provided with 12000m via Xmx Xms
Seems like an internal Java thing
7000m for datomic peer cache
Peer cache is within Java heap, as far as you know?
I think it would be
Without flight recorder recording it's gonna be super hard to know
I suspect it was swapping before this occurred too
ok. i’m not going to change anything, and next time i’ll keep the recording and all the pid logs
Do you have datadog stats for the machine?
Yup. Maybe modify the startup so local recording is to a diff file each time
Machine has 16G ram?
zk and onyx on the machine
not explicitly setting ram for ZK
k, I’m going to have a poke around datadog for a minute
I’d like to know if it was swapping
ZK using ~3.5% ram via top
Maybe you can look at cloudwatch?
ZK is on one machine?
ZK and Onyx are both on the same machine together
Ah, have a look at system.swap.used in datadog
Yeah, but one ZK, yes?
I guess when you restart it with a new onyx/id, it comes back up, which suggests that ZK is still working. I was going to say that maybe the swapping was taking out both
I don’t think that’s the case
Though it is definitely a reason why you’d want ZK on other machines
one ZK, yes
system.swap.* all appear to be flat
available swap also zero throughout
oh, your machines might not have swap
That would be odd though
total used free shared buffers cached Mem: 15039 5992 9047 196 126 1274 -/+ buffers/cache: 4592 10447 Swap: 0 0 0
cue spooky music
ok. so we’re doing it wrong. it’s using everything up and can’t swap, and then it falls over because it’s run out of options
That would be my diagnosis
I still want to see flight recorder logs so that I can find out if we have a memory leak, but that configuration is not a good thing
i’ll get you a flight record for sure
even if only for your edification
so, assuming a sane swap setup, it should be able to stay up for a long time, right
Well… yeah
I mean, we’re not trying to build this to fall over 😛
amateur hour over here, heh
Once you have swap, I’d want to put a monitor on it in datadog
You may want to decrease -Xmx too
I’d rather the machine GC’d a little more often than having it go into swap
we’ve had one all along 😐
note the big empty box bottom middle
Sorry, I mean an alert
oh, yes
Haha, this is an example of how metrics can trick you 😄
how much swap should we have?
Well, swap only really delays the issue, so you have to think about how much swap will cause a death spiral anyway
i think we’ll start with the same amount as ram
Yeah, that’s reasonable. You don’t want to use it though, so I think it’s important that -Xmx is reduced, unless you absolutely need all that memory
it’s a 16gb instance, and we’ve assigned it 12gb
Yeah, I’m curious about why it’s gone over that so much
i think it might be old highstorm processes sticking around
actually, no
this is the last 2 days
Restart is where that first and second drop is?
Looks to me like other processes are using > 4GB RAM if so
just checked that we only have zk and HS running
ok. next time this happens, we’re in a much better place to diagnose
we’ll add swap asap, and i’ll be sure to keep .jfr file before attempting to recover things
k, just be aware of the Xmx link above, because 12GB is likely too high
Swap could help there thoug
If you’re going into swap then it’s probably too high
where should we dial it back to?
I was thinking 10. If you’re adding swap it might be OK because it doesn’t mean automatic death. My preference would be less than 12 though
ok. we’ll go with 10
Even if it doesn’t die, get me a copy of the recording on the next restart/deploy
i will do
You may want to reduce the datomic peer cache if you reduce it to 10
As for why it didn’t start back up OK, I think I need a dump of the peer log
I need a dump from ZK
I’ll point you in the right direction in a little bit
ok cool
[org.onyxplatform/onyx "0.8.8"]
[org.onyxplatform/onyx-datomic ""]
[org.onyxplatform/onyx-metrics "”]