Fork me on GitHub
#onyx
<
2016-02-10
>
robert-stuttaford06:02:14

@lucasbradstreet: would the new task constraints feature allow us to safely continue using core.async inputs on a single instance?

robert-stuttaford06:02:25

to inject work into the cluster

robert-stuttaford06:02:57

really don’t want to have to run e.g. kafka just to support occasional from-the-side data correction tasks

lucasbradstreet06:02:06

It absolutely would

lucasbradstreet06:02:59

Just be careful that you don't get into a scenario where you can't recover because that node is down (I'm guessing you'd bounce it anyway)

lucasbradstreet06:02:44

I would consider just using an extra Datomic database that you can occasionally trash though. It seems operationally simpler.

lucasbradstreet06:02:58

It'll also give you a record of the manual adjustments

lucasbradstreet06:02:51

If you play with the backoff policy in a log reader for that database, it wouldn't be very expensive to maintain resources wise

robert-stuttaford06:02:20

interesting. you’re saying watch a trash db but have the side input do the work on the actual db?

lucasbradstreet06:02:28

Yeah, basically use the side db as an input source to inject the side correction segments

robert-stuttaford06:02:49

i like that idea!

robert-stuttaford06:02:00

i’m going to explore this

lucasbradstreet06:02:15

Since you can always trash the db you don't have to worry about overloading your main db and you get fault tolerance on your side tasks

robert-stuttaford06:02:42

and a system of record

robert-stuttaford06:02:48

you’re good at this!

lucasbradstreet06:02:48

Compliments will get you everywhere with me 😛

robert-stuttaford07:02:31

using git commit hash based :onyx/id was the final thing i needed to fix

robert-stuttaford07:02:08

the submitting node is processing events. the other node has started peers etc but has not submitted jobs. it’s not processing events. should it be live before the jobs are submitted?

robert-stuttaford07:02:17

it started after the jobs were submitted

lucasbradstreet07:02:51

It shouldn’t matter which node submits

lucasbradstreet07:02:01

Once the job is submitted, both nodes should start processing events

robert-stuttaford07:02:35

as long as they use the same onyx/id and the same catalog/workflow, right?

robert-stuttaford07:02:45

jobs do the catalog/workflow bit

lucasbradstreet07:02:48

same onyx/id yup

robert-stuttaford07:02:20

would number two do anything if number 1 had enough peers to start the work?

robert-stuttaford07:02:39

the first one’s onyx.log shows ‘enough peers are active’, the second one doesn't

lucasbradstreet07:02:40

depends on your scheduling

lucasbradstreet07:02:55

But almost definitely yes

robert-stuttaford07:02:56

:onyx.job-scheduler/balanced :onyx.task-scheduler/balanced

lucasbradstreet07:02:06

Assuming you’re not putting max-peers on every task

lucasbradstreet07:02:12

Both nodes should start the job

lucasbradstreet07:02:20

sorry, should start working on the job

robert-stuttaford07:02:59

ok. so that’s not happening right now. i didn’t start them up cleanly (i was messing with the env vars driving the onyx/id thing), so i’m going to do a full system cold boot and see what happens

lucasbradstreet07:02:15

Yeah, I suspect they’re on diff onyx/ids

robert-stuttaford07:02:09

what do you use to watch logs?

robert-stuttaford07:02:25

2 instances, 3 logs each (app, onyx, zk), makes for a lot of terminals

robert-stuttaford07:02:58

i’m so close i can taste it!!

lucasbradstreet07:02:24

I’m a little too used to the pain there

lucasbradstreet07:02:56

but then you have another service

robert-stuttaford07:02:04

cool. we use that for prod. we’ll get it set up for test as well

robert-stuttaford07:02:18

even if just to watch the background stuff (onyx/zk)

lucasbradstreet07:02:19

Ah yeah, it’s a no brainer then

robert-stuttaford08:02:11

@lucasbradstreet: given a peer object instance, what can i print to identify it? do they have ids?

robert-stuttaford08:02:54

(doseq [peer peers]
      (try
        (log/info “> Peer” (??? peer))
        (onyx/shutdown-peer peer)
        (catch Throwable e
          (log/info "Error during peer shutdown:" e))))

robert-stuttaford08:02:03

what can i use in place of ???

robert-stuttaford08:02:15

i have read the source but nothing jumped out at me

lucasbradstreet08:02:41

It’s a little opaque because it’s nested fairly deep in the peer system

lucasbradstreet08:02:07

And peers can restart and obtain new peer-ids

robert-stuttaford08:02:07

yeah. so i’m discovering simple_smile

lucasbradstreet08:02:37

What are you trying to achieve?

robert-stuttaford08:02:12

just want to log something unique for each peer so i can see 40 distinct shutdowns

robert-stuttaford08:02:30

currently have a hung shutdown

lucasbradstreet08:02:03

You don’t see anything like this in your log? 16-Feb-10 16:44:17 lbpro INFO [onyx.peer.virtual-peer] - Stopping Virtual Peer 36e1e995-b33c-408e-995c-b05bf1df13fe

robert-stuttaford08:02:26

last onyx log on both is 16-Feb-10 03:36:24 INFO [onyx.peer.virtual-peer] - Fell out of outbox loop

robert-stuttaford08:02:43

and ZK has lots of closed-socket and/or processed session termination

lucasbradstreet08:02:53

I should probably put the peer-id in that log message

lucasbradstreet08:02:23

Is this occuring on calling shutdown-peer?

robert-stuttaford08:02:36

that’s what i’m working on figuring out now

robert-stuttaford08:02:48

logging the peer shutdowns and the peer group shutdown explicitly

lucasbradstreet08:02:06

Usually if it falls out of outbox loop it’s either a shutdown or it hit an exception and should be rebooting

robert-stuttaford08:02:13

as you can see, we do catch and log Throwables, which i’m not seeing

robert-stuttaford09:02:05

i’m thinking the peers shut down but the peer-group doesn't

lucasbradstreet09:02:55

Ah. Are you catching around the peer-group shutdown?

robert-stuttaford09:02:30

looks like it’s failing on some peers

robert-stuttaford09:02:36

got a meeting now. will share info soon

lucasbradstreet09:02:05

@robert-stuttaford: off the top of my head, the most likely is a peer getting blocked doing something like opening a connection to datomic, or something like that. Then when you shutdown it's still in a task startup state

robert-stuttaford10:02:35

ok, assuming that is so, how do we cleanly abort peers in such a state?

lucasbradstreet10:02:01

I don't currently have a good answer for you, though it's something we'd want to look at if that is the problem

robert-stuttaford10:02:35

cool. i think i’m going to disable yeller for test servers, i can never be sure if any exceptions are silently being eaten

lucasbradstreet10:02:47

Yeah that is a concern

robert-stuttaford10:02:11

man. prod just stopped again. had to restart zookeeper and use a new onyx/id before it would start working again.

lucasbradstreet10:02:20

Ok that's not good

robert-stuttaford10:02:23

i’m definitely still suffering some big knowledge gaps

lucasbradstreet10:02:35

No exceptions in prod?

lucasbradstreet10:02:54

It would be good to see some metrics charts leading up to the event

lucasbradstreet10:02:10

You obviously need some extra metrics but I might be able to read the matrix a little

lucasbradstreet10:02:54

Using a new onyx id shouldn't be necessary

robert-stuttaford10:02:04

zookeeper, and input metrics

lucasbradstreet10:02:06

So that is definitely a worry

robert-stuttaford10:02:20

everything came back up with the same onyx id, but no work happened

robert-stuttaford10:02:27

switching to a new id, work happened

robert-stuttaford10:02:36

~2500 datomic txes in the backfill

robert-stuttaford10:02:55

looks like ZK conns just started dropping off

lucasbradstreet10:02:00

Wow is that really a 40s batch latency on the read-log?

robert-stuttaford10:02:18

looks like it 😐

robert-stuttaford10:02:29

datomic txor/ddb graphs are nominal

robert-stuttaford10:02:59

even after a reboot?

robert-stuttaford10:02:13

so theres that it died, and theres that it wouldnt start again until i used a new id

lucasbradstreet10:02:32

Yeah I'm only speaking to the dying atm

lucasbradstreet10:02:36

Is this after it died?

lucasbradstreet10:02:01

The chart with the 40K batch latency I mean

lucasbradstreet10:02:17

I'm asking whether this is after startup or whether it's before death

robert-stuttaford10:02:27

the read latency is just before death

robert-stuttaford11:02:07

the very last data point is 35 seconds

lucasbradstreet11:02:48

Ok. Definitely seems like GC

lucasbradstreet11:02:02

You're using G1GC though

robert-stuttaford11:02:09

well, not explicitly. it might be included in AggreissiveOpts

lucasbradstreet11:02:10

Oh yeah, the conc mark GC. I don't actually know much about how that works

robert-stuttaford11:02:24

should we switch to G1GC?

lucasbradstreet11:02:32

I think you still get short GCs with conc mark

lucasbradstreet11:02:45

It'd be tops if you had flight recorder output :/

lucasbradstreet11:02:39

Oh you are recording it

lucasbradstreet11:02:48

You may have overwritten it when you started it up again :/

robert-stuttaford11:02:50

i’ve got a lot of these

robert-stuttaford11:02:51

hs_err_pid29570.log

robert-stuttaford11:02:59

yeah the file is 0 bytes

lucasbradstreet11:02:08

Are you using Aeron yet?

lucasbradstreet11:02:19

And too bad. Next time grab the flight recorder file before starting up

lucasbradstreet11:02:27

But internally yes? Is this multi node?

lucasbradstreet11:02:39

With short circuiting right?

robert-stuttaford11:02:44

with short-circuiting

lucasbradstreet11:02:55

Yeah, ok. Def ram

robert-stuttaford11:02:30

ok. is that because aeron uses unsafe?

lucasbradstreet11:02:54

I don't think so because with short circuiting it should never be allocating any buffers

robert-stuttaford11:02:07

the jar is provided with 12000m via Xmx Xms

lucasbradstreet11:02:10

Seems like an internal Java thing

robert-stuttaford11:02:16

7000m for datomic peer cache

lucasbradstreet11:02:23

Peer cache is within Java heap, as far as you know?

lucasbradstreet11:02:31

I think it would be

lucasbradstreet11:02:54

Without flight recorder recording it's gonna be super hard to know

lucasbradstreet11:02:40

I suspect it was swapping before this occurred too

robert-stuttaford11:02:50

ok. i’m not going to change anything, and next time i’ll keep the recording and all the pid logs

lucasbradstreet11:02:50

Do you have datadog stats for the machine?

lucasbradstreet11:02:08

Yup. Maybe modify the startup so local recording is to a diff file each time

lucasbradstreet11:02:53

Machine has 16G ram?

robert-stuttaford11:02:14

zk and onyx on the machine

robert-stuttaford11:02:32

not explicitly setting ram for ZK

lucasbradstreet11:02:02

k, I’m going to have a poke around datadog for a minute

lucasbradstreet11:02:08

I’d like to know if it was swapping

robert-stuttaford11:02:09

ZK using ~3.5% ram via top

lucasbradstreet11:02:13

Maybe you can look at cloudwatch?

lucasbradstreet11:02:06

ZK is on one machine?

robert-stuttaford11:02:19

ZK and Onyx are both on the same machine together

lucasbradstreet11:02:25

Ah, have a look at system.swap.used in datadog

lucasbradstreet11:02:29

Yeah, but one ZK, yes?

lucasbradstreet11:02:10

I guess when you restart it with a new onyx/id, it comes back up, which suggests that ZK is still working. I was going to say that maybe the swapping was taking out both

lucasbradstreet11:02:23

I don’t think that’s the case

lucasbradstreet11:02:33

Though it is definitely a reason why you’d want ZK on other machines

robert-stuttaford11:02:46

system.swap.* all appear to be flat

robert-stuttaford11:02:30

available swap also zero throughout

lucasbradstreet11:02:41

oh, your machines might not have swap

lucasbradstreet11:02:51

That would be odd though

robert-stuttaford11:02:00

total used free shared buffers cached Mem: 15039 5992 9047 196 126 1274 -/+ buffers/cache: 4592 10447 Swap: 0 0 0

robert-stuttaford11:02:24

ok. so we’re doing it wrong. it’s using everything up and can’t swap, and then it falls over because it’s run out of options

lucasbradstreet11:02:55

That would be my diagnosis

lucasbradstreet11:02:30

I still want to see flight recorder logs so that I can find out if we have a memory leak, but that configuration is not a good thing

robert-stuttaford11:02:40

i’ll get you a flight record for sure

robert-stuttaford11:02:47

even if only for your edification

robert-stuttaford11:02:09

so, assuming a sane swap setup, it should be able to stay up for a long time, right

lucasbradstreet11:02:41

I mean, we’re not trying to build this to fall over 😛

robert-stuttaford11:02:06

amateur hour over here, heh

lucasbradstreet11:02:11

Once you have swap, I’d want to put a monitor on it in datadog

lucasbradstreet11:02:25

You may want to decrease -Xmx too

lucasbradstreet11:02:49

I’d rather the machine GC’d a little more often than having it go into swap

robert-stuttaford11:02:49

we’ve had one all along 😐

robert-stuttaford11:02:59

note the big empty box bottom middle

lucasbradstreet11:02:17

Sorry, I mean an alert

lucasbradstreet11:02:34

Haha, this is an example of how metrics can trick you 😄

robert-stuttaford11:02:44

how much swap should we have?

lucasbradstreet11:02:20

Well, swap only really delays the issue, so you have to think about how much swap will cause a death spiral anyway

robert-stuttaford11:02:38

i think we’ll start with the same amount as ram

lucasbradstreet11:02:25

Yeah, that’s reasonable. You don’t want to use it though, so I think it’s important that -Xmx is reduced, unless you absolutely need all that memory

robert-stuttaford11:02:20

it’s a 16gb instance, and we’ve assigned it 12gb

lucasbradstreet11:02:38

Yeah, I’m curious about why it’s gone over that so much

robert-stuttaford11:02:21

i think it might be old highstorm processes sticking around

robert-stuttaford11:02:04

this is the last 2 days

lucasbradstreet11:02:40

Restart is where that first and second drop is?

lucasbradstreet11:02:04

Looks to me like other processes are using > 4GB RAM if so

robert-stuttaford11:02:35

just checked that we only have zk and HS running

robert-stuttaford11:02:12

ok. next time this happens, we’re in a much better place to diagnose

robert-stuttaford11:02:34

we’ll add swap asap, and i’ll be sure to keep .jfr file before attempting to recover things

lucasbradstreet11:02:22

k, just be aware of the Xmx link above, because 12GB is likely too high

lucasbradstreet11:02:28

Swap could help there thoug

lucasbradstreet11:02:41

If you’re going into swap then it’s probably too high

robert-stuttaford11:02:13

where should we dial it back to?

lucasbradstreet11:02:11

I was thinking 10. If you’re adding swap it might be OK because it doesn’t mean automatic death. My preference would be less than 12 though

robert-stuttaford11:02:19

ok. we’ll go with 10

lucasbradstreet11:02:49

Even if it doesn’t die, get me a copy of the recording on the next restart/deploy

lucasbradstreet11:02:15

You may want to reduce the datomic peer cache if you reduce it to 10

lucasbradstreet11:02:49

As for why it didn’t start back up OK, I think I need a dump of the peer log

lucasbradstreet11:02:57

I need a dump from ZK

lucasbradstreet11:02:04

I’ll point you in the right direction in a little bit

robert-stuttaford11:02:21

[org.onyxplatform/onyx "0.8.8"]
                 [org.onyxplatform/onyx-datomic "0.8.8.0"]
                 [org.onyxplatform/onyx-metrics "0.8.8.0”]