onyx 2016-01-20 | Slack Archive

lucasbradstreet11:01:58

@greywolve Right, so did you have to add 10 corresponding peers? I'm guessing it's running on your own machine alone?

lucasbradstreet11:01:08

The solution is to increase the timeout

lucasbradstreet11:01:37

This can often pop up during long GCs too.

greywolve11:01:32

yup, added the peers, batch timeout? our max pending is also set pretty low

lucasbradstreet11:01:32

If you look at the notes listed in https://github.com/onyx-platform/onyx-jepsen/blob/master/README.md, you could use similar settings to those. Maybe reduce the timeout a bit. I think it's set to 60s

lucasbradstreet11:01:36

Err 50s

lucasbradstreet11:01:40

How many peers total now? 25-30? All running on your lone machine?

greywolve11:01:43

39 😛 , ahh right the aeron timeout itself

lucasbradstreet11:01:29

Hah that's a lot of tasks for one machine to handle. No wonder you're hitting issues.

lucasbradstreet11:01:20

I'd consider rationalising/fusing some where it makes sense

greywolve11:01:21

haha, ok then i'm just being silly i guess

greywolve11:01:40

good point

greywolve11:01:23

btw, do you think running two datomic read-log tasks, in separate jobs, is wise? i actually thought that was causing my issues, and ended up merging the 2nd job into the first, to share a read-log, and transact

greywolve11:01:27

does that matter?

lucasbradstreet11:01:26

It'll add a bit of load because you're reading multiple times, but you get the advantage of decoupling the jobs, which can help a lot with retries

lucasbradstreet11:01:53

Given how many tasks you have in your job I would definitely consider doing it

lucasbradstreet11:01:23

If you do make sure they're not both check pointing to the same key

greywolve11:01:35

i actually had it like that originally, then i thought this aeron issue was due to that, lol. thankfully i kept the other service as is, so i'll just revert back

greywolve11:01:45

yup i made sure of that

lucasbradstreet11:01:34

Cool. I think it makes sense to split them again. It's easier to unit test/reason about too

lucasbradstreet11:01:52

I think it's a good idea to still increase the timeout and switch to shared mode media driver

greywolve11:01:50

will do, you mean the stand alone driver? even for local dev?

lucasbradstreet11:01:52

I think I'd switch to shared mode in both the stand alone driver (prod) as well as for you local dev

lucasbradstreet11:01:47

Dedicated can achieve better throughputs but at the cost of burning CPU and your throughput is rather low for what Aeron can handle

lucasbradstreet11:01:55

We've switched to shared mode by default

greywolve11:01:03

awesome, i'll try it out, thanks

lucasbradstreet11:01:37

You're welcome

robert-stuttaford11:01:07

robert-stuttaford11:01:12

feel the learn

lucasbradstreet12:01:24

@lsnape: I think I’d benefit from having a quick look at your onyx.log first

lsnape12:01:38

@lucasbradstreet: wrt the issue I just posted on google groups: https://groups.google.com/forum/#!topic/onyx-user/6s7VNT6iloM I'm going to wipe my onyx.log and submit the offending job again...

lucasbradstreet12:01:46

Ok, sounds good.

lucasbradstreet12:01:49

Hmm, wow it doesn’t get very far at all

lsnape12:01:59

I start the system with 4 peers. Like I said, the sample job runs fine but modified one does not

lucasbradstreet12:01:25

Do you have more than 4 tasks in the new job?

lsnape12:01:50

no, there's only input and output

lsnape12:01:18

:workflow [[:read-messages :write-output]]

lsnape12:01:51

my kafka topic is configured with 8 partitions, but I think that's irrelevant

lucasbradstreet12:01:57

I have a feeling that you’re submitting the job to a different ZK cluster than the peers are looking at

lucasbradstreet12:01:03

or with a different onyx/id

lucasbradstreet12:01:25

It seems like the peers are just waiting around and never see the submit-job

lucasbradstreet12:01:41

Though it does seem to be written successfully to ZooKeeper

lsnape12:01:41

so :kafka/zookeeper is set to the prod zookeeper instance

lsnape12:01:24

the peers are looking at local zk

lucasbradstreet12:01:37

that should be fine I think

lsnape12:01:42

so why might the onyx id be different. I'll take another look..

lucasbradstreet12:01:21

:onyx/max-peers 8, :onyx/min-peers 8,

lucasbradstreet12:01:32

That is concerning, given that you only have 4 peers

lsnape12:01:45

ah yes, your right. I'll change number of peers to 8. I think I've run it with 8 peers before though and had the same problem. Will try again now though

lucasbradstreet12:01:46

I believe the scheduler will see the submit-job entry and not schedule it because there aren’t enough peers yet

lucasbradstreet12:01:17

You’ll need more than 8, you’ll need at least 1 for each task (except for those that set a minimum, which set a minimum). So in your case I think it would be 9

lsnape12:01:07

okay something has happened! The messenger buffers now start and I get exceptions further downstream

lucasbradstreet12:01:19

Great. Progress

lsnape12:01:09

ah it's complaining about a missing symbol. This kind of stuff is expected. I should be alright from here I think

lucasbradstreet12:01:14

You may also be interested in our new template (currently it’s only a snapshot release). We’re still iterating on it, but it uses some of the latest best practices. https://github.com/onyx-platform/onyx-template/tree/feature/new-idioms

lsnape12:01:33

oh nice, I'll check it out

lucasbradstreet12:01:49

Cool. Sounds like you’ll be good from here then. Feel free to come back with any other issues

lsnape12:01:06

Thanks for your help :thumbsup:

lucasbradstreet12:01:12

You’re welcome

lucasbradstreet12:01:07

You hit a common problem that we’d like a validator to deal with when deving locally. It doesn’t make sense to throw an exception in prod because you might just be waiting for more peers to come up, or other jobs to finish.

lsnape12:01:03

yeah, so I guess that would be a case of scanning the workflow and catalog to find the minimum number of peers required to run the job?

lucasbradstreet12:01:41

And checking it against the current number that are running in the current peer coordination replica. It would only be useful in dev

gardnervickers13:01:53

@lsnape: I will have some updates for that template in about an hour, hopefully it’ll make it a little clearer

lsnape13:01:43

@gardnervickers: awesome. I aim to give it a whirl this afternoon

lucasbradstreet13:01:40

@lsnape: you may need to lein install it manually and use lein new onyx-app proj-name —snapshot to make it work

lucasbradstreet13:01:56

grr, annoyingd that it converts - - to --

lucasbradstreet13:01:02

huh

greywolve14:01:31

thanks again @lucasbradstreet , your suggestions made my machine handle both services it seems - no starvation.

lucasbradstreet15:01:59

👍

2016-01-20

Channels