onyx 2016-06-30 | Slack Archive

dspiteself16:06:09

Is there any thing we can do to get a better error message when we submit a job and the cluster/test-environment that does not have enough peers?

gardnervickers16:06:00

@dspiteself: There’s this https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/test_helper.clj#L40

gardnervickers16:06:20

For the with-test-env macro, if that’s what your talking about

dspiteself16:06:47

yea

dspiteself16:06:49

awesome

dspiteself16:06:59

could I push for that to happen by default

dspiteself16:06:39

we have probably lost 4 -8 hours of total development time as each developer runs into that issue

lucasbradstreet16:06:44

@dspiteself: unfortunately there’s not a good solution other than that sort of check, because the cluster is dynamic and not having enough peers can be a normal condition if you have lots of jobs

dspiteself16:06:25

how about a default on with-test-env ?

dspiteself16:06:41

or a warning

lucasbradstreet16:06:08

I could see a wrapper for submit-job that uses with-test-env, or does the submit as part of with-test-env, but those are more limited. Making it a warning could be bad under certain conditions (I could see cases where thousands of messages could be printed every time a new cluster event happens (e.g. do you want to print it every time a peer joins, leaves, etc)

dspiteself16:06:02

with-dummy-test-env

dspiteself16:06:05

🙂

lucasbradstreet16:06:11

I will think about it some more. We get blocked on it because there isn’t really a great solution.

gardnervickers16:06:40

it would be more like wrapping submit-job to be submit-test-job, I think that’s something that would be pretty simple to do in your own codebase if needed.

dspiteself16:06:52

we are going to wrap submit-job in ours I am just worried about the beginner experience of onyx for the next learners.

michaeldrogalis16:06:12

@dspiteself: See https://github.com/onyx-platform/onyx/issues/452

michaeldrogalis16:06:11

Onyx is designed to intentionally handle both under and over allocated clusters. Every other platform has some notion of process slots, there's nothing terribly different going on here.

lucasbradstreet16:06:31

It worries me too, but we’ve not found a good user friendly way to inform users since it’s the expected behaviour. @michaeldrogalis what about a submit-job version that will auto-kill if the log entry plays and it can’t allocate it? We could print a message at that point. Of course, submit-job would have just returned :success? so that not ideal either

dspiteself16:06:37

I understand I read that when I hit the problem after 2-3 hours of fiddling, but now another developer of ours went through the same thing.

dspiteself16:06:04

and storm as of 0.9.X did give error messages for that case

dspiteself16:06:35

I respect that stance

gardnervickers16:06:42

The onyx peers do give error messages though right? They claim not enough peers to start the job.

lucasbradstreet16:06:42

@michaeldrogalis: I guess I could see printing a warning when the submit-job entry is played, and only then.

dspiteself16:06:55

I am just making noise of felt pain

lucasbradstreet16:06:01

@gardnervickers: that’s only printed when the job has been started but all the peers aren’t warmed up

michaeldrogalis16:06:18

Our architecture is substantially different. This sounds a bit awful, but now that you've hit that problem, you're unlikely to make that mistake again. Losing a couple of hours and learning about how the scheduler works is preferably than doing something hacky in the architecture.

gardnervickers16:06:19

ah gotcha

lucasbradstreet16:06:20

@dspiteself: I feel it since it comes up enough, which is why we added that validation function

michaeldrogalis16:06:33

@dspiteself: Definitely. I understand.

michaeldrogalis16:06:57

Serious question though -- does my previous paragraph sound that terrible?

lucasbradstreet16:06:06

@michaeldrogalis: after thinking about it some more, I think we could output some log entries for this properly.

lucasbradstreet16:06:31

The “Our architecture” one?

dspiteself16:06:44

if I would have seen "waiting for node to become availiable" I would have known what to do

michaeldrogalis16:06:45

Im alright with someone stumbling for a few hours if it's only going to happen once.

lucasbradstreet16:06:56

It happens many times tho, and people do forget

lucasbradstreet16:06:03

I’ve forgotten that it could be the reason

lucasbradstreet16:06:21

and fumble around and then figure it out… which is why I added that validation function

dspiteself16:06:22

it was just once but once for several people on the team

dspiteself16:06:05

maybe a validation function is not necessary if the node allocation logs were clear enough.

lucasbradstreet16:06:15

What I’m thinking is that we output something to timbre under the following conditions:

lucasbradstreet16:06:22

1. When the submit-job entry is played, but the job isn’t scheduled, output a warning.

lucasbradstreet16:06:36

2. When a job is newly scheduled, output an info

dspiteself16:06:02

that would have been perfect for us

lucasbradstreet16:06:28

It would have been a bit cumbersome before when every peer would’ve printed it, but we have a peer-group now.

michaeldrogalis16:06:46

I feel like a substantial number of people don't even realize Onyx writes to a log file though, which is a problem all on its own

lucasbradstreet16:06:13

This is true.

lucasbradstreet16:06:22

I’m not suggesting we try to solve that one 😛

michaeldrogalis16:06:34

Can someone make an issue for this so I can think about it later? Please keep discussing -- I gotta run though

michaeldrogalis16:06:14

(Or reopen the issue I linked to)

dspiteself16:06:28

agreed we were looking at our logs

michaeldrogalis17:06:03

@dspiteself: For what its worth, I use this in all my tests so I never have to think about adjusting the peer count until I go to production: https://github.com/onyx-platform/learn-onyx/blob/master/src/workshop/workshop_utils.clj#L28

dspiteself17:06:21

nice

asolovyov18:06:08

hey all! What happens if a task function returns nil as a segment? Is it then passed as a segment further or is it skipped?

michaeldrogalis18:06:45

@asolovyov: Undefined. Been meaning to raise an exception for that. Functions need to either return a map or a sequence of maps

asolovyov18:06:26

right

asolovyov18:06:43

so if I need to skip, I have to return something and then stop it with flow condition?

dspiteself18:06:00

@asolovyov: you can return and empty sequence of maps

lucasbradstreet18:06:03

Part of me would like to see nil treated as an empty sequence, since map/empty?/etc all do. I’m of two minds about it though

asolovyov18:06:40

@dspiteself: you mean just []? 🙂

dspiteself18:06:16

yea

asolovyov18:06:49

@lucasbradstreet: I would say I'm also not completely sure. On one hand it seems really confusing, on the other it really complements ability to return lists

lucasbradstreet18:06:56

Yeah, there’s an implicit mapcat when you return a vector of segments. So returning [] means you’re just dropping all the output

asolovyov18:06:56

so like 'do nothing after that'

asolovyov18:06:09

@dspiteself: thanks! that's a nice hack 🙂

2016-06-30

Channels