onyx 2015-10-23 | Slack Archive

Okay, so after some effort I have determined that submitting a job from inside of another job caused the issue

I have taken the outer job out and turned it in to a future

now only one layer of job is happening, and everything is working great

I don't know if that is something you want to support but you should probably either make sure it works or explicitly forbid it in the docs somehow

spangler00:10:52

A suggestion!

spangler00:10:24

I do not have a reproducible for you that is not our entire app, but when I get a moment I will try to put something together to demonstrate the issue

spangler00:10:50

Though it was not a complicated thing I was doing

michaeldrogalis14:10:15

@spangler: Were you by chance reusing the same ZooKeeper connection from the event map to submit the second job?

michaeldrogalis14:10:51

Very little actually happens when you submit a job. It writes some data to ZooKeeper storage, so I'm suspicious about the connection being reused and shutdown.

spangler16:10:48

@michaeldrogalis Hmm, quite possible! Which is the event map, and which is the zookeeper connection inside that map?

spangler16:10:58

Here is the map I am passing into submit-job:

{:zookeeper/address "127.0.0.1:2181",
 :zookeeper.server/port 2181,
 :onyx/id #uuid "134c49a7-2df1-4094-8578-4364501435d0",
 :onyx.peer/job-scheduler :onyx.job-scheduler/balanced,
 :onyx.messaging/ack-daemon-timeout 60000,
 :onyx.messaging/impl :aeron,
 :onyx.messaging/bind-addr "localhost",
 :onyx.messaging/peer-port-range [40200 40600]}

spangler16:10:20

(the :onyx/id is different every startup of course)

michaeldrogalis16:10:56

:onyx.core/log in the event map. Will take a peak later today.

michaeldrogalis16:10:07

@spangler: Is :onyx/id different between the first and second job you submit?

michaeldrogalis16:10:07

If so, I see a problem there. :onyx/id strongly partitions deployments to give you multi-tenancy in ZooKeeper. Each :onyx/id tracks which ports are being used by which hosts in Aeron. There would be a port collision if you passed the same values for :peer-port-range in that last parameter to two different instances.

spangler16:10:50

@michaeldrogalis No, :onyx/id is set at startup

spangler16:10:21

I will take a little time today to see if I can set up a minimal reproducible

michaeldrogalis16:10:45

Cool, much appreciated @spangler!

spangler17:10:35

@michaeldrogalis Okay, set up the test project, but getting this mysterious error:

WARNING: /var/folders/2m/0shxv04j04lcm1c0dddyf4qw0000gn/T/aeron-rspangler already exists.
INFO: Aeron directory /var/folders/2m/0shxv04j04lcm1c0dddyf4qw0000gn/T/aeron-rspangler exists
INFO: Aeron CnC file /var/folders/2m/0shxv04j04lcm1c0dddyf4qw0000gn/T/aeron-rspangler/cnc exists
uk.co.real_logic.aeron.driver.exceptions.ActiveDriverException: active driver detected

spangler17:10:47

Is my other project interfering somehow?

spangler17:10:21

I see this issue about it, but it says it is fixed: https://github.com/onyx-platform/onyx/issues/273

spangler17:10:24

Maybe not?

spangler17:10:52

Ah, I stopped my other project, the error went away

spangler17:10:05

So I can't have two onyx processes running at the same time it seems?

michaeldrogalis17:10:11

@spangler: You started 2 Aeron drivers

michaeldrogalis17:10:38

Aeron restricts you to 1 of those per box

spangler17:10:01

So, that means I can't have two running processes that use onyx then?

spangler17:10:16

I did not start Aeron directly

spangler17:10:31

I started onyx and it started aeron

michaeldrogalis17:10:58

It probably started through start-env if you configured it to use an embedded media driver

michaeldrogalis17:10:27

You can have two Onyx jar's running, just need to be cognisant of Aeron

spangler17:10:34

My env-config

(def env-config
  {:zookeeper/address "127.0.0.1:2181"
   :zookeeper/server? false
   :zookeeper.server/port 2181
   :onyx/id onyx-id})

spangler17:10:48

So how would I work around that in the code then?

michaeldrogalis17:10:19

:onyx.messaging.aeron/embedded-driver? defaults to true.

spangler17:10:41

Ah, so I start aeron independently... like zookeeper

spangler17:10:44

spangler17:10:58

Then set that to false?

michaeldrogalis17:10:01

Embedded Aeron is useful for development since you dont need to manage it yourself. If you're going to bring up two instances of Onyx on the same machine you need to manage it yourself. I usually deploy into Docker so its not an issue

spangler17:10:53

Got it

lucasbradstreet17:10:31

Just caught something there that might need a mention. You won't be able to run two onyx jars on the same host if you bind them to the same ip

lucasbradstreet17:10:59

Even if you use different ports. The short circuiting will go haywire

lucasbradstreet17:10:21

If you disable short circuiting and make them use different ports it'll be fine

michaeldrogalis18:10:44

Right, good catch @lucasbradstreet.

spangler19:10:07

@michaeldrogalis Okay, I have a reproducible for you: https://github.com/littlebird/onyx-test

spangler19:10:22

Let me know what you think

spangler19:10:33

Information to run and expected vs witnessed output detailed in the README

michaeldrogalis19:10:55

@spangler: Ty. Will let you know

lucasbradstreet19:10:06

I can give it a look in the morning if you don't get to it tonight

michaeldrogalis19:10:13

@spangler: One quick thing. You cant reuse a catalog entry more than once in a workflow. Surprised our schema checks dont catch that, Ill add a fix for it. So inner-in and inner-out

michaeldrogalis19:10:15

https://github.com/littlebird/onyx-test/blob/master/src/onyx_test/core.clj#L12

spangler19:10:05

@michaeldrogalis I don't follow. This example for instance reuses catalog entries: https://github.com/onyx-platform/onyx-examples/blob/0.7.x/multi-output-workflow/src/multi_output_workflow/core.clj

spangler19:10:11

(specifically :in)

lucasbradstreet19:10:48

I don't quite follow either

michaeldrogalis19:10:59

Im pretty sure that incidentally works, and should not (pretty old example that I forgot to update)... There are several cases where Im not confident that reusing a keyword in a workflow would be correct.

michaeldrogalis19:10:03

Ill verify that though.

lucasbradstreet19:10:37

What do you mean by reusing a keyword?

lucasbradstreet19:10:10

Specifically which task?

michaeldrogalis19:10:15

Duplicate left-side usage: https://github.com/onyx-platform/onyx-examples/blob/0.7.x/multi-output-workflow/src/multi_output_workflow/core.clj#L13-L14

spangler19:10:59

This example also uses several examples of :out https://github.com/lbradstreet/onyx-timeline-example/blob/master/src/clj/onyx_timeline_example/onyx/workflow.clj

lucasbradstreet19:10:32

Do you mean because it's an input task? Isn't that the standard way to flow to multiple tasks?

michaeldrogalis19:10:48

Hold up, I have a work thing. Back in a bit.

michaeldrogalis20:10:02

@lucasbradstreet: It would be fine to have 2 catalog entries only different by their :onyx/name

lucasbradstreet20:10:41

np, take your time

lucasbradstreet20:10:02

To me it looks similar to https://github.com/onyx-platform/onyx/blob/0.7.x/test/onyx/peer/dag_test.clj#L172

lucasbradstreet20:10:27

Aside from the fact that :in is an input task

lucasbradstreet20:10:44

no need to respond now

spangler20:10:35

And if that is true, how are you supposed to flow to multiple tasks?

spangler20:10:47

As another example, the onyx-starter project does the same thing: https://github.com/onyx-platform/onyx-starter/blob/0.7.x/src/onyx_starter/workflows/sample_workflow.clj

lucasbradstreet20:10:08

Yeah, it looks fine to me. We’ll just have to wait to hear back from @michaeldrogalis to clarify

spangler20:10:49

Totally

spangler20:10:19

Ah, here was the canonical one I was looking for, right in the docs: https://onyx-platform.gitbooks.io/onyx/content/doc/user-guide/concepts.html

spangler20:10:33

;;;            input
;;;             /\
;;; processing-1 processing-2
;;;         \      /
;;;          output

[[:input :processing-1]
 [:input :processing-2]
 [:processing-1 :output]
 [:processing-2 :output]]

spangler20:10:55

Just documenting these here. If it IS wrong, then all of these examples are very misleading!

lucasbradstreet20:10:07

Yeah, don’t worry, I think Mike just got himself a bit confused

lucasbradstreet20:10:25

You’re starting up a buttload of jobs

lucasbradstreet20:10:28

Is the problem

lucasbradstreet20:10:34

Try bumping up the number of peers to 40 and run it again

lucasbradstreet20:10:03

I didn’t quite get the correct output, but it’s closer at least, and it finishes

lucasbradstreet20:10:04

INFO: Aeron toDriver consumer heartbeat is 45790 ms old SUBMITTING OUTER INPUT OUTER SEGMENTS OUTER! {:outer suffix} SUBMITTING INNER INPUT INNER SEGMENT {:inner what} INPUT INNER SEGMENT {:inner context} INPUT INNER SEGMENT {:inner is} INPUT INNER SEGMENT {:inner this?} BBBBBBBBBBBBB CCCCCCCCCCCCC AAAAAAAAAAAAA CCCCCCCCCCCCC CCCCCCCCCCCCC BBBBBBBBBBBBB AAAAAAAAAAAAA AAAAAAAAAAAAA AAAAAAAAAAAAA BBBBBBBBBBBBB CCCCCCCCCCCCC BBBBBBBBBBBBB OUTER SEGMENTS COMPLETE! INNER SEGMENTS COMPLETE! [:done]

spangler20:10:42

Hmm..... that is a concern

spangler20:10:03

In production I want to have an arbitrary number of jobs running at the same time

spangler20:10:11

based on adding new machines to the cluster

lucasbradstreet20:10:31

Basically you get 1 task running on one peer at a time

lucasbradstreet20:10:11

So if you only have 11 peers and you you start 3 jobs taking 4 tasks only two of them will run concurrently

lucasbradstreet20:10:31

You can start up more peers than your number of cores if they’re generally lying around doing nothing though

spangler20:10:38

Ah I see

spangler20:10:45

Sure, but shouldn't it still finish with 11 peers?

lucasbradstreet20:10:04

Lemme check what scheduler you’re using

spangler20:10:12

The fact that it starts one job over and over again (which is wrong), then never completes makes me feel like something is wrong

lucasbradstreet20:10:19

Balanced hmm

spangler20:10:22

Based on what you are saying it should just take longer

lucasbradstreet20:10:49

There definitely may be some other weirdness going on yes

spangler20:10:29

Do you get the same behavior as me with 11 peers?

spangler20:10:32

I will try with 40

lucasbradstreet20:10:45

You’re right, you have an 8 task job and a 3 task job so it should be able to run one of the 3 task jobs at a time

lucasbradstreet20:10:51

Yes, I did

lucasbradstreet20:10:40

Oh, hmm

spangler20:10:49

Yeah the fact that the outer job is triggered over and over again is bad behavior too, since then it queues up another set of inner jobs

spangler20:10:01

In the context of our app this could be a disaster

lucasbradstreet20:10:06

Understood

lucasbradstreet20:10:31

You may have to deal with preventing duplicate job submissions anyway, because it’s the nature of the onyx retry mechanism

lucasbradstreet20:10:43

Scheduling the jobs via an onyx job may not be the best approach

spangler20:10:17

Okay, that is good to know. What is the nature of the onyx retry mechanism?

lucasbradstreet20:10:51

If a message isn’t acked all the way to the output it’ll resend the message again from the input source i.e. at least once messaging

lucasbradstreet20:10:40

I’m sure it’s possible to do it via Onyx, you’ll just have to think about it a little and make sure you filter out duplicate job submissions

spangler20:10:15

Is there a way to make that behavior optional?

spangler20:10:31

A flag in the config or something?

lucasbradstreet20:10:43

Our stateful processing in 0.8 allow you to make it pretty close to exactly once quite easily

michaeldrogalis20:10:40

@spangler @lucasbradstreet Sorry nevermind, had a brain-fart. There's nothing wrong with that.

lucasbradstreet20:10:50

michaeldrogalis20:10:57

2 years later and I can still mess up my own data model, heh

lucasbradstreet20:10:08

There’s definitely something weird going on with the scheduler

lucasbradstreet20:10:14

I don’t quite understand it yet

michaeldrogalis20:10:56

Okay.

lucasbradstreet20:10:21

Ah I think I know what’s going on

lucasbradstreet20:10:23

11 peers

lucasbradstreet20:10:40

Job with 3 peers on it, job with 8 peers on it

lucasbradstreet20:10:46

err

lucasbradstreet20:10:52

sorry, I mean, job with min 3 peers, job with min 8 peers

lucasbradstreet20:10:57

balanced task scheduler

lucasbradstreet20:10:34

Job with 3 peers is running, and it submits a job requiring 8 peers

lucasbradstreet20:10:59

Hmm, no

lucasbradstreet20:10:46

Ah, and it’s max-peers, not min-peers

lucasbradstreet20:10:02

Ahhh

michaeldrogalis20:10:06

Bingo https://github.com/littlebird/onyx-test/blob/master/src/onyx_test/core.clj#L95

lucasbradstreet20:10:21

There’s definitely a lot of fun stuff happening with the scheduler I think

lucasbradstreet20:10:37

core.async input is kinda bad when the peers get switched around too because the data is going to be lost

michaeldrogalis20:10:37

Right. Also, that line I pointed to is making 4 peer processes try to read from the same channel.

lucasbradstreet20:10:56

I guess that’s only really a problem in terms of the :done

spangler20:10:16

@michaeldrogalis But that is max peers right?

spangler20:10:08

Also, I took that line out and I am getting the same behavior

michaeldrogalis20:10:12

Yes. But anything > 1 will yield incorrect behavior there. At least, that was true when I first wrote the core.async plugin, which is why the docs specify max-1

michaeldrogalis20:10:21

Mmkay. That's a start at least.

spangler20:10:07

@lucasbradstreet What do you mean "data is going to be lost"?

michaeldrogalis20:10:11

Added an issue to schema check that

lucasbradstreet20:10:02

So, the core.async plugin takes from the input channel and then it does it put’s it in a pending-messages map for replay and then sends it on

lucasbradstreet20:10:46

If the peer that’s on that task gets switched to another job then that pending-messages map is lost. Normally it’s not a problem with other plugins like kafka because we can start from the same position and we won’t “complete” it until it’s fully acked

lucasbradstreet20:10:11

But since core.async is mutable and the data isn’t in the channel you’re kinda screwed. The core.async plugin is only really for testing

spangler20:10:47

Okay, so I should use kafka for this then? We are using kafka for other things, so wouldn't be that bad to switch it

lucasbradstreet20:10:31

Yeah, since you’re doing some complicated scheduling related stuff it might be better

lucasbradstreet20:10:46

I don’t really see the downside in just using kafk

lucasbradstreet20:10:47

spangler20:10:52

Okay, set :onyx/max-peers 1 and it got through the tasks at least... (!) But no data came through?

lucasbradstreet20:10:05

Yeah, you’re probably going to have to debug that one further

spangler20:10:07

Just got [:done]

michaeldrogalis20:10:02

@spangler: Btw, you can add (shutdown-agents) to your clean up process for a quicker shutdown

spangler20:10:19

@michaeldrogalis Ah thank you for that

lucasbradstreet20:10:22

@michaeldrogalis: I think there’s something non-ideal going on with the scheduling but it might just be task switching going slow with not many spare peers. I’ll give it a look at another time. I think it should be starting up these jobs much more quickly without any changes

spangler20:10:02

Okay right, this was my other issue with the job. The outer job completes before the inner ones

spangler20:10:20

even though the outer job should be waiting for the inner job

spangler20:10:34

because it calls take-segments!

spangler20:10:51

on the inner output channel

lucasbradstreet20:10:52

Ahhh I see what’s going on

lucasbradstreet20:10:00

You’re blocking in one of your functions, right?

lucasbradstreet20:10:11

So basically you end up with a lot of peers on these body tasks

lucasbradstreet20:10:13

just blocking

spangler20:10:21

I am calling take-segments! from the outer job yeah

lucasbradstreet20:10:30

Waiting for inner jobs to run, but these peers can’t be rescheduled

lucasbradstreet20:10:38

Because they’re currently blocking

michaeldrogalis20:10:04

I set the peer count to 15 and get the expected behavior after setting max-peers to 1. I'd not be totally surprised if this is an edge case on the scheduler trying to reallocate with little spare room.

michaeldrogalis20:10:27

Also, @lucasbradstreet, makes sense. Nice assessment.

lucasbradstreet20:10:37

That’s why the scheduler seems borked

michaeldrogalis20:10:08

That explains what I just noted, too. More room to reallocate non-blocked peers.

lucasbradstreet20:10:04

We just need to point @spangler to looking at the log to see when his jobs are completing

spangler20:10:05

Okay, so I need 3 peers for the outer job

lucasbradstreet20:10:14

Rather than blocking there

lucasbradstreet20:10:49

I have to go to bed, if you don’t work through it today talk to me tomorrow, I’ll explain how you could do things differently

spangler20:10:56

And 8 for the inner job.... I still don't see why I would need more than 11

spangler20:10:00

Okay, well thanks for your help!

lucasbradstreet20:10:22

@spangler basically when there are no jobs it’s overallocating peers to your body task

lucasbradstreet20:10:32

But then they get tied up blocking on take-segments!

lucasbradstreet20:10:41

because they’re blocking and never finish the batch

spangler20:10:42

Overallocating?

spangler20:10:44

Hmm.....

lucasbradstreet20:10:45

they never get reallocated

spangler20:10:53

Did not know about that

lucasbradstreet20:10:01

By overallocating I mean it’s just throwing as many peers on it as you have

lucasbradstreet20:10:07

because there are no other jobs running

lucasbradstreet20:10:12

so chuck all 11 on that one job

lucasbradstreet20:10:39

Then all the peers get tied up on the take-segments! body task and never finish their batch

lucasbradstreet20:10:46

So never get reallocated to your newly submitted job

lucasbradstreet20:10:01

Then because you never returned from the body function, and the segment isn’t acked

lucasbradstreet20:10:07

the job will retry the original segment

spangler20:10:09

So the peers get tied up, even though they aren't directly involved in the take-segments! ?

michaeldrogalis20:10:22

Also makes sense why you'd see "Not enough peers starting..." The scheduler thinks the peer can move, but the peer doesnt because its blocked.

lucasbradstreet20:10:41

You’d have had 9 peers on the :body task to start

lucasbradstreet20:10:48

and 1 on the input one on the output

lucasbradstreet20:10:54

because the scheduler will use all the peers that it has

lucasbradstreet20:10:10

Then all the segments come in to body, and all those peers will do take-segments!

lucasbradstreet20:10:14

and submit the job

lucasbradstreet20:10:28

The scheduler will try to reallocate the peers but it does it in between batches

lucasbradstreet20:10:40

those peers never finish their batches because they’re waiting for take-segments!

lucasbradstreet20:10:43

and the job never starts

spangler20:10:47

I hear what you are saying, and that explains the behavior, but it still seems undesirable to me... ?

spangler20:10:59

Is there a different scheduler that doesn't do that?

lucasbradstreet20:10:23

Well, you could set :onyx/max-peers on the :body task

lucasbradstreet20:10:40

There’s also a percentage job scheduler

lucasbradstreet20:10:43

that you could use

spangler20:10:50

It seems weird that peers would wait on a task they aren't even running...

spangler20:10:55

But thanks for the explanation!

lucasbradstreet20:10:17

Blocking in your tasks like you’re doing is probably an antipattern

lucasbradstreet20:10:29

What do you mean wait on a task they aren’t even running?

spangler20:10:54

Well, I am assuming only one process is waiting on take-segments!

spangler20:10:59

Or is that not how it works?

lucasbradstreet20:10:07

That’s not how it works

lucasbradstreet20:10:32

The scheduler will use as many peers as you have for the jobs that are running. In the balanced scheduler it’ll try to evenly balance them (max-peers/min-peers) not withstanding

lucasbradstreet20:10:41

When you only have one job running (the outer job)

lucasbradstreet20:10:54

it’ll throw all of the peers on that job, because everything = balanced for one job

lucasbradstreet20:10:08

So you’ll end up with almost all your peers on the body task

spangler20:10:15

Okay, I guess I don't understand what the peers are doing then? How can they all be involved in just running one function?

lucasbradstreet20:10:36

The way onyx works is that you have a task, with a function on it

spangler20:10:40

Or are they just waiting for input?

lucasbradstreet20:10:42

and other tasks send them segments to be processeg

lucasbradstreet20:10:44

processed

spangler20:10:51

Right

lucasbradstreet20:10:51

so maybe one :body task gets one segment

lucasbradstreet20:10:55

another gets another segment

spangler20:10:09

Right

lucasbradstreet20:10:15

each time they get a segment, the function, in this case :onyx-test.core/body is called on it

spangler20:10:29

But if they never got a segment, they are still allocated, just waiting for a segment?

lucasbradstreet20:10:29

So say you have 8 peers currently scheduled to the :body task

lucasbradstreet20:10:35

correct

spangler20:10:57

Hmmm.... and they don't deallocate until all other tasks have completed?

lucasbradstreet20:10:15

They don’t deallocate until they finish a full task lifecycle, i.e. process their batch

lucasbradstreet20:10:18

maybe the batch is empty

lucasbradstreet20:10:22

because it hit a small timeout

lucasbradstreet20:10:25

maybe it’s not

lucasbradstreet20:10:13

Hmm, ok I can kinda see why you’re confused

spangler20:10:17

So this will happen also if a different job gets submitted?

lucasbradstreet20:10:19

you’re only putting one segment on

lucasbradstreet20:10:20

right?

spangler20:10:25

Yeah

lucasbradstreet20:10:37

Yeah, ok, maybe there’s something else going on too

spangler20:10:48

Say I submit a different job while the first is running

spangler20:10:56

since all peers are allocated to the first job

spangler20:10:06

The second job has to wait entirely until the first job is complete

spangler20:10:13

even if those peers are not being directly used?

lucasbradstreet20:10:21

Second job can start immediately if there are enough spare peers

lucasbradstreet20:10:27

including peers already being used on the first job

spangler20:10:34

But you just said all the peers get allocated to the first job...

spangler20:10:39

Sorry, just trying to understand

lucasbradstreet20:10:43

The only reason I was saying they weren’t being reused was because I thought they were blocking in the body fn

lucasbradstreet20:10:47

on take-segments!

spangler20:10:55

Right, but only one peer will be blocking

spangler20:10:58

in this case

lucasbradstreet20:10:01

But looking at it again, you only have one segment

spangler20:10:05

That is what I don't understand yeah

lucasbradstreet20:10:07

so yeah it should only be one peer

lucasbradstreet20:10:12

Sorry, I was looking at the inner segments

lucasbradstreet20:10:19

It’s ok, you got a decent lesson 😉

spangler20:10:23

Thanks!

spangler20:10:29

Communication is hard

spangler20:10:46

But this is weird right?

lucasbradstreet20:10:59

A little bit, I’m not sure where the fault lies though

lucasbradstreet20:10:49

Maybe it’s just what @michaeldrogalis said

spangler20:10:50

Okay

lucasbradstreet21:10:01

the max-peers 4 on the input task on the inner job

lucasbradstreet21:10:08

I think because only one sees the done

spangler21:10:30

Ah, I fixed that yeah

spangler21:10:35

now it is max-peers 1

spangler21:10:53

The issue I have now is that the outer job completes before the inner jobs

spangler21:10:04

even though the outer job should be waiting for the inner job

spangler21:10:28

So the outer job gets no data from take-segments!

lucasbradstreet21:10:28

How do you know?

spangler21:10:32

Then it completes

spangler21:10:50

and later the inner jobs all complete, sending their data

spangler21:10:54

but the outer job is already done

spangler21:10:59

The output of the job

lucasbradstreet21:10:05

1 sec

lucasbradstreet21:10:56

You’re using the same output channel for all the inner jobs right?

spangler21:10:02

Yes

lucasbradstreet21:10:06

you’re going to end up with a channel with lots of dones on it

lucasbradstreet21:10:10

with messages in between

lucasbradstreet21:10:16

take-segments! will only read up to the done

lucasbradstreet21:10:20

the first done

spangler21:10:52

Ah, I had a whole discussion with @michaeldrogalis about this, where he assured me that take-segments! does not return until all jobs have finished processing

lucasbradstreet21:10:06

It does not return until it reads a :done

lucasbradstreet21:10:08

any done

spangler21:10:15

That was my concern

michaeldrogalis21:10:15

I did not say that 😛

spangler21:10:29

Oh man

michaeldrogalis21:10:35

take-segments is agnostic to anything Onyx related.

michaeldrogalis21:10:52

Tiny utility fn: https://github.com/onyx-platform/onyx/blob/0.7.x/src/onyx/plugin/core_async.clj#L122

lucasbradstreet21:10:00

Yeah we mostly use it in testing

spangler21:10:33

This was the conversation

spangler21:10:51

@michaeldrogalis

lucasbradstreet21:10:14

Technically it’s right. A done is put on the end

lucasbradstreet21:10:23

after the job is completed, you don’t see it in the tasks

lucasbradstreet21:10:37

However, you’re running multiple jobs and reusing the output media

lucasbradstreet21:10:48

So… all bets are off

lucasbradstreet21:10:09

If you used a new channel for each job you’d be safe

lucasbradstreet21:10:09

everything is finished != all jobs

michaeldrogalis21:10:22

Right. I had in mind a single job, there.

spangler21:10:32

It is a single job

spangler21:10:41

that flows to multiple tasks

spangler21:10:49

and those tasks funnel to a single output

lucasbradstreet21:10:56

Ok, it’s only the outer job channel

lucasbradstreet21:10:58

in that case

lucasbradstreet21:10:00

the job is finishing

lucasbradstreet21:10:01

haha

lucasbradstreet21:10:03

lucasbradstreet21:10:09

Another misreading

lucasbradstreet21:10:25

We’re getting good at this

lucasbradstreet21:10:52

no I mean

lucasbradstreet21:10:54

Sorry

lucasbradstreet21:10:00

I meant the take-segments in the body fn

lucasbradstreet21:10:14

You’re using the same channel for all the inner jobs right?

spangler21:10:17

Right

lucasbradstreet21:10:33

I guess that can kiiiind of work though it’s a bit schetchy

lucasbradstreet21:10:40

since you read up the done, print, return

lucasbradstreet21:10:44

reuse the channel again

spangler21:10:36

So, I need a separate output channel for each task then?

lucasbradstreet21:10:53

It’s recommended

lucasbradstreet21:10:02

I’d probably just make the switch to kafka

spangler21:10:07

So using one output channel for many tasks is always wrong, since you can never tell when it is really done

lucasbradstreet21:10:23

Yeah, it’s a bad idea

spangler21:10:25

Please then, remove that example from the docs

spangler21:10:31

It is terribly misleading

lucasbradstreet21:10:37

Which example?

spangler21:10:58

https://onyx-platform.gitbooks.io/onyx/content/doc/user-guide/concepts.html

lucasbradstreet21:10:23

Which part is misleadingZ?

spangler21:10:27

And, you shouldn't really refer to it as a DAG, since you never want to merge paths once diverged

lucasbradstreet21:10:40

I think we’re getting ourselves confused here

lucasbradstreet21:10:45

All these things are fine

spangler21:10:47

Under workflow, last example

lucasbradstreet21:10:01

That example is totally fine

michaeldrogalis21:10:14

Excuse what I said earlier, I was on the phone trying to multitask. Its not incorrect to do that.

spangler21:10:38

You are telling me that if I send :done to something that branches, if it merges again then you can't depend on :done to tell you when it is done

lucasbradstreet21:10:50

The done never goes to any task

spangler21:10:02

Right, but when reading from the output channel

spangler21:10:10

using take-segments!

lucasbradstreet21:10:10

It’s used to decide when to finish the job

lucasbradstreet21:10:23

and it’s written to the output task if requested

spangler21:10:43

When did I request that?

spangler21:10:50

Doesn't it just flow from task to task?

lucasbradstreet21:10:53

scratch “requested"

lucasbradstreet21:10:02

and no, it’s just used to decide when all the input tasks are fully processed

lucasbradstreet21:10:10

and to punctuate on all the output tasks

michaeldrogalis21:10:17

Refer to what I said in that screen capture.

lucasbradstreet21:10:20

so when you read from the media you know it’s done… buuuuuuut

spangler21:10:34

So why is take-segments! returning early then?

lucasbradstreet21:10:35

because you’re using the same channel in between jobs you’ll get

lucasbradstreet21:10:46

[segment segment segment :done segmet segment segment :done

lucasbradstreet21:10:46

]

lucasbradstreet21:10:54

take-segments! is just a helper function for our testing

lucasbradstreet21:10:01

which just reads all the way up to the first done

lucasbradstreet21:10:02

and returns

lucasbradstreet21:10:37

The problem is that you’re relying on it for something that it wasn’t intended to be used for

lucasbradstreet21:10:51

Maybe it should have a better docstring

lucasbradstreet21:10:06

Docstring looks pretty ok

lucasbradstreet21:10:25

Normally we wouldn’t ever re-use a channel between multiple tasks

lucasbradstreet21:10:26

or multiple jobs

spangler21:10:27

Every example I have seen uses it......

lucasbradstreet21:10:34

so we wouldn’t ever get multiple :dones on a channel

spangler21:10:46

Ah, I am not reusing the channel between jobs

spangler21:10:04

but I do want to merge all of the output of all the various tasks

lucasbradstreet21:10:08

Aren’t you re-suing it in the inner job?

spangler21:10:12

lucasbradstreet21:10:31

Gah I forgot there is only one segment

spangler21:10:31

Not between jobs, between tasks

lucasbradstreet21:10:54

If you had more than one segment on the input task on the outer job you would hit this problem

lucasbradstreet21:10:26

You’ll definitely hit this problem later when you submit multiple jobs via multiple segments

spangler21:10:02

Okay, so how am I supposed to set this up then?

lucasbradstreet21:10:25

I’ll have to discuss that with you another time. it’s 5am here

spangler21:10:32

Oh! Where are you?

lucasbradstreet21:10:34

Singapore

spangler21:10:39

Got it

spangler21:10:53

Well, thanks for working through it with me

lucasbradstreet21:10:59

I’ve never tried to submit jobs within onyx jobs, so there are probably a few pitfalls

spangler21:10:08

So it seems!

lucasbradstreet21:10:08

I think there are probably better approaches

spangler21:10:38

I would love to know about those

lucasbradstreet21:10:02

Sure, happy to chat another time

lucasbradstreet21:10:10

(let [results (take-segments! (:output inner))]

lucasbradstreet21:10:19

ah nevermind

lucasbradstreet21:10:57

input outer and output inner get me all mixed up

lucasbradstreet21:10:03

because inner and input are similar sounding

spangler21:10:15

Probably a better name yeah

michaeldrogalis21:10:43

I'm sure the problem will jump out with a little time. At any rate, it's worth a reminder that Onyx is reaching nearly 15,000 lines of code in total, along with all of its documentation. If something doesn't make sense straight away or there's an error in an example, keep that in mind. I develop Onyx for free in my spare time, so not everything will be 100% smooth all the time.

spangler21:10:49

@michaeldrogalis Of course, and your effort is appreciated. I am usually on the other side of the equation, so when I have a new user who is trying to figure out how to use my code, I take it as an opportunity to learn all the ways in which things appear to an outsider. So many assumptions go into making something like this that you become blind to how it appears to someone trying to figure it out for the first time. In that way feedback from new users is invaluable (if also annoying!)

spangler21:10:08

That is, if you want people to use your code

spangler21:10:49

So, I am offering my perspective in trying to make it work, at least in this use case

spangler21:10:38

These are some ways that the experience can be improved for others down the line

michaeldrogalis21:10:31

Right, it's appreciated. As is your patience. Need to run now, catch you later.

lucasbradstreet21:10:10

Whatever is going wrong is going wrong in the body function. It’s really weird. Something to do with take-segments and I don’t quite understand it

spangler21:10:29

@lucasbradstreet I thought you were going to bed!

lucasbradstreet21:10:37

Yeah I know but I hate to be beat

lucasbradstreet21:10:39

Anyway

spangler21:10:42

I have pushed up the latest version of onyx-test.

lucasbradstreet21:10:43

For reals this time

spangler21:10:46

Yes, and you are right

spangler21:10:51

There are two unresolved questions

spangler21:10:53

(too late)

spangler21:10:13

1. Why do I need more than 11 peers? (this is also what the docs suggest)

lucasbradstreet21:10:27

If you put a constant segment at the end of the body fn, it’ll never end up on the output channel. If you take out the take-segments, it gets called fine

spangler21:10:29

2. Why am I not getting any output from the outer job?

spangler21:10:39

Yes

spangler21:10:41

Cool

lucasbradstreet21:10:47

Yeah, agree with both of those

spangler21:10:52

Those are my issues currently

spangler21:10:04

Glad we agree on these!

spangler21:10:09

Thank you for working through this stuff

lucasbradstreet21:10:44

I’m not convinced there’s a scheduling problem, but I’m definitely interested in looking into that further.

lucasbradstreet21:10:53

I’m pretty sure segments are ending up on the inner output channel

lucasbradstreet21:10:59

you may what to read from it manually to debug

spangler21:10:15

Yeah, they are on the inner output channel, I verified as well

spangler21:10:18

Just not the outer

spangler21:10:34

(the latest version demonstrates this)

lucasbradstreet21:10:22

You need to figure out why take-segments! is not reading from it properly in body

spangler21:10:35

Well, it is in body

spangler21:10:43

*reading correctly

spangler21:10:10

But in the version I just ran, I get this

OUTER SEGMENTS COMPLETE!
INNER SEGMENTS COMPLETE!

spangler21:10:20

So the outer job is completing before the inner jobs

spangler21:10:11

I am willing to acknowledge that I should not submit jobs from inside other jobs and wait for them to complete, but it still means there is some kind of mystery

spangler21:10:14

which is unsettling

lucasbradstreet21:10:24

Yeah which is why I’m spending more time on it

lucasbradstreet21:10:42

I don’t like strange things 😛

spangler21:10:45

Cool, thanks for that!

spangler21:10:54

Yeah, mysteries in code are the worst

spangler21:10:06

Usually, it shows itself as minor unexplainable things