This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-10-23
Channels
- # announcements (1)
- # aws (11)
- # beginners (28)
- # boot (235)
- # business (1)
- # cider (19)
- # clojure (34)
- # clojure-china (1)
- # clojure-czech (10)
- # clojure-japan (7)
- # clojure-poland (3)
- # clojure-russia (84)
- # clojure-sg (4)
- # clojure-uk (3)
- # clojurescript (114)
- # community-development (4)
- # core-async (15)
- # cursive (8)
- # datascript (5)
- # datomic (6)
- # editors-rus (27)
- # events (2)
- # hoplon (61)
- # jobs (2)
- # ldnclj (56)
- # ldnproclodo (5)
- # lein-figwheel (232)
- # luminus (1)
- # off-topic (5)
- # om (215)
- # onyx (436)
- # overtone (8)
- # re-frame (3)
- # reagent (3)
Okay, so after some effort I have determined that submitting a job from inside of another job caused the issue
I don't know if that is something you want to support but you should probably either make sure it works or explicitly forbid it in the docs somehow
I do not have a reproducible for you that is not our entire app, but when I get a moment I will try to put something together to demonstrate the issue
@spangler: Were you by chance reusing the same ZooKeeper connection from the event map to submit the second job?
Very little actually happens when you submit a job. It writes some data to ZooKeeper storage, so I'm suspicious about the connection being reused and shutdown.
@michaeldrogalis Hmm, quite possible! Which is the event map, and which is the zookeeper connection inside that map?
Here is the map I am passing into submit-job
:
{:zookeeper/address "127.0.0.1:2181",
:zookeeper.server/port 2181,
:onyx/id #uuid "134c49a7-2df1-4094-8578-4364501435d0",
:onyx.peer/job-scheduler :onyx.job-scheduler/balanced,
:onyx.messaging/ack-daemon-timeout 60000,
:onyx.messaging/impl :aeron,
:onyx.messaging/bind-addr "localhost",
:onyx.messaging/peer-port-range [40200 40600]}
:onyx.core/log
in the event map. Will take a peak later today.
@spangler: Is :onyx/id
different between the first and second job you submit?
If so, I see a problem there. :onyx/id
strongly partitions deployments to give you multi-tenancy in ZooKeeper. Each :onyx/id
tracks which ports are being used by which hosts in Aeron. There would be a port collision if you passed the same values for :peer-port-range
in that last parameter to two different instances.
@michaeldrogalis No, :onyx/id
is set at startup
Cool, much appreciated @spangler!
@michaeldrogalis Okay, set up the test project, but getting this mysterious error:
WARNING: /var/folders/2m/0shxv04j04lcm1c0dddyf4qw0000gn/T/aeron-rspangler already exists.
INFO: Aeron directory /var/folders/2m/0shxv04j04lcm1c0dddyf4qw0000gn/T/aeron-rspangler exists
INFO: Aeron CnC file /var/folders/2m/0shxv04j04lcm1c0dddyf4qw0000gn/T/aeron-rspangler/cnc exists
uk.co.real_logic.aeron.driver.exceptions.ActiveDriverException: active driver detected
I see this issue about it, but it says it is fixed: https://github.com/onyx-platform/onyx/issues/273
@spangler: You started 2 Aeron drivers
Aeron restricts you to 1 of those per box
It probably started through start-env
if you configured it to use an embedded media driver
You can have two Onyx jar's running, just need to be cognisant of Aeron
My env-config
(def env-config
{:zookeeper/address "127.0.0.1:2181"
:zookeeper/server? false
:zookeeper.server/port 2181
:onyx/id onyx-id})
:onyx.messaging.aeron/embedded-driver?
defaults to true.
Embedded Aeron is useful for development since you dont need to manage it yourself. If you're going to bring up two instances of Onyx on the same machine you need to manage it yourself. I usually deploy into Docker so its not an issue
Just caught something there that might need a mention. You won't be able to run two onyx jars on the same host if you bind them to the same ip
Even if you use different ports. The short circuiting will go haywire
If you disable short circuiting and make them use different ports it'll be fine
Right, good catch @lucasbradstreet.
@michaeldrogalis Okay, I have a reproducible for you: https://github.com/littlebird/onyx-test
@spangler: Ty. Will let you know
I can give it a look in the morning if you don't get to it tonight
@spangler: One quick thing. You cant reuse a catalog entry more than once in a workflow. Surprised our schema checks dont catch that, Ill add a fix for it. So inner-in
and inner-out
https://github.com/littlebird/onyx-test/blob/master/src/onyx_test/core.clj#L12
@michaeldrogalis I don't follow. This example for instance reuses catalog entries: https://github.com/onyx-platform/onyx-examples/blob/0.7.x/multi-output-workflow/src/multi_output_workflow/core.clj
I don't quite follow either
Im pretty sure that incidentally works, and should not (pretty old example that I forgot to update)... There are several cases where Im not confident that reusing a keyword in a workflow would be correct.
Ill verify that though.
What do you mean by reusing a keyword?
Specifically which task?
Duplicate left-side usage: https://github.com/onyx-platform/onyx-examples/blob/0.7.x/multi-output-workflow/src/multi_output_workflow/core.clj#L13-L14
This example also uses several examples of :out
https://github.com/lbradstreet/onyx-timeline-example/blob/master/src/clj/onyx_timeline_example/onyx/workflow.clj
Do you mean because it's an input task? Isn't that the standard way to flow to multiple tasks?
Hold up, I have a work thing. Back in a bit.
@lucasbradstreet: It would be fine to have 2 catalog entries only different by their :onyx/name
np, take your time
To me it looks similar to https://github.com/onyx-platform/onyx/blob/0.7.x/test/onyx/peer/dag_test.clj#L172
Aside from the fact that :in is an input task
no need to respond now
As another example, the onyx-starter
project does the same thing: https://github.com/onyx-platform/onyx-starter/blob/0.7.x/src/onyx_starter/workflows/sample_workflow.clj
Yeah, it looks fine to me. We’ll just have to wait to hear back from @michaeldrogalis to clarify
Ah, here was the canonical one I was looking for, right in the docs: https://onyx-platform.gitbooks.io/onyx/content/doc/user-guide/concepts.html
;;; input
;;; /\
;;; processing-1 processing-2
;;; \ /
;;; output
[[:input :processing-1]
[:input :processing-2]
[:processing-1 :output]
[:processing-2 :output]]
Just documenting these here. If it IS wrong, then all of these examples are very misleading!
Yeah, don’t worry, I think Mike just got himself a bit confused
You’re starting up a buttload of jobs
Is the problem
Try bumping up the number of peers to 40 and run it again
I didn’t quite get the correct output, but it’s closer at least, and it finishes
INFO: Aeron toDriver consumer heartbeat is 45790 ms old SUBMITTING OUTER INPUT OUTER SEGMENTS OUTER! {:outer suffix} SUBMITTING INNER INPUT INNER SEGMENT {:inner what} INPUT INNER SEGMENT {:inner context} INPUT INNER SEGMENT {:inner is} INPUT INNER SEGMENT {:inner this?} BBBBBBBBBBBBB CCCCCCCCCCCCC AAAAAAAAAAAAA CCCCCCCCCCCCC CCCCCCCCCCCCC BBBBBBBBBBBBB AAAAAAAAAAAAA AAAAAAAAAAAAA AAAAAAAAAAAAA BBBBBBBBBBBBB CCCCCCCCCCCCC BBBBBBBBBBBBB OUTER SEGMENTS COMPLETE! INNER SEGMENTS COMPLETE! [:done]
Basically you get 1 task running on one peer at a time
So if you only have 11 peers and you you start 3 jobs taking 4 tasks only two of them will run concurrently
You can start up more peers than your number of cores if they’re generally lying around doing nothing though
Lemme check what scheduler you’re using
The fact that it starts one job over and over again (which is wrong), then never completes makes me feel like something is wrong
Balanced hmm
There definitely may be some other weirdness going on yes
You’re right, you have an 8 task job and a 3 task job so it should be able to run one of the 3 task jobs at a time
Yes, I did
Oh, hmm
Yeah the fact that the outer job is triggered over and over again is bad behavior too, since then it queues up another set of inner jobs
Understood
You may have to deal with preventing duplicate job submissions anyway, because it’s the nature of the onyx retry mechanism
Scheduling the jobs via an onyx job may not be the best approach
If a message isn’t acked all the way to the output it’ll resend the message again from the input source i.e. at least once messaging
I’m sure it’s possible to do it via Onyx, you’ll just have to think about it a little and make sure you filter out duplicate job submissions
Our stateful processing in 0.8 allow you to make it pretty close to exactly once quite easily
@spangler @lucasbradstreet Sorry nevermind, had a brain-fart. There's nothing wrong with that.
2 years later and I can still mess up my own data model, heh
There’s definitely something weird going on with the scheduler
I don’t quite understand it yet
Ah I think I know what’s going on
11 peers
Job with 3 peers on it, job with 8 peers on it
sorry, I mean, job with min 3 peers, job with min 8 peers
balanced task scheduler
Job with 3 peers is running, and it submits a job requiring 8 peers
Hmm, no
Ah, and it’s max-peers, not min-peers
Bingo https://github.com/littlebird/onyx-test/blob/master/src/onyx_test/core.clj#L95
There’s definitely a lot of fun stuff happening with the scheduler I think
core.async input is kinda bad when the peers get switched around too because the data is going to be lost
Right. Also, that line I pointed to is making 4 peer processes try to read from the same channel.
I guess that’s only really a problem in terms of the :done
@michaeldrogalis But that is max peers right?
Yes. But anything > 1 will yield incorrect behavior there. At least, that was true when I first wrote the core.async plugin, which is why the docs specify max-1
Mmkay. That's a start at least.
@lucasbradstreet What do you mean "data is going to be lost"?
Added an issue to schema check that
So, the core.async plugin takes from the input channel and then it does it put’s it in a pending-messages map for replay and then sends it on
If the peer that’s on that task gets switched to another job then that pending-messages map is lost. Normally it’s not a problem with other plugins like kafka because we can start from the same position and we won’t “complete” it until it’s fully acked
But since core.async is mutable and the data isn’t in the channel you’re kinda screwed. The core.async plugin is only really for testing
Okay, so I should use kafka for this then? We are using kafka for other things, so wouldn't be that bad to switch it
Yeah, since you’re doing some complicated scheduling related stuff it might be better
I don’t really see the downside in just using kafk
Okay, set :onyx/max-peers 1
and it got through the tasks at least... (!) But no data came through?
Yeah, you’re probably going to have to debug that one further
@spangler: Btw, you can add (shutdown-agents)
to your clean up process for a quicker shutdown
@michaeldrogalis Ah thank you for that
@michaeldrogalis: I think there’s something non-ideal going on with the scheduling but it might just be task switching going slow with not many spare peers. I’ll give it a look at another time. I think it should be starting up these jobs much more quickly without any changes
Okay right, this was my other issue with the job. The outer job completes before the inner ones
Ahhh I see what’s going on
You’re blocking in one of your functions, right?
So basically you end up with a lot of peers on these body tasks
just blocking
Waiting for inner jobs to run, but these peers can’t be rescheduled
Because they’re currently blocking
I set the peer count to 15 and get the expected behavior after setting max-peers to 1. I'd not be totally surprised if this is an edge case on the scheduler trying to reallocate with little spare room.
Also, @lucasbradstreet, makes sense. Nice assessment.
That’s why the scheduler seems borked
That explains what I just noted, too. More room to reallocate non-blocked peers.
We just need to point @spangler to looking at the log to see when his jobs are completing
Rather than blocking there
I have to go to bed, if you don’t work through it today talk to me tomorrow, I’ll explain how you could do things differently
@spangler basically when there are no jobs it’s overallocating peers to your body task
But then they get tied up blocking on take-segments!
because they’re blocking and never finish the batch
they never get reallocated
By overallocating I mean it’s just throwing as many peers on it as you have
because there are no other jobs running
so chuck all 11 on that one job
Then all the peers get tied up on the take-segments! body task and never finish their batch
So never get reallocated to your newly submitted job
Then because you never returned from the body function, and the segment isn’t acked
the job will retry the original segment
So the peers get tied up, even though they aren't directly involved in the take-segments!
?
Also makes sense why you'd see "Not enough peers starting..." The scheduler thinks the peer can move, but the peer doesnt because its blocked.
You’d have had 9 peers on the :body task to start
and 1 on the input one on the output
because the scheduler will use all the peers that it has
Then all the segments come in to body, and all those peers will do take-segments!
and submit the job
The scheduler will try to reallocate the peers but it does it in between batches
those peers never finish their batches because they’re waiting for take-segments!
and the job never starts
I hear what you are saying, and that explains the behavior, but it still seems undesirable to me... ?
Well, you could set :onyx/max-peers on the :body task
There’s also a percentage job scheduler
that you could use
Blocking in your tasks like you’re doing is probably an antipattern
What do you mean wait on a task they aren’t even running?
That’s not how it works
The scheduler will use as many peers as you have for the jobs that are running. In the balanced scheduler it’ll try to evenly balance them (max-peers/min-peers) not withstanding
When you only have one job running (the outer job)
it’ll throw all of the peers on that job, because everything = balanced for one job
So you’ll end up with almost all your peers on the body task
Okay, I guess I don't understand what the peers are doing then? How can they all be involved in just running one function?
The way onyx works is that you have a task, with a function on it
and other tasks send them segments to be processeg
processed
so maybe one :body task gets one segment
another gets another segment
each time they get a segment, the function, in this case :onyx-test.core/body is called on it
But if they never got a segment, they are still allocated, just waiting for a segment?
So say you have 8 peers currently scheduled to the :body task
correct
They don’t deallocate until they finish a full task lifecycle, i.e. process their batch
maybe the batch is empty
because it hit a small timeout
maybe it’s not
Hmm, ok I can kinda see why you’re confused
you’re only putting one segment on
right?
Yeah, ok, maybe there’s something else going on too
Second job can start immediately if there are enough spare peers
including peers already being used on the first job
The only reason I was saying they weren’t being reused was because I thought they were blocking in the body fn
on take-segments!
But looking at it again, you only have one segment
so yeah it should only be one peer
Sorry, I was looking at the inner segments
It’s ok, you got a decent lesson 😉
A little bit, I’m not sure where the fault lies though
Maybe it’s just what @michaeldrogalis said
the max-peers 4 on the input task on the inner job
I think because only one sees the done
How do you know?
You’re using the same output channel for all the inner jobs right?
you’re going to end up with a channel with lots of dones on it
with messages in between
take-segments! will only read up to the done
the first done
Ah, I had a whole discussion with @michaeldrogalis about this, where he assured me that take-segments!
does not return until all jobs have finished processing
It does not return until it reads a :done
any done
I did not say that 😛
take-segments is agnostic to anything Onyx related.
Tiny utility fn: https://github.com/onyx-platform/onyx/blob/0.7.x/src/onyx/plugin/core_async.clj#L122
Yeah we mostly use it in testing
Technically it’s right. A done is put on the end
after the job is completed, you don’t see it in the tasks
However, you’re running multiple jobs and reusing the output media
So… all bets are off
If you used a new channel for each job you’d be safe
everything is finished != all jobs
Right. I had in mind a single job, there.
Ok, it’s only the outer job channel
in that case
the job is finishing
Another misreading
We’re getting good at this
no I mean
I meant the take-segments in the body fn
You’re using the same channel for all the inner jobs right?
I guess that can kiiiind of work though it’s a bit schetchy
since you read up the done, print, return
reuse the channel again
It’s recommended
I’d probably just make the switch to kafka
So using one output channel for many tasks is always wrong, since you can never tell when it is really done
Yeah, it’s a bad idea
Which example?
Which part is misleadingZ?
And, you shouldn't really refer to it as a DAG, since you never want to merge paths once diverged
I think we’re getting ourselves confused here
All these things are fine
That example is totally fine
Excuse what I said earlier, I was on the phone trying to multitask. Its not incorrect to do that.
You are telling me that if I send :done
to something that branches, if it merges again then you can't depend on :done
to tell you when it is done
The done never goes to any task
It’s used to decide when to finish the job
and it’s written to the output task if requested
scratch “requested"
and no, it’s just used to decide when all the input tasks are fully processed
and to punctuate on all the output tasks
Refer to what I said in that screen capture.
so when you read from the media you know it’s done… buuuuuuut
because you’re using the same channel in between jobs you’ll get
[segment segment segment :done segmet segment segment :done
take-segments! is just a helper function for our testing
which just reads all the way up to the first done
and returns
The problem is that you’re relying on it for something that it wasn’t intended to be used for
Maybe it should have a better docstring
Docstring looks pretty ok
Normally we wouldn’t ever re-use a channel between multiple tasks
or multiple jobs
so we wouldn’t ever get multiple :dones on a channel
Aren’t you re-suing it in the inner job?
Gah I forgot there is only one segment
If you had more than one segment on the input task on the outer job you would hit this problem
You’ll definitely hit this problem later when you submit multiple jobs via multiple segments
I’ll have to discuss that with you another time. it’s 5am here
Singapore
I’ve never tried to submit jobs within onyx jobs, so there are probably a few pitfalls
I think there are probably better approaches
Sure, happy to chat another time
(let [results (take-segments! (:output inner))]
ah nevermind
input outer and output inner get me all mixed up
because inner and input are similar sounding
I'm sure the problem will jump out with a little time. At any rate, it's worth a reminder that Onyx is reaching nearly 15,000 lines of code in total, along with all of its documentation. If something doesn't make sense straight away or there's an error in an example, keep that in mind. I develop Onyx for free in my spare time, so not everything will be 100% smooth all the time.
@michaeldrogalis Of course, and your effort is appreciated. I am usually on the other side of the equation, so when I have a new user who is trying to figure out how to use my code, I take it as an opportunity to learn all the ways in which things appear to an outsider. So many assumptions go into making something like this that you become blind to how it appears to someone trying to figure it out for the first time. In that way feedback from new users is invaluable (if also annoying!)
So, I am offering my perspective in trying to make it work, at least in this use case
Right, it's appreciated. As is your patience. Need to run now, catch you later.
Whatever is going wrong is going wrong in the body function. It’s really weird. Something to do with take-segments and I don’t quite understand it
@lucasbradstreet I thought you were going to bed!
Yeah I know but I hate to be beat
Anyway
For reals this time
If you put a constant segment at the end of the body fn, it’ll never end up on the output channel. If you take out the take-segments, it gets called fine
Yeah, agree with both of those
I’m not convinced there’s a scheduling problem, but I’m definitely interested in looking into that further.
I’m pretty sure segments are ending up on the inner output channel
you may what to read from it manually to debug
You need to figure out why take-segments! is not reading from it properly in body
But in the version I just ran, I get this
OUTER SEGMENTS COMPLETE!
INNER SEGMENTS COMPLETE!
I am willing to acknowledge that I should not submit jobs from inside other jobs and wait for them to complete, but it still means there is some kind of mystery
Yeah which is why I’m spending more time on it
I don’t like strange things 😛
There's probably an 70-80% chance the last one is a bug in your code but I still prefer to check just in case
me either
Yes, but it’s a bit convoluted
an could be easy to mix inner for outer somewhere, etc, not that I’ve found it
alright. really sleep now
catch you, gl
Scumbag brain won't stop thinking about it. So I think the reverse of what we talked about before could be happening
Meh maybe not
Alright, out