Fork me on GitHub
#onyx
<
2015-10-16
>
gardnervickers00:10:16

replacing in-calls with it’s version for each job

spangler00:10:27

Maybe this is my misconception about how onyx works then... can't I run the same job multiple times with different inputs?

spangler00:10:39

So, I have a job that calculates some metric on a graph say

spangler00:10:59

I want to run that same job using many different graphs as input

spangler00:10:29

So, I have a graph with 5000 nodes, I submit a job to analyze it

spangler00:10:48

Then, I have another graph with 10000 nodes, I submit a job to analyze that

spangler00:10:13

two different jobs, same workflow and lifecycles etc

spangler00:10:31

Ideally different input/output channels, since maybe they are both running at the same time

spangler00:10:46

I must have this all messed up somehow

gardnervickers00:10:50

If you want to keep their results separate, I believe it is idiomatic to launch two jobs in this case.

noisesmith00:10:01

spangler: sorry to butt in, but why not sort them by task-id on the way back out?

spangler00:10:52

So you are saying, there is only one output channel for a particular workflow, even if I submit multiple jobs for it?

noisesmith00:10:10

that makes sense to me - you define the number of workers reading from that channel

noisesmith00:10:17

the channel itself should not be a bottleneck

gardnervickers00:10:43

You can have multiple output channels, how you route messages internal to that workflow is up to you

noisesmith00:10:47

remember the peers config

gardnervickers00:10:05

If you wanted to do something like route based on an id, there are flow conditions for that.

gardnervickers00:10:28

Am I thinking of something different than you?

spangler00:10:15

So what if I have different processes for each job?

gardnervickers00:10:42

Do you mean the workflow varied for each kind of work you wanted to do?

spangler00:10:43

If each is reading from the same channel, then one will get messages meant for the other and vice versa

spangler00:10:54

No, the workflow is the same

gardnervickers00:10:14

You can have two different channels as input in your workflow

spangler00:10:27

That would be awesome

spangler00:10:36

How do I provide different channels to the workflow?

gardnervickers00:10:45

something like this

[[:in1 :inc]
[:in2 :inc]
[:inc :out]]

spangler00:10:45

Right now it seems I need to def them

spangler00:10:24

No, I have an undetermined number of jobs that will run with this same workflow

spangler00:10:29

not just two

spangler00:10:54

Maybe 40, maybe 4000

spangler00:10:25

I think there is some misconception here on what a "job" is perhaps

gardnervickers00:10:48

A job in the docs is an entire work-unit. Lifecycle+Catalog+Workflow

spangler00:10:48

So for example, I have a network I want to calculate betweenness on

spangler00:10:06

I submit the job, send the network info in on the input channel, and read the result from the output channel

spangler00:10:04

But maybe I submit another job at the same time, from a different process

noisesmith00:10:11

I think output channels are only for things you would combine and / or reduce over

spangler00:10:19

how do I keep the wires from being crossed here?

noisesmith00:10:19

they aren't for specific task results

gardnervickers00:10:49

You would submit separate jobs, especially in this case as it (i assume) invovles aggregation.

gardnervickers00:10:23

You can get around this, but I dont see the benefit

spangler00:10:56

@gardnervickers Right, that is what I want, to literally call (submit-job) from two different processes, put input on two different channels and read output from two different channels

gardnervickers00:10:16

Yea, i’m sorry was that what you were asking about all along?

gardnervickers00:10:38

Shoot, didnt pick up on that for some reason haha

spangler00:10:12

The thing is, I don't see how to provide different input channels during submit job, since the lifecycle map refers to def'd functions as keywords

gardnervickers00:10:40

hold on I have a snippet of code from a while ago doing this

spangler00:10:11

Cooooool thank you

gardnervickers00:10:31

As long as you have a reference to it somewhere it shouldnt be GC’d, so you could throw a bunch of channels in a map

gardnervickers00:10:43

Having some trouble finding this code

spangler00:10:53

I thought of that, having a map of channels in an atom or something?

gardnervickers00:10:15

Yea I believe the leiningen template does something similiar

gardnervickers00:10:21

stores them by GUID

gardnervickers00:10:57

they actually have a really good setup in there for this kind of stuff

spangler00:10:00

Hmm.... then I can clean them up when the task in complete possibly?

spangler00:10:12

I saw that, they memoize the function based on the id

gardnervickers00:10:16

yea that’s it

gardnervickers00:10:33

Or you can just leave them hanging around, they’re so light weight

spangler00:10:39

So that is the standard way to do it then?

gardnervickers00:10:41

dissoc should toss them to the GC though

gardnervickers00:10:06

I’m not sure about standard but that’s the approach I took

spangler00:10:12

Well, how do you dissoc a memoized function? I mean, memoizing a function is basically using a map behind the scenes really

spangler00:10:21

I would rather just use an explicit map in that case

gardnervickers00:10:25

Oh I meant just dissoc from a hash map

gardnervickers00:10:55

but dont even worry about that, chans use nothing in terms of resources

gardnervickers00:10:18

especially if they’re closed (which you should do when you finish a particular graph processing)

spangler00:10:40

So, every time I submit a job, assoc the chan into a map, do the processing, close the channel?

spangler00:10:43

Then read from that channel?

gardnervickers00:10:01

Yea you can have that be part of the lifecycle

gardnervickers00:10:12

to close the channel after :done or something

gardnervickers00:10:20

if they still use :done 😕

gardnervickers00:10:23

it’s been a while

spangler00:10:29

They do, I just checked : )

spangler00:10:43

Cool thanks!

spangler00:10:50

implementing now....

gardnervickers00:10:51

Honestly the data-driven approach makes this kind of stuff easy to do programatically

gardnervickers00:10:06

Also, the Gitter chat room is far more likely to get a response in my experience, Mike and Lucas are on there reliably

spangler00:10:35

@gardnervickers Ah, I was trying to avoid having another communication channel open : P

spangler00:10:58

Trying to figure out how to use your tool here ; )

michaeldrogalis00:10:06

@spangler: The core.async I/O plugins are pretty much only useful for dev. Have you seen the application template? We have a few idioms in there to use multiple channels.

spangler00:10:06

@michaeldrogalis I don't know if you have read through the entire history here... are you referring to the memoizing the channel function based on id?

michaeldrogalis00:10:28

Yeah. I read through some of it. Still stuck, should I reread it?

spangler00:10:18

I am just trying to figure out how to submit multiple jobs using the same workflow

spangler00:10:31

Basically I want each job to have its own set of input and output channels

spangler00:10:22

But all the examples I have seen the lifecycle refers to functions to inject the channels by keyword

gardnervickers00:10:24

Storing them in a map is functionally equivalent to just making a bunch of (def a (chan)), (def b (chan)) but way easier to manipulate with code.

spangler00:10:53

Yes, that seems like what I want

gardnervickers00:10:54

One for the cookbook/faq/tips&tricks 😉

spangler00:10:07

It would be awesome if I could pass the channels I need into the call to submit-job somehow... or the call to submit-job returned the channels I need. But I can basically create an abstraction that does this

spangler00:10:16

using the map-inside-an-atom approach

gardnervickers00:10:27

Yea that sounds interesting

michaeldrogalis00:10:59

A quick explanation of how things work might help: -everything- you pass into onyx.api/submit-job is data, and is strictly not shared across jobs. You can reuse those things freely, and they need not exist on the peers at compile time. Lifecycles are a construct that helps you create side effects. You're feeling stuck because you're sharing the "recipe" for creating core.async channels, but the actual implementation of how channel access works is where you want to focus. I think you have what you're looking for, but that's another way to phrase it.

michaeldrogalis00:10:26

Everything that goes into submit-job must be serializable, we drop it into ZooKeeper. Keep learning to use lifecycles, they're what you want.

spangler00:10:48

Right, that makes sense

michaeldrogalis00:10:35

Indeed, @gardnervickers, need to write that cookbook sometime.

spangler00:10:55

So if I need channels for each job, I need to refer to a function that generates them

spangler00:10:47

I wonder, is this how onyx was intended to be used? Or should I be decomposing my problem another way?

michaeldrogalis00:10:25

That's a normal usage pattern.

gardnervickers00:10:27

The 'new-years resolution' of the software world, empty git repo https://github.com/gardnervickers/onyx_cookbook

gardnervickers00:10:35

now’s a good time to transfer my notes as any simple_smile

spangler00:10:09

Well, thanks for your help @michaeldrogalis and @gardnervickers. I will probably be asking more questions in the future : )

gardnervickers00:10:16

@spangler: always ask I love working on this stuff.

michaeldrogalis15:10:35

@shaunxcode: There's a patch in develop that implements the grouping by window behavior that we discussed.

michaeldrogalis15:10:07

Haven't documented it yet - basically if you use :onyx/group-by-key or :onyx/group-by-fn, window state is maintained as a map rather than a scalar.

parentheian17:10:12

awesome, looking at it now!

cddr22:10:16

Hey @ericlavigne! How's it going?

ericlavigne22:10:06

@cddr Andy? simple_smile Having a great time. Trying to choose a scalable database at the moment.