Fork me on GitHub

@michaeldrogalis Thank you for the pointers. Yes, they run consecutively. Sounds like we should consider external stores (kafka, redis, db, etc) rather than serializing large values


@aaron51 Yup! You have plenty of choices there.


will the onyx seq plugin ensure that peers are coordinating the input segments and not processing a unique segment n-times (n being the number of peers associated with that input task)? In other words, if I'm passing a list of segments to onyx-seq 3 times during task definition on 3 peers, will the peers coordinate to see which one has seen a given segment?


It won’t, though assuming you’re just passing it a static list, one input peer would likely be enough to distribute the work downstream. It’d be pretty easy to update onyx-seq to partition the work based on the number of peers but I haven’t seen a reason to yet


Alternatively if you inject the data via a lifecycle you can partition it based on the slot id and number of peers


What if there's a lot of data to be put into the seq? Would it still work to put all of it in the lifecycle? I had the understanding that putting data in the lifecycle might have the unintended consequence of making your barriers really big


Yes, so what I meant by that is that you have a before-task-start fn that injects say (map (fn[i] {:n i}) (range 10000))


but instead you check :onyx.core/slot-id inside that fn, and based on the slot, you would return differently partitioned ranges for each peer


that way each peer gets a different part of your data set


it also works if you know that want to read, say, a bunch of S3 objects. Based on the particular slot the peer is on, you would have each peer’s injection return a different partition of the objects that you passed in


(e.g. first peer gets the first 1/3, second gets the second 1/3, third gets the last 1/3)


the idea works whether you’re generating the messages or passing in via the lifecycle


Gotcha, ty