onyx 2018-01-23 | Slack Archive

@atwrdik Ran out of time tonight -- going to have to take a look tomorrow. Sorry! 😕

Starting work on a use-case where a user will upload a file that needs to be converted and streamed into our Onyx system. We're currently just uploading straight to S3. Before I go reinventing any wheels, if anyone has any tips or pointers to useful libraries for this task, it would be much appreciated. Thanks.

michaeldrogalis15:01:37

@dave.dixon S3 -> notification into Lambda -> read file inside Lambda and stream to Kafka -> into Onyx is one route. From what I've heard about your use case previously, Kafka's not ideal here though.

michaeldrogalis15:01:04

Being able to make use of onyx-amazon-s3 directly off of S3 change notifications would be pretty nice

sparkofreason15:01:21

We'll get explicit notification of the file upload in our onyx job. I was going to have that task stream in the file, parse/transform and output to another topic (the contents are transformed into the same commands used in other parts of the application). I think that should be fine, just that one task will be spewing a lot of segments. Will that work? The main thing to avoid would be having to vacuum up the whole file, or having to accumulate all output segments and push them at once, and I'm not entirely clear how to manage that.

sparkofreason15:01:24

Or maybe that requires a plugin, and I need to kick off another parameterized job?

michaeldrogalis16:01:47

I think the main risk in that approach is knowing what to do if one of your Onyx nodes gets kicked over. Are those notifications replayable?

michaeldrogalis16:01:25

I think it'd work okay if you can parallelize the number of processes on the input task doing the reading activity.

sparkofreason18:01:55

If a task returns a seq of segments, is it evaluated lazily?

michaeldrogalis18:01:56

How late does the evaluation need to occur for it to be "lazy"? 🙂

michaeldrogalis18:01:06

Or rather, by whom as the evaluator?

sparkofreason18:01:44

Presumably the evaluator is whatever is distributing the segments to downstream tasks. Just trying to get my head around how a task could output a large batch of segments without having to surface them all in memory.

michaeldrogalis19:01:57

Im not sure that it's handled lazily all the way through. I could be wrong though. @lucasbradstreet?

lucasbradstreet19:01:11

If a task returns a seq of segments it’ll realize it before sending it on. It may split the seq into multiple batches on the messenger, but it’ll have to realize it all at once.

lucasbradstreet19:01:38

The exception is if you use something like onyx-seq. With onyx-seq input tasks the plugin can support realizing a lazy seq incrementally.

2018-01-23

Channels