onyx 2016-04-14 | Slack Archive

I would like to batch messages given other criteria than the batch-size. For instance I would like to calculate the accumulated file size and write when the threshold is reached

aspra13:04:56

What’s the best way to do that?

lucasbradstreet13:04:00

@aspra: I'll have to think about that a little bit. I can't think of many good options off the top of my head

aspra13:04:22

@lucasbradstreet thanks!

michaeldrogalis14:04:49

@aspra: Is the idea to control the batch size for outgoing segments from an output plugin?

aspra14:04:02

@michaeldrogalis yes exactly

michaeldrogalis15:04:37

@aspra: Interesting use case. Just out of curiosity, is there a problem with incrementally writing to an open file handle rather than going at it all at once?

aspra15:04:12

I would like my output to be files of a max size

aspra15:04:24

I have created an output plugin that batches to a new file per batch-size but I would like to do it for a byte-size

lucasbradstreet15:04:27

The best I’ve got so far is to either do that, or to manually ack from an output plugin and accumulate the segments until you hit your criteria, at which point you write out and ack.

michaeldrogalis15:04:19

@lucasbradstreet: Can you see any problems with using a leaf function task for that with a global window and a trigger that writes to a file when the criteria is met?

michaeldrogalis15:04:12

Sorry, need to brb for a bit.

lucasbradstreet15:04:25

That would work, but it would be achieving fault tolerance by journalling the whole file to BookKeeper, which is probably undesirable

michaeldrogalis15:04:51

@aspra: Can you point me to another tool that has a similar feature? Just curious to see how it works elsewhere.

aspra15:04:43

@michaeldrogalis: no idea if there is such a tool I am afraid

lucasbradstreet15:04:15

I think @michaeldrogalis’s suggestion of incrementally writing to the file would work. You could track the size of the file and switch to a new handle if the file would grow past the size limit

michaeldrogalis15:04:30

@aspra: Okay, no worries.

aspra15:04:04

so using a window/trigger?

lucasbradstreet15:04:44

I think window/triggers are probably the wrong solution because you will end journalling all your files to BookKeeper too

lucasbradstreet15:04:29

Instead of creating a file per batch in your output plugin, you can keep your file handle in an atom in your plugin, and switch to a new one each time the file becomes too big

michaeldrogalis15:04:40

Yeah, was just a quick suggestion. An async ack is probably the best shot for performance.

michaeldrogalis15:04:51

Or that, yeah

aspra15:04:14

ok thanks should work, will give it a try

jeroenvandijk15:04:23

@michaeldrogalis: one example of a tool that does this is the Pail consolidator for Hadoop https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/Consolidator.java

jeroenvandijk15:04:33

FYI, this is the use case we want to implement http://metamx.github.io/docs.metamarkets.com/docs/latest/send-data.html#file-formats-names-and-compression . MetaMarkets uses Druid as data crunch solution

lucasbradstreet15:04:31

Ah. Will uploading to a HTTPS endpoint be part of the Onyx job, or will that happen outside of Onyx?

jeroenvandijk15:04:42

The idea was to do it inside, but we are noobs

michaeldrogalis15:04:08

@jeroenvandijk: Ah okay, I understand the motivation now.

lucasbradstreet15:04:20

OK, so in that case you would be both writing to a file and also making the http call?

lucasbradstreet15:04:56

You might not need to write to a file first, I’m guessing?

michaeldrogalis15:04:36

Alright I gotta run for real now, catch ya'll.

lucasbradstreet15:04:47

Catch you

aspra15:04:52

@lucasbradstreet there are two use cases

aspra15:04:36

one is to use the api and the other one to persist it in a structured directory

lucasbradstreet15:04:46

Right, it makes sense to start with the second one

lucasbradstreet15:04:44

Let me know if you'd like any pointers splitting up the file writing in the plugin then

lucasbradstreet15:04:22

What is batch-file doing?

jeroenvandijk15:04:57

it collects messages and creates a file of around 0.5mb and then gzips this. (that’s the goal at least)

jeroenvandijk15:04:07

i’m also not sure about whether it is ok if the file needs to be transported from one node to another node via Aeron

lucasbradstreet15:04:21

Yeah, that’s the main thing I wanted to determine

lucasbradstreet15:04:53

Making sure that file is on the same node as the send-to-api and write-to-s3 tasks is a little tricky.

jeroenvandijk15:04:27

Is it a good practise to serialize the file and send it from one node to the other?

jeroenvandijk15:04:48

Maybe the reuse of the batching step isn’t a particular good idea if it’s not

lucasbradstreet15:04:06

I think it’s OK to send the file contents in a segment to each, based on it being around 300-600KB. However, it might not be worth the pain. It might be better to just re-use the batching code in both send-to-api and write-to-s3

aspra15:04:37

@lucasbradstreet ok. So more what we were discussing before, the batching happens on the output plugins

lucasbradstreet15:04:26

Ah, I must have misunderstood the purpose of metamarkets/batch-file?

jeroenvandijk15:04:49

Yeah i guess a case of premature optimization :#

jeroenvandijk16:04:05

my bad

2016-04-14

Channels