Fork me on GitHub
#onyx
<
2017-03-23
>
Travis02:03:46

Hey guys, anyone ever attempted to do anything with onyx and parquet files?

lucasbradstreet02:03:43

Haven't heard of anybody, but I'd be interested as it's something I've thought to do

jeroenvandijk12:03:59

@camechis we’re looking into this

Travis12:03:18

Cool,. I was just exploring some ideas and wondered if anyone has used it in onyx. Was thinking of storing some data in parquet on s3

jeroenvandijk12:03:51

yep we are planning on doing the same if it all works out. My hope is that the only change to the current setup is the way we generate bytes (parquet instead of json), but not sure yet

jeroenvandijk12:03:00

@camechis Do you see particular challenges?

Travis12:03:56

I haven't even begun to even see what it would take. Just an idea popped in my head. About as far as it's got, lol

jeroenvandijk12:03:47

some my hope/estimate is that you only need to create a custom serializer-fn for parquet https://github.com/onyx-platform/onyx-amazon-s3/blob/0.10.x/test/onyx/plugin/s3_end_to_end_test.clj#L59

jeroenvandijk12:03:59

the rest should be the same as writing other data to s3

jeroenvandijk12:03:42

maybe if you want to read the parquet files from some other hdfs client you need to write to the proper paths (based on the data), but that’s it i think

Travis12:03:56

Sounds reasonable

gardnervickers12:03:20

If it's just a file format for files in S3, a custom deserializer will be sufficient for ingest. Any other kind of storage medium will need a plugin.

theblackbox13:03:26

If I wanted to emit a segment from one task to multiple tasks would I need to use flow-conditions? Or am I barking up the wrong tree there?

Travis13:03:13

you should be able to just use that as the input to the other tasks if I believe , in your worflow.

gardnervickers13:03:30

@theblackbox To emit a segment from task :a to :b and :c, you’d make your workflow look like [[:a :b] [:a :c]].

michaeldrogalis19:03:13

Long awaited patch going in shortly. onyx-kafka peers will be able to consume from multiple partitions, chosen at runtime: https://github.com/onyx-platform/onyx-kafka/pull/35

Travis19:03:42

@michaeldrogalis I am pretty sure I know what this means but can you explain a little

lucasbradstreet19:03:21

@camechis it means you can set onyx/n-peers to any number, and the partitions will be distributed between them. So if you have a topic with 4 partitions, and set onyx/n-peers to 2, one peer will take partitions 0,1 and the other will take 2,3

Travis19:03:59

awesome, thats what I thought so now you don’t have to change it to match how many partitions you have

Travis19:03:08

very handy

lucasbradstreet19:03:26

Should help quite a bit because the prevailing wisdom is to overpartition your topic to save you trouble down the line.