onyx 2016-06-08 | Slack Archive

Are people already using onyx with hdfs? I guess not given this issue https://github.com/onyx-platform/onyx/issues/7

E.g. In case someone happens to think about the same, I’m wondering how this would implemented with Onyx https://github.com/nathanmarz/dfs-datastores

lucasbradstreet12:06:43

@jeroenvandijk: not yet, but it's one of our medium term priorities

jeroenvandijk12:06:30

I’m not sure if I need hfds (we would write to S3 which is hdfs compatible), but definitely need something that reads and writes (larger) files in a fault tolerant way. Not sure what that is exactly

lucasbradstreet12:06:39

@jeroenvandijk: would you mind explaining how you'd use it?

lucasbradstreet12:06:55

Ah ok

lucasbradstreet12:06:08

Mostly we're waiting for the right use case before we implement something with it

jeroenvandijk12:06:09

so the Pail library is something I would like to have in Onyx

jeroenvandijk12:06:16

I can explain that if you want

jeroenvandijk12:06:06

Pail is inside dfs-datastores

jeroenvandijk12:06:03

We use Pail/dfs-datastores to store large sets of data. It’s robust, cheap and we don’t have to worry about scale

lucasbradstreet12:06:14

Ok, would it usually be outputting to HDFS, or would and in input plugin be important too? Guessing you'd want both since you'll probably want to read your results from other jobs

jeroenvandijk12:06:25

yeah so we have two steps, in storage. Writing raw unconnected data. Sometimes we query this directly (with Cascalog). The next step is to connect different events together. For this we need to read multiple files and join them. After this we write it to another set. This is also something we query (with Cascalog). I would like to replace this process by writing to parquet or json files. The writing and joining would be done by onyx. I would do the querying with Drill or Redshift.

jeroenvandijk12:06:30

That’s a rough idea 🙂

jeroenvandijk12:06:41

I’m not sure if it is worth the trouble, but we had some debugging problems with Hadoop/Cascalog in the past. I hope Onyx would be more transparant

lucasbradstreet12:06:52

Cool. That makes sense.

lucasbradstreet12:06:56

Yes, I'd hope so too

lucasbradstreet12:06:46

A lot of reason why @michaeldrogalis started work on Onyx was to get away from Cascalog and storm inflexibility and opacity

jeroenvandijk12:06:19

Yeah also the reason why I became enthusiastic about Onyx 🙂

lucasbradstreet12:06:01

I'd love to have a HDFS plugin, but we're a bit busy at the moment so it would have to be user contributed, or sponsored work until we're done with our current batch of onyx work

jeroenvandijk12:06:16

yeah makes sense

jeroenvandijk12:06:24

Maybe will ask you later

lucasbradstreet12:06:33

Sounds good. It's definitely something we need

michaeldrogalis14:06:10

Definitely keen on getting HDFS support out soon. It's probably the most requested feature to be honest.

lucasbradstreet16:06:35

Agreed

Drew Verlee17:06:33

@jeroenvandijk: > we would write to S3 which is hdfs compatible could you explain what that means briefly?

michaeldrogalis17:06:56

@drewverlee: I think he's off for today since he's in Europe, but I'll take a guess. A lot of systems can read/write to S3 as if it were HDFS since they present roughly the same abstraction - a K/V store for large files.

michaeldrogalis17:06:22

One of the important differences is that S3 doesn't support block-level reading, though.

Drew Verlee17:06:03

@michaeldrogalis: thanks!. cool. i ask because part of the team (me and my brother) are trying to write to to s3 as an alternative to hdfs largely because we want options. So hearing they share the same abstraction is encouraging.

2016-06-08

Channels