Fork me on GitHub
#onyx
<
2016-03-10
>
mike_ananev20:03:41

Hi there! briefly i've read onyx examples but can't find how i can process json files from hdfs folder? should I take files from hdfs and put them to kafka topic in order to do ETL?

michaeldrogalis20:03:49

@mike1452: Hello! We don't have an HDFS plugin yet, but you have some options. You can stream the files into Kafka, or you can use onyx-seq to read the file and buffer it into an in-memory data structure: https://github.com/onyx-platform/onyx-seq

mike_ananev20:03:39

@michaeldrogalis: thank you! Kafka is my choice, cause we have big data cluster with offline files (in hdfs) and online data - clickstream. As an architect i'm considering to make a pilot with Onyx. Currently we work much with docker. Is there any Onyx docker image that i can pull and start?

mike_ananev21:03:13

@michaeldrogalis: ok, i can describe more deeply my task. i have 56 servers with 760 TB HDFS storage. currently we much use Spark in order to process csv,tsv,json files. But Spark very capricious thing. Also, we have clickstream from our site. I'm a clojure developer. so i want to try out Onyx in order to compare: level of easy development, Onyx stability, declarative data processing...

michaeldrogalis21:03:43

Awesome, that sounds like a really good use case for Onyx. Our last release has the latest windowing abstractions for processing stateful streams of both online and offline data transparently.