Fork me on GitHub

Hi! We have an ETL pipeline where we pull from a changelog (ms dynamics webservice) every second or so. When something has changed we fetch the relevant records from another endpoint, sending the records to kafka and save the timestamp. We use Samza today and persists timestamps etc. with the builtin checkpointing/state (rocksdb) features of samza. Any tips on how I can build this with Onyx? Would it be idiomatic to integrate the webservice client as an onyx plugin or would it be sufficient to use lifecycle functions for this? (spin up a "worker thread" during task initialization that will emit onyx segments?). Also, any suggestions for how to persist my "state" (basicly just storing UUIDs with a "last modified date" timestamp) ?


Hi @ckarlsen. Yes, it would be idiomatic to either build an input plugin for the webservice client, or continue to send the records to Kafka and use the Kafka input plugin.


As for persisting your state, you would want to use the windowing/state management feature, which will maintain your state in a fault tolerant way


You’re welcome.


looking forward to play with this. Samza works and the API is simple, but the dev and testing experience is horrible with the java interop and the clumsy shell scripts and .properties files...


You could maybe even use onyx-seq, via a lazy-seq that polls the web service, to prototype it out


If the web service isn’t replayable then you probably won’t lose anything by using it


I hope you have a good experience with Onyx. One thing we’ve really focused on is making dev and testing a breeze


I recommend giving the template a look for some of our best practices


nice. thanks


Can anyone offer me a invite?


We no longer need and in the kafka plugin, since we run the in-memory versions. Any objections to dropping those?


Drop em. I have a feeling that we may need to get them back one day, especially for ZooKeeper (the testing ZK may not be representative), but we can always just grab them from earlier revisions