Fork me on GitHub
#onyx
<
2016-03-23
>
aaelony01:03:09

Something just came up that I'd normally do in Hive, but I'd like to get a sense of the feasibility, ease, and processing time needed to do this in Onyx(?). The task is essentially to read in a large amount ( ~1TB ) of json files with 20 or so keys from an s3 bucket, sort by 2 or 3 of the keys, and output the sorted data delimited by tab into another s3 bucket. In Hive this is an easy task but will probably be quite slow... What will it be like via Onyx ? (thanks in advance)

michaeldrogalis02:03:53

@aaelony: Use :onyx/group-by-key in the catalog entries for your tasks.

michaeldrogalis02:03:13

Can do :onyx/group-by-key [:name :country :age] for a multi-key grouping.

aaelony03:03:34

very cool! if I can learn enough before tomorrow to make this happen, I may give it a shot ;)

lvh19:03:35

I’m trying to understand ; specifically why it was moved from lifecycles to tasks. Wasn’t it always a task?

lvh19:03:19

also, that code appears to assume that you can reach in and get a key at [:onyx.ocre/results :tree] and you’re going to get a bunch of things that have :leaves keys, but that’s not what the docs suggest at http://www.onyxplatform.org/docs/cheat-sheet/latest/#/event-map

lvh19:03:42

(they suggest: > :onyx.core/results > > A map of read segment to a vector of segments produced by applying the function of this task )

lvh19:03:13

So I was guessing a map that looks like {{...}: [{...} {...} {...}]}

lucasbradstreet19:03:47

@lvh the reason it was moved to tasks was because of the "add-task" functionality. Really it's a task behaviour. We did debate whether it should be under tasks.

lucasbradstreet19:03:19

Thanks for the heads up about the docs. I recently added that from the old information model markdown file and missed that it was out of date

lvh19:03:21

the add-task functionality being add-logging, or a fn I’m not aware of called add-task?

lvh19:03:24

no worries

lvh19:03:39

I am writing tests for log-batch now but the tests are based on the assumption the implementation is correct

lvh19:03:56

(we have a bunch of junior clojure developers so I am hoping this gives them some opportunities for easy refactoring)

gardnervickers19:03:27

add-task is a new function added into core to assist in building up jobs, its just sugar over top of the datastructure api.

lucasbradstreet20:03:42

We're going through the release process in 0.9.0 now. I believe add-task was part of the 0.8.x template, right Gardner?

lucasbradstreet20:03:02

We will be writing a blog post about composing the behaviours and task builders soon

gardnervickers20:03:03

It’s changed since

gardnervickers20:03:16

@lvh: Here’s an example of the job builder pattern we are working on currently. https://github.com/gardnervickers/twit/blob/master/src/twit/jobs/basic.clj

gardnervickers20:03:37

There’s accompanying tests

lvh20:03:51

that does look much neater

lucasbradstreet20:03:35

You can stick with the old way, but we find this more composable and reusable while keeping the data as the overall DSL that you build your jobs out of

lvh20:03:57

hm, I’m surprised that has an empty catalog

lucasbradstreet20:03:04

You don't need to use the multi arity task builders. All of them take an opts map that gets merged into the main task map of you prefer to just keep them together.

gardnervickers20:03:18

The catalog is constructed with the add-task functionality

gardnervickers20:03:22

often times you have a catalog entry, and lifecycles that are linked together. Separating them breaks their functionality as a logical task unit. By combining them into a function, you can keep them together AND validate the input in the context of each piece.

gardnervickers20:03:05

All the plugins involved with the 0.9.x release should have schematized “task bundles” making dropping them into a workflow dead simple https://github.com/onyx-platform/onyx-redis/blob/0.9.x/src/onyx/tasks/redis.clj

gardnervickers20:03:45

aka (add-task (redis-writer :write-to-redis redis-opts))

lvh20:03:00

I see. So, add-task takes your job and does whatever it needs to to do the job in order to get it to work. That does seem simpler; and it seems like you would be able to model pretty much all behavior (e.g. adding metrics) that way?

gardnervickers20:03:32

The onyx job map is very flexible, but rather verbose. So we use what we’ve been calling “task bundles” to tie together logical units (catalog entry, lifecycle, window… etc) into one function call that knows how to validate/create itself.

lvh20:03:42

makes sense

gardnervickers20:03:03

Gives nice opportunities for doc generation, etc. too

gardnervickers20:03:30

and is a lot nicer at the repl when using autocomplete, as you dont have to jump back and forth between docs as much.

gardnervickers20:03:20

I’m working on incorporating this into the template as we speak. I want to try it with as many edge cases as I can think of before we release, just to make sure the pattern is bulletproof.

lvh20:03:14

gardnervickers: Gotcha. What’s the threshold for moving stuff into onyx or a separate package?

lvh20:03:24

gardnervickers: E.g. this logging ns seems really generic

gardnervickers20:03:40

We have lib-onyx currently

gardnervickers20:03:48

I have a few other bundles I’m looking to move there

gardnervickers20:03:58

i.e. generic connection pooling/schema migrations

gardnervickers20:03:27

That logging, as it currently stands, would most likely not be put into core however.

michaeldrogalis20:03:31

@lvh In general, the threshold is that we use something in production, then Lucas and Gardner say it needs to be in core, then I say no, then they badger me about it and a week later its in

lvh20:03:40

yes my dear

michaeldrogalis22:03:54

Onyx 0.9.0 has been released. We'll have a more public announcement with the ramifications in a few days. Here's the changelog for now: https://github.com/onyx-platform/onyx/blob/0.9.x/changes.md#090

michaeldrogalis22:03:21

This release mostly increases the performance of windowing, and fixes some Jepsen-related bugs.