Fork me on GitHub
#onyx
<
2016-11-02
>
Travis01:11:30

can’t wait to see this

kjothen11:11:52

Hi, maybe someone can help with this. I have a live stream of stock prices (1,000/sec), and I want to reduce this stream to store just one price per stock per minute in a db (eg. the first price in that minute interval) What's the right onyx function/window/trigger combo to do this? I do have something running in onyx, but can't help but feel I've got it wrong - I've had to write a custom aggregate and use timer triggers...

lucasbradstreet12:11:44

@kjothen: you should be able to use group-by-key with a simple aggregation that just returns the new value from both the apply and the create fns. Then use a timed trigger like you are doing.

eelke14:11:07

Hi, anyone has some experience with loading json data into s3 and copying it to redshift?

eelke14:11:20

loading it through Onyx

michaeldrogalis15:11:54

@eelke We don’t have an S3 reader plugin, but we’re working on that. And yeah, what @yonatanel said ^ If you’re staying inside AWS, hard to beat that.

michaeldrogalis15:11:38

I suppose if you were looking to alter the data between S3 and Redshift, Onyx would be viable again.

eelke15:11:42

Ah sorry, my intention is to load data from kafka into s3 using onyx. Then indeed use the copy command to load from s3 into redshift. I actually had a very minor issue with the serializer-fn to write to s3 in the right jsonformat so that the copy command could deal with it. That is now resolved. Anyway thanks for the swift response

michaeldrogalis15:11:21

Oh, yeah that seems quite reasonable. @eelke

michaeldrogalis15:11:35

Glad you’re off to the races then

michaeldrogalis15:11:30

@kjothen group-by-key on the stock symbol, 1 minute timer triggers, and global windows. Custom aggregate seems okay here since you need some logic to determine that you’re seeing the “latest” stock value, since data order isn’t guaranteed.

aaelony16:11:07

@eelke, you might consider https://github.com/uswitch/blueshift, I happen to like it.

kjothen18:11:38

@michaeldrogalis the problem with the global window is that I'm not guaranteed to capture the first stock price in that minute interval. That is, the timer fires every minute, not on the minute. Using a fixed one minute, accumulating window with a custom aggregation does though. However, memory usage goes unbounded I think - one window, per stock, per minute with one segment in each.

michaeldrogalis18:11:45

@kjothen You can use a discarding refinement mode on the trigger to dump unused state. Is there an attribute of the data that signifies that you’re looking at the “right” stock price for that interval?

kjothen18:11:53

@michaeldrogalis the discarding refinement mode dumps all state for that window, no? In my case, the right stock price is that with the earliest timestamp in the minute interval. But it's a general sampling problem I think.

michaeldrogalis18:11:24

@kjothen You can use two triggers on a window. Use a predicate trigger with a discarding refinement to ditch old state, and a timer trigger with an accumulating refinement to sync it to the DB, perhaps?

michaeldrogalis18:11:54

Each group gets its own window, and triggers operate per window.

michaeldrogalis18:11:12

So in effect, every stock symbol has its own completely isolated window and trigger set

kjothen18:11:22

@michaeldrogalis I hadn't considered two triggers, neat. Whilst writing a custom aggregation was fun, I think it would be useful to bundle min-segment and max-segment aggregations in the platform, to pin an entire segment, not just a value. Thanks for your help!

michaeldrogalis18:11:01

@kjothen Can you either open an issue with that thought or send the code in via a PR? Seems like a reasonable addition

mccraigmccraig22:11:56

i have some failure-prone network operations which i want to partition by a key and implement a "latest op for a key wins" strategy - i.e. whenever a new op arrives for a key X, forget about any already incomplete ops for X and focus on the new op, otherwise keep re-trying an incomplete op until it succeeds or the give-up threshold is reached. the ops themselves are not serializable data (they are cancellable promises), but are described by serializable data (clojure maps) - this seems like something i might be able to jam into onyx aggregations, but i'm not sure - does it seem feasible ?

michaeldrogalis22:11:23

@mccraigmccraig I can think of a few features to combine that gets you 80% of the way, but it feels like a paradigm mismatch. I think this would be better served by a Workflow Engine, something like Amazon SWF.

mccraigmccraig22:11:53

thanks @michaeldrogalis i'll take a look at SWF - but what were the onyx features you were thinking of ?

michaeldrogalis22:11:01

You coulddd use a window and write an aggregate to only ever keep the last operation map, and store the promise on the Event map, then use a segment trigger of size 1

michaeldrogalis22:11:09

That would work. It feels mismatched, but yeah — that would do it

michaeldrogalis22:11:51

Does it matter if two operations are running currently for a period? Is this the kind of thing you need an absolute guarantee that n two operations are ever running at the same time?

michaeldrogalis22:11:01

Read up a little, you want the aggregation that @kjothen wrote earlier today.

mccraigmccraig22:11:06

no, absolute guarantees are not necessary - the ops are push-notifications partitioned by device, and there are occasional upstream errors which cause missing notifications, so i want to have some retries, but at the same time i don't want to keep retrying a notification which is out of date (since the notifications carry badge-counts with them)

michaeldrogalis22:11:32

Oh, yeah. This might be fine then.

michaeldrogalis22:11:52

Use a window to keep one op around, use a trigger to transition between ops. Should work okay for that

mccraigmccraig22:11:48

excellent - i will probably take a little mismatch over adding another major system component 🙂

michaeldrogalis22:11:27

Heh, sure. Understandable. :thumbsup: