onyx 2015-12-03 | Slack Archive

robert-stuttaford07:12:47

@lucasbradstreet: ok if we use https://github.com/pyr/clj-statsd/blob/master/project.clj in our onyx-metrics PR for statsd?

lucasbradstreet07:12:18

Sure, but make it part of the dev dependencies of the project, and then document that the dependency should be explicitly included if you use the functionality.

lucasbradstreet07:12:34

That's what we do with riemann

greywolve07:12:47

great, that's perfect

robert-stuttaford07:12:06

clever!

lucasbradstreet07:12:45

One day we may split out the projects so we can upgrade the library dependencies for our users but this works well enough for now

robert-stuttaford07:12:21

last two days have been bliss

robert-stuttaford07:12:42

not much input, true, but the latencies have been flat!

lucasbradstreet07:12:56

Yeah, that’s a really good sign that you’ve fixed the underlying issue which was mostly performance

lucasbradstreet07:12:04

Looks great

robert-stuttaford07:12:22

yeah

robert-stuttaford07:12:36

we’ll be going multi-node next - back down to 4-core 8gb ram machines, but two of them

robert-stuttaford07:12:07

probably going to need input on externalising aeron from you soon

lucasbradstreet07:12:24

Sure

robert-stuttaford13:12:45

@lucasbradstreet: i am SO happy with our Onyx system now. thank you so much for your patience and mentorship. it means the world!

lucasbradstreet13:12:04

I'm really glad to hear it! You definitely understand what's going on a lot better now, so I think you're well placed to deal with any potential issues that might come up.

lucasbradstreet13:12:38

Do you have any other Onyx shaped problems that you might tackle in the future?

robert-stuttaford13:12:24

yes. we’re going to be working on triggered/scheduled/reminder messages, and gamification badges quite soon. it’s all going to be tasks in Highstorm

robert-stuttaford13:12:27

the only thing that won’t be in HS is the actual scheduler. that’ll live outside. but the scheduler will create work for HS to perform

lucasbradstreet13:12:47

Cool, within the same job? Or will you decouple it?

robert-stuttaford13:12:02

fantastic question!

robert-stuttaford13:12:22

i suppose we’ll start with the same job and see

robert-stuttaford13:12:04

splitting it apart isn’t that hard to do. we have a find-tasks-in-namespaces abstraction, so we can easily just re-org namespaces and duplicate the job and link things correctly

lucasbradstreet13:12:24

One cool idea I just had, that is an advantage of splitting the job up, is that you could possibly use a degraded mode at a high load time by killing the secondary job until you get another box up.

robert-stuttaford13:12:15

that’s true. that’s a good reason to split things up early

lucasbradstreet13:12:32

Either way, it should be relatively easy to switch the code between the two

robert-stuttaford13:12:15

yeah

robert-stuttaford14:12:37

just pushed 1700 segments through, no latency spikes above 1s

robert-stuttaford14:12:47

backfilling all the missing stats from the last 90 days

robert-stuttaford14:12:51

-super impressed-

lucasbradstreet14:12:42

Killing it!

lucasbradstreet14:12:42

How's the load on the server looking? Depending on how much you've improved things you may want to look at a using a higher max pending in the future.

lucasbradstreet14:12:40

Not that I'm suggesting you start mucking with things now that you have it working nicely :D

michaeldrogalis15:12:34

@robert-stuttaford: Nice man, congrats

robert-stuttaford16:12:16

we’re going multi-node next, and then we’ll look at raising that value

lucasbradstreet16:12:29

Sounds reasonable. I guess the key is how it performs under load / on hard data.

lucasbradstreet16:12:29

With datomic you could probably run experiments for that pretty easily in your test env by setting the starting tx suitably low and see how it handles the catch up.

lucasbradstreet16:12:06

Assuming you have something there that looks like the real db

lucasbradstreet16:12:17

That's the awesome thing about log processing

greywolve17:12:47

we have some simulation tests going already, working on one that will let us go back to a previous backup, and replay transactions up till whatever point beyond that, at the same real time they occurred. and then adding a param to vary that replay speed. should be interesting. have a sim test currently that just spams it with chat events, on this i7 macbook, with chrome etc running in the background, the system can handle around 120 users concurrently no problem, which is around ~1500 transactions a minute. i'm really happy with that, thanks so much.

greywolve17:12:30

also working to get a close to production test setup going on aws, so we can do proper sim tests

lucasbradstreet17:12:39

That’s sweet

lucasbradstreet17:12:44

Exactly what you need!

lucasbradstreet17:12:52

It’ll be great when you implement new features

greywolve17:12:00

for sure, we'd like to not have a repeat of the past, haha

robert-stuttaford17:12:39

amen to that

greywolve19:12:35

@lucasbradstreet, question, do you know what the pros and cons are of doing the onyx-metrics approach, of summing and doing percentiles etc within the app, and then just sending these to something to plot them, versus not doing any calculations inside the app, but rather sending them directly to something like statsd to do the summing / percentiles etc, for you? i'm just wondering where one would be better than the other.

lucasbradstreet19:12:52

for high throughput work, we’d never be able to send that many events without a hit

lucasbradstreet19:12:12

so we really have no choice but to coallesce them in the app

lucasbradstreet19:12:19

I think for the current work you’re doing, you could probably send all the events and not take too much of a hit

lucasbradstreet19:12:55

For StatsD, are you guys going to initially do what we do and transform the events that the main metrics lifecycle puts on the channel, and send them out? As in here: https://github.com/onyx-platform/onyx-metrics/blob/0.8.x/src/onyx/lifecycle/metrics/riemann.clj

lucasbradstreet19:12:24

I really have to clean: https://github.com/onyx-platform/onyx-metrics/blob/0.8.x/src/onyx/lifecycle/metrics/metrics.clj up at some point

lucasbradstreet19:12:01

@greywolve: back to the first point, we’re processing on the order of 3M segments a second on some of our benchmark tests, where it would be pretty costly to send that order of events to riemann

lucasbradstreet19:12:58

sending out a whole event via TCP, versus adding one new measurement to an interval-metrics reservoir are very different orders of cost

greywolve19:12:53

ahhh that makes sense

michaeldrogalis23:12:08

A little peek at what's coming out next: https://gist.github.com/MichaelDrogalis/9f6109703c660789839b

michaeldrogalis23:12:31

Automated transfer of data across storage mediums through Onyx. Removes the grunt work.

2015-12-03

Channels