This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-11-30
Channels
- # admin-announcements (13)
- # announcements (2)
- # avi (1)
- # aws (10)
- # beginners (427)
- # boot (3)
- # cider (4)
- # clara (26)
- # cljs-dev (21)
- # cljsrn (24)
- # clojure (205)
- # clojure-dev (32)
- # clojure-india (26)
- # clojure-japan (1)
- # clojure-russia (256)
- # clojurescript (41)
- # clojurex (1)
- # cursive (38)
- # datavis (99)
- # datomic (15)
- # emacs (19)
- # events (2)
- # funcool (5)
- # immutant (45)
- # ldnclj (3)
- # om (60)
- # omnext (4)
- # onyx (383)
- # overtone (7)
- # parinfer (1)
- # re-frame (3)
- # reagent (7)
- # ring (1)
- # testing (5)
@lucasbradstreet: curious - is a metric that shows the current pending segment count available? if not, is it possible?
that’s the gas pedal we’re using to control whether we retry or not. we’ve found a number of ways to optimise our code, but at some point, we’re going to reach the scaling limits of our system again. thinking of ways to predict outages rather than just respond to them after-the-fact, and i’m thinking that monitoring the pending count might help
I’d really like to do that as part of metrics. The main problem that I can see is that it’ll depend on all the input plugins putting the pending-messages map in the same place
Actually, I think I have a better idea
We just use a counter that is increased as we read a batch, and decremented as we ack and retry
I’ve been wanting to do this for a long time
Something else you might want to do is monitor how many txes the log reader is behind the current basis-t
I would suggest doing that as a measure of txes rather than time because you’d need perfect clock sync
Tho I guess you could look up the time of your current tx and compare it to the latest tx, then both timestamps would be transactor timestamps
@robert-stuttaford: I'll implement a pending count and I'd accept a patch for the latter
@robert-stuttaford: how’s the 0.8.0 conversion going? Are you in prod yet?
yes, i want to monitor that time gap as well. we’ve been co-plotting read-log throughput with cloudwatch transaction sum, but they rarely correlate at all
0.8.2 is looking good. we’re still in stress-testing. hope to be shipping the new code tomorrow
really just want to establish that the new code is indeed faster, and by how much
(i’m 100% certain it is faster, but not at all sure by how much)
No worries, just not trying to rush you at all. It’s just good to know when a version is in real use.
I’ll try to get the pending count done tonight so you can try to include it in your next release
assuming the technique I mentioned will work
the inc-when-read, dec-when-ack approach is good
we need to look into how we might do the second one
There are a few different ways. Happy to chat with you about them when you start thinking about doing it
the read-log plugin would have to do it, right. if, on reading a batch from the tx-log, it could log a metric of how many txes there are beyond it’s ‘to’ value, we could monitor that count
e.g. 25 new txes, it reads 15, the count is 10
I’m thinking what we do is track it via the commit loop, which will track the highest completely acked tx
Which we already do
Then periodically (in another thread or something), we get the basis-t from the db and compare it to the highest completely acked tx, and output a metric to the channel used by the metrics plugin
You could do both highest acked and highest read if you want
(- latest-basis-t highest-acked) is a measure of how far you’ve completed
(- latest-basis-t highest-read) is a measure of how far you’ve started
awesome
i’d be totally ok with just the first number - how far we’ve completed
as the ones in ‘the middle’ would be covered by the pending count we’ve already discussed
Yep true
@robert-stuttaford: onyx-metrics 0.8.2.4 has been released with support for pending message count
@robert-stuttaford: release notes https://github.com/onyx-platform/onyx-metrics/blob/0.8.x/changes.md#0823
wonderful thanks, @lucasbradstreet !
looks like we have some dashboard changes too - .
> _
Yep, that's right
@lucasbradstreet: i’d like you and @michaeldrogalis to consider adding bounded retry
i worry that we’re always going to be on the back foot, the way it’s set up now
is it possible to do so in a manner that makes it optional?
the reason why is whenever we encounter retries, we end up having to restart and lower our max-pending
and, in that situation, the process almost never exits gracefully
meaning a kill -9
and then a manual cleanup to find and re-process the stuff that was busy
I think the best way to do it would be via a lifecycle call that’s similar to the restart fn call. Then you can decide whether to keep it, and if not write it out somewhere
that’s great, if we can say that it happens after N retries
Yeah, you’d code some kind of monitoring into it
It’d be up to you though
I’m also not sold on the idea yet
when we have retries, our cpu is totally hosed (understandably) but it never really recovers
You should probably increase your pending-timeout
If you’re just getting data that is going unprocessed because it’s hard, you have a pending-timeout problem
(or just a general performance problem)
here’s the thing though, it feels like a couple retries immediately compounds the problem
Yeah, I understand
that is, the CPU is ok until it starts retrying, and then very quickly it’s poked
know what i mean?
because the pipeline starts backing up and up and up
That probably means that you can barely handle even the max-pending you have now
look at how sharply these mountains form
that’s a 24hr time period
yeah. i’ll agree to that.
OK, here’s how we should look at it
You’re basically using core.async messaging
Since you’re on a single box
so the pending count will help a lot. because if it approaches max-pending at all, we’re headed for trouble
Which means that your only source of retries is that you’re going too slow for all the messages that are currently being processed
ok, that makes sense
(because retries are mostly because you could lose messages along the way, and you’re not because you have a perfect medium)
So that means max-pending is probably too high, or pending-timeout is probably too low, or some mixture of the two
For example, if you increase max-pending, you probably need to increase pending-timeout to give all the segments more time because there’s going to be more queuing
the big mountain was at 250, the small mountain at 50 max-pending
ok. to increase max-timeout we also need to increase ack-timeout
If you increase pending-timeout to a big enough figure, you won’t get a retry storm
but you will get things starting to back up if your tasks are slow
We increased the default ack-daemon-timeout to 480 seconds in 0.8.0
it was dumb to have it equivalent to the pending timeout as it gives you no room at run time
ah, i was just about to ask
:onyx.messaging/ack-daemon-timeout
@robert-stuttaford: A lot of the pain you're experiencing comes from the fact that your cluster is 1 machine. Normally I'd say add more machines to off-set the load. I know why that's currently a problem, but that's the source of a lot of the friction that's coming up.
ok. i think if we can alert on pending reaching some % of max-pending .. e.g. 60% or 80%, once we’ve established sensible workaday performance baselines, that’ll help a lot
The other thing you need to consider is what part of your job is realtime and which part is not
For example, your slow links calculations, could maybe be persisted and calculated in another job with different constraints
thus decoupling the hard work from the realtime user connected work
hey Michael! yeah. the constraint that created that situation has been lifted. we just need to get to stability and then we’ll start work on going multiple instance
Yeah, they bumped their machine
we’ve actually factored most of the work in the links task out
Oh nice. Did you end up increasing your Datomic peer limit then?
we did bump the machine, but we also have more Datomic peers now
@robert-stuttaford: "we’ve actually factored most of the work in the links task out”. Do you see the issues in this code?
we have two tasks (move scheduler to separate app, externalise aeron) and i have one question: if we multi-peer a task, there’s no ordering guarantees right?
yes. one ridiculous one, and a couple subtle ones
There’s no ordering guarantees regardless
ok. i think we need to take advantage of windows and triggers
sorry, I mean do you see the retry issue in that code?
Once you’ve decoupled it, improved it
oh, no, not at all
OK, so I guess what you’re asking for is a way for things to degrade more gracefully
Until you can fix the underlying issues?
the new code is tearing through the same work much faster than the current prod. all this discussion is in preparation for next time when we hit our next scaling issue
I think most of the issues you’ve hit are underlying issues
However, I do understand why you’d want a way for things to degrade more gracefully
We can work on better backpressure strategies, but we're definitely not going to do bounded retry. That has to come at the user level. Onyx guarantees that it's going to process the data corpus to completion - completion being defined by your app.
Yes, I forgot to mention that we have a new backpressure mechanism in the works
The old one only really worked with high throughput workloads
that’s totally cool, Michael, if i have a way to decide what to do at each retry - that i can know it’s a retry
You have a low throughput, high CPU usage workload
I get what you need though, @robert-stuttaford. It's going to be a combination of continued tuning, understanding the fault tolerance model, and us having a few more helpful options.
@robert-stuttaford: That's sensible. I'll see what we can do there.
Yeah, we’ll think about it some more
btw, aside from the mild panic, this is actually supremely fascinating
my respect for you and all the people who build such systems grows every day
Hah. Yeah, tuning is hard. I've never had an easy time doing this as the user of any system.
35 degrees celcius here 😓
Well, feel free to let us know if you can think of any more helpful metrics. I'm almost always in favor of adding more visibility.
Eeep, hot.
cool! the one Lucas added earlier should help a crapton
I think the new planned backpressure mechanism will sort probably help your case too
We currently use the same one as Twitter's Heron but I think it’s very particular to certain workloads and requires tuning
@robert-stuttaford: If I were in your shoes, I would use a CloudWatch Alarm to spin up more machines for Onyx when pending-segments-count goes over a certain threshold.
Yeah, something like that is the end game
absolutely
What exactly sets off the retries for you guys? Slower than expected Datomic transactions?
slower than expected queries that create transactions
once the txes reach the :output it’s fast
queries -> magic -> transaction data
Then once the queries were slow they’d end up retrying, which ends up with bigger batches
Ah. Yup, the answer is scaling that specific piece when things get bad.
The bigger batches would then definitely take too long
I’m not so sure scaling would solve that issue though. The issue there is partially due to too large batch sizes
yeah. so we increased the peer count on those, but very likely didn’t dial the max-pending down enough
It’ll help slightly that there’s less CPU contention
@lucasbradstreet: If more peers were reading from the upstream tasks, they'd get fewer segments in their batch since they're concurrently racing to read.
Not currently
Well, sorta
You’re mostly right, except for the fact that from A->B, a whole batch -> new batch is sent to B, regardless of how big it is
You’re right in that the next batch will probably be sent to another peer
The problem is that if you get the batches starting to be big enough that when the whole batch is sent to the next peer
And it’s too big to be processed in less than pending-timeout then you’re screwed
that sounds familiar 😄
Ah, you're right. segs
is the whole group: https://github.com/onyx-platform/onyx/blob/0.8.x/src/onyx/peer/function.clj#L34
We have an issue for this
We could stripe the segments over the peer set to alleviate that.
Aha. 😄
We now have write queues, so the next step should be pretty easy
We just need to chunk it
Totally got this under control, @robert-stuttaford 😛
I’ll put it on an 0.8 milestone, because I think it’s a priority
-chuckle-
@michaeldrogalis is right in that you’re probably oversubscribing your tasks vs your number of cores, and all the retries are going to cause all your cores to be overly active
27 peers, 8 cores
very likely oversubscribed
the code fixes will buy us the time we need to go multi-instance
Yeah, under normal circumstances it’s fine but once you start getting your bad behaviour all of them will start being used at a single time
Yep, I’d still suggest increasing your pending-timeout
will do
Note that it normally comes at a cost of fault tolerance latency
i.e. a message gets lost and then takes a long time to end up being processed
I don’t think this is a problem for you currently because you’re not on a lossy medium, but it may have costs later, so I would probably comment the entry in the task
if it prevents retry storms
and allows us to eventually be consistent without manual effort, i prefer it
Yep, I would also monitor how badly you’re lagging behind the latest basis-t
we realised that datadog puts statsd on every instance, so we’re going to use that instead of riemann. Nikita is already using it for C3 metrics, it works great
You mentioned that. Have you written the sender function for that yet?
It should be really easy
not yet. was waiting for Nikita to do his 😄
Ha, good idea 😄
@lucasbradstreet, @michaeldrogalis: odd one: Capacity must be a positive power of 2 + TRAILER_LENGTH: capacity=1048448
when starting peer group
seen this before?
we have an odd number of peers (27)
you have to trash your aeron dir
It should be listed a few lines up
They’ve started versioning their data recently, so it should be more obvious
ok, cool
just murder the whole /dev/shm/aeron-deploy
?
I would suggest adding rm -rf
that dir to your startup script
We best effort delete it on shutdown but when you kill -9 there’s not much we can do
We can’t do it on startup because there can be multiple JVMs using the same media driver
But you’ll only have once
startup script, you mean instance startup?
Whatever starts your peers
ok. we have an upstart script that starts our jar, and the jar starts the peers on startup
Because presumably you’ve started your peers, killed them, and started them again on this machine
Looking
Actually, deleting after you kill the peers is probably the right approach
peer == app with onyx?
ok, will add to start script this removing
@lucasbradstreet: is this how the metrics are named now?
Whoa, the double _ is weird
i presume these ids change on every startup
this going to make dashboards hella hard to manage
I can see where this is going
In Grafana things are more dynamic than that
trying to see if i can wildcard it
Yeah, the problem is that we need some way to segment the individual peers
chatting to ddog support about it
Otherwise when :read-log sends two 5000 measurements, we don’t know whether it’s saying that it’s 5000 throughput, and it’s still 5000 throughput
we can’t be the first
or whether it’s two peers outputting 5000 throughput each
Cool, I’m happy to make changes there as long as we can have two peers outputting 5000 = 5000 + 5000
yup, totally understand that
i’m guessing numbering them in a monotonic and deterministic fashion is not so easy
esp considering there might be 1000s
I suspect this is the answer
ah, yes
We currently assign tags to the metrics, but I have no idea how you get at them
via datadog
they’re there
INFO [2015-11-30 13:59:12,554] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host wheee, :service [:semaphore/send-email]_c15c58e3-bf85-4671-9354-741f9f1a0f9a 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :semaphore/send-email highstorm], :time 1448909952, :ttl nil}
https://github.com/onyx-platform/onyx-metrics/blob/0.8.x/src/onyx/lifecycle/metrics/metrics.clj#L66
yep, those tags look good
So, you probably need to figure out how to get at them in datadog
for reference, what does grafana let you do?
I have no idea how to get at the tags in grafana either
It’s getting better but honestly I’m not a fan
I was considering trying out datadog
"The solution here would be to submit all data under the same metric name” -cry-
“and add the UUID as a tag for this metric” hmm?
That’s kinda weird. It’s basically the reverse
would it work for the problem you’re solving with the uuids in the metric name?
as far as i understand ddog allows create graphs only by metric names and it allows to separate graphs with same metric name by tags
an aside: pending messages count is working
yup lowl4tency i’m on with datadog support on irc. that’s what they just said too
For example metric A: allows create 2 separate graphs with tags env:prod and env:staging
Blah. As far as I understand that’s the opposite of how Grafana works
robert-stuttaford: hah, you are quicker
i’m doing nothing else right now heh
With the datadog method, I don’t know how it adds everything up. I guess you give it a period to aggregate over?
So, say you have 5 measurements over 1 second
They’re summed?
asking...
#datadog on freenode if you want to watch
Hm, one way are in my mind, push datadog api for create a graph with actual name of metric with start of app
New name become with each start of app?
yeah that could work but it creates big time dependencies between our systems
It's a huge uglyhack 😞
highstorm shouldn’t even know datadog exists
Whereas I think Grafana may get the most recent within some time limit for each service with some sub metric, and sums them
OK, I think our main problem is we were just using a shitty version of Grafana
I see group by time period in the latest version
@michaeldrogalis is pushing our benchmark suite forward now, which means a later version of grafana
It looks like we can group by 1s and them sum, which is exactly what we want
We don’t have it up and running yet, but I’m happy to revert the [task-name]_peer-id changes
and just put the peer-id in the tags
that is a tremendous relief, thank you Lucas
No worries
I’m glad we have users
"Ok, in the API case, if more than one datapoint is submitted under the same name with the exact same tag set during the same split second 9:21 MartinDdog only the last datapoint will be stored"
same split second? i.e. two measurements in one second?
Yeah, that’s bad
well riemann is pushing on a beat so it’d always do this
yeah, but say you have two :in tasks
same tag set
not same service
if they’re tagged, we control how they’re combined
"So regarding your question, if the 5 values are sent under different tags set, if you use the sum
space-aggregator (i.e. query sum:your_metric{your_scope}
) the 5 values will be added together"
right so if you have two values, on service task-name, tags: [task-name peer-1] [task-name peer-2]
in the same second
you should have two unique values to aggregate
OK, cool
once i learn how, heh
ok. so should we avg or sum them?
sum for throughput
Average or independently plot for latency
ok, great
we’ve had avg for both
feel the learn!
learn so hard
Releasing now
you’re the man
got it, thanks
Hah, it must have just finished the deploy
0.8.2.5
@robert-stuttaford: I just had to deal with that /dev/shm issue myself
rm -rf on media-driver startup is the right approach
not on kill
Since you don’t run an external media driver since you’re on a single machine
yeah, we’re putting it in before the jar start
you can just do it before you start the peers
Perfect
robert-stuttaford: i'll send a PR shortly
lucasbradstreet: rm -rf /dev/shm/aeron-deploy or rm -rf /dev/shm/aeron-deploy/* better?
the whole folder is fine
Lucas, thank you SO much for all your help today. truly appreciated. i must go and rest now. seeing double!
Alrighty, let us know how you go when it’s all up and running
Thanks for the constructive feedback
certainly will do. now just have to get a test server metrics dash running and smoke-test it all
then we can
we’re still getting retries at a low max-pending, but it’s recovering from them
which tells me we’re just doing way too much work and asking it to do too many at a time
Yep, I agree
pending-timeout will help with the retries but it’ll still cause lag
at the end of the day you need to make your tasks faster or scale up your boxes to cope with it
lucasbradstreet: 15-Nov-30 15:21:18 http://highstorm.cognician.com INFO [onyx.peer.task-lifecycle] - [370d49e7-1925-453d-96ff-28cffb28f114] Not enough virtual peers have warmed up to start the task yet, backing off and trying again...
could you describe this? I've got these messages and lost riemann ._.
something is very wrong
Do you see any other exceptions?
s/other/any/
15-Nov-30 15:23:46 http://highstorm.cognician.com WARN [onyx.lifecycle.metrics.riemann] - Lost riemann connection 45.55.69.56 5555 java.lang.Thread.run Thread.java: 745 java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 617 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1142 java.util.concurrent.FutureTask.run FutureTask.java: 266 ... clojure.core/binding-conveyor-fn/fn core.clj: 1916 onyx.lifecycle.metrics.riemann/riemann-sender/fn riemann.clj: 36 onyx.lifecycle.metrics.riemann/riemann-sender/fn/fn riemann.clj: 39 riemann.client/send-event client.clj: 72 com.aphyr.riemann.client.RiemannClient.sendEvent RiemannClient.java: 115 com.aphyr.riemann.client.RiemannClient.sendMessage RiemannClient.java: 110 com.aphyr.riemann.client.TcpTransport.sendMessage TcpTransport.java: 259 com.aphyr.riemann.client.TcpTransport.sendMessage TcpTransport.java: 289 java.io.IOException: no channels available
Lots like this
Is that IP correct?
Quick google says it's network issue
Yeah, seems like it can’t connect
But I've got riemann zk and onyx on same host
Port are available, I can connect via telnet
Hmm, that was my next suggestion
And I've tried localhost and interface ip for connects
That’s very weird
Riemann config was not changed
telnetted from same machine?
I mean, the machine the peers are running on
deploy@highstorm:~$ telnet localhost 5555 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'.
Try: 45.55.69.56
zk, onyx and riemann. All staff are on one machine
deploy@highstorm:~$ telnet 45.55.69.56 5555 Trying 45.55.69.56... Connected to 45.55.69.56. Escape character is '^]'.
Hmm, it should have logged more of the exception
15-Nov-30 15:23:46 WARN [onyx.lifecycle.metrics.riemann] - Lost riemann connection 45.55.69.56 5555
java.lang.Thread.run Thread.java: 745
java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 617
java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1142
java.util.concurrent.FutureTask.run FutureTask.java: 266
...
clojure.core/binding-conveyor-fn/fn core.clj: 1916
onyx.lifecycle.metrics.riemann/riemann-sender/fn riemann.clj: 36
onyx.lifecycle.metrics.riemann/riemann-sender/fn/fn riemann.clj: 39
riemann.client/send-event client.clj: 72
com.aphyr.riemann.client.RiemannClient.sendEvent RiemannClient.java: 115
com.aphyr.riemann.client.RiemannClient.sendMessage RiemannClient.java: 110
com.aphyr.riemann.client.TcpTransport.sendMessage TcpTransport.java: 259
com.aphyr.riemann.client.TcpTransport.sendMessage TcpTransport.java: 277
org.jboss.netty.channel.AbstractChannel.write AbstractChannel.java: 248
org.jboss.netty.channel.Channels.write Channels.java: 671
org.jboss.netty.channel.Channels.write Channels.java: 704
com.aphyr.riemann.client.TcpHandler.handleDownstream TcpHandler.java: 60
org.jboss.netty.channel.Channels.write Channels.java: 686
org.jboss.netty.channel.Channels.write Channels.java: 725
org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream OneToOneEncoder.java: 59
org.jboss.netty.handler.codec.oneone.OneToOneEncoder.doEncode OneToOneEncoder.java: 66
org.jboss.netty.handler.codec.protobuf.ProtobufEncoder.encode ProtobufEncoder.java: 68
com.google.protobuf.AbstractMessageLite.toByteArray AbstractMessageLite.java: 64
com.aphyr.riemann.Proto$Msg.getSerializedSize Proto.java: 4195
com.google.protobuf.CodedOutputStream.computeMessageSize CodedOutputStream.java: 628
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag CodedOutputStream.java: 865
com.aphyr.riemann.Proto$Event.getSerializedSize Proto.java: 2102
com.google.protobuf.UnmodifiableLazyStringList.getByteString UnmodifiableLazyStringList.java: 68
com.google.protobuf.LazyStringArrayList.getByteString LazyStringArrayList.java: 187
com.google.protobuf.LazyStringArrayList.asByteString LazyStringArrayList.java: 231
java.lang.ClassCastException: java.util.UUID cannot be cast to [B
java.io.IOException: Write failed.
Ok, that’s fay more helpful
OK, bad release
I’ll have a fix for your in a couple mins
lucasbradstreet: thanks a lot!
@lowl4tency: mocking in a nutshell https://pbs.twimg.com/media/CIPzMN4WEAAvoJL.jpg:large
It’s going through our release process now
All tests are passed
We really need to test against a real riemann instance
I mocked that part out and it doesn’t coerce things to strings
e.g. uuids
0.8.2.6 is up
@lowl4tency: let me know if you hit any other issues
Gotcha, let me build and try
lucasbradstreet: onyx-metrics?
Just FYI: seems it works pretty good
My metrics are alive again
Excellent
Remember to aggregate over 1s periods now
I afraid I'm out of the contex
Ah, Rob will know what to do. Basically you’re going to get multiple 1s throughput messages within a 1s time block
s/block/period/
For throughput these measurements should be added together
Also for retries
For latency they should either be averaged or graphed independently
Ok, I guess we will discuss it tomorrow
Yeah, as long as it’s submitting metrics
Now all works, will tune it
15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1
hm. timeout, but in rieman logs I'm watching metrics
does it mean I can miss several metrics?
It will retry them until the buffer is full at which point it will drop them
If you don't see them constantly then you're probably safe
I could probably improve the message to make it more obvious that it was retried
Are you getting timeouts reasonably consistently? Are you using a Riemann instance on the same network?
I think we should probably be batching our Riemann sends too
Anyway, occasional timeouts are ok as long, as long as they're not happening every second (there's a sender for each task so they can happen many times a second though)
Actually it's riemann on same machine
Yeah, you really shouldn't be getting timeouts there
Let me know if they're consistent. I'm off to sleep though