Fork me on GitHub
#onyx
<
2017-02-28
>
jasonbell10:02:28

Is there a calculation from the Aeron buffer size based on the message size and task count?

jasonbell10:02:58

If I have n peers processing xbytes in message size, I should be really tuning Aeron to z

lucasbradstreet10:02:00

I believe it's approximately aeron.term.buffer.length ~= onyx/batch-size (or onyx/write-batch-size if used) * segment size

lucasbradstreet10:02:38

Ie the buffer needs to be big enough to hold a batch of segments

jasonbell10:02:15

or just on any :input types

lucasbradstreet10:02:14

On any task. Are you trying to figure out how big the buffers should be or how big your shm space should be

jasonbell10:02:17

At 1536m for the shm-size on a 3gb container (20% xmx for media driver and 30% for the peers, leaving 50% for docker, OS etc) and aeron.term.buffer.length at 64m (16m is the default) docker complains on shared being exhausted

jasonbell10:02:26

So I’m just figuring out what’s best.

lucasbradstreet10:02:24

Out of interest, why did you increase the term buffer length?

jasonbell10:02:36

Messages could be up to 1mb a piece, across 8 tasks (3 partition kafka) and other workflow.

jasonbell10:02:02

So I was interested to see what happens when you increase that buffer.

jasonbell10:02:29

currently the system is dying as it runs out of shared mem in the container.

jasonbell10:02:37

Obviously running 0.10 has now added overhead with metrics running etc

lucasbradstreet10:02:42

OK, so there will be a 3*term buffer size log for each task to node/task connection, on both the client and the server

jasonbell10:02:37

gotcha, that's good to know

jasonbell11:02:33

64 x 3 x8 = 1536, so that explains that then

lucasbradstreet11:02:50

so if you have a job [[:A :B] [:A :C] [:B :C]] and A,B are on node1, C is on node2, then you will need (3*term.buffer.length) * 2 (pub and sub) for A->B, A->C

lucasbradstreet11:02:59

A->B and A->C can use the same logs

jasonbell11:02:25

yeah I understand, nice to see it that way

lucasbradstreet11:02:32

Sorry, there’s a mistake there, but you get the point

jasonbell11:02:39

I get the point

lucasbradstreet11:02:02

so you probably don’t need 64MB term buffer lengths

jasonbell11:02:17

I've set it back to 16mb

lucasbradstreet11:02:35

you will probably want to dial back onyx/batch-size or onyx/write-batch-size though

jasonbell11:02:02

well that batch size is 1

jasonbell11:02:23

what I'm dealing with here are files basically.

jasonbell11:02:35

so I want to maintain one file at a time in the workflow

lucasbradstreet11:02:18

The memory consumption taken up by the aeron buffers has increased since we don’t multiplex all task to task connections over a single network image any more. This should have better QoS and perf properties, but it is quite a bit harder to tune especially because you need to worry about SHM size

lucasbradstreet11:02:41

I’ll have to write a document describing how to tune it before we release

lucasbradstreet11:02:45

Sleep time for me

jasonbell11:02:55

@lucasbradstreet thanks for the feedback

jasonbell11:02:04

Enjoy your rest.

jasonbell11:02:14

Just run another test that has settled down, I think it got to the end of the offsets in Kafka without any real issues.

17-02-28 11:53:08 334bf4bc63ed INFO [onyx.peer.coordinator:284] - Coordinator stopped.
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.publisher:84] - Stopping publisher {:session-id -1506935240, :slot-id -1, :src-peer-id #uuid "1abaf64c-5a49-eb0f-0a13-7fc0166d38bd", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :pos 7534752, :rv 416, :e 11, :stream-id 1782891076, :dst-channel "aeron:udp?endpoint=localhost:40200", :short-id 10, :ready? true, :dst-task-id [#uuid "bce08049-b619-cb58-0930-4697af32b054" :out-processed]}
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.endpoint-status:58] - Stopping endpoint status [#uuid "1abaf64c-5a49-eb0f-0a13-7fc0166d38bd"]
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.publisher:84] - Stopping publisher {:session-id -1506935241, :slot-id -1, :src-peer-id #uuid "1abaf64c-5a49-eb0f-0a13-7fc0166d38bd", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :pos 390848, :rv 416, :e 11, :stream-id 2043388692, :dst-channel "aeron:udp?endpoint=localhost:40200", :short-id 8, :ready? true, :dst-task-id [#uuid "bce08049-b619-cb58-0930-4697af32b054" :out-error]}
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.endpoint-status:58] - Stopping endpoint status [#uuid "1abaf64c-5a49-eb0f-0a13-7fc0166d38bd"]
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.subscriber:103] - Stopping subscriber [[#uuid "bce08049-b619-cb58-0930-4697af32b054" :cheapest-flight] -1 {:address "localhost", :port 40200, :aeron/peer-task-id nil}] :subscription 328
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "1abaf64c-5a49-eb0f-0a13-7fc0166d38bd", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 1054304, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "a5c04374-4e8c-3045-1841-49f9110e1dfc", :dst-session-id -1506935243, :short-id 4, :status-session-id -1506935239}
17-02-28 11:53:08 334bf4bc63ed INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "1abaf64c-5a49-eb0f-0a13-7fc0166d38bd", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 1054304, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "9fc0dc04-361d-c511-bcfa-048f6f53fd83", :dst-session-id -1506935243, :short-id 3, :status-session-id -1506935239}
My one concern is the 1 x cpu allocated and Mesos was having to feed it 3.3CPU's because of the throughput. Running docker stats showed CPU % at avg 1300% under heavy load.

jcf12:02:06

Is anyone using the Datadog integration with Onyx Metrics? I'm not seeing any metrics reach Datadog from Onyx, but can send metrics using Cognician's library directly.

jcf12:02:24

Just wondering in the Datadog integration has been tested by a few people.

jcf12:02:24

This lifecycle looks right to me:

{:dogstatsd/global-sample-rate 1.0,
 :dogstatsd/global-tags ["myapp" "dev"],
 :dogstatsd/url "10.20.0.249:8125",
 :lifecycle/calls :onyx.lifecycle.metrics.metrics/calls,
 :lifecycle/doc "Instruments all tasks, and submits to Riemann.",
 :lifecycle/task :all,
 :metrics/buffer-capacity 10000,
 :metrics/sender-fn :onyx.metrics.dogstatsd/dogstatsd-sender}

jcf12:02:37

Apart from the doc! 😉

robert-stuttaford17:02:54

@jcf what you’re doing matches what we’re doing

jcf17:02:04

Thanks @robert-stuttaford. Restarting the DD agent, and the Onyx peers seems to have fixed things.

michaeldrogalis18:02:09

Sooo anyone else twiddling their thumbs with the S3 outage?

Travis18:02:37

well that sucks