Fork me on GitHub
#onyx
<
2017-12-14
>
danielstockton11:12:50

I'm trying to send data to an aws kinesis stream and send the output to S3: https://gist.github.com/danielstockton/1004aed11873daa4730199829f2bef19 Can't seem to get it working, whereas I do get output onto a core.async channel. I'm thinking there is an exception somewhere, but feedback-exception! causes the job to hang.

danielstockton11:12:18

Anyone know what the problem might be?

mccraigmccraig11:12:29

i keep getting 17-12-14 11:35:16 WARN [onyx.messaging.aeron.publication-manager:79] [aeron-client-conductor] - Aeron messaging publication error: io.aeron.exceptions.ConductorServiceTimeoutException: Timeout between service calls over 5000000000ns after peers have been running for a few days

mccraigmccraig11:12:42

are there any known issues which could cause this to happen ?

jasonbell11:12:59

@mccraigmccraig Are the heartbeats timing out?

jasonbell11:12:31

They wouldn’t at that point I don’t think.

mccraigmccraig11:12:15

i've got three containers running onyx peer processes, with aeron running in a separate process managed by the s6-overlay, which is the recommended configuration i think

mccraigmccraig11:12:05

the container logging the aeron timeouts seems to stop functioning as a peer, not unsurprisingly

michaeldrogalis16:12:43

@danielstockton I assume no exceptions in the logs? First thing would be to check is that your AWS keys are getting picked up by the S3 writer.

michaeldrogalis16:12:26

Assuming they are, the next thing you can do -- if you're local - is drop a debug function here: https://gist.github.com/danielstockton/1004aed11873daa4730199829f2bef19#file-etl-clj-L23

michaeldrogalis16:12:39

e.g. (fn [x] (prn x) x) Remember to return the segment

michaeldrogalis16:12:09

If you're having issues in a real environment, throughput metrics will tell you where something's not flowing properly.

danielstockton08:12:39

Thanks, that should get me started!

michaeldrogalis16:12:39

@mccraigmccraig What version are you on now?

michaeldrogalis16:12:43

Also have you changed your cluster's hardware or experienced changes in data intensity? That looks like starvation of Aeron again

mccraigmccraig16:12:00

@michaeldrogalis 0.9.15 on production - we'll be upgrading with our new cluster (with newer docker and kafka which both required upgrades), but i'm stuck on 0.9.15 for the moment

mccraigmccraig16:12:09

there has been a considerable increase in data intensity

michaeldrogalis16:12:05

Allocate more resources to the Aeron container.

mccraigmccraig16:12:29

i'm currently giving 2.5GB to the container, with 0.6=1.5GB of that going to peer heap and 0.2=0.5GB going to aeron... the onyx peers don't seem to need anything like that amount of heap though - they only seem to use a couple of hundred MB according to yourkit, so i could give 1.5GB to aeron easily enough - does that seem reasonable ?

michaeldrogalis16:12:52

Yeah, that would likely help. I'd have to look at the metrics to give you a good answer, but it's definitely in the right direction.

michaeldrogalis16:12:23

Running them in separate containers would help more if your set up supports it.

mccraigmccraig16:12:09

running onyx and aeron in separate containers. hmm. i think that would be difficult with my current setup (mesos+marathon) but might be feasible on the new cluster (dc/os) with pods

michaeldrogalis16:12:52

Ah, wasnt sure what you meant by your last message when you said "the container". Got it

lucasbradstreet17:12:07

@mccraigmccraig you’re probably hitting GCs which are causing timeouts in the Aeron conductor. You could give it a bit more RAM, and you can increase the conductor service timeout to make it not time out quite so easily. Taking a flight recorder log would help you diagnose it further

lucasbradstreet17:12:06

Unfortunately I can only help so much with 0.9 issues

eriktjacobsen19:12:55

My understanding is that the onyx input kafka task manages its own offsets in format partition->offset like {0 50, 1, 90} that can be passed in as :kafka/start-offsets. I'm curious, for a given snapshot / resume point, how would i get that offset map? For instance I have a resume point:

:in
 {:input
  {:tenancy-id "2",
   :job-id #uuid "23709a1c-ae8b-0b3b-a1a5-5bce4cc935cb",
   :replica-version 25,
   :epoch 144223,
   :created-at 1512603190357,
   :mode :resume,
   :task-id :in,
   :slot-migration :direct}},

What path would I look in ZK to get the actual offsets? (I have s3 checkpointing on if its in the checkpoint)

eriktjacobsen19:12:58

additionally, is there anything in onyx.api or an onyx lib that could pull those offsets out of zk into my repl env?

lucasbradstreet19:12:47

It’s in the S3 checkpoint.

lucasbradstreet20:12:42

I would love one to exist. Currently the best you can do is instantiate this: https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/storage/s3.clj#L122

eriktjacobsen20:12:06

@lucasbradstreet Thanks a bunch! looking into it

lucasbradstreet20:12:43

it shouldn’t be so hard to wire up. If you have any questions let me know, but I would love if you shared it after 🙂

eriktjacobsen20:12:20

We're trying. Going through the corporate open sourcing process in January. We also forked onyx-dashboard so you can view each task's window state, it's been super helpful for us.

lucasbradstreet20:12:35

The extension from here will be that you will be able to repartition the offsets are a bit more easily

lucasbradstreet20:12:44

I’ve been considering allowing resume points to be passed as values for these sorts of circumstances, as an alternative to :kafka/start-offsets

lucasbradstreet20:12:01

:in
 {:input
  {:tenancy-id "2",
   :job-id #uuid "23709a1c-ae8b-0b3b-a1a5-5bce4cc935cb",
   :replica-version 25,
   :epoch 144223,
   :created-at 1512603190357,
   :mode :value,
   :task-id :in,
   :value {0 {0 35} 1 {1 99}}
   :slot-migration :direct}},

eriktjacobsen20:12:44

That would definitely be helpful as well

lucasbradstreet20:12:05

Would remove the complexity from the plugin side

niamu22:12:37

So, I’ve been using the new :reduce type and noticed that it breaks the visualization graph in the onyx-visualization library and therefore the dashboard as well. If I were to open a pull request to fix that in onyx-visualization, should it share the same node colour as a :function?

michaeldrogalis22:12:30

@niamu Sure, works for us.

michaeldrogalis22:12:35

Happy to merge it when it's ready.

niamu23:12:34

Are there any long-term plans for the onyx-visualization library? I’ve been talking with a coworker about making a fork of the library that would visualize flow conditions and other components of the job as well.

lucasbradstreet23:12:35

onyx-dashboard and onyx-visualization have taken a back seat to all the plugins and helper utilities, but we do love PRs

lucasbradstreet23:12:51

Sorry about the reduce breakage :/

niamu23:12:10

No worries. We’re still actively working on deploying our first Onyx job so it wasn’t a big issue for us, but something that was easily fixed we noticed.

lucasbradstreet23:12:49

How are you finding the reduce type?

niamu23:12:37

It’s great. Perfectly solved the hack of using a flow condition to stop the segments from flowing downstream in our pipeline.

niamu23:12:26

We started working in earnest with Onyx during the 0.12 release so noticing that in the release notes allowed us to fix that portion of the job that we found a little ugly.

lucasbradstreet23:12:28

Nice. Yes, it was about time for a solution for that. It actually solved a number of issues quite cleanly (eg. Needing to use a null plugin for aggregating terminal tasks). Glad it worked out for you too.

niamu23:12:54

Today we started our first steps migrating from docker-compose to a Kubernetes deployment and we’ve encountered a problem that we’re not sure how to debug involving the Aeron media driver not starting.

niamu23:12:57

>17-12-14 21:21:14 onyx-peer-79d6d498d4-b8c6r WARN [onyx.peer.peer-group-manager:277] - Aeron media driver has not started up. Waiting for media driver before starting peers, and backing off for 500ms.

niamu23:12:10

Any thoughts?

lucasbradstreet23:12:59

Sounds like you’re not starting a media driver when you start the peers. We run it in a side car container sharing memory between the two containers. If you want to get unstuck for now you can start the embedded driver (search the cheat sheet for the right peer config)

lucasbradstreet23:12:37

I don’t think we have an example for the side car up anywhere yet. @gardnervickers?

niamu23:12:24

We’ve been following the manifests from the onyx-twitter-sample (https://github.com/onyx-platform/onyx-twitter-sample/tree/master/kubernetes) but it sounds like there are improvements to be made in that process.

gardnervickers23:12:23

@niamu are you using Helm?

niamu23:12:48

First I’ve heard of it actually.

gardnervickers23:12:02

We need some examples soon for compiling/running the peer sidecar container. What it consists of is running two containers in a single pod, sharing /dev/shm as a type: Memory volume.

niamu23:12:34

Ok, sounds similar to what is defined already in the sample manifests here: https://github.com/onyx-platform/onyx-twitter-sample/blob/master/kubernetes/peer.deployment.yaml#L26

niamu23:12:16

Apart from the two containers in a single pod bit.

gardnervickers23:12:15

Essentially you don’t want to run multiple processes in a single container, so splitting out the template docker container into two containers is best practice.

niamu23:12:33

oh I see. So we just need a container that’s sole job is to start the media driver and share it’s volume with the other container in the pod that runs the peer process.

niamu23:12:34

Thanks a lot. I’ll try that out tomorrow and see how far I get.

lucasbradstreet23:12:32

@niamu just a heads up for when you go to prod with k8s and onyx. I highly recommend wiring up health checks to https://github.com/onyx-platform/onyx-peer-http-query#route

niamu23:12:45

Yes, I believe I saw you recommend that a while back to someone else. I have that bookmarked to revisit. 🙂

lucasbradstreet23:12:51

Good good. Ah, forgot that I collected all of this stuff into: http://www.onyxplatform.org/docs/user-guide/0.12.x/#production-check-list. Less of a reason to tell everyone 🙂