Fork me on GitHub
#onyx
<
2017-10-25
>
ben.mumford10:10:54

hi all, i have a project that consumes from a kafka topic. however when i kill the job and resubmit it starts from the beginning again. the job may change in future (new tasks, different flow confitions etc) but i still want to preserve where my application got to. i can't understand the resume point documentation at all. any clues how to get started?

ben.mumford10:10:29

I also can't fine any examples and nothing in the workshop

gardnervickers11:10:07

@ben.mumford Are you setting :kafka/offset-reset?

ben.mumford12:10:24

i have to set this otherwise i get a schema exception when i submit the job

ben.mumford12:10:23

i think that value is used if no offset csn be found (my interpretation of docs)

gardnervickers13:10:21

@ben.mumford are you setting a :kafka/group-id? This is on Onyx 0.11 I presume?

ben.mumford13:10:42

i am using 0.11

ben.mumford13:10:50

i don't see that option in the documentation

ben.mumford13:10:06

is the documentation out of date?

ben.mumford13:10:27

having a randomly generated group id would explain why it keeps going back to the beginning! 🙂

ben.mumford13:10:29

i'll try that now

gardnervickers13:10:25

No it's not randomly generated, it actually defaults to "onyx".

ben.mumford13:10:35

then setting to something else won't solve my problem will it?

ben.mumford14:10:28

yeah, tried setting the group-id, no difference

ben.mumford14:10:42

still reprocesses from the beginning of the topic

gardnervickers14:10:49

@ben.mumford Yea that’s strange behavior. Could you post your onyx-kafka catalog entry?

ben.mumford14:10:15

from my config edn: :input-config {:onyx/name :in :kafka/topic "input" :kafka/zookeeper "127.0.0.1:2181" :kafka/offset-reset :earliest :kafka/group-id "wrangler" :kafka/deserializer-fn :wrangler.utils.shared/deserialize-message-json :onyx/batch-size 10 :onyx/batch-timeout 50 :onyx/n-peers 1} then i call (add-task (kafka-task/consumer :in input-config))

gardnervickers14:10:51

And you’re running zookeeper externally right?

gardnervickers14:10:56

So the offsets are being committed for the "wrangler' consumer group between job runs.

michaeldrogalis15:10:41

@ben.mumford Docs are up-to-date as far as I know. Resume functionality works fine. Are you seeing any of the policy INFO messages in the logs? https://github.com/onyx-platform/onyx-kafka/blob/0.11.x/src/onyx/plugin/kafka.clj#L44

ben.mumford15:10:07

@michaeldrogalis the earliest/latest functionality works perfectly. i wasn't sure how to make the application resume from the same point it shut down. for example i submit my job, post a message to input topic, i see the message appear downstream, kill job, resubmit job, see the message again (this bit i don't want)

ben.mumford15:10:32

i wasn't clear how to add a resume point for my application based on the documentation (or if i need one). the documentation i thought might be out of date was the onyx kafka one (i can't see group-id in the readme)

michaeldrogalis15:10:59

Ahh, got it. Sorry I misunderstood. Yeah you'll want to resume points. Roughly:

michaeldrogalis15:10:53

(->> job-id
     (onyx.api/job-snapshot-coordinates peer-config tenancy-id)
     (onyx.api/build-resume-point new-job)
     (assoc new-job :resume-point)
     (onyx.api/submit-job peer-config))

michaeldrogalis15:10:03

That's out of the docs - which I assume you saw but are still confused by

ben.mumford15:10:12

what is new job?

michaeldrogalis15:10:26

So you're stopping your job and standing up a new one, right?

michaeldrogalis15:10:35

It's the new one 🙂

ben.mumford15:10:58

it's the map with workflow, flow-conditions, catalog etc?

ben.mumford15:10:51

so i need to add a task to my jar, so far i have start-peers, submit-job and i need to have another called "resume-job" which takes the id of the job to restart

ben.mumford15:10:13

is there a way to specify the name of a job (instead of a randomly generated id)?

michaeldrogalis15:10:37

If you're adding a new function that you need on your classpath, then yeah you'll want to modify your jar. If you're only updating your job map with new parameters, you can leave your jar alone

michaeldrogalis15:10:12

Not a name, but you can submit an ID that you choose.

michaeldrogalis15:10:27

(assoc-in job [:metadata :job-id] UUID)

michaeldrogalis15:10:45

These values need to be unique, hence why it needs to be a UUID. We assume you'll pick a random one and store it on your side

ben.mumford15:10:11

it needs to be unique but not if i'm resuming

ben.mumford15:10:14

i think i get it

ben.mumford15:10:19

i will try again

lucasbradstreet15:10:40

It needs to be unique if you’re resuming, as it considers each job a new job that brings in state from the old job.

ben.mumford15:10:00

that makes sense

lucasbradstreet15:10:20

We need a resume points example in learn-onyx or in onyx-examples

ben.mumford15:10:21

i think i know what i'm doing (although i've been wrong before)

ben.mumford15:10:32

yeah, that would be great

ben.mumford15:10:26

to conclude: my main method will have three actions: start-peers, submit-job and (the new) resume-job which takes a pre-existing job to restart and runs micahel's code

Travis20:10:28

so still running into the media driver timeout exception with an idle job ( roughly 4hrs before death ). Is there anything we need to tweak with the way the media driver runs ? Like the threading mode ? Or is that unrelated ?

lucasbradstreet20:10:44

@camechis are you running in process or out of process?

Travis20:10:10

we are spinning the media driver process up in the docker container with the peer, like the template so out of process

lucasbradstreet20:10:57

ok. Any chance the container is running out of memory and OOMing it?

lucasbradstreet20:10:33

We health check both the peer JVM and the aeron media driver. It’s easy enough to do with onyx-peer-http-query if you’re using that.

Travis20:10:23

i don’t think its being OOMing, I do have onyx-peer-http-query enabled and currently being scraped by prometheus.

lucasbradstreet20:10:27

Try polling /network/media-driveron peer-http-query

lucasbradstreet20:10:36

you’ll get something like this:

lucasbradstreet20:10:37

{:active true, :driver-timeout-ms 10000, :log “INFO: Aeron directory /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas exists INFO: Aeron CnC file /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas/cnc.dat exists INFO: Aeron toDriver consumer heartbeat is 687 ms old”}

lucasbradstreet20:10:08

Are you tracking JVM GCs?

lucasbradstreet20:10:33

The fact that it dies when idle is strange as that would usually not indicate a GC

lucasbradstreet20:10:53

best advice I can really give you is to perform a flight recording of your JVM

lucasbradstreet20:10:41

you could increase the timeout past 10s, but you’re probably papering over another issue and it’ll eventually occur anyway.

Travis20:10:32

yeah, i agree

Travis20:10:35

so i am polling the peer query end point but only seeing something like this

{:status :success, :result true} 

Travis20:10:41

actually its showing the result as nil ?

lucasbradstreet20:10:30

hmm, maybe the docs are wrong 😮

lucasbradstreet20:10:21

yea, looks like that call is broken

eriktjacobsen20:10:56

Are there any guides/best practices for monitoring metrics. We have a Datadog contract so prefer that over prometheus, and I've got the JMX export working, but even without a job it runs into the 350 metric limit for datadog, and I'm curious which are the "important" ones. I can take some guesses, but would like to see someone elses dashboard setup or which bean/attributes they are focusing on would be a huge help. I can release my datadog jmx conf after going through this.

lucasbradstreet20:10:20

@eriktjacobsen you’re going to have to reshape the metrics, as each peer has its own tag

lucasbradstreet20:10:05

I’m not all that familiar with datadog, but essentially you need to turn something like this:

lucasbradstreet20:10:20

throughput.peer-id.XXX.slot-id.YYY value

lucasbradstreet20:10:37

throughput {peer-id = XXX, slot-id YYY}

lucasbradstreet20:10:55

so they are all effectively the same metric, just different workers

lucasbradstreet20:10:52

@camechis fixed that query on master. You can do /network/media-driver/active to check whether it’s active though

Travis20:10:06

active comes back as true

Travis20:10:14

at least at the moment

Travis20:10:33

should i wait for the exception again and then check ?

lucasbradstreet20:10:46

Yeah, that’s probably the best idea

lucasbradstreet20:10:53

To give a bit of context, normally the media driver should operate fine out of process and shouldn’t be subject to GCs in your main peer, so it shouldn’t really ever timeout unless it dies or if it wasn’t given enough memory (it doesn’t need much)

Travis20:10:38

ok, I will waiting for it to die and see if the metric comes back as active or not. That should let me know for sure if its still running or not

eriktjacobsen20:10:29

Gotcha, shouldn't be a problem as I can filter metrics by host, so two hosts can just emit a throughput metric. My primary concern is just which of the beans often break. I don't have a good mental mapping of how things like zookeeper.force-read-chunk relates to zookeeper.read-lifecycles relates to zookeeper.read-origin. For all I know, two are downstream of the other, so if one starts slowing down, they all will. Or which metrics generally can fluctuate but wouldn't be worth investigating on their own until they cause higher level problems. Just some guide / focus on which are the primary metrics most people are watching

Travis20:10:39

@lucasbradstreet I see this stuff in the logs occasionally. Is this normal ?

17-10-25 20:50:50 onyx-peer-383723949-vm5s9 INFO [onyx.messaging.aeron.utils:60] - Error stopping subscriber's subscription. io.aeron.exceptions.RegistrationException: Unknown Subscription: 346

lucasbradstreet21:10:02

Nope, that’s not normal. What’s probably happening there is you’re getting a bad GC and the media driver is completely timing out and collecting your subscriptions, and thus it doesn’t know about them any more

Travis21:10:34

yeah, I Just noticed this time the peer restarted on me twice. trying to determine why

Travis21:10:26

exit code 143

Travis21:10:24

this time I did find one that was OOM killed

Travis21:10:02

Do you guys if running in Kubernetes run the media driver in a separate container in the pod or side car it in a single container ?

lucasbradstreet21:10:22

We’re side car containering in a separate container with shared shm space

Travis21:10:10

gotcha, maybe i will move it out to that. Currently I am running them in the same container which makes it tricky to know what really died

Travis21:10:49

out curiosity do you have an example of setting up a shared shm space. I know thats coming next, lol

gardnervickers21:10:25

@camechis

volumes:
        - name: peersharedmem
          emptyDir:
            medium: "Memory"

gardnervickers21:10:34

That’ll give you a ramdisk

Travis21:10:43

how does it know how big ?

gardnervickers21:10:08

volumeMounts:
          - name: peersharedmem
            mountPath: "/dev/shm"
Then you can include that in both your peer and media-driver container definition in your pod template.

gardnervickers21:10:25

It shares your memory allocation

Travis21:10:29

ah, gotcha

Travis21:10:50

cool, thanks

gardnervickers21:10:03

Also, the most recent JDK has these nice options

- -XX:+UnlockExperimentalVMOptions
- -XX:+UseCGroupMemoryLimitForHeap

gardnervickers21:10:18

Which are awesome for container work, especially with Kubernetes where resources might be dynamic.

Travis21:10:29

recent as JDK 9 ?

lucasbradstreet21:10:55

Latest JDK 8 works too

Travis21:10:02

so if you use those settings than you don’t have to do the trick to figure out how much memory the container has and set the XMX to some percentage of that memory to avoid the JVM trying to take more memory than the container has

gardnervickers22:10:37

Which free’s you up to use resource request/limit ranges in Kubernetes

Travis22:10:22

Nice, will give it a go