This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-10-25
Channels
- # aws (2)
- # aws-lambda (2)
- # beginners (95)
- # boot (47)
- # cider (13)
- # clara (5)
- # cljs-dev (36)
- # cljsjs (9)
- # clojure (51)
- # clojure-austin (1)
- # clojure-greece (25)
- # clojure-italy (4)
- # clojure-japan (10)
- # clojure-russia (13)
- # clojure-spec (61)
- # clojure-uk (25)
- # clojurescript (26)
- # core-matrix (5)
- # cursive (8)
- # data-science (7)
- # datomic (43)
- # dirac (2)
- # emacs (8)
- # events (3)
- # fulcro (17)
- # graphql (29)
- # jobs-rus (4)
- # lambdaisland (4)
- # lein-figwheel (3)
- # leiningen (60)
- # luminus (15)
- # lumo (8)
- # mount (3)
- # off-topic (28)
- # om (22)
- # onyx (115)
- # other-languages (6)
- # pedestal (5)
- # re-frame (41)
- # reagent (12)
- # ring-swagger (12)
- # shadow-cljs (127)
- # unrepl (27)
- # yada (5)
hi all, i have a project that consumes from a kafka topic. however when i kill the job and resubmit it starts from the beginning again. the job may change in future (new tasks, different flow confitions etc) but i still want to preserve where my application got to. i can't understand the resume point documentation at all. any clues how to get started?
I also can't fine any examples and nothing in the workshop
@ben.mumford Are you setting :kafka/offset-reset
?
i have to set this otherwise i get a schema exception when i submit the job
i think that value is used if no offset csn be found (my interpretation of docs)
@ben.mumford are you setting a :kafka/group-id
? This is on Onyx 0.11 I presume?
i am using 0.11
i don't see that option in the documentation
is the documentation out of date?
having a randomly generated group id would explain why it keeps going back to the beginning! 🙂
i'll try that now
No it's not randomly generated, it actually defaults to "onyx"
.
then setting to something else won't solve my problem will it?
yeah, tried setting the group-id, no difference
still reprocesses from the beginning of the topic
@ben.mumford Yea that’s strange behavior. Could you post your onyx-kafka catalog entry?
from my config edn: :input-config {:onyx/name :in :kafka/topic "input" :kafka/zookeeper "127.0.0.1:2181" :kafka/offset-reset :earliest :kafka/group-id "wrangler" :kafka/deserializer-fn :wrangler.utils.shared/deserialize-message-json :onyx/batch-size 10 :onyx/batch-timeout 50 :onyx/n-peers 1} then i call (add-task (kafka-task/consumer :in input-config))
And you’re running zookeeper externally right?
So the offsets are being committed for the "wrangler'
consumer group between job runs.
@ben.mumford Docs are up-to-date as far as I know. Resume functionality works fine. Are you seeing any of the policy INFO messages in the logs? https://github.com/onyx-platform/onyx-kafka/blob/0.11.x/src/onyx/plugin/kafka.clj#L44
@michaeldrogalis the earliest/latest functionality works perfectly. i wasn't sure how to make the application resume from the same point it shut down. for example i submit my job, post a message to input topic, i see the message appear downstream, kill job, resubmit job, see the message again (this bit i don't want)
i wasn't clear how to add a resume point for my application based on the documentation (or if i need one). the documentation i thought might be out of date was the onyx kafka one (i can't see group-id in the readme)
Ahh, got it. Sorry I misunderstood. Yeah you'll want to resume points. Roughly:
(->> job-id
(onyx.api/job-snapshot-coordinates peer-config tenancy-id)
(onyx.api/build-resume-point new-job)
(assoc new-job :resume-point)
(onyx.api/submit-job peer-config))
That's out of the docs - which I assume you saw but are still confused by
what is new job?
So you're stopping your job and standing up a new one, right?
It's the new one 🙂
it's the map with workflow, flow-conditions, catalog etc?
Exactly, yep.
so i need to add a task to my jar, so far i have start-peers, submit-job and i need to have another called "resume-job" which takes the id of the job to restart
correct?
is there a way to specify the name of a job (instead of a randomly generated id)?
If you're adding a new function that you need on your classpath, then yeah you'll want to modify your jar. If you're only updating your job map with new parameters, you can leave your jar alone
Not a name, but you can submit an ID that you choose.
(assoc-in job [:metadata :job-id] UUID)
These values need to be unique, hence why it needs to be a UUID. We assume you'll pick a random one and store it on your side
it needs to be unique but not if i'm resuming
i think i get it
i will try again
It needs to be unique if you’re resuming, as it considers each job a new job that brings in state from the old job.
ah i see
that makes sense
We need a resume points example in learn-onyx or in onyx-examples
i think i know what i'm doing (although i've been wrong before)
yeah, that would be great
to conclude: my main method will have three actions: start-peers, submit-job and (the new) resume-job which takes a pre-existing job to restart and runs micahel's code
@ben.mumford Yup :thumbsup:
so still running into the media driver timeout exception with an idle job ( roughly 4hrs before death ). Is there anything we need to tweak with the way the media driver runs ? Like the threading mode ? Or is that unrelated ?
@camechis are you running in process or out of process?
we are spinning the media driver process up in the docker container with the peer, like the template so out of process
ok. Any chance the container is running out of memory and OOMing it?
We health check both the peer JVM and the aeron media driver. It’s easy enough to do with onyx-peer-http-query if you’re using that.
i don’t think its being OOMing, I do have onyx-peer-http-query enabled and currently being scraped by prometheus.
Try polling /network/media-driver
on peer-http-query
you’ll get something like this:
{:active true, :driver-timeout-ms 10000, :log “INFO: Aeron directory /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas exists INFO: Aeron CnC file /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas/cnc.dat exists INFO: Aeron toDriver consumer heartbeat is 687 ms old”}
Are you tracking JVM GCs?
The fact that it dies when idle is strange as that would usually not indicate a GC
best advice I can really give you is to perform a flight recording of your JVM
you could increase the timeout past 10s, but you’re probably papering over another issue and it’ll eventually occur anyway.
so i am polling the peer query end point but only seeing something like this
{:status :success, :result true}
hmm, maybe the docs are wrong 😮
yea, looks like that call is broken
Are there any guides/best practices for monitoring metrics. We have a Datadog contract so prefer that over prometheus, and I've got the JMX export working, but even without a job it runs into the 350 metric limit for datadog, and I'm curious which are the "important" ones. I can take some guesses, but would like to see someone elses dashboard setup or which bean/attributes they are focusing on would be a huge help. I can release my datadog jmx conf after going through this.
@eriktjacobsen you’re going to have to reshape the metrics, as each peer has its own tag
I’m not all that familiar with datadog, but essentially you need to turn something like this:
throughput.peer-id.XXX.slot-id.YYY value
throughput {peer-id = XXX, slot-id YYY}
so they are all effectively the same metric, just different workers
@camechis fixed that query on master. You can do /network/media-driver/active
to check whether it’s active though
Yeah, that’s probably the best idea
To give a bit of context, normally the media driver should operate fine out of process and shouldn’t be subject to GCs in your main peer, so it shouldn’t really ever timeout unless it dies or if it wasn’t given enough memory (it doesn’t need much)
ok, I will waiting for it to die and see if the metric comes back as active or not. That should let me know for sure if its still running or not
Gotcha, shouldn't be a problem as I can filter metrics by host, so two hosts can just emit a throughput metric. My primary concern is just which of the beans often break. I don't have a good mental mapping of how things like zookeeper.force-read-chunk
relates to zookeeper.read-lifecycles
relates to zookeeper.read-origin
. For all I know, two are downstream of the other, so if one starts slowing down, they all will. Or which metrics generally can fluctuate but wouldn't be worth investigating on their own until they cause higher level problems. Just some guide / focus on which are the primary metrics most people are watching
@lucasbradstreet I see this stuff in the logs occasionally. Is this normal ?
17-10-25 20:50:50 onyx-peer-383723949-vm5s9 INFO [onyx.messaging.aeron.utils:60] - Error stopping subscriber's subscription. io.aeron.exceptions.RegistrationException: Unknown Subscription: 346
Nope, that’s not normal. What’s probably happening there is you’re getting a bad GC and the media driver is completely timing out and collecting your subscriptions, and thus it doesn’t know about them any more
yeah, I Just noticed this time the peer restarted on me twice. trying to determine why
OK, great
Do you guys if running in Kubernetes run the media driver in a separate container in the pod or side car it in a single container ?
We’re side car containering in a separate container with shared shm space
gotcha, maybe i will move it out to that. Currently I am running them in the same container which makes it tricky to know what really died
out curiosity do you have an example of setting up a shared shm space. I know thats coming next, lol
volumes:
- name: peersharedmem
emptyDir:
medium: "Memory"
That’ll give you a ramdisk
volumeMounts:
- name: peersharedmem
mountPath: "/dev/shm"
Then you can include that in both your peer and media-driver container definition in your pod template.It shares your memory allocation
Also, the most recent JDK has these nice options
- -XX:+UnlockExperimentalVMOptions
- -XX:+UseCGroupMemoryLimitForHeap
Which are awesome for container work, especially with Kubernetes where resources might be dynamic.
Latest JDK 8 works too
JDK8u131
so if you use those settings than you don’t have to do the trick to figure out how much memory the container has and set the XMX to some percentage of that memory to avoid the JVM trying to take more memory than the container has
Exactly
Which free’s you up to use resource request/limit ranges in Kubernetes