This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-08-16
Channels
- # admin-announcements (2)
- # bangalore-clj (3)
- # beginners (15)
- # boot (303)
- # carry (18)
- # cider (7)
- # cljs-dev (222)
- # cljsrn (103)
- # clojure (196)
- # clojure-czech (2)
- # clojure-russia (69)
- # clojure-spec (21)
- # clojure-uk (48)
- # clojurescript (68)
- # cursive (18)
- # datomic (185)
- # events (1)
- # hoplon (2)
- # lambdaisland (1)
- # leiningen (1)
- # mount (10)
- # off-topic (1)
- # om (14)
- # onyx (154)
- # parinfer (1)
- # pedestal (3)
- # planck (5)
- # protorepl (9)
- # re-frame (17)
- # reagent (27)
- # ring (2)
- # specter (58)
- # test-check (1)
- # testing (7)
- # untangled (59)
- # yada (35)
Hey all! Is there a good way to make a 'retry in 60 seconds'? I have to POST something to an API which happens to be unavailable at times. 🙂
That’ll have to be handled at the plugin level. Are you using the onyx-http plugin?
Actually, if I remember correctly I think I added this feature
also, there is an action :retry
for exceptions - I wonder if it's possible to limit number of retries
> Accepts response as argument, should return boolean to indicate if the response was successful. Request will be retried in case it wasn't a success. nice! Is it possible to add some logic there? 🙂
Right. It turns out it's in another branch that we created for a client. We will have to back port it.
The one we implemented allows for exponential back off, and will stop retrying once it hits the pending timeout since the message will be retried from root anyway
See allow-retry in https://github.com/onyx-platform/onyx-http/blob/batch-byte-size/src/onyx/plugin/http_output.clj
I believe we cut a snapshot, but I'd have to lookup what the snapshot is
Yup. I agree. The intent is to get it in the main branch.
Can someone explain how namespaced keywords were useful in building onyx? Michael says it's worth it but a pain to use, here: https://groups.google.com/d/msg/clojure-dev/8tHLCm8LpyU/NX9VTg53GAAJ
Is correct understod that the onyx-dashboard no longer supports showing the metrics sent through the websocket-sender ?
That's correct. I'd like to add something like that back in once we release onyx-metrics Prometheus support. The main reason we took it out is because you should use something else in production, and the timbre metrics should be enough otherwise
Hi, why submit-job
is calling serialize-job-to-zookeeper
twice ?
Are you seeing it actually be called twice? Submit job should only take one path that calls serialize-job-to-zookeeper, depending on whether or not the job with that job id exists or not. This is so job submission is idempotent
Hi any suggestion how I troubleshot the following error, trying to import 42mill record from ms-sql to datomic
:29:27 silver FATAL [onyx.peer.task-lifecycle] -
java.lang.Thread.run Thread.java: 745
java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 617
java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1142
...
clojure.core.async/thread-call/fn async.clj: 439
onyx.peer.task-lifecycle/launch-aux-threads!/fn task_lifecycle.clj: 288
onyx.plugin.sql.SqlPartitionKeys/ack-segment sql.clj: 157
clojure.core.async/>!! async.clj: 142
clojure.core.async.impl.channels.ManyToManyChannel/put! channels.clj: 69
java.lang.IllegalArgumentException: Can't put nil on channel
clojure.lang.ExceptionInfo: Internal error. Failed to read core.async channels -> Exception type: java.lang.IllegalArgumentException. Exception message: Can't put nil on channel
job-id: #uuid "ec8af69f-5f8e-4be3-9ef6-3775e830273b"
metadata: {:job-id #uuid "ec8af69f-5f8e-4be3-9ef6-3775e830273b", :job-hash "71b1c13a4b7bbf3302da4fb30990c6b63ee5183877cbbc6add62882d3e3bd"}
peer-id: #uuid "fa06ff13-e6ae-4939-8151-0f7cdca8b1a2"
task-name: :partition-keys
Looks like a bug in onyx-sql
What’s probably happening is that messages are getting retried, then the segment that was retried is being acked, but is already missing
I would do two things, one reduce :onyx/max-pending on the input task (will help with retries)
also potentially reduce :sql/rows-per-segment
and try the onyx-sql snapshot that is releasing through CI at the moment
I’ll let you know the coordinates for the snapshot shortly
Ok I’m currently on 0.9.7 - I only saw it after i added max-pending 1000. rows-per-segment is 1000 and batch-size is 100.
onyx-sql 0.9.9.1-20160816.124319-6 is available now.
The main reason I added the max-pending was because after a while I would loose the connection to the transactor and onxy would no appear to notice, (detected by monitoring the datomic tx-report queue)
Trying the snapshot out now
It would appear that it worked now processed 200k records,
With the new snapshot?
Im currently having a heap size of 10g and I approx. imports between 200K-400K before a timeout in Aeron conductor. Which is probably because of resources. Using the embedded aeron with short-circut to false. You got any rules of thumb about the sizing of heap vs number of record vs embedded/standalone aeron ?
1. use short circuiting because it’ll improve perf in prod. It’s mostly for testing whether things are serializable etc
correction short-circut is et to true.
ok cool
I’d use a non-embedded conductor, because then your Onyx JVM can’t take it down
I’d also switch on the G1GC
Dedicated media driver in seperate jvm helped us when aeron was complaining about timeouts
Then I would probably reduce the max-pending some more. You’ll have rows-per-segment * max-pending input segments in flight at any given time
Should I increase aeron timouts ?
You should probably do that too. There are some java properties that you can set
Thx 🙂
dumb question but How do you know if aeron is complaining about timeouts, We are experiencing a high number of retries and haven’t been able to figure out what might be causing it?
It should be a pretty obvious exception
@camechis: are you using windows?
How are you using the windows? What kind of aggregation?
tried a few different types ( mainly fixed / session ) but always collect-by-key aggregation. We are grouping segments based on that key and doing a collapse of sort
OK, but it’s your own aggregation, and you’re not keeping the whole segment in the state?
You have to be careful because depending on what the changelog entry contains, you might get a lot of data being journaled
Ok. So what is going to happen is that every segment will be written to a BookKeeper log
This can hurt you perf wise if it’s big
One easy approach would be to just select-keys/get what you need in your aggregation
More importantly, you have a main tuning knob for retries, which is changing onyx/max-pending
try something really low, see if you get retries, and then start to move it up
If you’re just running it on a single node, also beware that the testing BookKeeper server / configuration will end up journalling it on three nodes locally
so that will hurt you too
if it’s on multiple machines it’ll be ok
right now we 5 peers running in mesos with 3 dedicated bookies running physically on 3 nodes
OK, that’s good
That should be fine
Metrics would help a lot
Understood
-pours camechis a ☕-
i've been here, buddy. one thing at a time 🙂
@robert-stuttaford understands
It may also be a good time to bring us on for a small fixed amount of hours
Alrighty. Happy to help here where I can too
What’s your complete latency like, by the way?
heartily recommend doing that, camechis. we didn't for a long time, and it cost us a lot of wasted time
much appreciated! We are waiting for some stuff to go finish up and then I am hoping we can do that
First thing is to start with a really really low max-pending and see where you're at
@lucasbradstreet: I just saw you asked about the latency. Let me see if i can get that
I lost my numbers but we will be running a new one soon and I will let you know what we see
No worries. It’ll give you a decent idea about how long a segment is taking to process end to end
As you increase max-pending, complete latency will creep up, and throughput will be increasing a lot, but at some point you’ll max out throughput and your complete latency will keep increasing (and possibly you’ll start hitting retries). That’s a sign you’ve gone too far
Do you happen to know with aero config stuff. Is there a way to read an ENV var into a int ? We have hit several spots where we need to make it a little more configurable on the docker container by setting envs but aero reads environment variables in as a string and then it blows up the onyx schema for certain things
Then you’ll probably need to fix up other things to improve things further, at which point you may be able to increase max-pending again
You’ll have to ask @gardnervickers that question
The easy answer would be to update-in them into ints in your code, after you’ve loaded the confug
I know aero is extensible, so maybe check the docs to see if there are any tags for that sort of thing
I don’t think there is though
Nope, this is an issue with any tool using env vars. You'll need to parse that yourself or make a custom Aero reader
#env-edn would be a handy one
Hi guys, just a quick question. I’m firing up a ten peer cluster. I have one job requiring three peers and then another job requiring another three peers. First job is a Kafka-in > Process > Kafka-out job
and the second is a Kafka-in > Process -> Onyx-out
. So the first job (showing in the dashboard connects fine and reports using the three peers for in - process -out. The second job never gets allocated peers from the host. Is there anything I can check to see what’s going on?
@lucasbradstreet: So I got the metrics going again and I dropped the max-pending to 1000. No retries right now. I am trying to figure out the best way to relay what the max_complete time is to you. Any suggestions
there should be a max_complete_latency metric on your input tasks. You should just be able to let me know what it usually looks like
@lucasbradstreet: so what I am seeing in grafana is numbers between 5K-20K kind of spiking all around in there
10-15s end to end
if the max is 10-15K that means segments are flowing through the whole DAG and being acked in about that time
you should be able to add up the batch latencies for each task (divided by the batch size) and get a pretty good idea about what is taking the most time
when the segment is written to the output task, and also journalled to bookkeeper (it reference counts the ack in that task, so both need to occur before the ack is sent)
ok, also we are thinking of writing our triggered data back into kafka so the pipeline can pick it back up and do some last minute enrichment. Right now are trigger is doing a ton of last minute work on the collected data. Thought this might make it more effecient? Any thoughts there?
Also that should be doable in one job ( I think )? probably cannot use the kafka output task in this case
@lucasbradstreet so previously fired job can be rerun only via submit with different id?
@mariusz_jachimowicz: A job is considered equal to another job if it shares the same job ID. If the IDs are different, they have no relationship to one another as far as idempotency is concerned, even if the jobs themselves are identical.
Anyone have any ideas on why we might have missing data in our windows? We are using a session window with a timer trigger and after process a predefined set of data in kafka the numbers don’t add up and are somewhat different on each run? Kafka->tasks->window-collect-by-key -> timer trigger sum up and write to elastic
I play with learn-onyx, and when I submit the same job definition twice within one test then tasks are only fired for first job
@mariusz_jachimowicz: Is there a message in onyx.log noting that there aren't enough virtual peers to sustain execution for the second job?
yes, I thougth that peers will be reused after first job
They will if the first job runs to completion.
Ok, I had to bind inputs again for second job, working correctly
Hi, I'm getting an error when a message is large, e.g.
Aeron write from buffer error: java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=2295220
What's the best way to print the segment that's causing the problem to onyx.log
?@aaelony: That error is being thrown from within Onyx itself, there's not really a good way to bail out of that one. Would recommend increasing the Aeron max message size. It's bounded and unchecked to keep performance quick.
it's actually due to bad data... I'm reading from large files now, and just isolated the problem by iteratively narrowing it down (painful, but found it). It appears the file happens to have fused records together by ^J
instead of \n
at like the 100th Megabyte or something like that.
I had increased the message size earlier, but I couldn't increase it further without (I surmise) recompiling aeron or something, so just thinking about how I would like to detect and handle this in production.
suppose bad data starts coming in, without the right delimiter... the segments start clumping together until they exceed maxMessageLength. I'm currently looking for ideas, but instead of calling (add-task (s/buffered-file-reader :in filename batch-settings))
I'm thinking of calling a new function with the same signature that will wrap that, first checking for bad data...
I think that might translate to 0a
, not sure if it's a good idea for the line reader to understand that natively...
@michaeldrogalis: If an Aeron exception is encountered, it seems the job terminates, is this correct? If the job is fault tolerant to this kind of error and processes the messages that don't have this problem, that's a better scenario...
eventually, this data will flow through kafka, so I will be able to assume (hopefully) that these issues wouldn't occur
@aaelony: The error you're seeing is very deep inside Onyx, and is irrecoverable without reconfiguring Aeron to accept larger message sizes at peer launch time.
right, the trouble is that the larger message size could be infinitely long, if the delimiter has unexpectedly changed...