Fork me on GitHub
#onyx
<
2016-08-16
>
asolovyov07:08:53

Hey all! Is there a good way to make a 'retry in 60 seconds'? I have to POST something to an API which happens to be unavailable at times. 🙂

lucasbradstreet07:08:35

That’ll have to be handled at the plugin level. Are you using the onyx-http plugin?

asolovyov08:08:59

(sorry for the pause, got dragged into a meeting)

lucasbradstreet08:08:11

Actually, if I remember correctly I think I added this feature

asolovyov08:08:17

also, there is an action :retry for exceptions - I wonder if it's possible to limit number of retries

asolovyov08:08:24

oh really? let me check docs again

asolovyov08:08:20

> Accepts response as argument, should return boolean to indicate if the response was successful. Request will be retried in case it wasn't a success. nice! Is it possible to add some logic there? 🙂

asolovyov08:08:31

I'd say I want to retry few times with a pause between them...

asolovyov08:08:42

I guess that's not really possible right now?

lucasbradstreet08:08:44

Right. It turns out it's in another branch that we created for a client. We will have to back port it.

lucasbradstreet08:08:22

The one we implemented allows for exponential back off, and will stop retrying once it hits the pending timeout since the message will be retried from root anyway

asolovyov08:08:49

is it possible to use this stuff without an official release?

lucasbradstreet08:08:15

I believe we cut a snapshot, but I'd have to lookup what the snapshot is

asolovyov08:08:09

I guess I can make a release for myself under my orgname on clojars 😄

asolovyov08:08:32

but that seems like a really useful thing!

asolovyov08:08:37

bad http apis are everywhere 😄

lucasbradstreet08:08:35

Yup. I agree. The intent is to get it in the main branch.

yonatanel09:08:16

Can someone explain how namespaced keywords were useful in building onyx? Michael says it's worth it but a pain to use, here: https://groups.google.com/d/msg/clojure-dev/8tHLCm8LpyU/NX9VTg53GAAJ

zamaterian11:08:28

Is correct understod that the onyx-dashboard no longer supports showing the metrics sent through the websocket-sender ?

lucasbradstreet11:08:38

That's correct. I'd like to add something like that back in once we release onyx-metrics Prometheus support. The main reason we took it out is because you should use something else in production, and the timbre metrics should be enough otherwise

mariusz_jachimowicz12:08:04

Hi, why submit-job is calling serialize-job-to-zookeeper twice ?

lucasbradstreet12:08:15

Are you seeing it actually be called twice? Submit job should only take one path that calls serialize-job-to-zookeeper, depending on whether or not the job with that job id exists or not. This is so job submission is idempotent

zamaterian12:08:26

Hi any suggestion how I troubleshot the following error, trying to import 42mill record from ms-sql to datomic

:29:27 silver FATAL [onyx.peer.task-lifecycle] - 
                                   java.lang.Thread.run              Thread.java:  745
     java.util.concurrent.ThreadPoolExecutor$Worker.run  ThreadPoolExecutor.java:  617
      java.util.concurrent.ThreadPoolExecutor.runWorker  ThreadPoolExecutor.java: 1142
                                                    ...                               
                      clojure.core.async/thread-call/fn                async.clj:  439
        onyx.peer.task-lifecycle/launch-aux-threads!/fn       task_lifecycle.clj:  288
           onyx.plugin.sql.SqlPartitionKeys/ack-segment                  sql.clj:  157
                                 clojure.core.async/>!!                async.clj:  142
clojure.core.async.impl.channels.ManyToManyChannel/put!             channels.clj:   69
java.lang.IllegalArgumentException: Can't put nil on channel
        clojure.lang.ExceptionInfo: Internal error. Failed to read core.async channels -> Exception type: java.lang.IllegalArgumentException. Exception message: Can't put nil on channel
       job-id: #uuid "ec8af69f-5f8e-4be3-9ef6-3775e830273b"
     metadata: {:job-id #uuid "ec8af69f-5f8e-4be3-9ef6-3775e830273b", :job-hash "71b1c13a4b7bbf3302da4fb30990c6b63ee5183877cbbc6add62882d3e3bd"}
      peer-id: #uuid "fa06ff13-e6ae-4939-8151-0f7cdca8b1a2"
    task-name: :partition-keys

lucasbradstreet12:08:24

Looks like a bug in onyx-sql

lucasbradstreet12:08:56

What’s probably happening is that messages are getting retried, then the segment that was retried is being acked, but is already missing

lucasbradstreet12:08:39

I would do two things, one reduce :onyx/max-pending on the input task (will help with retries)

lucasbradstreet12:08:53

also potentially reduce :sql/rows-per-segment

lucasbradstreet12:08:09

and try the onyx-sql snapshot that is releasing through CI at the moment

lucasbradstreet12:08:22

I’ll let you know the coordinates for the snapshot shortly

zamaterian12:08:36

Ok I’m currently on 0.9.7 - I only saw it after i added max-pending 1000. rows-per-segment is 1000 and batch-size is 100.

lucasbradstreet12:08:46

onyx-sql 0.9.9.1-20160816.124319-6 is available now.

zamaterian12:08:30

The main reason I added the max-pending was because after a while I would loose the connection to the transactor and onxy would no appear to notice, (detected by monitoring the datomic tx-report queue)

zamaterian12:08:39

Trying the snapshot out now

zamaterian12:08:40

It would appear that it worked now processed 200k records,

lucasbradstreet12:08:10

With the new snapshot?

zamaterian13:08:17

Im currently having a heap size of 10g and I approx. imports between 200K-400K before a timeout in Aeron conductor. Which is probably because of resources. Using the embedded aeron with short-circut to false. You got any rules of thumb about the sizing of heap vs number of record vs embedded/standalone aeron ?

lucasbradstreet13:08:59

1. use short circuiting because it’ll improve perf in prod. It’s mostly for testing whether things are serializable etc

zamaterian13:08:03

correction short-circut is et to true.

lucasbradstreet13:08:49

I’d use a non-embedded conductor, because then your Onyx JVM can’t take it down

lucasbradstreet13:08:59

I’d also switch on the G1GC

dignati13:08:08

Dedicated media driver in seperate jvm helped us when aeron was complaining about timeouts

lucasbradstreet13:08:49

Then I would probably reduce the max-pending some more. You’ll have rows-per-segment * max-pending input segments in flight at any given time

zamaterian13:08:25

Should I increase aeron timouts ?

lucasbradstreet13:08:40

You should probably do that too. There are some java properties that you can set

zamaterian13:08:55

Thx 🙂

Travis13:08:17

dumb question but How do you know if aeron is complaining about timeouts, We are experiencing a high number of retries and haven’t been able to figure out what might be causing it?

lucasbradstreet13:08:48

It should be a pretty obvious exception

lucasbradstreet13:08:53

@camechis: are you using windows?

Travis13:08:53

ok, gotcha

Travis13:08:04

Yeah we are

lucasbradstreet13:08:35

How are you using the windows? What kind of aggregation?

Travis13:08:38

tried a few different types ( mainly fixed / session ) but always collect-by-key aggregation. We are grouping segments based on that key and doing a collapse of sort

lucasbradstreet13:08:29

OK, but it’s your own aggregation, and you’re not keeping the whole segment in the state?

lucasbradstreet13:08:53

You have to be careful because depending on what the changelog entry contains, you might get a lot of data being journaled

Travis13:08:08

it is whole segment but we discard once the trigger fires

lucasbradstreet13:08:37

Ok. So what is going to happen is that every segment will be written to a BookKeeper log

Travis13:08:48

makes sense

lucasbradstreet13:08:56

This can hurt you perf wise if it’s big

lucasbradstreet13:08:22

One easy approach would be to just select-keys/get what you need in your aggregation

lucasbradstreet13:08:46

More importantly, you have a main tuning knob for retries, which is changing onyx/max-pending

lucasbradstreet13:08:54

try something really low, see if you get retries, and then start to move it up

Travis13:08:09

ok, we have started playing with that knob

lucasbradstreet13:08:41

If you’re just running it on a single node, also beware that the testing BookKeeper server / configuration will end up journalling it on three nodes locally

lucasbradstreet13:08:43

so that will hurt you too

lucasbradstreet13:08:05

if it’s on multiple machines it’ll be ok

Travis13:08:12

right now we 5 peers running in mesos with 3 dedicated bookies running physically on 3 nodes

lucasbradstreet13:08:41

OK, that’s good

Travis13:08:51

well, not dedicated but installed outside of docker on 3 nodes

lucasbradstreet13:08:02

That should be fine

lucasbradstreet13:08:14

Metrics would help a lot

Travis13:08:21

we do have metrics now

Travis13:08:32

still trying to wrap our heads around interpreting them

Travis13:08:54

got riemann, influx, and grafana going. Based the queries of the benchmark project

robert-stuttaford13:08:17

-pours camechis a ☕-

robert-stuttaford13:08:33

i've been here, buddy. one thing at a time 🙂

lucasbradstreet13:08:54

It may also be a good time to bring us on for a small fixed amount of hours

Travis13:08:10

believe me, I am working on that one

lucasbradstreet13:08:38

Alrighty. Happy to help here where I can too

lucasbradstreet13:08:16

What’s your complete latency like, by the way?

robert-stuttaford13:08:33

heartily recommend doing that, camechis. we didn't for a long time, and it cost us a lot of wasted time

Travis13:08:09

much appreciated! We are waiting for some stuff to go finish up and then I am hoping we can do that

lucasbradstreet13:08:43

First thing is to start with a really really low max-pending and see where you're at

Travis13:08:25

cool, we will do that

Travis13:08:36

@lucasbradstreet: I just saw you asked about the latency. Let me see if i can get that

Travis14:08:23

I lost my numbers but we will be running a new one soon and I will let you know what we see

lucasbradstreet14:08:41

No worries. It’ll give you a decent idea about how long a segment is taking to process end to end

lucasbradstreet14:08:56

As you increase max-pending, complete latency will creep up, and throughput will be increasing a lot, but at some point you’ll max out throughput and your complete latency will keep increasing (and possibly you’ll start hitting retries). That’s a sign you’ve gone too far

Travis14:08:31

Do you happen to know with aero config stuff. Is there a way to read an ENV var into a int ? We have hit several spots where we need to make it a little more configurable on the docker container by setting envs but aero reads environment variables in as a string and then it blows up the onyx schema for certain things

lucasbradstreet14:08:50

Then you’ll probably need to fix up other things to improve things further, at which point you may be able to increase max-pending again

lucasbradstreet14:08:07

You’ll have to ask @gardnervickers that question

lucasbradstreet14:08:44

The easy answer would be to update-in them into ints in your code, after you’ve loaded the confug

Travis14:08:13

ok, thats what we have been doing

lucasbradstreet14:08:39

I know aero is extensible, so maybe check the docs to see if there are any tags for that sort of thing

Travis14:08:58

yeah, i think we can make our own. Just wasn’t sure if that was something built in

lucasbradstreet14:08:03

I don’t think there is though

gardnervickers14:08:23

Nope, this is an issue with any tool using env vars. You'll need to parse that yourself or make a custom Aero reader

lucasbradstreet14:08:06

#env-edn would be a handy one

jasonbell14:08:29

Hi guys, just a quick question. I’m firing up a ten peer cluster. I have one job requiring three peers and then another job requiring another three peers. First job is a Kafka-in > Process > Kafka-out job and the second is a Kafka-in > Process -> Onyx-out. So the first job (showing in the dashboard connects fine and reports using the three peers for in - process -out. The second job never gets allocated peers from the host. Is there anything I can check to see what’s going on?

Travis14:08:59

might check the job scheduler

Travis14:08:12

if its set to greedy i think the second job will never get any peers

jasonbell14:08:09

tis greedy, good catch. thanks

Travis15:08:19

@lucasbradstreet: So I got the metrics going again and I dropped the max-pending to 1000. No retries right now. I am trying to figure out the best way to relay what the max_complete time is to you. Any suggestions

lucasbradstreet15:08:55

there should be a max_complete_latency metric on your input tasks. You should just be able to let me know what it usually looks like

Travis15:08:53

@lucasbradstreet: so what I am seeing in grafana is numbers between 5K-20K kind of spiking all around in there

Travis15:08:18

maybe closer between 10K-15K on average

Travis15:08:33

if those numbers sound right

lucasbradstreet15:08:29

10-15s end to end

Travis15:08:10

i am not sure, so my Y access is 0 - 20 K and X is in 2 minute increments

lucasbradstreet15:08:00

if the max is 10-15K that means segments are flowing through the whole DAG and being acked in about that time

lucasbradstreet15:08:23

you should be able to add up the batch latencies for each task (divided by the batch size) and get a pretty good idea about what is taking the most time

Travis17:08:00

what causes a segment to get acked when a window is the last piece of your pipeline?

lucasbradstreet17:08:32

when the segment is written to the output task, and also journalled to bookkeeper (it reference counts the ack in that task, so both need to occur before the ack is sent)

Travis17:08:07

ok, also we are thinking of writing our triggered data back into kafka so the pipeline can pick it back up and do some last minute enrichment. Right now are trigger is doing a ton of last minute work on the collected data. Thought this might make it more effecient? Any thoughts there?

Travis18:08:05

Also that should be doable in one job ( I think )? probably cannot use the kafka output task in this case

mariusz_jachimowicz19:08:01

@lucasbradstreet so previously fired job can be rerun only via submit with different id?

michaeldrogalis20:08:15

@mariusz_jachimowicz: A job is considered equal to another job if it shares the same job ID. If the IDs are different, they have no relationship to one another as far as idempotency is concerned, even if the jobs themselves are identical.

Travis20:08:20

Anyone have any ideas on why we might have missing data in our windows? We are using a session window with a timer trigger and after process a predefined set of data in kafka the numbers don’t add up and are somewhat different on each run? Kafka->tasks->window-collect-by-key -> timer trigger sum up and write to elastic

mariusz_jachimowicz20:08:43

I play with learn-onyx, and when I submit the same job definition twice within one test then tasks are only fired for first job

michaeldrogalis20:08:20

@mariusz_jachimowicz: Is there a message in onyx.log noting that there aren't enough virtual peers to sustain execution for the second job?

mariusz_jachimowicz20:08:47

yes, I thougth that peers will be reused after first job

michaeldrogalis20:08:32

They will if the first job runs to completion.

mariusz_jachimowicz20:08:11

Ok, I had to bind inputs again for second job, working correctly

aaelony21:08:24

Hi, I'm getting an error when a message is large, e.g.

Aeron write from buffer error:  java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=2295220
What's the best way to print the segment that's causing the problem to onyx.log ?

michaeldrogalis21:08:01

@aaelony: That error is being thrown from within Onyx itself, there's not really a good way to bail out of that one. Would recommend increasing the Aeron max message size. It's bounded and unchecked to keep performance quick.

aaelony21:08:51

it's actually due to bad data... I'm reading from large files now, and just isolated the problem by iteratively narrowing it down (painful, but found it). It appears the file happens to have fused records together by ^J instead of \n at like the 100th Megabyte or something like that.

aaelony21:08:47

I had increased the message size earlier, but I couldn't increase it further without (I surmise) recompiling aeron or something, so just thinking about how I would like to detect and handle this in production.

aaelony21:08:43

suppose bad data starts coming in, without the right delimiter... the segments start clumping together until they exceed maxMessageLength. I'm currently looking for ideas, but instead of calling (add-task (s/buffered-file-reader :in filename batch-settings)) I'm thinking of calling a new function with the same signature that will wrap that, first checking for bad data...

aaelony22:08:10

I think that might translate to 0a, not sure if it's a good idea for the line reader to understand that natively...

aaelony23:08:40

@michaeldrogalis: If an Aeron exception is encountered, it seems the job terminates, is this correct? If the job is fault tolerant to this kind of error and processes the messages that don't have this problem, that's a better scenario...

aaelony23:08:45

eventually, this data will flow through kafka, so I will be able to assume (hopefully) that these issues wouldn't occur

michaeldrogalis23:08:27

@aaelony: The error you're seeing is very deep inside Onyx, and is irrecoverable without reconfiguring Aeron to accept larger message sizes at peer launch time.

aaelony23:08:51

right, the trouble is that the larger message size could be infinitely long, if the delimiter has unexpectedly changed...

aaelony23:08:50

basically it's treating what should be many many segments as one segment, until this badness in the data stops

aaelony23:08:20

this would never happen if the data were clean prior to consumption