Fork me on GitHub
#onyx
<
2017-11-15
>
lucasbradstreet00:11:55

@asolovyov any idea why it might be failing on circleci? I think maybe it’s the case where exceptions are thrown before the other calls succeed.

lucasbradstreet01:11:21

Figured it out via some judicious sleeps 😛

asolovyov07:11:11

@lucasbradstreet I'll look into backpressure!

asolovyov07:11:57

also, about Thread/sleep: that's the only thing I found in docs about timeouts, but yeah, I'm not exactly sure about it

asolovyov07:11:21

it didn't manifest as a problem in tests, but I guess because there is more threads in an aleph thread pool than requests

lucasbradstreet07:11:32

I remembered the other thing that might be useful. If you have long running retries, you may wish to stick them in a map when they fail, and return them from the checkpoint call. Then you can continue to progress on sync even if there are retries that didn’t succeed.

lucasbradstreet07:11:47

This could help you for the Thread/sleep issue too.

asolovyov07:11:48

I guess I need to test for that, I have an idea how to replace Thread/sleep

asolovyov07:11:31

IIRC you also had an idea where retries are handled by Onyx' inner machinery, should I maybe look more into that?

lucasbradstreet07:11:52

Yeah, what I just mentioned is the best play for now. I can run you through it quickly if you’re interested.

lucasbradstreet07:11:32

K. So at the moment you increase the in-flight-writes atom when you process the segment for the first time

lucasbradstreet07:11:16

Then you retry the message until it either fails or you give up. But until you decrement in-flight-writes you can’t move to the next epoch, so eventually it’ll block the whole pipeline https://github.com/onyx-platform/onyx-http/blob/0.12.x/src/onyx/plugin/http_output.clj#L86

asolovyov07:11:52

yeah... and depending on retry-params it can take a while to end

lucasbradstreet07:11:02

So instead, I think when you get a failure that you want to retry on, you should decrement in-flight-writes, but add the message you want to retry to a checkpoint atom, along with the time to try it next.

lucasbradstreet07:11:41

This gives you a way to store your retryable segments in a fault tolerant way.

asolovyov07:11:57

this makes sense

lucasbradstreet07:11:59

These segments will be checkpointed to S3 on each checkpoint.

asolovyov07:11:09

what do you mean under checkpoint atom? should I have an atom there?

asolovyov07:11:35

like (reset! saved-state {req1 req2 req3])?

asolovyov07:11:51

ah, so I manage saved state manually

lucasbradstreet07:11:01

And (swap! saved-state assoc next-try-time segment) when you want to stash it away

lucasbradstreet07:11:07

or something like that

asolovyov07:11:09

well, this doesn't sound too hard

lucasbradstreet07:11:18

Yeah, onyx should handle all the complex bits for you

asolovyov07:11:45

and I can "wait" (Thread/sleep or whatever) before making that request in process-message

lucasbradstreet07:11:46

Then you just need to occasionally check for things to retry in your write-batch, without making that check cost too much.

asolovyov07:11:11

so that instead of loop/`recur` I'll go through Onyx' checkpoints, but then I need to "wait" anyway

asolovyov07:11:39

instead of Thread/sleep I can just check enough time has passed

asolovyov07:11:42

in write-batches

asolovyov07:11:56

that sounds much nicer than what I have right now!

lucasbradstreet07:11:24

Just try not to check it on every write-batch otherwise you might end up burning a lot of CPU when you have a lot of segments to retry. Maybe put an tick inverval on it

lucasbradstreet07:11:29

Yeah, it should come out pretty clean

asolovyov07:11:30

the less async code is there the better

asolovyov07:11:55

okay, I don't promise to do it today, but that's something I'm definitely interested in! 🙂

lucasbradstreet07:11:07

Sure thing. Anyway, I’ve merged and released onyx-http as part of 0.12.0-beta2

asolovyov07:11:21

Cool thanks!

lucasbradstreet07:11:29

I had to put a big sleep in the test to CI correctly, so you may want to take it out when running your tests locally.

asolovyov07:11:11

would you recommend updating to that version if we're in the middle of update right now? Or is it better to stick with 0.11.1 if Black Friday is in front of us and we're a little bit worried about stability? :-))

lucasbradstreet07:11:56

I made a lot of breaking changes in the last release, so if what you’re currently doing is working then I would stick to that until after.

asolovyov07:11:46

oh okay, then 0.11.1 it is for now 🙂

asolovyov07:11:41

we're seeing weird problems with 0.9.5 lately (after changing bits of infrastructure) and I'm wary of reporting it until we upgrade because we can't determine if it's problems with Onyx or what

asolovyov07:11:50

but jobs just get stuck from time to time 😕

asolovyov07:11:08

> onyx-http has been restored to functionality. Thanks Vsevolod Solovyov hey! Vsevolod is my brother 🙂

asolovyov07:11:56

I think what confused you is when I fixed link in readme 🙂

asolovyov07:11:11

he did initial implementation two years ago

lucasbradstreet07:11:32

You’re right in that I copied it from the readme 😉

asolovyov07:11:34

but now he's just "oh I forgot why it was like that"

asolovyov08:11:31

> BREAKING CHANGE Event map key :onyx.core/results has been removed. It has been replaced by :onyx.core/transformed, :onyx.core/triggered, and :onyx.core/write-batch. Output plugins should use :onyx.core/write-batch when writing outputs to their storage medium. this repeats twice in changes.md

asolovyov08:11:51

three times 🙂

lucasbradstreet08:11:54

Cool, issue looks good and I agree.

asolovyov08:11:09

It's fairly important though 😄

lmergen08:11:25

well at least everyone should be aware there's a breaking change then 🙂

lucasbradstreet08:11:30

@lmergen you will like this one: * New api function onyx.api/job-ids. Allows reverse lookup of a :job-name job-key, to corresponding job-id and tenancy-id. This makes supporting long running migrateable jobs much easier.

asolovyov08:11:01

we tried to move to giving explicit job ids in 0.9.5 but it broke the world for us

lmergen08:11:27

@lucasbradstreet does that mean we can start working with job names rather than job ids ?

lucasbradstreet08:11:55

Yes, so if you have a stable service name, you’ll be able to lookup what the current corresponding tenancy and job-id for that job is

asolovyov08:11:30

that is something I dreamed of

lucasbradstreet08:11:32

This should make it really easy to write a bunch of helper utils which do stuff like: give me the current state of the job with this name. Tell me what the resume point is for the job with this name and then kill the old job id.

lucasbradstreet08:11:47

I’ll bang out an onyx-example before the release is final.

asolovyov08:11:10

now if we could improve onyx-dashboard to show this plus allow to execute scripts (to restart/resubmit jobs)...

lucasbradstreet08:11:33

onyx-dashboard could certainly use some love. Unfortunately I don’t think it’ll be getting any, any time soon.

lmergen08:11:32

@lucasbradstreet gotcha. will be useful.

lmergen08:11:51

@lucasbradstreet i'm making more progress on debugging my issue btw. apparently reducing things to a single peer group has no effect, so that's actually good news. but using only a single job apparently fixes the issue. my focus is now on kafka. i am making topics ad-hoc, and within milliseconds launch onyx jobs. there might be a race condition going on there, with an onyx job launching before the topic being fully initialized.

lucasbradstreet08:11:15

Ooh, yeah, good idea

lmergen08:11:35

and if this is indeed the issue, then i want to find out why i don't see any useful errors

lmergen14:11:03

@lucasbradstreet well apparently there is true wisdom in adding sleep statements... i just had 25 succesful test runs when adding a 1s sleep after creating a kafka topic facepalm

lmergen14:11:35

so i am relatively confident that it was a race condition, some error the kafka plugin is running into with a topic not yet being ready

lmergen14:11:57

i'm now going to focus on figuring out why it's not displaying any errors, though

michaeldrogalis15:11:39

@lmergen I've seen this before with Onyx out of the equation. If you try to connect to a topic or partition that doesn't exist yet, the client will block and sit there. 😕

michaeldrogalis15:11:46

Not sure if this has been fixed as of Kafka 1.0

lmergen15:11:08

well i tested it and apparently it is able to request partition information

michaeldrogalis15:11:13

Potentially also a problem with Onyx, too though. Just noting that this seems familiar.

lmergen15:11:19

in onyx, it just never arrives at the recover! call

lmergen15:11:31

somewhere between start and recover things die

michaeldrogalis15:11:23

Ahhh, got it. Disregard my comment then

lmergen15:11:50

so in this case, i think there is some obscure error being triggered, somewhere

lmergen15:11:06

i will continue adding println statements all around the stack to figure this out 🙂

michaeldrogalis15:11:45

Like a true Clojure champion.

lmergen16:11:49

anyway i'm glad i got this resolved now. spent way too much time on this.

nha16:11:57

https://github.com/onyx-platform/onyx-http/blob/0.12.x/src/onyx/plugin/http_output.clj#L10 This fn called successively with the same params may sometimes return nil right? Is that a concern? (I don’t use onyx-http, I am just curious - maybe it is just a matter of choosing “good” values for the params)

lmergen16:11:39

apparently aeron was not the problem -- apart from some "unavailable network image" errors being generated, which was solved by not running multiple instances of the embedded aeron driver on the same machine.

michaeldrogalis16:11:02

@nha It seems okay at a glance because its being checked by the cond before getting used.

kenny21:11:00

According to the Onyx cheat sheet, :trigger/sync is not required (http://www.onyxplatform.org/docs/cheat-sheet/latest/#trigger-entry/:trigger/sync) but Onyx specs marks it as being required (https://github.com/onyx-platform/onyx-spec/blob/0.11.x/src/onyx/spec.cljc#L302-L306). It doesn't seem like it should be required though. Which is correct?

michaeldrogalis21:11:43

@kenny The spec is incorrect, it is not required.

michaeldrogalis21:11:56

Ill make the change now, thanks for reporting.

michaeldrogalis21:11:26

@kenny Pushed to master. It will go out with the next release.

kenny21:11:36

Awesome. Thanks!

michaeldrogalis21:11:42

Anytime! Thank you.

kenny21:11:14

Is there a way to get better catalog validation errors? I am receiving this error right now and it is quite vague:

------ Onyx Job Error -----
There was a validation error in your catalog for key 13

{
    :onyx/batch-size 1
    :onyx/batch-timeout 1000
    :onyx/name :pull-updates-aggregation
    :onyx/fn :compute.command-processor.tasks.query/pull-updates-task
    :onyx/params [:db-uri]
    :db-uri "datomic:"
    :onyx/type :function
    :onyx/group-by-fn :compute.command-processor.tasks.query/pull-updates-aggregation-group-by
    :onyx/flux-policy :kill
}

------

lucasbradstreet21:11:06

I’ve never seen a catalog validation error that bad 😕

kenny21:11:53

Running validate-job on my job prints the above and contains this exception as well:

Value does not match schema: {:catalog [nil nil nil nil nil nil nil nil nil nil nil nil nil (not (valid-flux-policy-min-max-n-peers a-clojure.lang.PersistentArrayMap)) nil nil nil nil nil]}

michaeldrogalis21:11:12

Grouped tasks require :onyx/n-peers to be defined

michaeldrogalis21:11:36

We need to rip out the Schema validation in favor of something like Expound, this isnt holding up over time

kenny21:11:16

Yeah that would be extremely helpful. We've ran into a number of schema errors that just baffle us. Spec errors help a lot here.

michaeldrogalis21:11:31

Wish Spec was out 2 years earlier 🙂