onyx 2016-09-14 | Slack Archive

michaeldrogalis04:09:43

That Grafana query looks right to me. The usual problem is not aggregating over the most discrete time unit, but you're doing that.

smw06:09:53

I’m using the onyx lein template, if I break the included test such that the job throws an exception, it appears that the test never completes...

smw06:09:29

would I normally just have to wait for it to timeout? should the job being killed due to unhandled exception normally fail the test?

lucasbradstreet07:09:22

@smw the test is generally killed in such a way that the feedback-exception! call will return. Is this a fresh version of onyx template?

smw07:09:01

Fresh as of a couple of days ago.

smw07:09:26

Btw, onyx is wonderful, totally solving my use case once I figured out what I was doing.

lucasbradstreet07:09:43

Awesome to hear :)

smw07:09:52

Yeah, definitely not getting exceptions back — or test failure, even with timeout set much lower.

smw07:09:29

Also, if I interrupt execution, I have to restart the repl due to the old system holding the port open.

lucasbradstreet07:09:35

Can you check onyx.log to see if there’s anything interesting there? Sounds like the job isn’t shutting down cleanly

smw07:09:35

Yeah, let me reproduce the problem and I’ll pastebin the relevant part of the log.

smw07:09:42

(ns user
  (:require [clojure.tools.namespace.repl :refer [refresh set-refresh-dirs]]))

(set-refresh-dirs "src" "test")

(defn init [])

(defn start [])

(defn stop [])

(defn go []
  (init)
  (start))

(defn reset []
  (stop)
  (refresh))

smw07:09:17

I saw you guys using the ‘reloaded’ pattern in the testing onyx jobs section of the user guide.

smw07:09:49

But you’re not actually defining anything for init/start/stop? Not using alter-var-root, etc?

lucasbradstreet07:09:21

Ah, just checked out the template. It looks like the feedback-exception! part got dropped in a refactor

lucasbradstreet07:09:35

Hmm, where did that user ns get pasted from? Heh

smw07:09:57

Oh, I can write a pr for this tomorrow, but it would also be nice to have some docs in your onyx.metrics readme specifying that you need to match onyx version.

smw07:09:03

That’s in the user guide.

lucasbradstreet07:09:25

Yeah, we should say that more places

smw07:09:08

that kw->fn move was blowing up for me using onyx 0.9.9

lucasbradstreet07:09:39

Ah yes 😕

smw07:09:56

You still need my onyx log? Should I assume that the missing feedback-exception! is why the onyx test environment doesn’t get killed?

lucasbradstreet07:09:07

No need

smw07:09:13

Great 🙂

lucasbradstreet07:09:13

(deftest basic-test
  (testing "That we can have a basic in-out workflow run through Onyx"
    (let [{:keys [env-config
                  peer-config]} (read-config (io/resource "config.edn"))
          job (my-app-name.jobs.basic/basic-job {:onyx/batch-size 10
                                                  :onyx/batch-timeout 1000})
          {:keys [in out]} (get-core-async-channels job)]
      (with-test-env [test-env [3 env-config peer-config]]
        (onyx.test-helper/validate-enough-peers! test-env job)
        (let [job-id (:job-id (onyx.api/submit-job peer-config job))]
          (doseq [segment segments]
            (>!! in segment))
          (onyx.test-helper/feedback-exception! peer-config job-id))

        (is (= (set (take-segments! out))
               (set [{:n 2} {:n 3} {:n 4} {:n 5} {:n 6} :done])))))))

lucasbradstreet07:09:15

try that

smw07:09:28

doing it. Thank you.

lucasbradstreet07:09:30

The problem is that take-segments! is blocking and has no way to know the job is killed

smw07:09:40

I’m having another problem where my actual ‘test’ (which happens to be doing my real job right now, with limited data) seems to complete work correctly, but the test doesn’t finish executing.

zamaterian07:09:53

@lucasbradstreet Can you recall the conversation we had last week where onyx sees to hang without killing/stopping the job on heavy load to the transactor. It occured yesterday three times, I could even kill the transactor without affecting the peer. When the peers is restarted it start working on the job, without I have to resubmit the job again. I have added 2 threadump and the state from the replica server. https://gist.github.com/zamaterian/fe8495e07caafc20f9ab8f5a8384d010

smw07:09:54

Maybe I’ll figure out why after adding the exception line.

lucasbradstreet07:09:28

@smw are you signaling that the job is :done?

lucasbradstreet07:09:33

@zamaterian thanks, I’ll look at it

smw07:09:41

You know what, I’m not.

smw07:09:42

🙂

smw07:09:46

That’s gotta be it.

smw07:09:13

should the last segment be a map or just the keyword?

lucasbradstreet07:09:24

keyword

smw07:09:28

awesome.

smw07:09:34

dumb newbie mistake

smw07:09:02

luckily I was still getting the results I needed from the logs, but still 🙂

lucasbradstreet07:09:56

@zamaterian so the peer gets stuck, you can kill it, and it’ll start work again including having a working connection to the transactor?

zamaterian07:09:09

correct

lucasbradstreet07:09:10

do you get any exceptions in the log?

smw07:09:13

@lucasbradstreet In your fix above...

smw07:09:15

(onyx.test-helper/feedback-exception! peer-config job-id))

smw07:09:23

should that be job instead of job-id?

lucasbradstreet07:09:35

job-id which we obtained from the submitted job

zamaterian07:09:35

No exeption in the logs

smw07:09:09

Ahh, I missed the let. Sorry.

lucasbradstreet07:09:33

make sure you get the order with the >!!s right, because you need to put them on the channel before the feedback-exception! since it’ll block

lucasbradstreet07:09:41

and then the job will never finish

lucasbradstreet07:09:41

@zamaterian anything in the transactor logs?

zamaterian07:09:06

No, after I restarts the transactor, the I start seeing datomic debug messages in the logs on the onyx peer. Without any reaction from onyx or in the onyx specific log files

smw07:09:29

ugh, spam

lucasbradstreet07:09:01

@zamaterian this sounds quite similar, but sounds like it should’ve been fixed years ago http://thread.gmane.org/gmane.comp.db.datomic.user/3437

smw07:09:22

(onyx.test-helper/feedback-exception! peer-config job-id))

lucasbradstreet07:09:39

I think the diagnosis is right overall though, I suspect the async write is being blocked when derefing and it’s never finding out. It’s probably due to the same cause (GC issue)

lucasbradstreet07:09:27

@smw hmm. Yuck. Interesting though. What it’s doing is deserializing the exception that was written to zookeeper

lucasbradstreet07:09:39

@smw is there anything interesting in the exception that was thrown?

lucasbradstreet07:09:10

@zamaterian I have a few suggestions. One, are you using "-XX:+UseG1GC -server” for your JAVA_OPTS? Can you increase -Xmx? Can you decrease onyx/max-pending on your input task? All of these things will decrease memory pressure a bit

smw07:09:37

One of my tasks searches elasticsearch. Looks like I’m now getting a blank segment or something that I wasn’t getting before. I’ll investigate.

smw07:09:51

Not sure how that would bubble into zk

lucasbradstreet07:09:54

We serialize the exception that killed the job to zookeeper, but it looks like it’s having trouble round tripping it

lucasbradstreet07:09:40

@zamaterian you’re using write-bulk-datoms-async?

smw07:09:14

ahh, I think I found my problem. clojure failure. Evidently it’s hard to (conj) something to the end of a lazy-seq

smw07:09:25

Ok, that’s beautiful. Everything working wonderfully. Thanks again for your time and the amazing project!

lucasbradstreet07:09:35

:thumbsup:

smw07:09:24

also, cursive’s diff functionality for deftest matchers is awesome

zamaterian07:09:59

@lucasbradstreet just got caught in the daily scrum ritual 🙂 Correct using write-bulk-datoms-async.

lucasbradstreet07:09:10

@zamaterian: my initial thoughts are GC pressure leading to datomic causing futures being unable to be derefed. I can add a timeout to the derefs, which is good practice, but I suspect you'll need to resolve your memory pressure issues too.

zamaterian07:09:38

@lucasbradstreet if its a gc issue, then shouldn't onyx start recovering after some time. The last hang was from 17.46 yesterday to 08.00, where I killed the Onyx-peer.

lucasbradstreet08:09:42

@zamaterian: my diagnosis / guess based on that issue is that memory pressure is happening, the peer becomes unresponsive for some time, doesn't receive the acknowledgement to its write and then is blocking forever on deref'ing the async write. When you kill the peer it gets unstuck ok. If this is true, there are two problems. One is that memory pressure is causing issues with datomic, the second is that we should probably be timing out the derefs and rebooting the peer when they do time out

lucasbradstreet08:09:53

@zamaterian: a bit later I can push a snapshot up that does the timeout. Hopefully you can give it a go with the current configuration so we can make sure it's handling it right.

aspra08:09:21

are the riemann metrics still supported?

lucasbradstreet08:09:55

@aspra: yes

aspra08:09:11

thx!

vladclj13:09:32

Hi, I use onyx-platform plugin with retry-params (https://github.com/onyx-platform/onyx-http/commit/4b7c985bf065d570600597319027e9cc8b6abcdb) I need retry post every 5 minutes over 1 day, something like this

:http-output/retry-params
                                        {:base-sleep-ms      2000
                                         :max-sleep-ms       300000
                                         :max-total-sleep-ms 86400000}

With this params I have 4 post every 1 minute( @lucasbradstreet can you look?

lucasbradstreet13:09:40

@vladclj it’s implemented to exponentially backoff

lucasbradstreet13:09:02

So it probably starts at 2000, then something like 4000, etc etc

lucasbradstreet13:09:24

You would probably need to factor out the retry mechanism, and allow a function to be passed in

vladclj13:09:15

Hm, what about release branch (without retry-params) can I set timeout for next retry in the function :http-output/success-fn :my.namespace/success?

lucasbradstreet13:09:36

Actually, now that I think about it, all you need to do is set the :onyx/pending-timeout on the input task to be 5 mins

lucasbradstreet13:09:41

then don’t use any retry in onyx-http

lucasbradstreet13:09:51

I guess the main problem is that it’ll never give up

mariusz_jachimowicz18:09:23

eh, sometimes the only way to find the problem source is to turn on logging for included libraries

michaeldrogalis18:09:44

@mariusz_jachimowicz +1. Im no stranger to lein installing local clones to do some digging around too.

michaeldrogalis18:09:59

I know its kind of barbaric, but it gets me a solution pretty quickly. Shrug

smw18:09:47

I’m still missing understanding of a lot of how clustering fits together...

smw18:09:22

is it a feasible thing to do to have a more or less permanently launched ‘development’ cluster that I can run tests against instead of running them locally?

smw18:09:31

Or do I have to relaunch all the peers because I’ve changed the code?

lucasbradstreet19:09:00

There’s nothing stopping you, but you will probably have to kill your jobs in between doing things like re-defining functions, because Onyx resolves a lot of stuff at the time you start jobs

smw19:09:09

ok… so if I want to do tdd with something larger than my laptop, I should probably write something like with-test-env that launches with newly built uberjar against marathon?

smw19:09:12

do I have to go that far?

michaeldrogalis19:09:04

@smw with-test-env doesn't require an uberjar, that's something you can use right at the repl

michaeldrogalis19:09:18

It'll build the entire environment up in memory, run the test, then tear down in a timely manner, no uberjaring needed

smw19:09:23

Right, but is there an easy way to adapt it to run on an actual cluster of machines?

smw19:09:27

instead of locally?

smw19:09:59

I realize that part of the magic of onyx is that you can do iterative development with smaller datasets on your laptop...

smw19:09:55

but for some of this I really want to test results with larger datasets. Would love to be able to use the same pattern locally that somehow has your test client join a cluster and submit the job...

smw19:09:14

but Lucas just suggested that I probably need to restart the peers to pick up new functions.

smw19:09:57

maybe I can write a simple mesos framework with a web service that I can submit a jar to...

smw19:09:22

and wrap a client into some magic like with-mesos-test-env

smw19:09:56

I dunno. I’ll dig into it deeper.

smw19:09:21

Just want the ‘build cluster that will run this newly modified job, run it, check for exceptions’ to be more seamless.

mariusz_jachimowicz19:09:19

yes @michaeldrogalis, I use this install local clones technique also 😄. I have been strugling with strange behaviour in my current PR and after turning on logging for indluded jars I was able to find problem source.

michaeldrogalis19:09:59

@smw So the issue with having a live cluster is that you don't be able to iterate on your function implementations. It's easy enough to continually kill the last job and submit a new one to keep all the peers executing the same job, since the job is data and thats easy to change.

michaeldrogalis19:09:28

I've done some experimentation with using nrepl to jack into a running cluster and modify the functions from my Emacs repl. It worked, I just didnt need it at the time

michaeldrogalis19:09:44

Clojure socket repl is another viable option

smw19:09:45

Right…. but if you had a mesos framework that would (a) accept a jar, (b) restart all the peers with same jar?

smw19:09:58

oh, interesting

michaeldrogalis19:09:01

@mariusz_jachimowicz Nice 🙂

michaeldrogalis19:09:24

@smw I'd be annoyed by the time it takes to make another uberjar and upload it every time

smw19:09:25

but would you have to only point the repl at one of the peers, or update the functions on all of them?

michaeldrogalis19:09:35

But if that's not a problem for you, maybe that can work?

michaeldrogalis19:09:42

All of them

smw19:09:47

Hrrrm.

Travis19:09:47

@smw trust me that is very annoying, We have really been through that pain dealing with performance issues

michaeldrogalis19:09:11

The cost of uberjar'ing is just awful across the board for every project I've worked on, Onyx and not. 😛

Travis19:09:22

That is so true

smw19:09:31

it’d be cool if somehow new functions could be pushed through the messaging pipeline

smw19:09:40

but maybe that’s against the spirit of onyx

michaeldrogalis19:09:40

So if you want something iterative, I don't see how that can be part of the game, you know?

smw19:09:59

Yeah, it’s not quite as big of a deal if it’s going to take me 5 minutes to see the results of the job anyway.

michaeldrogalis19:09:01

Yeah, not a particularly viable path design wise either

Travis19:09:02

Yeah, that part should be your last steps / performance tuning

michaeldrogalis19:09:30

I think there's a ton of value in someone exploring a repl-connection to a live cluster to update functions

Travis19:09:38

yeah, that does sound interesting

smw19:09:05

core.async based multi-repl proxy thing

smw19:09:07

I dunno.

Travis19:09:36

Another thing I did was stand up my entire stack via DockerCompose to test a more real world example. Still ran the job via the test-env

michaeldrogalis19:09:54

I believe the point at which I put it down was establishing multiple socket connections through Emacs. I was on a plane traveling, so I was just playing around.

michaeldrogalis19:09:09

Yeah, we favor that approach too @camechis

Travis19:09:29

I would also make a lot of your knobs configurable through ENV vars so you can just change values in marathon without having to redo the uberjar. We also suffered that, slowly correcting those

smw19:09:36

eh, if you can do docker compose you can do docker swarm, right? 🙂

michaeldrogalis19:09:16

I'm not a big fan of either of those, but I'm for the idea behind it. Whatever you can use to replicate your prod environment locally to support your Onyx job is good.

smw19:09:03

Eh, locally doesn’t help me all that much 🙂

Travis19:09:03

Another thing we did to help speed things up is to automate our Jenkins 2.0 Pipeline to auto deploy the peer directly into marathon, So just a commit,push, approve deploy process

michaeldrogalis19:09:06

You certainly have some interesting options if you want to update a live cluster - mostly because of how dynamic Clojure is.

smw19:09:34

@camechis Yeah, I’ll do something similar for prod jobs, but I’m not sure I want to go there for dev.

smw19:09:08

(we have an in-house tool we use for deploys that configures monitoring, pushes grafana dashboards, configures calico profiles ,etc, so the process would really go [ci-server] -> [pickwick] -> [marathon])

smw19:09:05

@camechis can you update job performance knobs while the job is running? possible to make it poll consul or something?

Travis19:09:10

I don’t think that is possible

Travis19:09:08

but not the expert in that area

michaeldrogalis19:09:35

@smw At the moment, no. But there will be far fewer performance knobs to tune for Onyx itself in the next release, so it will be less of an issue.

michaeldrogalis19:09:54

Namely backpressure will be automatic, and that's the big one that usually requires the most attention.

Travis19:09:58

@michaeldrogalis Tell me more on this!

michaeldrogalis19:09:49

@camechis The short summary is that instead of the root peer maintaing a pool of segments to possibly rollback to in the event of failure, and throttling when it's pool is too big, every peer can backpressure locally, no matter what spot in the workflow it's at. Markers called barriers are injected regularly into the stream of segments to act as checkpoints for progress and potentionally rollback.

aengelberg19:09:00

I thought automatic backpressure was already implemented? http://michaeldrogalis.github.io/jekyll/update/2015/06/08/Onyx-0.7.0.html

smw19:09:28

oh, is the the change you were talking about on the defn podcast? More like flink’s model?

michaeldrogalis19:09:42

@aengelberg To some degree, I think using the word "automatic" there was a misstep. There was more we could have done, but we didn't know how far we could go at the time.

michaeldrogalis19:09:53

@smw Correct, yeah. Async Barrier Snapshotting.

smw19:09:09

I’ll have to do some reading on that. Sounds like nirvana

michaeldrogalis19:09:10

It's pretty complicated to get right, but worth the time. This'll be Onyx's 3rd rewrite of the streaming engine.

Travis19:09:08

Really cool!

michaeldrogalis22:09:40

StrangeLoop is already here -- that snuck up quick. This marks ~2 years since Onyx was open sourced. Thanks so much for everyone to contributes to Onyx and the discussion here. Our little ecosystem is a lot bigger than it used to be, so everyone helping each other out has gone a very long way towards making it awesome. 🙂

Travis22:09:59

Awesome

Travis22:09:09

Glad to be a part of it

Drew Verlee22:09:06

on that note, is anyone that frequents #onyx at strangeloop this year?

bfaber22:09:34

you can keep an eye out for @dspiteself, he isn't on this thing often, but he

bfaber22:09:53

... (cont) will be there, and we have some onyx in production

Drew Verlee22:09:52

@bfaber will do