This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-09-14
Channels
- # aws-lambda (5)
- # beginners (38)
- # boot (197)
- # carry (7)
- # clara (3)
- # cljs-dev (7)
- # cljsjs (6)
- # cljsrn (24)
- # clojure (39)
- # clojure-art (10)
- # clojure-austin (7)
- # clojure-dusseldorf (1)
- # clojure-italy (8)
- # clojure-russia (89)
- # clojure-spec (119)
- # clojure-taiwan (1)
- # clojure-uk (19)
- # clojurescript (104)
- # community-development (2)
- # conf-proposals (22)
- # copenhagen-clojurians (8)
- # cursive (2)
- # datomic (35)
- # devcards (4)
- # dirac (79)
- # euroclojure (2)
- # immutant (35)
- # om (138)
- # om-next (2)
- # onyx (172)
- # proton (4)
- # protorepl (1)
- # re-frame (36)
- # reagent (34)
- # spacemacs (1)
- # specter (7)
- # untangled (89)
- # yada (2)
That Grafana query looks right to me. The usual problem is not aggregating over the most discrete time unit, but you're doing that.
I’m using the onyx lein template, if I break the included test such that the job throws an exception, it appears that the test never completes...
would I normally just have to wait for it to timeout? should the job being killed due to unhandled exception normally fail the test?
@smw the test is generally killed in such a way that the feedback-exception! call will return. Is this a fresh version of onyx template?
Awesome to hear :)
Yeah, definitely not getting exceptions back — or test failure, even with timeout set much lower.
Also, if I interrupt execution, I have to restart the repl due to the old system holding the port open.
Can you check onyx.log to see if there’s anything interesting there? Sounds like the job isn’t shutting down cleanly
(ns user
(:require [clojure.tools.namespace.repl :refer [refresh set-refresh-dirs]]))
(set-refresh-dirs "src" "test")
(defn init [])
(defn start [])
(defn stop [])
(defn go []
(init)
(start))
(defn reset []
(stop)
(refresh))
I saw you guys using the ‘reloaded’ pattern in the testing onyx jobs section of the user guide.
But you’re not actually defining anything for init/start/stop? Not using alter-var-root
, etc?
Ah, just checked out the template. It looks like the feedback-exception! part got dropped in a refactor
Hmm, where did that user ns get pasted from? Heh
Oh, I can write a pr for this tomorrow, but it would also be nice to have some docs in your onyx.metrics readme specifying that you need to match onyx version.
Yeah, we should say that more places
Ah yes 😕
You still need my onyx log? Should I assume that the missing feedback-exception!
is why the onyx test environment doesn’t get killed?
No need
(deftest basic-test
(testing "That we can have a basic in-out workflow run through Onyx"
(let [{:keys [env-config
peer-config]} (read-config (io/resource "config.edn"))
job (my-app-name.jobs.basic/basic-job {:onyx/batch-size 10
:onyx/batch-timeout 1000})
{:keys [in out]} (get-core-async-channels job)]
(with-test-env [test-env [3 env-config peer-config]]
(onyx.test-helper/validate-enough-peers! test-env job)
(let [job-id (:job-id (onyx.api/submit-job peer-config job))]
(doseq [segment segments]
(>!! in segment))
(onyx.test-helper/feedback-exception! peer-config job-id))
(is (= (set (take-segments! out))
(set [{:n 2} {:n 3} {:n 4} {:n 5} {:n 6} :done])))))))
try that
The problem is that take-segments! is blocking and has no way to know the job is killed
I’m having another problem where my actual ‘test’ (which happens to be doing my real job right now, with limited data) seems to complete work correctly, but the test doesn’t finish executing.
@lucasbradstreet Can you recall the conversation we had last week where onyx sees to hang without killing/stopping the job on heavy load to the transactor. It occured yesterday three times, I could even kill the transactor without affecting the peer. When the peers is restarted it start working on the job, without I have to resubmit the job again. I have added 2 threadump and the state from the replica server. https://gist.github.com/zamaterian/fe8495e07caafc20f9ab8f5a8384d010
@smw are you signaling that the job is :done
?
@zamaterian thanks, I’ll look at it
keyword
@zamaterian so the peer gets stuck, you can kill it, and it’ll start work again including having a working connection to the transactor?
correct
do you get any exceptions in the log?
@lucasbradstreet In your fix above...
job-id which we obtained from the submitted job
No exeption in the logs
make sure you get the order with the >!!s right, because you need to put them on the channel before the feedback-exception! since it’ll block
and then the job will never finish
@zamaterian anything in the transactor logs?
No, after I restarts the transactor, the I start seeing datomic debug messages in the logs on the onyx peer. Without any reaction from onyx or in the onyx specific log files
@zamaterian this sounds quite similar, but sounds like it should’ve been fixed years ago http://thread.gmane.org/gmane.comp.db.datomic.user/3437
I think the diagnosis is right overall though, I suspect the async write is being blocked when derefing and it’s never finding out. It’s probably due to the same cause (GC issue)
@smw hmm. Yuck. Interesting though. What it’s doing is deserializing the exception that was written to zookeeper
@smw is there anything interesting in the exception that was thrown?
@zamaterian I have a few suggestions. One, are you using "-XX:+UseG1GC -server” for your JAVA_OPTS? Can you increase -Xmx? Can you decrease onyx/max-pending on your input task? All of these things will decrease memory pressure a bit
One of my tasks searches elasticsearch. Looks like I’m now getting a blank segment or something that I wasn’t getting before. I’ll investigate.
We serialize the exception that killed the job to zookeeper, but it looks like it’s having trouble round tripping it
@zamaterian you’re using write-bulk-datoms-async?
ahh, I think I found my problem. clojure failure. Evidently it’s hard to (conj) something to the end of a lazy-seq
Ok, that’s beautiful. Everything working wonderfully. Thanks again for your time and the amazing project!
:thumbsup:
@lucasbradstreet just got caught in the daily scrum ritual 🙂 Correct using write-bulk-datoms-async.
@zamaterian: my initial thoughts are GC pressure leading to datomic causing futures being unable to be derefed. I can add a timeout to the derefs, which is good practice, but I suspect you'll need to resolve your memory pressure issues too.
@lucasbradstreet if its a gc issue, then shouldn't onyx start recovering after some time. The last hang was from 17.46 yesterday to 08.00, where I killed the Onyx-peer.
@zamaterian: my diagnosis / guess based on that issue is that memory pressure is happening, the peer becomes unresponsive for some time, doesn't receive the acknowledgement to its write and then is blocking forever on deref'ing the async write. When you kill the peer it gets unstuck ok. If this is true, there are two problems. One is that memory pressure is causing issues with datomic, the second is that we should probably be timing out the derefs and rebooting the peer when they do time out
@zamaterian: a bit later I can push a snapshot up that does the timeout. Hopefully you can give it a go with the current configuration so we can make sure it's handling it right.
Hi, I use onyx-platform plugin with retry-params (https://github.com/onyx-platform/onyx-http/commit/4b7c985bf065d570600597319027e9cc8b6abcdb) I need retry post every 5 minutes over 1 day, something like this
:http-output/retry-params
{:base-sleep-ms 2000
:max-sleep-ms 300000
:max-total-sleep-ms 86400000}
With this params I have 4 post every 1 minute(
@lucasbradstreet can you look?@vladclj it’s implemented to exponentially backoff
So it probably starts at 2000, then something like 4000, etc etc
You would probably need to factor out the retry mechanism, and allow a function to be passed in
Hm, what about release branch (without retry-params) can I set timeout for next retry in the function :http-output/success-fn :my.namespace/success?
Actually, now that I think about it, all you need to do is set the :onyx/pending-timeout on the input task to be 5 mins
then don’t use any retry in onyx-http
I guess the main problem is that it’ll never give up
eh, sometimes the only way to find the problem source is to turn on logging for included libraries
@mariusz_jachimowicz +1. Im no stranger to lein install
ing local clones to do some digging around too.
I know its kind of barbaric, but it gets me a solution pretty quickly. Shrug
is it a feasible thing to do to have a more or less permanently launched ‘development’ cluster that I can run tests against instead of running them locally?
There’s nothing stopping you, but you will probably have to kill your jobs in between doing things like re-defining functions, because Onyx resolves a lot of stuff at the time you start jobs
ok… so if I want to do tdd with something larger than my laptop, I should probably write something like with-test-env
that launches with newly built uberjar against marathon?
@smw with-test-env
doesn't require an uberjar, that's something you can use right at the repl
It'll build the entire environment up in memory, run the test, then tear down in a timely manner, no uberjaring needed
I realize that part of the magic of onyx is that you can do iterative development with smaller datasets on your laptop...
but for some of this I really want to test results with larger datasets. Would love to be able to use the same pattern locally that somehow has your test client join a cluster and submit the job...
but Lucas just suggested that I probably need to restart the peers to pick up new functions.
maybe I can write a simple mesos framework with a web service that I can submit a jar to...
Just want the ‘build cluster that will run this newly modified job, run it, check for exceptions’ to be more seamless.
yes @michaeldrogalis, I use this install local clones technique also 😄. I have been strugling with strange behaviour in my current PR and after turning on logging for indluded jars I was able to find problem source.
@smw So the issue with having a live cluster is that you don't be able to iterate on your function implementations. It's easy enough to continually kill the last job and submit a new one to keep all the peers executing the same job, since the job is data and thats easy to change.
I've done some experimentation with using nrepl to jack into a running cluster and modify the functions from my Emacs repl. It worked, I just didnt need it at the time
Clojure socket repl is another viable option
Right…. but if you had a mesos framework that would (a) accept a jar, (b) restart all the peers with same jar?
@smw I'd be annoyed by the time it takes to make another uberjar and upload it every time
but would you have to only point the repl at one of the peers, or update the functions on all of them?
But if that's not a problem for you, maybe that can work?
All of them
@smw trust me that is very annoying, We have really been through that pain dealing with performance issues
The cost of uberjar'ing is just awful across the board for every project I've worked on, Onyx and not. 😛
So if you want something iterative, I don't see how that can be part of the game, you know?
Yeah, it’s not quite as big of a deal if it’s going to take me 5 minutes to see the results of the job anyway.
Yeah, not a particularly viable path design wise either
I think there's a ton of value in someone exploring a repl-connection to a live cluster to update functions
Another thing I did was stand up my entire stack via DockerCompose to test a more real world example. Still ran the job via the test-env
I believe the point at which I put it down was establishing multiple socket connections through Emacs. I was on a plane traveling, so I was just playing around.
Yeah, we favor that approach too @camechis
I would also make a lot of your knobs configurable through ENV vars so you can just change values in marathon without having to redo the uberjar. We also suffered that, slowly correcting those
I'm not a big fan of either of those, but I'm for the idea behind it. Whatever you can use to replicate your prod environment locally to support your Onyx job is good.
Another thing we did to help speed things up is to automate our Jenkins 2.0 Pipeline to auto deploy the peer directly into marathon, So just a commit,push, approve deploy process
You certainly have some interesting options if you want to update a live cluster - mostly because of how dynamic Clojure is.
@camechis Yeah, I’ll do something similar for prod jobs, but I’m not sure I want to go there for dev.
(we have an in-house tool we use for deploys that configures monitoring, pushes grafana dashboards, configures calico profiles ,etc, so the process would really go [ci-server] -> [pickwick] -> [marathon])
@camechis can you update job performance knobs while the job is running? possible to make it poll consul or something?
@smw At the moment, no. But there will be far fewer performance knobs to tune for Onyx itself in the next release, so it will be less of an issue.
Namely backpressure will be automatic, and that's the big one that usually requires the most attention.
@michaeldrogalis Tell me more on this!
@camechis The short summary is that instead of the root peer maintaing a pool of segments to possibly rollback to in the event of failure, and throttling when it's pool is too big, every peer can backpressure locally, no matter what spot in the workflow it's at. Markers called barriers are injected regularly into the stream of segments to act as checkpoints for progress and potentionally rollback.
I thought automatic backpressure was already implemented? http://michaeldrogalis.github.io/jekyll/update/2015/06/08/Onyx-0.7.0.html
oh, is the the change you were talking about on the defn podcast? More like flink’s model?
@aengelberg To some degree, I think using the word "automatic" there was a misstep. There was more we could have done, but we didn't know how far we could go at the time.
@smw Correct, yeah. Async Barrier Snapshotting.
It's pretty complicated to get right, but worth the time. This'll be Onyx's 3rd rewrite of the streaming engine.
StrangeLoop is already here -- that snuck up quick. This marks ~2 years since Onyx was open sourced. Thanks so much for everyone to contributes to Onyx and the discussion here. Our little ecosystem is a lot bigger than it used to be, so everyone helping each other out has gone a very long way towards making it awesome. 🙂
on that note, is anyone that frequents #onyx at strangeloop this year?
you can keep an eye out for @dspiteself, he isn't on this thing often, but he
@bfaber will do
@bfaber Can we put your company logo on the GitHub README of core?
I can't imagine we'd object at all, but that decision rests with greater minds and powers, I'll ask
🙏 Thanks
@drewverlee Enjoy the museum party. One of the most fun events everrr. 😄
@michaeldrogalis I have been excited about it for a whole year. Your going to have to emerge from you work cacoon and go next year.
Hi, so digging a bit on encrypted s3 bucket, looking at http://www.programcreek.com/java-api-examples/index.php?api=com.amazonaws.services.s3.transfer.Transfer.TransferState there is a objectMetadata.setSSEAlgorithm("AES256");
that could be added to https://github.com/onyx-platform/onyx-amazon-s3/blob/0.9.x/src/onyx/plugin/s3_utils.clj#L32 in some cogent way to indicate that the s3 plugin needs to use SSE... thoughts? Should I open an issue for this?
@drewverlee I went the last 2 years, actually was with @lucasbradstreet last year. Needed a break. 🙂
@aaelony A PR that adds a writer param :s3/encryption
and allows value :aes256
would be great if you can.