This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-08-03
Channels
- # bangalore-clj (1)
- # beginners (104)
- # boot (30)
- # braveandtrue (1)
- # cider (6)
- # cljs-dev (95)
- # cljsjs (16)
- # cljsrn (3)
- # clojure (106)
- # clojure-italy (15)
- # clojure-nl (2)
- # clojure-norway (3)
- # clojure-russia (1)
- # clojure-spec (40)
- # clojure-uk (53)
- # clojure-ukraine (1)
- # clojurescript (200)
- # code-reviews (2)
- # cursive (1)
- # datascript (3)
- # datomic (32)
- # editors (28)
- # gorilla (6)
- # graphql (8)
- # hoplon (1)
- # jobs (8)
- # jobs-discuss (5)
- # jobs-rus (1)
- # keechma (13)
- # leiningen (5)
- # luminus (3)
- # lumo (53)
- # off-topic (5)
- # om (5)
- # om-next (1)
- # onyx (56)
- # parinfer (7)
- # protorepl (22)
- # re-frame (47)
- # reagent (37)
- # remote-jobs (1)
- # ring-swagger (9)
- # specter (7)
- # vim (14)
- # yada (30)
@michaeldrogalis if i recall correctly, you once talked about an onyx job management tool that you’re using at pyroclast to manage job state (create, delete, list). do i remember correctly that this tool was also reading the onyx log and storing any interesting mutations (job stopped, job crashed, etc) in datomic ? because that’s pretty much what i’m considering to do right now to make my life a bit easier
@lmergen Correct, that’s basically what we’re doing. Recommended. :thumbsup:
@lmergen yes, we essentially do that, and also track a job-id for each deployment. When we kill a job, and resume the state in a new job, we update the job-id in the db
yes, indeed, seems like a necessary interface on top of all this if you’re having continuous / streaming jobs
Would love for something like that to be open source, but we don’t have time to extract it right now.
The best part is probably the ability to audit everything that’s happened as jobs start and resume.
what i'm wondering, you're pretty much keeping a shadow log of the onyx log. do these things ever get out of sync ? or do you just periodically rebuild from the onyx log ?
(i must admit that i haven't looked into the onyx log that much yet, but since peers need a consistent view of the log, i can only imagine that this is a solved problem)
@lmergen we assume the log is the source of truth, and only flow one way
so if you submit a job, we wouldn’t go and update the db with a preliminary status, we would wait until things happen on the log before updating the db
@ghadi! Welcome. 🙂
howdy @michaeldrogalis .
I think I have a good usecase for investigating onyx for myself. I saw this blog post: http://formcept.com/blog/locality-sensitive-hashing-part-2-moving-from-spark-to-onyx/#sthash.zLbPj90c.dpbs
@ghadi Cool! Do you have any particular questions or want to bounce your design off us?
Not yet, but I need to immerse myself in onyx / the space. Right now I'm duplicating a lot what that post does, but on a single machine with core.async
I'm not using locality sensitive hashing but I have to take 1M records, generate a list of candidate pairs, then compare them for matches
Cool, that’s a good way to start. If you want to dabble with the abstractions, onyx-local-rt can be helpful. It’s a purely deterministic version of Onyx without any threads/networking.
the middle part (generate candidate pairs) and the last part where it might be nice to have a framework
Pretty neat - is it strictly a batch problem, or are you getting a stream of records?
Cool. You might actually want to look at Spark for doing the joining piece if it’s completely batch-oriented. Their architecture is geared to working with a static dataset better since they can make more presumptions.
Ha, well I want you to win the competition after all 😛
It would work, but I think you can do better.
the "path" of the computation i need is so close what is in the formcept post, just not streaming.
On the plus side, if you’re doing it with core async already, it may be easier to convert over to Onyx.
The locality problem with batching is a little different to streaming.
Even if the performance isn’t equivalent to what you could get out of Spark for this use case.
That’s also true. If you don’t need absolutely crushing performance, it may matter less.
the task is matching patient records / finding dups. The technical challenges are 1) avoiding exploring the n-squared space of pairs 2) accounting for erroneous input data 3) having some fast feedback while changing the comparison algorithm for a given pair of patients
Once you’ve set up your flow with core.async for processing the data, do you expect any piece of that equation to need to change in response to the problem?
e.g. can the problem be asked a bit differently that would require you to process the data in a different way
Also, if I find a novel way of getting more high quality candidate pairs, I can just extend & persist the set of candidate pairs
i haven't determined if i'm going to need to scale out to more than one machine because the pair matching algo isn't complete or measured
Ah. I’d avoid Onyx or Spark until you know you need to distributed. local-rt might be an optimal middleground because you can use a proven abstraction, but avoid the distribution headaches until you need to make the investment.
Can easily become “and then you have n problems” 😄
sage advice. i hope i can win this competition with only clojure core libs and a few utility libraries for string similarity algos
Any interesting prizes?
only doing it b/c I used to work in clinical records wayyyyy back and have a bit of domain knowledge
If you hit perf issues that don’t seem to be solvable with algorithms + performance optimisation, you can always run it on an 18 core ec2 spot instances and judiciously pmap first.
That’s pretty cool, sounds like a fun time.