Fork me on GitHub
#onyx
<
2017-08-03
>
lmergen15:08:55

@michaeldrogalis if i recall correctly, you once talked about an onyx job management tool that you’re using at pyroclast to manage job state (create, delete, list). do i remember correctly that this tool was also reading the onyx log and storing any interesting mutations (job stopped, job crashed, etc) in datomic ? because that’s pretty much what i’m considering to do right now to make my life a bit easier

michaeldrogalis16:08:54

@lmergen Correct, that’s basically what we’re doing. Recommended. :thumbsup:

lucasbradstreet17:08:58

@lmergen yes, we essentially do that, and also track a job-id for each deployment. When we kill a job, and resume the state in a new job, we update the job-id in the db

lmergen18:08:48

yes, indeed, seems like a necessary interface on top of all this if you’re having continuous / streaming jobs

lucasbradstreet18:08:40

Would love for something like that to be open source, but we don’t have time to extract it right now.

michaeldrogalis18:08:50

The best part is probably the ability to audit everything that’s happened as jobs start and resume.

lmergen18:08:43

what i'm wondering, you're pretty much keeping a shadow log of the onyx log. do these things ever get out of sync ? or do you just periodically rebuild from the onyx log ?

lmergen18:08:30

(i must admit that i haven't looked into the onyx log that much yet, but since peers need a consistent view of the log, i can only imagine that this is a solved problem)

lucasbradstreet19:08:21

@lmergen we assume the log is the source of truth, and only flow one way

lmergen19:08:15

yeah that makes a lot of sense.

lucasbradstreet19:08:44

so if you submit a job, we wouldn’t go and update the db with a preliminary status, we would wait until things happen on the log before updating the db

lmergen19:08:52

this makes a lot of sense, and shouldn't be too difficult to implement actually.

ghadi21:08:41

I think I have a good usecase for investigating onyx for myself. I saw this blog post: http://formcept.com/blog/locality-sensitive-hashing-part-2-moving-from-spark-to-onyx/#sthash.zLbPj90c.dpbs

ghadi21:08:00

And it's very similar to a design I'm working on for a US gov competition

michaeldrogalis21:08:27

@ghadi Cool! Do you have any particular questions or want to bounce your design off us?

ghadi21:08:24

Not yet, but I need to immerse myself in onyx / the space. Right now I'm duplicating a lot what that post does, but on a single machine with core.async

ghadi21:08:38

I'm not using locality sensitive hashing but I have to take 1M records, generate a list of candidate pairs, then compare them for matches

michaeldrogalis21:08:03

Cool, that’s a good way to start. If you want to dabble with the abstractions, onyx-local-rt can be helpful. It’s a purely deterministic version of Onyx without any threads/networking.

ghadi21:08:12

the middle part (generate candidate pairs) and the last part where it might be nice to have a framework

ghadi21:08:24

Thanks, I'll be looking into that

michaeldrogalis21:08:39

Pretty neat - is it strictly a batch problem, or are you getting a stream of records?

ghadi21:08:47

strictly batch

ghadi21:08:06

1M records input --> make a list of matches (out)

ghadi21:08:42

so 1M * 1M possible comparisons. If i have enough time i might just brute force

michaeldrogalis21:08:54

Cool. You might actually want to look at Spark for doing the joining piece if it’s completely batch-oriented. Their architecture is geared to working with a static dataset better since they can make more presumptions.

ghadi21:08:16

how mature of you to suggest that

michaeldrogalis21:08:43

Ha, well I want you to win the competition after all 😛

ghadi21:08:03

Thanks! So onyx may be inapprop for this particular usecase?

michaeldrogalis21:08:50

It would work, but I think you can do better.

ghadi21:08:58

the "path" of the computation i need is so close what is in the formcept post, just not streaming.

lucasbradstreet21:08:08

On the plus side, if you’re doing it with core async already, it may be easier to convert over to Onyx.

michaeldrogalis21:08:16

The locality problem with batching is a little different to streaming.

lucasbradstreet21:08:22

Even if the performance isn’t equivalent to what you could get out of Spark for this use case.

michaeldrogalis21:08:40

That’s also true. If you don’t need absolutely crushing performance, it may matter less.

ghadi21:08:41

the task is matching patient records / finding dups. The technical challenges are 1) avoiding exploring the n-squared space of pairs 2) accounting for erroneous input data 3) having some fast feedback while changing the comparison algorithm for a given pair of patients

michaeldrogalis21:08:46

Once you’ve set up your flow with core.async for processing the data, do you expect any piece of that equation to need to change in response to the problem?

michaeldrogalis21:08:07

e.g. can the problem be asked a bit differently that would require you to process the data in a different way

ghadi21:08:58

not really. I can cache the list of candidate pairs, and treat them like a worklist

ghadi21:08:28

Also, if I find a novel way of getting more high quality candidate pairs, I can just extend & persist the set of candidate pairs

ghadi21:08:08

i haven't determined if i'm going to need to scale out to more than one machine because the pair matching algo isn't complete or measured

ghadi21:08:25

the hardest part is generating quality pairs

michaeldrogalis21:08:03

Ah. I’d avoid Onyx or Spark until you know you need to distributed. local-rt might be an optimal middleground because you can use a proven abstraction, but avoid the distribution headaches until you need to make the investment.

lucasbradstreet21:08:58

Can easily become “and then you have n problems” 😄

ghadi21:08:22

sage advice. i hope i can win this competition with only clojure core libs and a few utility libraries for string similarity algos

michaeldrogalis21:08:03

Any interesting prizes?

ghadi21:08:00

only doing it b/c I used to work in clinical records wayyyyy back and have a bit of domain knowledge

ghadi21:08:08

that was before I could code tho

lucasbradstreet21:08:58

If you hit perf issues that don’t seem to be solvable with algorithms + performance optimisation, you can always run it on an 18 core ec2 spot instances and judiciously pmap first.

michaeldrogalis21:08:59

That’s pretty cool, sounds like a fun time.