This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-03-02
Channels
- # aws-lambda (1)
- # beginners (28)
- # boot (54)
- # cider (11)
- # clara (28)
- # cljs-dev (74)
- # cljsrn (13)
- # clojure (342)
- # clojure-austin (3)
- # clojure-dusseldorf (4)
- # clojure-france (2)
- # clojure-greece (11)
- # clojure-italy (42)
- # clojure-poland (7)
- # clojure-russia (11)
- # clojure-spec (44)
- # clojure-uk (156)
- # clojure-ukraine (4)
- # clojurescript (102)
- # cursive (17)
- # datascript (19)
- # datomic (17)
- # dirac (39)
- # emacs (22)
- # funcool (56)
- # hoplon (25)
- # jobs (3)
- # jobs-discuss (31)
- # leiningen (2)
- # luminus (4)
- # lumo (3)
- # off-topic (47)
- # om (51)
- # onyx (57)
- # re-frame (13)
- # reagent (57)
- # remote-jobs (15)
- # ring (9)
- # ring-swagger (7)
- # robots (2)
- # rum (6)
- # specter (16)
- # sql (7)
- # test-check (37)
- # untangled (7)
- # yada (5)
@lucasbradstreet I'm using onyx-datomic 0.10.0-beta5 log-reader ... the ABS "no backoff in log reader" seems to be crashing my topology ... is this something that could have a quick solution, or do i need to stay on 0.9.15? I have a bifurcated onyx app right now and I'd really like to move completely onto 0.10.0-x
@michaeldrogalis also if you have any input on this ^ ?
@hunter Sorry, could I get a little more context?
so I have several onyx topologies which use the ony.plugin.datomic/read-log to read my datomic transaction log ... i have 0.9.15 in production and it works great ...
i am trying to move to 0.10.0-beta5 to use the newest onyx-kafka and some windowing features
but currently 0.10.0-beta5 onyx-datomic has this warning https://github.com/onyx-platform/onyx-datomic#abs-issues
specifically "checkpointing for log reader can't be global / savepoints" seems to be causing the checkpointing to lose track of it's place in the tx-log immediately
The work to fix the latter of the issues should be done. I can’t patch it today.
I’ll see if we can get another beta of this out tomorrow.
@hunter No prob.
@hunter are you saying that you can’t use :checkpoint/key
any more as it’s crashing the job?
oh, sorry, I mixed the two discussions up.
Thought they were both about onyx-datomic
Could you paste bin the exception that you’re seeing? I would think that the lack of backoff wouldn’t crash the job.
it's that the tx-log stops being tracked by the read-log plugin after a couple of segments
Ok thanks. I'll look into it. That's not a known issue
lucasbradstreet thanks, i'll generate fresh logs in a little while and get them to you.
Loving the Async Barrier Snapshot, just one question. When designing a job am I to think of $NPEERS
+ 1 addition peer for the ASB task?
you mean to cover the coordinator? If so, there’s no need. The coordinator is very lightweight and piggybacks on a regular peer (albeit in an extra thread)
Where are the checkpoint messages being generated from?
Checkpointed input #uuid "6ae670af-0e7b-1f0b-271d-d6ad8f1ed878" 6 29 :in 0 :input
Checkpointed output #uuid "6ae670af-0e7b-1f0b-271d-d6ad8f1ed878" 6 28 :out 0 :input
what do the numbers mean past the uuid?cluster replica version for the allocation = 6 (this ensures that all the peers think they’re doing the same thing), barrier epoch = 28 (resets to 1 on a new job replica version, increases on each snapshot)
0 = the slot the peer is on. If you have 10 peers on that task there will be 10 slots. Slots are used to ensure consistent hashing in the group bys
the combination of the replica version and the epoch forms a vector clock of sorts. When you restore from a resume point and want to restore the latest checkpoint, you would find the checkpoint with the largest replica-version, and find the checkpoint with the largest epoch for that replica-version.
@hunter I think I’ve narrowed down the issue. what batch size are you using?
Also thinking, with large volumes within docker is it worth mapping /dev/shm
to a point on the instance outside of the container.
If I understand you correctly, you mean some shared memory space that could be shared by the containers on that node?
I believe we're doing something like that, with the media driver being in its own container
If you're using SSDs and you don't have much you can also choose to put your log buffers on disk, though there will be a performance hit
@lucasbradstreet batch-size 1
In prod we’re running separate containers for the media driver and peer and mapping a memory volume to /dev/shm
On both containers
@hunter OK, I’m pretty certain I know what’s going on. Previously we would try to read the log without setting an end-tx, but now we’re setting the log to read from (last-tx, last-tx+batch-size). Unfortunately datomic doesn’t always increase the tx we can read by 1 (I’ve always wondered what was going on there), so now it is trying to read from (tx, tx+1) but there is never a tx+1 and it doesn’t advance.
@hunter can you try increasing the batch-size to 20 and tell me if it starts working?
Oh, I think that is because transaction-ids are like any other eid
Damn. That makes it hard
@jasonbell I don’t think it’s written down anywhere but it’s quite simple. Build lib-onyx
and then create a container that uses this namespace as the jar entrypoint.
https://github.com/onyx-platform/lib-onyx/blob/master/src/lib_onyx/media_driver.clj
Then mount /dev/shm
on both containers.
I’m not sure how that’s done in Mesos but I can help if you’re running on Kubernetes
@hunter actually, depending on how many entities you’re transacting in each tx, you may need an even bigger number
@gardnervickers thanks for the offer, not sure which way I'm going to turn yet. depends on a few factors.
@jasonbell please let me know what you end up deciding with shm (especially how big it ended up being). It’s good feedback.
@lucasbradstreet will do, to be honest by moving up to 0.10 we removed the need for an external heartbeat server with Yada wrapped in Component, so we've freed up overhead there. I'll have a better idea tomorrow, oh and I put more tighter control on the peer/partition so the throughput is more controlled. Once I get some data I'll let you know.
Great. There’s no rush 🙂