xtdb 2021-06-21 | Slack Archive

reborg08:06:00

Happy Monday #crux , I was wondering how to run the benchmarks from https://github.com/juxt/crux/tree/master/crux-bench locally. It assumes some Kafka setup if I run bin/run-bench.sh (which I have running) but still getting a Timed out waiting for a node assignment. message. The reason I’d like to run them is that I’m experimenting with some changes to Clojure libraries and I wanted to see how they impact real world projects. So not urgent, but Crux was a good use case. Thanks!

refset08:06:36

Hey @U054W022G - thank you & Happy Monday likewise ☺️ Is your Kafka port open on 9092? I'm looking at https://github.com/juxt/crux/blob/master/crux-bench/cloudformation.yaml#L181-L196

reborg09:06:52

Hey Jeremy, hope all well. Good hint, as I can telnet zookeeper but apparently not kafka on 9092, so trying to fix that and I’ll get back. Running kafka/zk dockerized with https://github.com/wurstmeister/kafka-docker/blob/master/README.md

🙂 3

reborg08:06:41

Hey @U899JBRPF I’ve been able to get past the Kafka connection problem. It is now asking for AWS credentials. Do the benchmarks require an AWS account? I was hoping they would skip the step https://github.com/juxt/crux/blob/master/crux-bench/README.md#setting-up-aws-credentials But if necessary, what sort of tasks is the benchmark performing on AWS? That is to prepare for potential costs.

refset09:06:53

Cool! So you shouldn't need AWS. Are you happy running the benchmarks on your own hardware?

refset09:06:12

> It is now asking for AWS credentials. what is the prompt / log message here?

reborg10:06:23

Right thanks for pinging. So after a few attempts with lein run -m crux.bench.main I tried ./bin/run-bench.sh which I guess is AWS dependent. So back to using lein run. I can see the following:

λ: nc -z 192.168.99.100 9092
Connection to 192.168.99.100 port 9092 [tcp] succeeded!
λ: echo dump | nc 192.168.99.100 2181 | grep brokers
        /brokers/ids/1001
λ: lein run -m crux.bench.main
Would post to Slack:
 *Starting Benchmark*, Commit Hash: null

Syntax error (TimeoutException) compiling at (/private/var/folders/km/lcsz0x0j4kg2_4h36m7bvjcm0000gn/T/form-init9219056612490469057.clj:1:125).
Timed out waiting for a node assignment.

Full report at:
/var/folders/km/lcsz0x0j4kg2_4h36m7bvjcm0000gn/T/clojure-6052950630285481569.edn

Which unfortunately is back at the (likely) Kafka connectivity problem. What I solved is that the connectivity with Kafka/ZK can now be established from the command line, but apparently not from lein run

reborg10:06:06

just noticed several localhost:9092 hardcoded, so going to change those to 192.168.99.100:9092 and see what happens

🤞 3

reborg10:06:49

Progress! Now I’m getting a bunch of CloudWatchException: The security token included in the request is invalid. but waiting to see if I can get to some relevant output

refset10:06:29

Is there a benchmark in particular that you're hoping to run? Or are you aiming for the full suite?

refset10:06:13

Whenever I run benchmarks locally I just do it via the REPL, and sidestep a lot of the orchestration code in crux-bench

reborg10:06:50

I see… I guess the most interesting tests for me would be around crux-core query engine. So I can see just the following after many pages of CloudWatch exceptions:

{"av-count":14005000,"time-taken-ms":409141,"bench-ns":"ts-devices","crux-commit":null,"bench-type":"ingest","bytes-indexed":2986412207,"doc-count":1001000,"crux-node-type":"kafka-rocksdb","success?":true}
{"bytes-on-disk":388133899,"compacted-bytes-on-disk":309455640,"time-taken-ms":19729,"crux-node-type":"kafka-rocksdb","bench-ns":"ts-devices","crux-commit":null,"bench-type":"compaction"}
{"success?":true,"time-taken-ms":426,"crux-node-type":"kafka-rocksdb","bench-ns":"ts-devices","crux-commit":null,"bench-type":"recent-battery-readings"}
{"success?":true,"time-taken-ms":186,"crux-node-type":"kafka-rocksdb","bench-ns":"ts-devices","crux-commit":null,"bench-type":"busiest-devices"}
{"success?":true,"time-taken-ms":35193,"crux-node-type":"kafka-rocksdb","bench-ns":"ts-devices","crux-commit":null,"bench-type":"min-max-battery-level-per-hour"}

Would that be a full output?

reborg10:06:17

I’m wondering if other bench are unable to display because of the throwing exception.

reborg10:06:58

Commented out https://github.com/juxt/crux/blob/master/crux-bench/src/crux/bench.clj#L211 and retrying

reborg10:06:01

but nope it still wants to talk with CW

refset11:06:55

I think you would need to comment out the cw/reporter config lines in the various start-node config maps

refset11:06:22

This is an example of how I would run each ns one-by-one (i.e. just ignore everything in the main bench.clj ns) https://github.com/juxt/crux/blob/11fd82577223ac35c9666b74cee8aca2d39a9262/crux-bench/src/crux/bench/tpch_stress_test.clj#L57

reborg11:06:48

thanks, where is that user/node coming from?

reborg11:06:12

In the meanwhile, I ran the bench commenting the reporter and without exceptions! However, it sounds like tests the one your pointing at above are not part of the suite. Good to know, I’ll probably scout the namespaces to search what would be good to run

refset11:06:48

ah, that user/node is a legacy reference, you would now use dev/crux-node https://github.com/juxt/crux/blob/11fd82577223ac35c9666b74cee8aca2d39a9262/dev/dev.clj#L93 which you get running when starting the repl and doing (dev) then (go)

refset11:06:37

> it sounds like tests the one your pointing at above are not part of the suite This is a good point...we don't run all of the benchmarks in the nightly runs, since the ones we do run normally give such excellent coverage that more data would just be more noise 🙂

refset11:06:40

As a bit of an aside, I also found these generative tests very helpful when spiking a crux-redis KV module: https://github.com/juxt/crux/blob/6d602bb5b6caed199f10fd8c3711cb034d49248a/crux-test/test/crux/kv_test.clj#L235-L342 ...but they don't live in crux-bench 🙂 I can't think of other generative tests like this though!

refset11:06:56

what changes are you trying to make?

reborg12:06:37

I’m working on a replacement for clojure.set and evaluating impact on real-world projects having a dependency on set operations. Crux seems to depend on it for querying (no idea to what degree, compile-time or runtime, etc). My thinking is that perhaps I’m lucky and I can run a before/after benchmarks to show some improvement. This worked for Datascript (for instance) and I’ll be presenting the results at the next Clojurians meetup.

refset12:06:56

oh wow, that sounds awesome 🙂

➕ 3

refset12:06:33

Crux definitely uses clojure.set operations during query compilation, for aggregates, and for generally returning the results (as per https://github.com/juxt/crux/blob/master/crux-core/src/crux/query.clj) - but I can't see that it's on the "hot path" for the runtime query execution /cc @U050V1N74

jarohen13:06:31

yeah, I don't think it's on the hot path, but it'd no doubt be an improvement at compile-time regardless - it's not unlikely that compile-time dominates for cold runs of low-latency queries. nice one @U054W022G 👏

👍 3

reborg13:06:59

Thanks @U050V1N74 for having a look. I see other sub-modules depending on clojure.set do you think there could be other interesting hot-paths to be aware of? I’m sure that if you’re doing serious perf work you are not going to forget a set/* call in your path :)

jarohen13:06:33

anxiously checks hot paths 😅 (edit: looks good)

reborg13:06:41

anyway, will see shortly if the change has an impact (assuming crux.bench.tpch-stress-test is a good measure for that)

jarohen13:06:02

sounds good - cheers 🙂

jarohen13:06:59

we do have other benchmarks which are more tailored to ingest, but I can't find any references to clojure.set in ingest

reborg14:06:29

No problem, thanks for helping. How do I read the results from the benchmark? I got a query time out but the rest seems to be fine, with {"success?":true,"av-count":6258483,"bytes-indexed":1262752421,"doc-count":432844,"time-taken-ms":259212,"bench-ns":"tpch-stress","bench-type":"ingest"} as a result

jarohen14:06:12

:success? true is checking against the published TPC-H results, guessing you've spotted time-taken-ms

jarohen14:06:58

:bench-type :ingest - we split the benchmarks out into :ingest and :queries - there should be another entry for the latter

jarohen14:06:15

:bench-ns is a wider category - that's for the different benchmarks. e.g. ts-devices is a different benchmark

jarohen14:06:30

bench-type is like a sub-category of bench-ns

jarohen14:06:02

the counts are more for space benchmarks - we added those when we were going through a period of bashing at the disk space usage

jarohen14:06:18

also good for smoke tests, can see if a change has resulted in less data (especially if it shouldn't have!)

reborg14:06:11

ok thanks for the break down. Yeah I wasn’t sure if to expect a single elapsed for each query. There is a :query-stress but it timed out

[{:success? true, :av-count 6258483, :bytes-indexed 1262752421, :doc-count 432844, :time-taken-ms 259212, :bench-ns :tpch-stress, :bench-type :ingest} {:error "java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Query timed out.", :time-taken-ms 1006201, :bench-ns :tpch-stress, :bench-type :query-stress}]

jarohen14:06:49

you might find it easier to base yourself on the crux.bench.tpch namespace - there's a run-tpch function in there that takes a node and scale-factor

jarohen14:06:22

for mid-sized (i.e. more than microbenches but not long runs) we use SF0.01 - that's the one that TPC-H provide expected results for, too

jarohen14:06:54

if you don't want any of the bench harness, you pretty much only need tpch/load-docs! (only have to do that once for a node if you're benching query times) and run-tpch-queries

reborg14:06:31

ok, let me try that

🤞 3

reborg15:06:52

This is what I get, am I doing it correctly?

(require '[crux.bench.tpch :as tp])
(require '[crux.fixtures.tpch :as tpch])
(let [node (dev/crux-node) scale-factor 0.01]
   (tpch/load-docs! node scale-factor tpch/tpch-entity->pkey-doc)
   (tp/run-tpch-queries node {:scale-factor scale-factor}))
Transacting TPC-H tables...
Transacted 1500 customer
Transacted 15000 orders
Transacted 60175 lineitem
Transacted 2000 part
Transacted 8000 partsupp
Transacted 100 supplier
Transacted 25 nation
Transacted 5 region

FAIL in () (tpch.clj:678)
expected: (<= diff epsilon)
  actual: (not (<= 22500.0 0.01))

FAIL in () (tpch.clj:678)
expected: (<= diff epsilon)
  actual: (not (<= 417.0 0.01))

FAIL in () (tpch.clj:678)
expected: (<= diff epsilon)
  actual: (not (<= 3090671.039999999 0.01))
false

jarohen15:06:32

yep, although I wouldn't've expected to see those FAILs

jarohen15:06:35

:thinking_face:

jarohen15:06:03

just to check - you've only loaded the docs once on that node?

reborg15:06:11

think so, that’s straight after opening up a repl and can see only one set of “Transacting” messages

reborg15:06:50

going to remove validation just to see if I can get the numbers

👍 2

reborg16:06:58

I’m going to take it as a positive :)

;; before "Elapsed time: 339841.871333 msecs"
  ;; after  "Elapsed time: 291307.660398 msecs"

jarohen16:06:48

niice 👏 :man-bowing:

➕ 2

reborg16:06:51

perhaps not that big, but the lib is a drop in replacement, just a require away, so perhaps it has some sense for people dealing with sets. Thanks for your support today, very appreciated!

jarohen16:06:58

you're welcome, and thank you 🙂

jarohen16:06:21

will it be becoming available on Maven soon? if so, happy to include it in our CI and overnight bench runs

reborg16:06:41

It’s on Clojars if that’s good https://github.com/droitfintech/fset

👀 2

refset18:06:19

nice work! Just to confirm though, are those before & after runs just for end-to-end query times? And are they completely distinct (e.g. full node shutdown and restart)? Or is it possible that there may be some effect from warm caches?

reborg18:06:53

Here’s the repro case. Open repo top level and:

(dev)
  (go)
  (require '[crux.bench.tpch :as tp])
  (require '[crux.fixtures.tpch :as tpch])
  (let [node (dev/crux-node) scale-factor 0.01]
    (tpch/load-docs! node scale-factor tpch/tpch-entity->pkey-doc)
    (time (tp/run-tpch-queries node {:scale-factor scale-factor}))
    (time (tp/run-tpch-queries node {:scale-factor scale-factor})))

Taking the time twice just in case warming up makes a difference, but didn’t see any. Then kill repl, replace all require [clojure.set :as set] with require [tech.droit.fset :as set] and try again.

refset19:06:03

great, that looks correct to me :thumbsup: 🙂

bocaj19:06:49

Would the content-hash replicate, similar to https://github.com/replikativ/hasch ?

bocaj19:06:00

I have, for example, a file with 300,000 records. I’ll process it for any new, deletes, or updates. It would be useful to use hashing extracted from crux, or to use crux itself. Ideally my coworker get the same hash on their system.

jarohen19:06:09

It's an internal function, but you could try re-using the hashing from crux.codec/new-id?

➕ 3

refset19:06:14

Crux currently uses 20-byte SHA-1 hashes, which are regarded as purely internal to the Crux system, and so they aren't being depended on for any security properties (unlike the reasoning behind hasch's 32-byte SHA-512 hashes). The various SHA-1 implementations should definitely be system-independent though, see here for details https://github.com/juxt/crux/blob/master/crux-core/src/crux/hash.clj

bocaj20:06:42

Great, thanks for the info. In theory, the SHA-1 would be consistent across machines/env?

refset20:06:12

Np! And I believe so, yes...it would be a pretty major bug for us if that wasn't the case 😅

😀 3

👍 2

2021-06-21

Channels