Fork me on GitHub
#xtdb
<
2022-12-15
>
Hukka07:12:05

Bah. Startup time to synched has increased 50% in couple of weeks, as data amounts have increased. I guess I have to start looking at checkpointing

tatut07:12:30

ephemeral nodes?

tatut07:12:54

we are doing the same, checkpointing is a must

Hukka07:12:28

What kind of startup times you are seeing with checkpoints, with which checkpoint store and number of documents?

tatut07:12:30

in AWS in the long run you need to configure the S3 client because the checkpoint is so huge it times out downloading it 🙈

Hukka07:12:49

Hm. We are on GCP, let's see how it goes

tatut07:12:51

I don’t have access to prod, but it should be just the time it takes to download latest checkpoint and run anything after that… so depends on your checkpointing frequency

Hukka07:12:37

I wonder if the local, in process xtdb node is tenable after all, or do we have to switch to three tiers: golden stores / xtdb node / application code, where only the last is updated often. Have to say that I've been surprised how slow the startup is, considering somewhat low amounts of documents. In production we should have thousands, if not tens of thousands time more data than what we are testing with now.

tatut07:12:35

you should get quite far with checkpoints, but I have been thinking that ephemeral nodes are not the best fit

tatut07:12:08

something like a permanent ec2 machine where you upgrade by just updating a new uberjar for the application code would likely be better and incur no startup cost

tatut07:12:03

you can still have checkpoints on top of that to support scaling up without starting from empty

Hukka08:12:14

Yeah, sure. Just wouldn't want to maintain even the VM OS, if possible. With the more common two tier SQL / application code it would be pretty simple

tatut08:12:02

I guess checkpoints still work, startup grace periods just need to be a little longer

tatut08:12:24

unless you have hundreds of gigabytes of data

Hukka08:12:53

I guess we should test the amount of data bytes too, but given how slow the transmission from the SQL to the node is (100kBps peaks), I'm guessing that it's more about the number of documents, rather than how big they are

Hukka08:12:22

Though perhaps if the data is split into lots and lots of small keyvalues, all indexed, it might not matter

refset10:12:51

Hey, @U8ZQ1J1RR in relation to "given how slow the transmission from the SQL to the node is" ...are you using 1.22.1? Or something older? If you can create and send a flamegraph (e.g. using YourKit or https://github.com/clojure-goes-fast/clj-async-profiler) of what the node is doing during that replaying it would help analyse what the real bottleneck is

Hukka11:12:51

.0. I remember reading that .1 is some 40% faster in some bulk ops, but I thought that it doesn't change the scaling. That is, thousand times more documents would still take thousand times as much, so even doubling the speed won't help here. Upgrading to .1 is one thing to I listed to try, in any case

Hukka11:12:48

And sure, I can make a flamegraph. We use those for our own code too

🙏 1
refset16:12:30

.0 should be the ~current performance (ignoring the RC we're about to put out based on the new master), .1 was just bug-fix release really so I wouldn't expect any performance difference

Hukka07:12:12

Ah, true, we are still at 1.20.0, not 1.21.1. Shouldn't have tried to be clever and save 4 chars

tatut08:12:22

btw, what have you set rocksdb block cache size to? and other memory configs

Hukka08:12:03

I haven't touched anything specific to rocksdb, sounds like I should?

Hukka08:12:43

> cache-size (int): Size of the cache in bytes - default size is 8Mb, although it is https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#block-cache-size this is set to a higher amount. Aha, gotta try that

Hukka09:12:18

Locally raising it from the default to 1GB didn't do anything to the startup time. It is further away from the Postgres, but still only about 25% slower than in the same data center. Have you seen changes with the cache size? Or any other setting?

Hukka09:12:29

Hm, Looking at the config options again, I think I didn't write them properly. I think it won't complain about config being out of spec, just ignores it

Hukka10:12:20

@U899JBRPF That flamegraph btw is really tall. Perhaps best to pass you the whole html file instead of just a screenshot?

Hukka10:12:12

Also a significant part is just under the jvm.so. I suppose debug symbols might help with that. Never had to go that deep with flamegraphs myself

Hukka10:12:17

Well, took the liberty of sending you that file already. Debug symbols indeed helper. On a high level, third of the time is spent in index_tx_events, significant part in InFlightTx.commit, for some reason small but still quite visible part in nrepl main loop, and what looks to my untrained eye quite a lot of time in JIT compilation and garbage collection. Especially the JIT part seems perplexing, when we are talking about a runtime of four minutes. It shouldn't be tens of percents. Perhaps I'm just interpreting it wrong (CompileBroker::compiler_thread_loop())

🙏 1
tatut10:12:20

we have 512mb of cache-size, and I’m thinking I should increase it still, the default is way too low

tatut10:12:30

but yes, it probably won’t have that big an effect on tx log replay, only queries

refset10:12:37

@U8ZQ1J1RR thanks for sending. I can already see that the upgrade to 1.22.1 or the https://repo1.maven.org/maven2/com/xtdb/xtdb-core/1.22.2-rc1/ will have some improvement, based on where some of the time is spent in your profile. I agree the JIT part is interesting, I'll review internally and get back to you

refset10:12:07

> we have 512mb of cache-size, and I’m thinking I should increase it still, the default is way too low Agreed the default is conservative for many realistic workloads - we are looking to improve memory & cache handling in the future so that the defaults can be safely raised/lifted...but it's a complex problem due to the number of caches and the way memory is allocated both initially and incrementally. @U0GE2S1NH has been looking into what can be done here fairly recently

Hukka11:12:38

I took a shot at simplifying the graph

wotbrew11:12:55

@U8ZQ1J1RR I would be interested to see another graph after a bump to 1.22.2-rc1, would that be possible?

Hukka11:12:24

Sure! Not sure if I can manage to make a reasonable diff graph, but at least a standalone graph

🙏 1
wotbrew11:12:44

standalone is fine! Diff will be quite noisy as a lot changed between 1.21 and 1.22

Hukka13:12:55

Sorry, had a meeting. The startup time fell to some 40%. Much better, but still almost two minutes. The flamegraph is very different. Clojure protocols, into, reduces, transducers etc are not really visible, and rocksdb itself becomes visible

Hukka13:12:24

I also sent @U0GE2S1NH the html with the flamegraph for interactive viewing

Hukka13:12:29

Heh, commits from yesterday seem to say pretty much what I just noticed (avoid protocol dispatch, avoid boxing etc)

🙂 1
Hukka18:12:32

So, after all the tuning tips, the startup time went from four minutes to one. The biggest change by far was trying 1.22.2-rc1. After that increasing the block cache size to 64MB helped too, but nothing bigger had an impact. Using the new :enable-filters? to use rocksdb bloom filters shaved another 14%. And increasing the initial heap size, though I still need to test more thoroughly (I just gave Java loads of more max and initial heap, and stack memory)

Hukka18:12:16

Still, it is a minute. So while 4× speed is great, it won't help when the data is hundred times larger. So checkpoints are still needed, or some other way of taking the rocksdb indices to the starting nodes.

tatut19:12:21

yes, I wouldn’t expect to be able to run ephemeral nodes without checkpoints

1
tatut19:12:28

at least not in a real production system

Hukka19:12:18

13s startup locally with GCS checkpoints. Should be faster in the same data center. Has to be, or moving to graal is still not going to happen 😉

tatut19:12:08

with checkpoints it is the download time that matters, so yes it should be fast within the same datacenter