Fork me on GitHub
#xtdb
<
2022-10-31
>
tatut09:10:25

any ideas what could cause an “incomplete checkpoint restore” startup problem (RocksDB) ?

tatut09:10:08

the exception info has :expected list with 105 items, :actual is only 62… has the node making the checkpoint been stopped while making it?

jarohen09:10:11

if there's any issues uploading the checkpoint, the node shouldn't upload the metadata-*.edn file at the end - nodes then check for this when choosing the checkpoint to restore from

jarohen09:10:35

is this file present in your checkpoint store?

jarohen10:10:09

looking at the error message, it seems the problem is on the download - expected vs actual in this case is the new node checking the file count on S3 vs the amount it was able to successfully download

tatut10:10:13

yes, the metadata file was present

tatut10:10:04

the throw is in xtdb.s3.checkpoint/download-checkpoint line 51 (on version 1.21.0)

tatut10:10:10

so this seems to be the cause, it can’t download the files:

{"context": "default",
"exception": "software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.\n\t
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:102)\n\t
at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:211)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:207)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\t
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\t
at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\t
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:103)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\t
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$executeHttpRequest$3(MakeAsyncHttpRequestStage.java:165)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\t
at java.base/java.lang.Thread.run(Thread.java:833)\nCaused by: java.lang.Throwable: Acquire operation took longer than the configured maximum time. This indicates th
at a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so th
at large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.\n\t
at software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.decorateException(NettyUtils.java:63)\n\t
at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:310)\n\t
at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:189)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)\n\t
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)\n\t
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)\n\t
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)\n\t
at software.amazon.awssdk.http.nio.netty.internal.CancellableAcquireChannelPool.lambda$acquire$1(CancellableAcquireChannelPool.java:58)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\t
at io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\t
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\t
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\t
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\t
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)\n\t
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\t
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\t... 1 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: Acquire operation took longer than 10000 milliseconds.\n\t
at software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\t
at software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\t
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\t
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t... 6 common frames omitted\n",
"level": "WARN", "logger": "xtdb.s3",
"message": "Error fetching S3 object: s3://...redacted.../index/checkpoint-43549667-2022-10-30T14:03:19.410-00:00/005788.sst",
"thread": "sdk-async-response-0-1", "timestamp": "2022-10-31T10:08:27.389Z" }

tatut10:10:16

so perhaps we need to add a netty configuration

tatut10:10:15

so looks like we need to give a reified S3Configurator which implements makeClient with better netty config

tatut11:10:38

looks like xtdb is using older awssdk, hopefully it can safely take a newer one

refset16:10:43

> hopefully it can safely take a newer one I haven't heard of anyone having problems so far. Let us know if you're blocked though 🙂

tatut10:11:01

yeah, looks like that S3Configurator worked, even with a newer awssdk

awesome 1
🙌 1
tatut10:11:08

I can’t say I’m a fan of awssdk default of 10s timeout in S3AsyncClient HTTP pool acquisition

jarohen10:11:15

yeah, that does seem a bit off...