This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-10-31
Channels
- # aleph (12)
- # announcements (4)
- # asami (7)
- # babashka (20)
- # beginners (92)
- # calva (74)
- # clj-kondo (8)
- # cljdoc (70)
- # clojure (47)
- # clojure-dev (29)
- # clojure-europe (27)
- # clojure-nl (7)
- # clojure-norway (3)
- # clojurescript (7)
- # cursive (2)
- # datomic (1)
- # emacs (8)
- # events (5)
- # fulcro (36)
- # gratitude (4)
- # humbleui (25)
- # introduce-yourself (1)
- # lsp (26)
- # malli (6)
- # missionary (8)
- # nbb (50)
- # off-topic (9)
- # pathom (2)
- # pedestal (3)
- # portal (32)
- # practicalli (5)
- # reitit (5)
- # releases (1)
- # ring (6)
- # shadow-cljs (87)
- # sql (31)
- # tools-deps (26)
- # vim (3)
- # xtdb (15)
any ideas what could cause an “incomplete checkpoint restore” startup problem (RocksDB) ?
the exception info has :expected
list with 105 items, :actual
is only 62… has the node making the checkpoint been stopped while making it?
if there's any issues uploading the checkpoint, the node shouldn't upload the metadata-*.edn
file at the end - nodes then check for this when choosing the checkpoint to restore from
looking at the error message, it seems the problem is on the download - expected vs actual in this case is the new node checking the file count on S3 vs the amount it was able to successfully download
so this seems to be the cause, it can’t download the files:
{"context": "default",
"exception": "software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.\n\t
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:102)\n\t
at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:211)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:207)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\t
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\t
at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\t
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:103)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\t
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\t
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$executeHttpRequest$3(MakeAsyncHttpRequestStage.java:165)\n\t
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)\n\t
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)\n\t
at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\t
at java.base/java.lang.Thread.run(Thread.java:833)\nCaused by: java.lang.Throwable: Acquire operation took longer than the configured maximum time. This indicates th
at a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so th
at large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.\n\t
at software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.decorateException(NettyUtils.java:63)\n\t
at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:310)\n\t
at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:189)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)\n\t
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)\n\t
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)\n\t
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)\n\t
at software.amazon.awssdk.http.nio.netty.internal.CancellableAcquireChannelPool.lambda$acquire$1(CancellableAcquireChannelPool.java:58)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\t
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\t
at io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\t
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\t
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\t
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\t
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)\n\t
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\t
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\t... 1 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: Acquire operation took longer than 10000 milliseconds.\n\t
at software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\t
at software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\t
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\t
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t... 6 common frames omitted\n",
"level": "WARN", "logger": "xtdb.s3",
"message": "Error fetching S3 object: s3://...redacted.../index/checkpoint-43549667-2022-10-30T14:03:19.410-00:00/005788.sst",
"thread": "sdk-async-response-0-1", "timestamp": "2022-10-31T10:08:27.389Z" }
so looks like we need to give a reified S3Configurator which implements makeClient with better netty config
> hopefully it can safely take a newer one I haven't heard of anyone having problems so far. Let us know if you're blocked though 🙂