datomic

jaret 2025-10-23T20:23:27.940289Z

Hi Friends! I have a new release of Datomic Pro. This release contains two fixes to Datomic Pro and one critical release notice for customers who use Excision. Please check out the forum post and the https://docs.datomic.com/release-notices.html#excision-repair-tool in the docs for this release. https://forum.datomic.com/t/critical-release-datomic-1-0-7469-pro-now-available/2598

2
3
🎉 5
2025-10-28T14:30:40.565979Z

Thanks for fixing this! We are impacted by it. I have a couple of questions about the repair tool to clarify our risks: The docs for it state: > When a transactor is running, its indexing process may try to update the index at the same time as the repair process. If the transactor encounters an index the repair tool generated, it may self-terminate. The first sentence implies the txor will exit if it tries to update the same index as the tool at the same time the tool is updating it, but the latter implies that the txor will exit if it updates an index that was last updated by the tool. Can you clarify that behavior? And what does "index" mean here? The full as-of and history indexes? I'm assuming those would be updated by every indexing event on the txor. Next ?: Is there a way to feed the report file generated in the report step into the repair step to tell it what to repair? Running the report on our db took 16 hours, so I assume w/o this feature, the repair will take at least that long.

favila 2025-10-28T14:34:56.703029Z

> Can you clarify that behavior? 1. txor reads index, starts preparing new index 2. repair tool reads index, makes new one, writes it. 3. txor finishes making its index, tries to CAS it in; CAS fails, transactor self-terminates.

favila 2025-10-28T14:35:14.900279Z

The two sentences are describing the same CAS-failure scenario.

favila 2025-10-28T14:35:33.325399Z

not two different scenarios

jaret 2025-10-28T14:41:18.260829Z

> The first sentence implies the txor will exit if it tries to update the same index as the tool at the same time the tool is updating it, but the latter implies that the txor will exit if it updates an index that was last updated by the tool. Can you clarify that behavior? Both the tool and the transactor can error. > And what does "index" mean here? The full as-of and history indexes? I'm assuming those would be updated by every indexing event on the txor. The repair tool is fixing the index. This is the same index as created by the txtor. > Next ?: Is there a way to feed the report file generated in the report step into the repair step to tell it what to repair? Running the report on our db took 16 hours, so I assume w/o this feature, the repair will take at least that long. The repair tool takes the report file in it's input line

bin/run -Xmx${MEM} -m datomic.tools.excise-history repair ${DB_URI} ${REPORT_FILE_NAME} ${MAX_ATTEMPTS}
> Running the report on our db took 16 hours, so I assume w/o this feature, the repair will take at least that long. What flags did you give the report tool, specifically memory? and would you be able to share your diagnostics or db-stats (we can open a support case to share this info). Can you run in a staging environment or against a restored backup to validate?

2025-10-28T14:42:52.929819Z

Thanks y'all!

2025-10-28T14:44:25.468969Z

> The repair tool takes the report file in it's input line I assumed that was so the repair tool could write it's own report. But if that that report is input, it might be worth clarifying that.

2025-10-28T14:46:44.376519Z

> What flags did you give the report tool, specifically memory? and would you be able to share your diagnostics or db-stats (we can open a support case to share this info). Can you run in a staging environment or against a restored backup to validate? I just gave it -Xmx10g -Xms10g as arguments. I can go bigger if that would be helpful. But if the report output is used by repair, I shouldn't need to run for 16 hours again (we haven't had any excisions since I ran the report). Regarding diagnostics/db-stats, how should I capture those?

jaret 2025-10-28T14:51:56.481239Z

@tcrawley I am wrong about the input of the report filename. It does not accept the file nor can the implementation be sped up by providing the output of the report tool. Credit @favila for setting me straight there.

jaret 2025-10-28T14:54:41.230899Z

> Regarding diagnostics/db-stats, how should I capture those? https://docs.datomic.com/clojure/index.html#datomic.api/db-stats And for diagnostics (here is my template e-mail): > To run diagnostics you can launch a clojure REPL from a machine that can connect to the DB. You can do this by running bin/repl from the Datomic root directory and execute the following command:

(require '[datomic.integrity :as int])
(int/diagnostics uri)
> You'll want to ensure that the REPL can connect to the system with all appropriate perms. You'll know you have the correct output from diagnostics if :db is not nil. If you're unable to run from a REPL you can run run directly from the command line (again the machine you're on has to be able to connect/have perms to the system) and you will need to replace $DB-URI with your DB URI:
bin/run -m datomic.integrity $DB-URI
> Please let me know if you have any questions about providing a diagnostic report.

2025-10-28T14:58:02.667729Z

Ah, I see. I thought you meant diagnostics during the tool run. I can get you those and move this to a ticket.

jaret 2025-10-28T15:02:48.501299Z

> But if the report output is used by repair, I shouldn't need to run for 16 hours again (we haven't had any excisions since I ran the report). Report output is not used.

jaret 2025-10-28T15:06:24.251869Z

If adding more RAM is not the bottleneck to speed we will want to look at datomic.readConcurrency and datomic.writeConcurrency settings. Are you currently setting these on your transactor system?

2025-10-28T15:46:49.101179Z

We currently set those to the following on the txor:

write-concurrency=4
read-concurrency=24
But I used the defaults with the tool. I can try with those settings to see if it is faster (though I assume write-concurrency doesn't matter for report mode, only for repair).

2025-10-28T15:47:01.283309Z

I'll also give it more memory (since I can)

jaret 2025-10-28T15:53:22.896399Z

Yeah, I definitely would bump read-concurrency as well when looking for a knob that would have an effect.

jaret 2025-10-28T15:54:01.138619Z

The other thing to consider is valcache and all other Datomic configuration options apply in terms of helping with performance

jaret 2025-10-28T15:54:38.850949Z

it might be worthwhile to configure valcache, populate the cache (perhaps by running the report) and then running repair.

2025-10-28T15:54:48.524529Z

That makes sense; I can add valcache to the mix as well.

2025-10-28T16:04:09.180979Z

I just opened a support ticket with our stats and diagnostics. I'll work on starting a new run with better settings this afternoon.

👍 1
2025-10-28T18:02:34.065029Z

Is this the correct way to provide properties to the tool?

bin/run -Xmx55g -Xms55g \
  -XX:+UseTransparentHugePages \
  -XX:+AlwaysPreTouch \
  -Ddatomic.memcachedServers=<elided> \
  -Ddatomic.valcachePath=/opt/datomic/valcache \
  -Ddatomic.valcacheMaxGb=200 \
  -Ddatomic.readConcurrency=24 \
  -Ddatomic.writeConcurrency=4 \
  -m datomic.tools.excise-history report \
  $URI \
  report-prod.edn
I ask because I'm running it now, but it isn't writing anything to the valcache dir.

ghadi 2025-10-28T18:09:57.049549Z

valcache config looks correct, but it should be writing...

2025-10-28T18:10:02.154889Z

Another question: Given the list of datoms, can we just excise them again via the txor to avoid tool <> txor contention when indexing? We would have to trickle them to the txor in batches, but have tooling to do that.

ghadi 2025-10-28T18:12:10.772669Z

no, re-excision is not guaranteed to work (because excision uses the E leading indexes to find the datoms to excise in AVET, and the datoms may be correctly removed already from the E-leading indexes)

ghadi 2025-10-28T18:13:07.219919Z

if your vc config is not working, I suspect your other params are not either

2025-10-28T18:15:35.217469Z

Thanks Ghadi. The -X* options are working, but I don't have a way to verify any of the -D props

ghadi 2025-10-28T18:16:29.762079Z

jcmd $PID VM.command_line

ghadi 2025-10-28T18:18:44.775009Z

Unfortunately I think the cache stack is inactive during this tool.

ghadi 2025-10-28T18:18:57.312949Z

hold please.

2025-10-28T18:20:32.055789Z

Thanks for checking Ghadi. Just confirming the properties are making it to the java process:

jvm_args: -Xmx1g -Xms1g -Xmx55g -Xms55g -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch -Ddatomic.memcachedServers=<elided> -Ddatomic.valcachePath=/opt/datomic/valcache -Ddatomic.valcacheMaxGb=200 -Ddatomic.readConcurrency=24 -Ddatomic.writeConcurrency=4 

ghadi 2025-10-28T18:37:44.292609Z

nice. I can confirm that the settings, aside from memcache and valcache, are being respected.

ghadi 2025-10-28T18:38:07.197409Z

you may want to use a separate memcache than your prod memcache

2025-10-28T18:40:39.981969Z

What would the value of an empty memcache cluster be? If it is to avoid repair from writing to the production one? If so, would having one even be useful?

ghadi 2025-10-28T18:42:07.456069Z

repair does a superset of the work that report does. a cache (valcache is sufficient) would help going to backing storage twice

2025-10-28T18:42:41.198369Z

Ah, I see.

ghadi 2025-10-28T18:42:52.285959Z

not sure if the box you're running from is persistent or ephemeral

ghadi 2025-10-28T18:43:09.540519Z

but memcached, even empty, would give you persistence

2025-10-28T18:44:16.286419Z

It's an ephemeral AWS instance. But I can't run repair yet, as we are still in the process of upgrading to 7469. I'm just trying to get a handle on how long this will take to run. I'm fine with running in report mode again when we are ready to repair.

ghadi 2025-10-28T18:45:02.924789Z

sounds good. Keenly watching.

jaret 2025-10-28T19:03:57.782189Z

@tcrawley please do take a Datomic level Backup before you do run repair.

2025-10-28T19:05:37.033499Z

Will do!

2025-10-29T12:22:50.934409Z

I reran it with the above settings (and with the cache layer initialized), and it still took 15 hours. However, it looks like it didn't actually use valcache. It initialized the directory structure, but did not write any data to it.

ghadi 2025-10-29T12:54:51.641299Z

Sigh. I will get this sorted

2025-10-29T13:10:13.943569Z

Thanks!

jaret 2025-10-23T20:25:58.832859Z

While we expect that few customer databases will be affected due to the rare circumstances where an excision of datoms in the historical indexes can be incomplete we have created a detection tool and repair tool in addition to resolving the issue that led to incomplete excisions.

jaret 2025-10-23T20:26:56.378279Z

If you have questions, please reach out to me here, contact <mailto:support@cognitect.com|support@cognitect.com>, or open a support ticket https://support.cognitect.com/hc/en-us/requests/new.