Hi Friends! I have a new release of Datomic Pro. This release contains two fixes to Datomic Pro and one critical release notice for customers who use Excision. Please check out the forum post and the https://docs.datomic.com/release-notices.html#excision-repair-tool in the docs for this release. https://forum.datomic.com/t/critical-release-datomic-1-0-7469-pro-now-available/2598
Thanks for fixing this! We are impacted by it. I have a couple of questions about the repair tool to clarify our risks: The docs for it state: > When a transactor is running, its indexing process may try to update the index at the same time as the repair process. If the transactor encounters an index the repair tool generated, it may self-terminate. The first sentence implies the txor will exit if it tries to update the same index as the tool at the same time the tool is updating it, but the latter implies that the txor will exit if it updates an index that was last updated by the tool. Can you clarify that behavior? And what does "index" mean here? The full as-of and history indexes? I'm assuming those would be updated by every indexing event on the txor. Next ?: Is there a way to feed the report file generated in the report step into the repair step to tell it what to repair? Running the report on our db took 16 hours, so I assume w/o this feature, the repair will take at least that long.
> Can you clarify that behavior? 1. txor reads index, starts preparing new index 2. repair tool reads index, makes new one, writes it. 3. txor finishes making its index, tries to CAS it in; CAS fails, transactor self-terminates.
The two sentences are describing the same CAS-failure scenario.
not two different scenarios
> The first sentence implies the txor will exit if it tries to update the same index as the tool at the same time the tool is updating it, but the latter implies that the txor will exit if it updates an index that was last updated by the tool. Can you clarify that behavior? Both the tool and the transactor can error. > And what does "index" mean here? The full as-of and history indexes? I'm assuming those would be updated by every indexing event on the txor. The repair tool is fixing the index. This is the same index as created by the txtor. > Next ?: Is there a way to feed the report file generated in the report step into the repair step to tell it what to repair? Running the report on our db took 16 hours, so I assume w/o this feature, the repair will take at least that long. The repair tool takes the report file in it's input line
bin/run -Xmx${MEM} -m datomic.tools.excise-history repair ${DB_URI} ${REPORT_FILE_NAME} ${MAX_ATTEMPTS}
> Running the report on our db took 16 hours, so I assume w/o this feature, the repair will take at least that long.
What flags did you give the report tool, specifically memory? and would you be able to share your diagnostics or db-stats (we can open a support case to share this info). Can you run in a staging environment or against a restored backup to validate?Thanks y'all!
> The repair tool takes the report file in it's input line I assumed that was so the repair tool could write it's own report. But if that that report is input, it might be worth clarifying that.
> What flags did you give the report tool, specifically memory? and would you be able to share your diagnostics or db-stats (we can open a support case to share this info). Can you run in a staging environment or against a restored backup to validate?
I just gave it -Xmx10g -Xms10g as arguments. I can go bigger if that would be helpful. But if the report output is used by repair, I shouldn't need to run for 16 hours again (we haven't had any excisions since I ran the report).
Regarding diagnostics/db-stats, how should I capture those?
> Regarding diagnostics/db-stats, how should I capture those?
https://docs.datomic.com/clojure/index.html#datomic.api/db-stats
And for diagnostics (here is my template e-mail):
> To run diagnostics you can launch a clojure REPL from a machine that can connect to the DB. You can do this by running bin/repl from the Datomic root directory and execute the following command:
(require '[datomic.integrity :as int])
(int/diagnostics uri)
> You'll want to ensure that the REPL can connect to the system with all appropriate perms. You'll know you have the correct output from diagnostics if :db is not nil. If you're unable to run from a REPL you can run run directly from the command line (again the machine you're on has to be able to connect/have perms to the system) and you will need to replace $DB-URI with your DB URI:
bin/run -m datomic.integrity $DB-URI
> Please let me know if you have any questions about providing a diagnostic report.Ah, I see. I thought you meant diagnostics during the tool run. I can get you those and move this to a ticket.
> But if the report output is used by repair, I shouldn't need to run for 16 hours again (we haven't had any excisions since I ran the report). Report output is not used.
If adding more RAM is not the bottleneck to speed we will want to look at datomic.readConcurrency and datomic.writeConcurrency settings. Are you currently setting these on your transactor system?
We currently set those to the following on the txor:
write-concurrency=4
read-concurrency=24
But I used the defaults with the tool. I can try with those settings to see if it is faster (though I assume write-concurrency doesn't matter for report mode, only for repair).I'll also give it more memory (since I can)
Yeah, I definitely would bump read-concurrency as well when looking for a knob that would have an effect.
The other thing to consider is valcache and all other Datomic configuration options apply in terms of helping with performance
it might be worthwhile to configure valcache, populate the cache (perhaps by running the report) and then running repair.
That makes sense; I can add valcache to the mix as well.
I just opened a support ticket with our stats and diagnostics. I'll work on starting a new run with better settings this afternoon.
Is this the correct way to provide properties to the tool?
bin/run -Xmx55g -Xms55g \
-XX:+UseTransparentHugePages \
-XX:+AlwaysPreTouch \
-Ddatomic.memcachedServers=<elided> \
-Ddatomic.valcachePath=/opt/datomic/valcache \
-Ddatomic.valcacheMaxGb=200 \
-Ddatomic.readConcurrency=24 \
-Ddatomic.writeConcurrency=4 \
-m datomic.tools.excise-history report \
$URI \
report-prod.edn
I ask because I'm running it now, but it isn't writing anything to the valcache dir.valcache config looks correct, but it should be writing...
Another question: Given the list of datoms, can we just excise them again via the txor to avoid tool <> txor contention when indexing? We would have to trickle them to the txor in batches, but have tooling to do that.
no, re-excision is not guaranteed to work (because excision uses the E leading indexes to find the datoms to excise in AVET, and the datoms may be correctly removed already from the E-leading indexes)
if your vc config is not working, I suspect your other params are not either
Thanks Ghadi. The -X* options are working, but I don't have a way to verify any of the -D props
jcmd $PID VM.command_line
Unfortunately I think the cache stack is inactive during this tool.
hold please.
Thanks for checking Ghadi. Just confirming the properties are making it to the java process:
jvm_args: -Xmx1g -Xms1g -Xmx55g -Xms55g -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch -Ddatomic.memcachedServers=<elided> -Ddatomic.valcachePath=/opt/datomic/valcache -Ddatomic.valcacheMaxGb=200 -Ddatomic.readConcurrency=24 -Ddatomic.writeConcurrency=4 nice. I can confirm that the settings, aside from memcache and valcache, are being respected.
you may want to use a separate memcache than your prod memcache
What would the value of an empty memcache cluster be? If it is to avoid repair from writing to the production one? If so, would having one even be useful?
repair does a superset of the work that report does. a cache (valcache is sufficient) would help going to backing storage twice
Ah, I see.
not sure if the box you're running from is persistent or ephemeral
but memcached, even empty, would give you persistence
It's an ephemeral AWS instance. But I can't run repair yet, as we are still in the process of upgrading to 7469. I'm just trying to get a handle on how long this will take to run. I'm fine with running in report mode again when we are ready to repair.
sounds good. Keenly watching.
@tcrawley please do take a Datomic level Backup before you do run repair.
Will do!
I reran it with the above settings (and with the cache layer initialized), and it still took 15 hours. However, it looks like it didn't actually use valcache. It initialized the directory structure, but did not write any data to it.
Sigh. I will get this sorted
Thanks!
While we expect that few customer databases will be affected due to the rare circumstances where an excision of datoms in the historical indexes can be incomplete we have created a detection tool and repair tool in addition to resolving the issue that led to incomplete excisions.
If you have questions, please reach out to me here, contact <mailto:support@cognitect.com|support@cognitect.com>, or open a support ticket https://support.cognitect.com/hc/en-us/requests/new.