Fork me on GitHub

another metaschema question... It seems like the metaschema data is not actually ETL'ing out of Datomic, rather querying Datomic directly - if this is the case, is there a way to specify a specific value of the db you want to query? so if I want to run a SQL query against yesterday's db state, for example..


Yes it is querying and no there is no way. They must be getting that feature request a lot though


cool, thanks 🙂 that would be a super handy thing to be able to do 😅


Has anyone here had to "decant" a Datomic database to clear out some bad data in the tree?


I'm wondering if the best approach is to iterate through the transaction log


If you can tolerate it being in history, just retraction should be fine. (If you can’t, make sure you’re not falling into this trap


if you can’t tolerate it in history and are using on-prem, consider excision


if excision isn’t possible, then iterating and transforming the transaction log is your only option


Thanks @U09R86PA4, the angle here is not data privacy but rather performance. We've identified some nodes in our tree that have some unfortunately large string values (megabytes +) and we're observing some very high read and write impacts in ddb. I've read that excision does not necessarily help in this case. We do not need the history for business logic


yeah, I’ve had that pain


excision can get rid of it, though


but fortunately the decant should be relatively straightforward if you have a solid plan for putting those strings somewhere else


ok noted. We do have a strategy now, we didn't then


we have a rule in our database (checked by test suite) that every string must have an attribute predicate that limits length


oh nice idea


we'll definitely do that


When you had this issue, where do you see most of the impact? What we're seeing is periodic storage IO spikes that align with the transactor index rebuilds


Unfortunately it was really difficult for us to put our finger on it. It manifested as drag on the entire system


the biggest problem we had was fulltext indexes would occasionally produce huge index merges, but that’s only for the fulltext-indexed values


there’s no way to drop fulltext from an attribute, so we had to actually physically move those values to a different attribute


yep, this is our exact experience as well


did you then excise the fulltext ones or just stop putting them?


but we also saw unpredictable big index sizes, large uncacheable segments in memcached, and inspecting the segments often they were DirNodes > 1MB


we just stopped putting them to stop future writes. Excision at our scale would have been untenable


we are preping for another round of decanting though


got it, this is all very familiar, appreciate the wisdom


on the decant, I'm assuming you're using some global identifier to negotiate the entity IDs?


Yeah fortunately we did a decant ~2-3 years ago, the purpose of which was to renumber entity ids with partitioning for performance (I wasn’t there). During that time all :db/id dependencies were shaken out of the code so it’s resilient to decants using ids we control ourselves

👍 1

the decant itself would just keep track of the entity id mapping; we also injected an additional assertion (long-type not ref type) with the entity-id of the entity in the previous system, which was good for correlating later


that's a good tip


do 100% of your entities have unique IDs? we have some obvious high level domain entities which get UUIDs, but there are certain things like a referenced list of settings that do not. Seem like those would need them, or should be refactored to be flattened onto the main entities


table brain


Almost all of them do


noted. thanks a bunch


Something you might consider if you really don’t need history is to copy every assertion at a time T, then do a partial decant from T to now. That might be better or worse depending on your circumstances

👍 1

you mean to enable doing it in chunks?


The “copy assertions at time T” part is to throw away history. You get a smaller target db, it’s faster than replaying every tx, and you avoid having to deal with any weirdness in the distant past of your tx log. The partial decant is just to reduce downtime--whatever happened in the db while you were doing the bulk copy. If you can tolerate downtime you don’t need it.

👍 1