Fork me on GitHub
#datomic
<
2022-01-20
>
cmdrdats13:01:22

another metaschema question... It seems like the metaschema data is not actually ETL'ing out of Datomic, rather querying Datomic directly - if this is the case, is there a way to specify a specific value of the db you want to query? so if I want to run a SQL query against yesterday's db state, for example..

favila13:01:34

Yes it is querying and no there is no way. They must be getting that feature request a lot though

cmdrdats13:01:15

cool, thanks 🙂 that would be a super handy thing to be able to do 😅

➕ 3
bhurlow20:01:51

Has anyone here had to "decant" a Datomic database to clear out some bad data in the tree?

bhurlow20:01:23

I'm wondering if the best approach is to iterate through the transaction log

favila20:01:47

If you can tolerate it being in history, just retraction should be fine. (If you can’t, make sure you’re not falling into this trap https://vvvvalvalval.github.io/posts/2017-07-08-Datomic-this-is-not-the-history-youre-looking-for.html)

favila20:01:02

if you can’t tolerate it in history and are using on-prem, consider excision

favila20:01:24

if excision isn’t possible, then iterating and transforming the transaction log is your only option

bhurlow21:01:21

Thanks @U09R86PA4, the angle here is not data privacy but rather performance. We've identified some nodes in our tree that have some unfortunately large string values (megabytes +) and we're observing some very high read and write impacts in ddb. I've read that excision does not necessarily help in this case. We do not need the history for business logic

favila21:01:52

yeah, I’ve had that pain

bhurlow21:01:54

yea 🤒

favila21:01:06

excision can get rid of it, though

favila21:01:40

but fortunately the decant should be relatively straightforward if you have a solid plan for putting those strings somewhere else

bhurlow21:01:03

ok noted. We do have a strategy now, we didn't then

favila21:01:20

we have a rule in our database (checked by test suite) that every string must have an attribute predicate that limits length

bhurlow21:01:30

oh nice idea

bhurlow21:01:34

we'll definitely do that

bhurlow21:01:06

When you had this issue, where do you see most of the impact? What we're seeing is periodic storage IO spikes that align with the transactor index rebuilds

favila21:01:39

Unfortunately it was really difficult for us to put our finger on it. It manifested as drag on the entire system

favila21:01:12

the biggest problem we had was fulltext indexes would occasionally produce huge index merges, but that’s only for the fulltext-indexed values

favila21:01:49

there’s no way to drop fulltext from an attribute, so we had to actually physically move those values to a different attribute

bhurlow21:01:11

yep, this is our exact experience as well

bhurlow21:01:29

did you then excise the fulltext ones or just stop putting them?

favila21:01:11

but we also saw unpredictable big index sizes, large uncacheable segments in memcached, and inspecting the segments often they were DirNodes > 1MB

favila21:01:42

we just stopped putting them to stop future writes. Excision at our scale would have been untenable

favila21:01:52

we are preping for another round of decanting though

bhurlow21:01:46

got it, this is all very familiar, appreciate the wisdom

bhurlow21:01:58

on the decant, I'm assuming you're using some global identifier to negotiate the entity IDs?

favila21:01:34

Yeah fortunately we did a decant ~2-3 years ago, the purpose of which was to renumber entity ids with partitioning for performance (I wasn’t there). During that time all :db/id dependencies were shaken out of the code so it’s resilient to decants using ids we control ourselves

👍 1
favila21:01:21

the decant itself would just keep track of the entity id mapping; we also injected an additional assertion (long-type not ref type) with the entity-id of the entity in the previous system, which was good for correlating later

bhurlow21:01:01

that's a good tip

bhurlow21:01:34

do 100% of your entities have unique IDs? we have some obvious high level domain entities which get UUIDs, but there are certain things like a referenced list of settings that do not. Seem like those would need them, or should be refactored to be flattened onto the main entities

bhurlow21:01:47

table brain

favila21:01:08

Almost all of them do

bhurlow21:01:28

noted. thanks a bunch

favila22:01:37

Something you might consider if you really don’t need history is to copy every assertion at a time T, then do a partial decant from T to now. That might be better or worse depending on your circumstances

👍 1
bhurlow22:01:23

you mean to enable doing it in chunks?

favila22:01:56

The “copy assertions at time T” part is to throw away history. You get a smaller target db, it’s faster than replaying every tx, and you avoid having to deal with any weirdness in the distant past of your tx log. The partial decant is just to reduce downtime--whatever happened in the db while you were doing the bulk copy. If you can tolerate downtime you don’t need it.

👍 1