Fork me on GitHub

Hello everyone, Is it possible to get the result of a datomic query as a stream of results? My use case would be applying a transducer on very large sets of results without realizing the whole result set in memory. Thanks

Linus Ericsson07:08:49

in short, no. the query result is realised as a hashset. Can you break down the query into several steps? It would then be possible to make partial reads of the result while working on it.


Thanks for the answer. I can break it into several pieces but it would increase significantly the complexity of the operation. My primary goal is to test entities against some core.specs in a database wide manner to ensure data integrity through time.

Linus Ericsson08:08:58

Is this a one-time operation or something you strive to do often? Because if you need to do it often, chances are that you need to reconsider that approach.


Both. I do it at dev time but I'm considering integrating this approach for data integrity in a test suite.

Linus Ericsson08:08:06

If you can adopt your validation logic to just check what has been changed, you will be able to use this approach. Otherwise you have to make pre-validations or similar for every transaction hitting the database. If not, the checks will soon be quite hard to scale.

Linus Ericsson08:08:41

(Sorry to sound negative, I tried this some years ago, and it works quite well for limited test cases, but, IMO, it is not so helpful when the database grows and the integrity checks takes longer and longer time)


Thanks for the feedback. I do it already at transaction time but the business domain is so variable from one client to another, that the data is evolving very rapidly. That would be convenient to back check the data already recorded.

Linus Ericsson09:08:03

Well, don't take my words for truth. I think it sounds like a reasonable use case. Maybe you can spin up some large cloud instance to do this when nescessary and have the whole database in memory.


Thanks. For the time being, that data set is still small and will remain for at least one or two years processable on a laptop. Next step is indeed manageable with big instances with a lot of memory, but this is a workaround that introduces complexity in the architecture. Stream processing would push further this scaling problem.


You can use d/qseq to pull lazy, but the result set itself is still eager

👍 4

Great info. Thanks.


If I understand well qseq, this piece of code would 1. Realize eagerly the set of integers that satisfy the query 2. Then get the required pull fields one by one, apply the xform and step by step calculate the mean of the max rating.

(->> (d/qseq {:query '[:find [(pull ?user [:user/ratings]) ...]
                       :in $ ?survey
                       :where [?survey :survey/user ?user]]
              :args [(get-current-db!) 1234]})
         (filter seq)
         (map #(->> % :user/ratings (apply max))))
One question: after pulling and processing an element of this lazy seq, is it garbage collected (stream like) or is it maintained into memory until the whole seq is consumed? Another way of asking this is: after realizing the integer set, is the memory consumption constant or is it growing on every step of the lazy seq?


(What’s the (filter seq) for?)


I don’t know for sure, but this feature would be pointless if the query result weren’t truly lazy


I would expect at least the pulls to be lazily realized and released even if the backing entity id list is not, but I see no reason why that can’t be lazily released also


but the query evaluation itself is not lazy, and the entire result (sans pull) is also fully realized by the time qseq returns, that much I do know

Joe Lane14:08:47

@U94V75LDV • How large is very large sets of results? • Am I interpreting the scenario correctly in my next statement? "For a given survey (`1234`) find the maximum rating for every user associated with that survey (defined as maxes) and then find the mean of maxes " • Is this for a report, an analytics scenario, or other batch job?


@U0CJ19XAM 1. For the time being, the set of results is quite small. But since my business domain is in academic, every year the data is fresh and aggregates. So within a few years, I suppose that the data sets might not fit into a laptop memory. 2. Yes, it is the meaning of the piece of code. The idea was having a pull expression inside of the :find clause AND having a "simple non trivial" xform in the transducer. 3. The code example is not linked to the project I'm working on. As said in 1., the domain is in academics. The code example was only for the example. My use case though is more having tests for the integrity of the formats of the entities. As an example, I have evaluation entities. They tend to have various forms depending on the context and also they tend to evolve quite a lot through the initial phase (I'm discovering the business domain). What I'd like to have is a test suite to check that the entities match the core.specs that I have in my code AND that the refs correspond to the entity type that it is supposed to refer to. I already have thousands (if not tens of thousands) of evaluation entities. I therefore fear that within a few years, I won't be able to load eagerly the whole collection of evaluations to check their shapes (~ their core.specs).

Joe Lane15:08:54

I see. So this is primarily about enforcing consistency? Have you seen before? Many entity predicates can be run per transaction. May help to ensure consistency on the way into the system. As for ensuring consistency of entities already in the database, you could leverage a nested pull pattern with index-pull to lazily walk through evaluation entities and their ref entities and check each entity against it's spec.


@U09R86PA4 the (filter seq) is a shorter equivalent of (filter #(-> % empty? not)) . I just figured that I could have written (filter not-empty) . But I tend to use #(-> % empty? not) in my projects.


Do you expect more than a thousand user ratings? perhaps a subquery would be better


[:find (avg ?max-rating) .
          :in $ ?survey
          [?survey :survey/user ?user]
          [(datomic.api/datoms $ :aevt :user/ratings ?user) ?user-ratings]
          [(apply max-key :v ?user-ratings) [_ _ ?max-rating]]]
might work. Or just go for broke and make the whole thing out of d/datoms


or use datomic-analytics for aggregations. Presto’s query planner is just much better than datomic datalog’s at performing aggregations in bounded memory. I’m continually astonished at how fast it is, especially considering it’s using the client api. Unfortunately you lose all history or as-of features

👍 2

Also, sql 😕


I think this would be the equivalent with just datoms. This will not necessarily be the fastest, but it will consume a small and constant amount of memory no matter how big your data gets. You also may not want to hand-write one of these for every query you have in mind…


(let [db db]
   (keep #(transduce
           (map :v)
           (d/datoms db :aevt :user/ratings (:v %))))
   (d/datoms db :aevt :survey/user survey)))


This is the opposite extreme. d/q is realize+parallelize everything, this is lazy-everything and keep no intermediaries.

👌 1

Waa, that is a lot of very interesting information. Thanks a lot. You're right, the datoms API should be sufficient for my needs and match the low memory requirements for very large datasets.


Also > Do you expect more than a thousand user ratings? perhaps a subquery would be better The rating thing was for the example. My project is in the academic world. For the evaluations (of various kinds), I expect to have soon hundreds of thousand entities. May be a million within a year or two. But that is the only case in the architecture.


I mentioned that because pull expressions have a default limit of 1000 for cardinality-many

👌 1
Ben Hammond17:08:57

I have a datomic Ion that is only ever going to be called via HTTP (it handles an AWS Cognito callback) what are the pros/cons of Lambda vs HttpDirect for this situation? Does it ever make sense to implement as a Lambda if it will never be called from any other part of aws? Should it always be HTTP_Direct? I am using pedestal to serve the http-direct; I presume there aren't any overhead issues there?

Ben Hammond17:08:26

is this just an irrelevant issue and I should stop procrastinating?

Ben Hammond17:08:12

(bike sheds should always be racing green, IMHO)

Daniel Jomphe21:08:00

Datomic Cloud Backup and Restore

👍 1
Daniel Jomphe21:08:17

Hi! From what I gather, there is no official recipe. Sources: 1. Cognitect mentions S3 is very durable, but doesn't say if we can just copy S3 data from one stack to another, and expect that to work as a restore (what about DDB, EFS, valcache?) 2. Cognitect didn't respond. 3. Documentation doesn't mention backup and restore. 4. No answer in the new Ask forum, but I read we can vote on the request and Cognitect will see that as an interesting signal.

Daniel Jomphe21:08:52

And there is an unofficial recipe coming out of a capable open source developer, @U0CKQ19AQ : 1. Some people shared interesting feedback when Tony announced the lib but I don't recall it. 1. Announcement 2. Update

Daniel Jomphe21:08:43

I'd like to ask if some good practices were shared with some details about proven recipes? Our wants, going into production soon, are: • Recovering from very bad days (e.g. someone deletes the raw data in DDB or S3 or executes d/delete-database). • Copying data into staging. • Being able to migrate to another AWS region. So I'd like to explore official/bare solutions and what Tony did too. I see them as complementary. But I wonder if anyone has found any guidance about any bare solution that would not involve streaming and replaying transactions?


afaict there is no official solution (yet at least) for offsite backups for datomic cloud


@U0514DPR7 I’m actually doing it two ways. The streaming way is the only way to really get a perfect snapshot of Cloud as far as I know. There may be a way to copy the Datomic underlying stuff, and pay Cognitect to manually stand it up elsewhere, but there is definitely not documented way, as you’ve found. So, streaming is the public API way. The other thing I’m doing (and you can do) is code that can analyze the dependency graph of a subset of data (e.g. a customer) and emit that as a snapshot of the “db now”. That is way harder, because it is pretty easy for a database to be “too well connected” where you get a lot more than you wanted….and when you start filtering that out, you often then run into get too little. If you want tp actually recovery a db from a “bad day”, I don’t see an alternative (until Cognitect makes one).

Daniel Jomphe15:08:37

Thanks to the both of you!

Daniel Jomphe15:08:58

@U0CKQ19AQ then I think I will start experimenting with the tool you published, and develop internal enablers around it for our team.


Yeah, I’m still working on it. The most recent version should work, but I have not had time to get a full successful restore yet…just FYI

Daniel Jomphe15:08:03

Good to know! 🙂 So would it be better to wait some time before trying it out and starting to provide feedback?


The backup looks fine, though, and it isn’t hard to have a thread running in prod that keeps that up to date. The restore had a bug or two in it, and I think I’ve fixed them all, but I tore down my prod restore system to save $ until I had time to try it again. Running a small target cluster, the restore was going to take 14 days to start (58M transactions) No need to wait, no. I’d love to have help on it, which is why I published it


Like I said, I think it should actually be working at this point…other that you writing scaffolding around it to deploy it

Daniel Jomphe15:08:09

Ok, good - then I should start trying it out. As for the eventual feedback, where would be best for you?


Issues on the GH project is fine. You can also just DM me about it.

👍 1
🙏 1

My restore loop that I’m using (and have tested fairly well) is just:

(defn do-restore-step [dbname redis-conn]
  (let [conn         (connection dbname)
        start-key    (str "next-segment-start" dbname)
        next-start-t (car/wcar redis-conn
                       (car/get start-key))
        start-t      (if (int? next-start-t) next-start-t 0)]
    (log/debug "Restoring" dbname "segment starting at" start-t)
    (let [{latest-start-t :start-t} (dcbp/last-segment-info dbstore dbname)
          next-start (if (or (nil? latest-start-t) (< latest-start-t start-t))
                         (log/debugf "Segment starting at %d is not available yet for %s." start-t dbname)
                       (cloning/restore-segment! dbname conn dbstore idmapper start-t {:blacklist #{:invoice/dian-response :invoice/dian-xml}}))]
      (if (= next-start start-t)
          (log/debug "There was nothing new to restore on" dbname)
          (Thread/sleep 5000))
          (log/info "Restored through" dbname next-start)
          (car/wcar redis-conn
            (car/set start-key (car/freeze next-start))))))))
where I’m using Redis to track the remappings and where I last tried to restore. If you run a single node in the “restore target” that runs this code, then you should be fine…otherwise you’d have to do some kind of leader selection algorithm to make sure only one restore thread ever runs at once.


I think there are things you could technically do to your database that could cause problems (renaming an ident), so be aware of how it works.

Daniel Jomphe15:08:18

oh, that's right, thanks for the warning!


Curious, what are your use cases for datomic that you have to put up with so much uncertainty(that they never respond to or document) from the company like this?

👍 1

I would love to hear a statement from the official team. Already asked many times. 😞


@U0514DPR7 Thanks for collecting the list of forum posts, etc. where related topics have come up before. I will visit each one and make sure that they are answered.

🙂 2

@U0CKQ19AQ Thanks for writing As a streaming ETL, it will probably be adaptable to many uses other than just DR. After I finish reviewing all the points on this thread, I will get back to you with more help on checking the DR box.


@U2J4FRT2T I am very sorry that we have failed to answer questions that you have asked repeatedly. I will be auditing our process for monitoring the various forums to make sure that doesn't happen again. ... After I answer the questions. 🙂


@U0514DPR7 Having revisited those old threads, I think it would be easier to follow along if I just start fresh, which I have done via in slack. Please let me know if this covers your questions.


@stuarthalloway Yeah, I'm still struggling with that, and have opened an official Datomic Cloud support ticket around backups in general. I can get the backup, but the restore process still doesn't work as written, and I see various little issues with the way I'm doing it (there is no way to easily do two-phase commit so that I'm sure I saved the tempid rewrites after the transaction...because there is no way to back out the transaction...I need to pre-allocated the :db/id, but not sure if I can do that. Anyway...I ususally get about 600k transactions restored and it crashes for some reason or other (have not diagnosed it yet...def a bug in my code), but since that takes a few hours to blow up, it's really painfully slow to test...I mean, I'm testing it in the small and it's only manages to hit the problems after several hours. So, an official solution would be awesome.


@U0CKQ19AQ Official solution is definitely where I want to end up. You should not need two phase commit. If you lose your tempid mapping, you should be able to recover it by comparing the matching transactions in the old system and the new.


Or (even more cautious) I have seen some flavors of this that save the source system entity id as an extra attribute/value on every entity in the target system.

💡 2
Daniel Jomphe19:09:12

@stuarthalloway Thanks a lot for following up! A follow-up FAQ would be why and what made it sensible or reasonable to not publish official solutions until now. Apart from lack of time or resources, I expect you'll be able to tell us that when you helped some Cloud users, you often took solution X and they found it sufficient for their needs. What was this X? Was X a log replay, or copying with decrypt re-encrypt at the storage level?


Cloud users helped themselves and shared their experiences with us. (Like @U0CKQ19AQ but in most cases less public.)

Daniel Jomphe19:09:03

I'm asking that because it might help us weigh our options. We're more into trying to make sure we tackle the most potent problem first, before deciding on a specific solution.


X is always some flavor of log replay, because that is what is possible in user space.


And log replay will continue to be valuable even in a world with full-copy backup/restore, because it lets you do things like filter or split databases.


right, would love to see a more full-featured version of my quick hack library 🙂


just don't have time to do it myself

Daniel Jomphe19:09:11

Incredible, Stu (coming from a fan of clojure datomic since 2009). Even though it's somehow a bit weird to not yet have an official tool, this speaks to the empowerment that your core tools give to the community that they could help themselves in such a way.


We are always listening to our users to prioritize what we do next. Cloud backup/restore has been high on the list for a long time but never at the very top.


@U0514DPR7 I don't know your requirements, but across a broad set of possible requirements log replay is the only user space option. Meanwhile I will be working on the options in product space!

Daniel Jomphe19:09:00

Your recent redesign to make it possible to pick your own size and price of Prod topology was very welcome indeed, making it practical for us to spin all our stacks using the same topology and tooling.

Daniel Jomphe19:09:49

Thanks for clarifying our user space options, @stuarthalloway, this is definitely helpful. One of our top requirements was: what can we do with the least effort that supports a fully functional restore in a regular deployment (not dev-local).


The biggest problem with the streaming replication is the time it takes to get the initial restore replayed. I estimate our restore can easily "keep up" once it gets through the last 2 years of history, but that is going to take about 2 weeks with a t3 xlarge. That means if my restore ever gets screwed up, I have a business continuity issue that could cost me 2 weeks or more of time. That is pretty unacceptable.


So a fast backup option that could be run regularly (or continuously via s3 replication or something) would really be ideal @stuarthalloway


Ah, I see you responded to my support request. For those reading here: The s3 copy doesn't work "because of how Cloud uses AWS encryption."


We have seen the exact same perf, using an almost identical process to your lib @U0CKQ19AQ. On top of the poor full restore performance, you also have the added eng & infra expense of needing to run and maintain this other process that continually backs up your Cloud system. This is absolutely worth is for DR purposes, but it's really just a band-aid to work around the lengthy full restore processes we're able to write in the user space.


Well, to be honest I like the idea that we can have up-to-the-minute streaming replication in user space; however, the initial restore time means I have to consider if I want two restores running in parallel just in case one gets screwed up so I don't have a 2-week exposure.'


For a lot of businesses, I could definitely see the appeal, certainly when a full restore time is on that magnitude of time. Curious, if a full restore could be completed in 15 minutes, would the streaming replication process still be as valuable? i.e., is the streaming replication process solving the long full restore problem, having the most up-to-date backup, or something else?


@U0514DPR7 Not to speak for Stu, but every business has limited resources and time to provide features. AWS multi-region/multi-hardware means that the data is pretty darn safe. Yes, you can accidentally delete your data, so from a DR perspective we do have to eventually check that box...but as a business (Datomic, e.g.) I imagine you have to ask yourself, as a company, will people buy my product (D.Cloud) without having feature X or Y? If X is backups and Y is something that gets you more customers, which do you pick to do?


@U083D6HK9 In our case we handle financial data, so not losing anything is important. So being able to continuously back up is the real benefit. Continuous restore just reduces down-time if you have to use your backup...No large (atomic, transactional) database I'm aware of can restore the amount of data we have in 15 mins....unless you do streaming replication (which is what I've done in the past with postgresql)

Daniel Jomphe19:09:42

@U0CKQ19AQ as for us, we're so small for now; we'll go into production for the first time in a few months, with a relatively low volume of ERP-style transactions across few clients for the first year. And we're all learning clojure and datomic on the job. So for us, log replication is certainly quite fast until the DB grows, and your library as it is will probably work easily, and who knows if we'll even use it to rewrite our entire history with better schemas at some point?


so, I'm about to revamp the way the library deals with tempids...I never liked that part, and Stu gave me a good idea

🙂 2

I would love to hear a statement from the official team. Already asked many times. 😞

Daniel Jomphe19:09:12

@stuarthalloway Thanks a lot for following up! A follow-up FAQ would be why and what made it sensible or reasonable to not publish official solutions until now. Apart from lack of time or resources, I expect you'll be able to tell us that when you helped some Cloud users, you often took solution X and they found it sufficient for their needs. What was this X? Was X a log replay, or copying with decrypt re-encrypt at the storage level?