datahike

filipkon 2024-06-14T16:29:21.359479Z

Hello! I've been using the latest version of datahike with the file backend and I stumbled upon a case where I get an error when inserting data to the DB after the connection has been released and re-established at least once. In the following snippet, I managed to reproduce this while trying the add some data to a schemaless DB.

(def products (mapv (fn [i] {:product/id       i
                               :product/revision {:name  (str "foo" i)
                                                  :price (rand-int 10)}})
                      (range 1000)))

  (def db-config {:schema-flexibility :read
                  :store              {:backend :file
                                       :path    "datahike"}})
  (d/create-database db-config)

  (def conn (d/connect db-config))

  (dotimes [_ 100]
    (d/transact conn products))

  (d/release conn)

  (let [conn (d/connect db-config)]
    (d/transact products)) ; adding after release fails
While the connection has not been released, transactions work without issues. But as soon as the connection is released and re-established, the same transaction fails with the following error:
Caused by java.lang.ClassCastException
   class clojure.lang.PersistentArrayMap cannot be cast to class
   java.lang.Comparable (clojure.lang.PersistentArrayMap is in unnamed module of
   loader 'app'; java.lang.Comparable is in module java.base of loader
   'bootstrap')

                 Util.java:  153  clojure.lang.Util/compare
                datom.cljc:  302  datahike.datom$cmp_temporal_datoms_aevt_quick/invokeStatic
                datom.cljc:  298  datahike.datom$cmp_temporal_datoms_aevt_quick/invoke
            AFunction.java:   53  clojure.lang.AFunction/compare
               Branch.java:  190  me.tonsky.persistent_sorted_set.Branch/add
  PersistentSortedSet.java:  249  me.tonsky.persistent_sorted_set.PersistentSortedSet/cons
 persistent_sorted_set.clj:   19  me.tonsky.persistent-sorted-set/conj
 persistent_sorted_set.clj:   16  me.tonsky.persistent-sorted-set/conj
       persistent_set.cljc:  162  datahike.index.persistent_set$temporal_upsert/invokeStatic
       persistent_set.cljc:  151  datahike.index.persistent_set$temporal_upsert/invoke
       persistent_set.cljc:  194  datahike.index.persistent_set$eval88483$fn__88490/invoke
            interface.cljc:    4  datahike.index.interface$eval85915$fn__86004$G__85888__86017/invoke
          transaction.cljc:  259  datahike.db.transaction$with_datom_upsert$fn__93276/invoke
                  AFn.java:  154  clojure.lang.AFn/applyToHelper
                  ...
Now, for a small number of data this does not happen immediately and transactions work fine with short-lived connections, but the same error occurs after an X number of data are inserted this way. Has anyone met this case before?

whilo 2024-06-14T18:25:28.640779Z

Hey @filipconstantinos! I think this might not be related to just the release call, but to the fact that you transact nested maps and I suspect that the nested maps for product/revision are inserted into the index and the index comparator stumbles over it once there is data in the db already. Does the problem persist when you remove the release call?

whilo 2024-06-14T18:27:36.108109Z

I think in general even for schemaless you want to have a schema entry for product/revision here that declares it to be a ref.

whilo 2024-06-16T19:45:42.730539Z

I can reproduce the problem and as I remembered it can be solved by adding the schema as follows

(def products (mapv (fn [i] {:product/id       i
                             :product/revision {:name  (str "foo" i)
                                                :price (rand-int 10)}})
                    (range 1000)))

(def db-config {:schema-flexibility :read
                :store              {:backend :file
                                     :path    "/tmp/datahike"}})

(d/create-database db-config)

(def conn (d/connect db-config))

(d/transact conn [{:db/ident :product/revision
                   :db/valueType :db.type/ref}])

(dotimes [_ 100]
  (d/transact conn products))

(d/release conn)

(let [conn (d/connect db-config)]
  (d/transact conn products)) ; adding after release fails

(d/delete-database db-config)

whilo 2024-06-16T19:46:01.261829Z

Adding after release works then.

whilo 2024-06-16T19:48:05.877579Z

transact uses the schema to decide what to do with the map and it won't destructure it into datoms if it is not a ref. I think we should maybe throw an error if it is not a ref or alternatively support map values. The problem is that it is not so easy to define a meaningful order on map values, but maybe a trivial one would do. I think storing nested blobs is a bit unfortunate though since we want to have data represented in triple format.

whilo 2024-06-17T19:41:39.048889Z

As the index is in memory the comparator is working, which is confusing. So it is worth opening an issue for this.

filipkon 2024-06-15T07:56:11.372569Z

@whilo Thank you for your answer. > this might not be related to just the release call, but to the fact that you transact nested maps and I suspect that the nested maps for product/revision are inserted into the index and the index comparator stumbles over it once there is data in the db already. Indeed, this error only occurs when the data under product/revision are nested maps. It's curious however that for a smaller number of product revisions (e.g., 5-10), nested maps can be inserted without issues. Only after surpassing a number of products (a few hundreds in my case) the error occurs. > Does the problem persist when you remove the release call? As shown in the snippet, transacting the same data many times before the connection is released has no issues. The error appears with transactions after releasing the connection once. > I think in general even for schemaless you want to have a schema entry for product/revision here that declares it to be a ref. In our codebase we use schema entries for product/id and product/revision(ref) but that did not make any difference regarding the issue. I decided to only showcase a minimal reproducible example here. Another observation is that this error appears only when using the file backend. Similar tests using the mem backend did not have those issues. I'll try to do the same with the jdbc backend too.

whilo 2024-06-15T21:18:37.477689Z

Yes, I think the problem happens because releasing the connection for the file backend drops the caches while for the memory backend everything stays in memory. As long as you compare values in memory it works, but as soon as you transact after releasing it has problems with accessing the stored index.

whilo 2024-06-15T21:19:54.448249Z

Maps are not a valid value for our comparator, they need to be broken down into their own entities, but somehow they slip through here and land in the index as maps.

whilo 2024-06-15T21:22:17.878919Z

That is why I thought declaring product/revision as a ref in the schema should fix the issue, but I have to look into it more.

filipkon 2024-06-20T09:28:37.053349Z

@whilo Thanks for taking time to check this out. I can confirm that defining the :product/revision as ref in the schema solves the issue in the above example. I guess our original problem, even after defining the schema, was because the product/revision contained more nested maps itself. > I think the problem happens because releasing the connection for the file backend drops the caches while for the memory backend everything stays in memory. As long as you compare values in memory it works, but as soon as you transact after releasing it has problems with accessing the stored index. > I think storing nested blobs is a bit unfortunate though since we want to have data represented in triple format. That starts to make sense to me. I was under the wrong impression that nested blobs could be stored in a schemaless db as "simple unstructured data" - without being represented as separate entities. I suspected that it might be a bug because this did not happen with the mem backend, but I reproduced the issue it with the jdbc backend too. I guess the best workaround would be to store a serialized version of the unstructured data in the db

whilo 2024-06-20T10:10:17.058229Z

Thanks for confirming! Storing blobs is one option, it might be nicer to do what I sketched in the main channel because it will store all nested structures in a way that is queryable. I am not sure whether the translator makes sense as I posted.

👍 1
grounded_sage 2024-06-14T04:58:49.800689Z

What are the use cases for Datahike S3 compared to using say Postgres? Specifically I am interested in storing large amounts of relational text across many chats from different people with LLM’s. I intend to use Weaviate for the indexing and search. This data is likely to be retrieved regularly to derive new insights as new data comes in. Some of the text will basically be like daily journaling.

whilo 2024-06-14T07:15:50.639909Z

You can put larger messages in Datahike and just have to adjust your caches and expectations. The argument earlier about imbalance happens if their sizes vary a lot and you expect access times for small objects to hold in general.

whilo 2024-06-14T07:16:44.736259Z

The main reason to put them in Datahike is to be able use index lookups on their content, e.g. having sorted phone book style names.

whilo 2024-06-14T07:16:59.604409Z

We do not have a good fulltext story yet.

grounded_sage 2024-06-14T08:02:20.955339Z

Okay. Thank you for the discussion. I have already been thinking about building some more db admin tooling and actually monitoring all of the queries and their performance and recording it. So I think I will just start with everything in one db and then address concerns as they arise.

grounded_sage 2024-06-14T08:04:41.370369Z

Anything you would be curious about tracking let me know and I will see about adding this so we have some actual data to work and reason about.

👍 1
grounded_sage 2024-06-14T05:03:53.043059Z

To add more context. Think daily journaling and building relations to multiple spiritual texts. Not just as an individual but also for groups of people to find patterns and meaning.

whilo 2024-06-14T05:43:05.949099Z

The benefits is that you do not need to run additional infrastructure and scale out readers with S3, which is fairly cheap.

whilo 2024-06-14T05:44:18.857519Z

Postgres has lower latency and allows you to run your backend independently of AWS, but there are also alternative open source implementations of it.

grounded_sage 2024-06-14T05:49:26.120499Z

Yes there is a lot of S3 compatible storage options around now so it may be nice to have that as an option - not suggesting that be implemented. Could you expand a little bit on what you mean by scale out readers?

grounded_sage 2024-06-14T05:51:28.750249Z

From what I understand this would be when you have a LOT of data. But I feel I may also be missing some finer details.

whilo 2024-06-14T05:59:43.050059Z

S3 can be just convenient. Creating a bucket and running your Datahike instances against it allows you to distribute without much thinking, while you can use its access control to make sure that the right people can access (e.g. read) the database. The only thing you need to make sure is use a single writer (server) that you dispatch the transactions to.

whilo 2024-06-14T06:00:40.961339Z

The downside is somewhat high latency depending on where you are located compared to your AWS region.

whilo 2024-06-14T06:04:18.253189Z

When you can do everything locally, just use the file backend though. You can still migrate later.

whilo 2024-06-14T06:04:39.339849Z

The most important thing about Datahike is make you not worry about accessing your data.

✨ 1
grounded_sage 2024-06-14T06:13:54.003109Z

I am using an electric app with the deployment to Fly that comes with the starter repo. Supabase offers a managed Postgres service with them. Fly is also working on an S3 API compatible storage option. The other option is to migrate to using AWS and co-locate with the S3 bucket. Which could also potentially be https://aws.amazon.com/s3/storage-classes/express-one-zone/ if I need even more performance with the data access. Both of these seem like viable options and I am unsure what/if any the finer trade-offs are from a technical point of view. Excluding vendor considerations and other features.

grounded_sage 2024-06-14T06:19:44.280869Z

One consideration is the single transaction you mentioned. When doing deploys while users are active handling this adds an extra layer. I was considering the ops complexity of keeping around older servers until the client disconnects. But this would potentially violate the single transactor requirement. Though I also don’t fully know the subtleties of having multiple connections to the same db either. https://clojurians.slack.com/archives/C7Q9GSHFV/p1716731253884639

grounded_sage 2024-06-14T06:21:51.885869Z

From what I understand the length of a string can impact the queries. Is there a max suggested length for strings or size for a datom?

whilo 2024-06-14T06:26:02.069279Z

We do not enforce that yet, but I think we should do so and allow you to set the value in the config.

whilo 2024-06-14T06:26:38.542709Z

Yes, handing over between multiple writers can be tricky, basically you have to ensure the old writer is down first.

whilo 2024-06-14T06:27:05.478359Z

Datomic outsourced that to Zoekeeper, there might be better newer options.

grounded_sage 2024-06-14T06:31:26.796869Z

When you say that it does not enforce that. Would that mean that I could transact a very large string and Datahike would break it up to optimise for keeping the tree balanced?

whilo 2024-06-14T06:41:53.796729Z

That would mean that one of the tree nodes would be massive, which will implicitly violate the balance.

whilo 2024-06-14T06:42:32.143599Z

The right way to do it for really large things (> than a few kilobytes) is to put them in blobs and transact the external blob id.

grounded_sage 2024-06-14T06:55:43.153879Z

Okay. Yea this is what I thought. I was just considering ways I could simplify. But it seems the best way would be to store all messages in blob storage and then query it through weaviate. Some messages may be small but it probably doesn’t make much sense to reason about with having some messages in Datahike and some in blob storage when they are of the same type.