Hello! I've been using the latest version of datahike with the file backend and I stumbled upon a case where I get an error when inserting data to the DB after the connection has been released and re-established at least once. In the following snippet, I managed to reproduce this while trying the add some data to a schemaless DB.
(def products (mapv (fn [i] {:product/id i
:product/revision {:name (str "foo" i)
:price (rand-int 10)}})
(range 1000)))
(def db-config {:schema-flexibility :read
:store {:backend :file
:path "datahike"}})
(d/create-database db-config)
(def conn (d/connect db-config))
(dotimes [_ 100]
(d/transact conn products))
(d/release conn)
(let [conn (d/connect db-config)]
(d/transact products)) ; adding after release fails
While the connection has not been released, transactions work without issues. But as soon as the connection is released and re-established, the same transaction fails with the following error:
Caused by java.lang.ClassCastException
class clojure.lang.PersistentArrayMap cannot be cast to class
java.lang.Comparable (clojure.lang.PersistentArrayMap is in unnamed module of
loader 'app'; java.lang.Comparable is in module java.base of loader
'bootstrap')
Util.java: 153 clojure.lang.Util/compare
datom.cljc: 302 datahike.datom$cmp_temporal_datoms_aevt_quick/invokeStatic
datom.cljc: 298 datahike.datom$cmp_temporal_datoms_aevt_quick/invoke
AFunction.java: 53 clojure.lang.AFunction/compare
Branch.java: 190 me.tonsky.persistent_sorted_set.Branch/add
PersistentSortedSet.java: 249 me.tonsky.persistent_sorted_set.PersistentSortedSet/cons
persistent_sorted_set.clj: 19 me.tonsky.persistent-sorted-set/conj
persistent_sorted_set.clj: 16 me.tonsky.persistent-sorted-set/conj
persistent_set.cljc: 162 datahike.index.persistent_set$temporal_upsert/invokeStatic
persistent_set.cljc: 151 datahike.index.persistent_set$temporal_upsert/invoke
persistent_set.cljc: 194 datahike.index.persistent_set$eval88483$fn__88490/invoke
interface.cljc: 4 datahike.index.interface$eval85915$fn__86004$G__85888__86017/invoke
transaction.cljc: 259 datahike.db.transaction$with_datom_upsert$fn__93276/invoke
AFn.java: 154 clojure.lang.AFn/applyToHelper
...
Now, for a small number of data this does not happen immediately and transactions work fine with short-lived connections, but the same error occurs after an X number of data are inserted this way. Has anyone met this case before?Hey @filipconstantinos! I think this might not be related to just the release call, but to the fact that you transact nested maps and I suspect that the nested maps for product/revision are inserted into the index and the index comparator stumbles over it once there is data in the db already. Does the problem persist when you remove the release call?
I think in general even for schemaless you want to have a schema entry for product/revision here that declares it to be a ref.
I can reproduce the problem and as I remembered it can be solved by adding the schema as follows
(def products (mapv (fn [i] {:product/id i
:product/revision {:name (str "foo" i)
:price (rand-int 10)}})
(range 1000)))
(def db-config {:schema-flexibility :read
:store {:backend :file
:path "/tmp/datahike"}})
(d/create-database db-config)
(def conn (d/connect db-config))
(d/transact conn [{:db/ident :product/revision
:db/valueType :db.type/ref}])
(dotimes [_ 100]
(d/transact conn products))
(d/release conn)
(let [conn (d/connect db-config)]
(d/transact conn products)) ; adding after release fails
(d/delete-database db-config)Adding after release works then.
transact uses the schema to decide what to do with the map and it won't destructure it into datoms if it is not a ref. I think we should maybe throw an error if it is not a ref or alternatively support map values. The problem is that it is not so easy to define a meaningful order on map values, but maybe a trivial one would do. I think storing nested blobs is a bit unfortunate though since we want to have data represented in triple format.
As the index is in memory the comparator is working, which is confusing. So it is worth opening an issue for this.
@whilo Thank you for your answer.
> this might not be related to just the release call, but to the fact that you transact nested maps and I suspect that the nested maps for product/revision are inserted into the index and the index comparator stumbles over it once there is data in the db already.
Indeed, this error only occurs when the data under product/revision are nested maps. It's curious however that for a smaller number of product revisions (e.g., 5-10), nested maps can be inserted without issues. Only after surpassing a number of products (a few hundreds in my case) the error occurs.
> Does the problem persist when you remove the release call?
As shown in the snippet, transacting the same data many times before the connection is released has no issues. The error appears with transactions after releasing the connection once.
> I think in general even for schemaless you want to have a schema entry for product/revision here that declares it to be a ref.
In our codebase we use schema entries for product/id and product/revision(ref) but that did not make any difference regarding the issue. I decided to only showcase a minimal reproducible example here.
Another observation is that this error appears only when using the file backend. Similar tests using the mem backend did not have those issues. I'll try to do the same with the jdbc backend too.
Yes, I think the problem happens because releasing the connection for the file backend drops the caches while for the memory backend everything stays in memory. As long as you compare values in memory it works, but as soon as you transact after releasing it has problems with accessing the stored index.
Maps are not a valid value for our comparator, they need to be broken down into their own entities, but somehow they slip through here and land in the index as maps.
That is why I thought declaring product/revision as a ref in the schema should fix the issue, but I have to look into it more.
@whilo Thanks for taking time to check this out. I can confirm that defining the :product/revision as ref in the schema solves the issue in the above example. I guess our original problem, even after defining the schema, was because the product/revision contained more nested maps itself.
> I think the problem happens because releasing the connection for the file backend drops the caches while for the memory backend everything stays in memory. As long as you compare values in memory it works, but as soon as you transact after releasing it has problems with accessing the stored index.
> I think storing nested blobs is a bit unfortunate though since we want to have data represented in triple format.
That starts to make sense to me. I was under the wrong impression that nested blobs could be stored in a schemaless db as "simple unstructured data" - without being represented as separate entities. I suspected that it might be a bug because this did not happen with the mem backend, but I reproduced the issue it with the jdbc backend too. I guess the best workaround would be to store a serialized version of the unstructured data in the db
Thanks for confirming! Storing blobs is one option, it might be nicer to do what I sketched in the main channel because it will store all nested structures in a way that is queryable. I am not sure whether the translator makes sense as I posted.
What are the use cases for Datahike S3 compared to using say Postgres? Specifically I am interested in storing large amounts of relational text across many chats from different people with LLM’s. I intend to use Weaviate for the indexing and search. This data is likely to be retrieved regularly to derive new insights as new data comes in. Some of the text will basically be like daily journaling.
You can put larger messages in Datahike and just have to adjust your caches and expectations. The argument earlier about imbalance happens if their sizes vary a lot and you expect access times for small objects to hold in general.
The main reason to put them in Datahike is to be able use index lookups on their content, e.g. having sorted phone book style names.
We do not have a good fulltext story yet.
Okay. Thank you for the discussion. I have already been thinking about building some more db admin tooling and actually monitoring all of the queries and their performance and recording it. So I think I will just start with everything in one db and then address concerns as they arise.
Anything you would be curious about tracking let me know and I will see about adding this so we have some actual data to work and reason about.
To add more context. Think daily journaling and building relations to multiple spiritual texts. Not just as an individual but also for groups of people to find patterns and meaning.
The benefits is that you do not need to run additional infrastructure and scale out readers with S3, which is fairly cheap.
Postgres has lower latency and allows you to run your backend independently of AWS, but there are also alternative open source implementations of it.
Yes there is a lot of S3 compatible storage options around now so it may be nice to have that as an option - not suggesting that be implemented. Could you expand a little bit on what you mean by scale out readers?
From what I understand this would be when you have a LOT of data. But I feel I may also be missing some finer details.
S3 can be just convenient. Creating a bucket and running your Datahike instances against it allows you to distribute without much thinking, while you can use its access control to make sure that the right people can access (e.g. read) the database. The only thing you need to make sure is use a single writer (server) that you dispatch the transactions to.
The downside is somewhat high latency depending on where you are located compared to your AWS region.
When you can do everything locally, just use the file backend though. You can still migrate later.
The most important thing about Datahike is make you not worry about accessing your data.
I am using an electric app with the deployment to Fly that comes with the starter repo. Supabase offers a managed Postgres service with them. Fly is also working on an S3 API compatible storage option. The other option is to migrate to using AWS and co-locate with the S3 bucket. Which could also potentially be https://aws.amazon.com/s3/storage-classes/express-one-zone/ if I need even more performance with the data access. Both of these seem like viable options and I am unsure what/if any the finer trade-offs are from a technical point of view. Excluding vendor considerations and other features.
One consideration is the single transaction you mentioned. When doing deploys while users are active handling this adds an extra layer. I was considering the ops complexity of keeping around older servers until the client disconnects. But this would potentially violate the single transactor requirement. Though I also don’t fully know the subtleties of having multiple connections to the same db either. https://clojurians.slack.com/archives/C7Q9GSHFV/p1716731253884639
From what I understand the length of a string can impact the queries. Is there a max suggested length for strings or size for a datom?
We do not enforce that yet, but I think we should do so and allow you to set the value in the config.
Yes, handing over between multiple writers can be tricky, basically you have to ensure the old writer is down first.
Datomic outsourced that to Zoekeeper, there might be better newer options.
When you say that it does not enforce that. Would that mean that I could transact a very large string and Datahike would break it up to optimise for keeping the tree balanced?
That would mean that one of the tree nodes would be massive, which will implicitly violate the balance.
The right way to do it for really large things (> than a few kilobytes) is to put them in blobs and transact the external blob id.
Okay. Yea this is what I thought. I was just considering ways I could simplify. But it seems the best way would be to store all messages in blob storage and then query it through weaviate. Some messages may be small but it probably doesn’t make much sense to reason about with having some messages in Datahike and some in blob storage when they are of the same type.