2025-02-03 datalevin | Clojure Slack Archive

datalevin 2025-02-03

Jeremy 2025-02-03T07:49:36.022859Z

Hi all, How can I get the max entity id from the db? I'm using fill-db, so need the max-id to resolve temp-ids. These cause out of mem errors with around 46 mil entities:

(d/datom-e (first (d/rseek-datoms db :eav))) ;; I didn't know rseek is eager
(d/q '[:find (max ?e)
         :where [?e]] db)

and it seems (:max-eid db) isn't fully consistent

✅ 1

Jeremy 2025-02-03T12:28:51.786869Z

Searching the codebase, it seems this might be the right approach: (datalevin.storage/init-max-eid (:store db))

✅ 1

Huahai 2025-02-03T22:03:41.312679Z

We can add a max-eid function to core

simongray 2025-02-03T08:52:49.530699Z

How can I negate a full-text search? I want the texts where the query tokens do NOT appear.

Huahai 2025-02-03T15:38:52.308709Z

Full-text search boolean operators is on the roadmap. For now, you will have to filter them out manually.

Huahai 2025-02-03T16:35:12.808629Z

It would be great if someone can pickup some of the tickets 🙂

👍 1

Huahai 2025-02-03T22:26:58.137149Z

ok, I can add boolean expression in the next release

🙏 1

simongray 2025-02-04T07:30:12.016779Z

Sounds good!

simongray 2025-02-04T07:31:06.720619Z

yeah, maybe it’s about time I also make some commits.

simongray 2025-02-04T07:31:28.709379Z

you’re a bit of a one-man army 😅

Huahai 2025-02-09T04:22:54.290679Z

phrase search is also available in the master branch

🎉 1

Huahai 2025-02-06T15:50:18.301369Z

The first cut of search boolean expression is available in the master branch.

👀 1

Jeremy 2025-02-03T14:39:45.661729Z

Can queries be compiled (aot planning) then reused over and over? Despite using the same query (but slightly different inputs), I'm getting really long planning times, as shown by d/explain. planning times is about 2 - 18s, and execution time is about 10 - 300ms.

Jeremy 2025-02-03T14:45:11.539249Z

datom count: 192,188,444 entity count: 39,939,786 query (all join keys are indexed):

[:find
 ?sel-id ?mname ?mval
 :in $ ?mk-id ?time ?mname
 :where
 [?mkc :mkc/market-id ?mk-id] ;; indexed attribute
 [?mt :mt/parent ?mkc] ;; indexed attribute. mt = metric
 [?mt :mt/object-id ?sel-id]
 [?mt :mt/name ?mname]
 
 [?mcell :mt.cell/parent ?mt] ;; indexed attribute
 [?mcell :mt.cell/value ?mval]
 [?mcell :mt.cell/valid-from ?mcell-vf] ;; indexed attribute
 [?mcell :mt.cell/valid-to ?mcell-vt] ;; indexed attribute
 [(<= ?mcell-vf ?time)]
 [(> ?mcell-vt ?time)]
 :order-by [?mval :asc]]

Huahai 2025-02-03T15:33:00.565249Z

which version of Datalevin is this?

Jeremy 2025-02-03T15:37:55.845439Z

{:mvn/version "0.9.13"}

Huahai 2025-02-03T15:40:27.327809Z

Can you upgrade to the latest? Planning time that long usually means the background sampler and counters are not working properly, which is the case in the earlier versions.

Jeremy 2025-02-03T15:41:07.052059Z

alright, would do so immediately

Jeremy 2025-02-03T16:31:48.847139Z

Planning time is now about 300-500ms; only once did it reach 3s. I still do wonder if it's possible to cache planning. Ik the parsed query is cached, but not sure about the planning

Huahai 2025-02-03T16:39:41.785709Z

We do have a plan cache, but the keys are probably not doing what we want, the keys needs to be the unparsed query?

Huahai 2025-02-03T16:40:52.327769Z

For the parsed one has references to db, which changes after a transaction

Jeremy 2025-02-03T16:51:46.308509Z

okk, I do now see that the plan is cached in q-result. I'm thinking of a my own helper macro/function that allows to hint at cache keys. Do you mind telling me how important the inputs are to the query planning? so I know which ones are best to cache

Huahai 2025-02-03T16:56:41.689259Z

Depending on what the input is used for.

Huahai 2025-02-03T17:05:45.030799Z

For example, if an input is used as a bound value. Different values can have dramatic impact on the plan, e.g. value 1 has 1 match in db, whereas value 2 has 100 millions matches. The plans should be very different.

👍 1

Jeremy 2025-02-03T17:11:57.215769Z

That makes sense. Though, I didn't think the planner scans db to know the resulting complexity of the inputs

Huahai 2025-02-03T17:13:06.565979Z

It does. How else does the planner know how to plan

Jeremy 2025-02-03T17:14:27.885789Z

😅 true. I was imagining a "dumb/generic" planner

Huahai 2025-02-03T17:14:46.029729Z

One thing we can do, is to be smart about the plan keys. If the input is not too different from the cached ones, we can still use the plan

Jeremy 2025-02-03T17:15:30.274899Z

Thanks vm. I'm not familiar with db internals (dlvn is the first db i'm able to explore some of its internals), so it's great to know this. Loving the journey so far

Huahai 2025-02-03T17:16:15.416249Z

Let me file an issue about smarter plan caching.

👍 1

Jeremy 2025-02-03T17:18:31.277709Z

knowing about the planner, the current approach (depending of db) makes sense. It's just that my usecase performs lots of random jumps

Huahai 2025-02-03T17:23:15.782299Z

https://github.com/juji-io/datalevin/issues/309

Jeremy 2025-02-03T23:25:58.160839Z

Some of my queries worked great initially, but with about 120 mil entities, there has been a massive slowdown. I'm beginning to question whether I'm making good use of the indexes. 1. does having a :db/index true attribute property do anything at all? I've been using it (came across it in a datascript tutorial), but I dont see it anywhere in datalevin doc. 2. I have an entity with :cell/valid-from and :cell/valid-to attributes. How can I efficiently find the value(s) of an entity at a specific time? either with a range function or d/q I currently use:

[?mcell :mt.cell/valid-from ?mcell-vf]
[?mcell :mt.cell/valid-to ?mcell-vt]
[(<= ?mcell-vf ?time)]
[(> ?mcell-vt ?time)]

but it seems this is doing a linear scan. I've attempted to use index-range with :cell/valid-from, but it can't work, as there's no option to start from the nearest datom before start

Jeremy 2025-02-04T19:38:14.056719Z

😞 After looking into this further, it's not possible to utilize a typical index for time-range/interval scan. At least in the case of overlapping intervals (range scan would work otherwise). I'd need to look into R trees or Interval B trees.| For now, I'd focus on the plan cache optimization to get up and running

Huahai 2025-02-04T01:03:00.516239Z

1. we don't use :db/index

Huahai 2025-02-04T01:09:24.237099Z

2. (or (< ?mcell-vf ?time ?mcell-vt) (= ?mcell-vf ?Tim))

👍 1

Huahai 2025-02-04T01:09:35.182509Z

probably better

Huahai 2025-02-04T01:14:24.681279Z

depends on your query, some queries would be slow no matter what you do (e.g. selectivity is low). You can look at the explanation to see if there's some massively misjudgement on the part of the planner. Compared with Postgres, our planner doesn't normally make big mistakes. So I guess the query could be just slow (e.g. there's no selective clauses).

Huahai 2025-02-04T01:15:41.156079Z

I would be interested to see examples of massive discrepancies between estimation and actual sizes.

Huahai 2025-02-04T01:18:26.204579Z

Since you upgraded to this version, and after that you haven't done too many transactions, so it is possible that the old estimations of sizes are still in there and they are off base. One fix is to re-index, if you can take the db offline, so counts are freshly collected.

Huahai 2025-02-04T01:19:24.878189Z

If there's no new transactions. the resampling would not be triggered.

Huahai 2025-02-04T06:13:18.689019Z

B-tree based DB will get slower as data volume increases , that's inevitable. I plan to mitigate this by sharding the data into multiple LMDB files. https://github.com/juji-io/datalevin/issues/303

Clojurians Log v2

datalevin 2025-02-03