Hi all,
How can I get the max entity id from the db?
I'm using fill-db, so need the max-id to resolve temp-ids.
These cause out of mem errors with around 46 mil entities:
(d/datom-e (first (d/rseek-datoms db :eav))) ;; I didn't know rseek is eager
(d/q '[:find (max ?e)
:where [?e]] db)
and it seems (:max-eid db) isn't fully consistentSearching the codebase, it seems this might be the right approach:
(datalevin.storage/init-max-eid (:store db))
We can add a max-eid function to core
How can I negate a full-text search? I want the texts where the query tokens do NOT appear.
Full-text search boolean operators is on the roadmap. For now, you will have to filter them out manually.
It would be great if someone can pickup some of the tickets π
ok, I can add boolean expression in the next release
Sounds good!
yeah, maybe itβs about time I also make some commits.
youβre a bit of a one-man army π
phrase search is also available in the master branch
The first cut of search boolean expression is available in the master branch.
Can queries be compiled (aot planning) then reused over and over?
Despite using the same query (but slightly different inputs), I'm getting really long planning times, as shown by d/explain.
planning times is about 2 - 18s, and execution time is about 10 - 300ms.
datom count: 192,188,444 entity count: 39,939,786 query (all join keys are indexed):
[:find
?sel-id ?mname ?mval
:in $ ?mk-id ?time ?mname
:where
[?mkc :mkc/market-id ?mk-id] ;; indexed attribute
[?mt :mt/parent ?mkc] ;; indexed attribute. mt = metric
[?mt :mt/object-id ?sel-id]
[?mt :mt/name ?mname]
[?mcell :mt.cell/parent ?mt] ;; indexed attribute
[?mcell :mt.cell/value ?mval]
[?mcell :mt.cell/valid-from ?mcell-vf] ;; indexed attribute
[?mcell :mt.cell/valid-to ?mcell-vt] ;; indexed attribute
[(<= ?mcell-vf ?time)]
[(> ?mcell-vt ?time)]
:order-by [?mval :asc]]which version of Datalevin is this?
{:mvn/version "0.9.13"}
Can you upgrade to the latest? Planning time that long usually means the background sampler and counters are not working properly, which is the case in the earlier versions.
alright, would do so immediately
Planning time is now about 300-500ms; only once did it reach 3s. I still do wonder if it's possible to cache planning. Ik the parsed query is cached, but not sure about the planning
We do have a plan cache, but the keys are probably not doing what we want, the keys needs to be the unparsed query?
For the parsed one has references to db, which changes after a transaction
okk, I do now see that the plan is cached in q-result.
I'm thinking of a my own helper macro/function that allows to hint at cache keys.
Do you mind telling me how important the inputs are to the query planning?
so I know which ones are best to cache
Depending on what the input is used for.
For example, if an input is used as a bound value. Different values can have dramatic impact on the plan, e.g. value 1 has 1 match in db, whereas value 2 has 100 millions matches. The plans should be very different.
That makes sense. Though, I didn't think the planner scans db to know the resulting complexity of the inputs
It does. How else does the planner know how to plan
π true. I was imagining a "dumb/generic" planner
One thing we can do, is to be smart about the plan keys. If the input is not too different from the cached ones, we can still use the plan
Thanks vm. I'm not familiar with db internals (dlvn is the first db i'm able to explore some of its internals), so it's great to know this. Loving the journey so far
Let me file an issue about smarter plan caching.
knowing about the planner, the current approach (depending of db) makes sense. It's just that my usecase performs lots of random jumps
Some of my queries worked great initially, but with about 120 mil entities, there has been a massive slowdown. I'm beginning to question whether I'm making good use of the indexes.
1. does having a :db/index true attribute property do anything at all? I've been using it (came across it in a datascript tutorial), but I dont see it anywhere in datalevin doc.
2. I have an entity with :cell/valid-from and :cell/valid-to attributes. How can I efficiently find the value(s) of an entity at a specific time? either with a range function or d/q
I currently use:
[?mcell :mt.cell/valid-from ?mcell-vf]
[?mcell :mt.cell/valid-to ?mcell-vt]
[(<= ?mcell-vf ?time)]
[(> ?mcell-vt ?time)]
but it seems this is doing a linear scan.
I've attempted to use index-range with :cell/valid-from, but it can't work, as there's no option to start from the nearest datom before startπ After looking into this further, it's not possible to utilize a typical index for time-range/interval scan. At least in the case of overlapping intervals (range scan would work otherwise). I'd need to look into R trees or Interval B trees.| For now, I'd focus on the plan cache optimization to get up and running
1. we don't use :db/index
2. (or (< ?mcell-vf ?time ?mcell-vt) (= ?mcell-vf ?Tim))
probably better
depends on your query, some queries would be slow no matter what you do (e.g. selectivity is low). You can look at the explanation to see if there's some massively misjudgement on the part of the planner. Compared with Postgres, our planner doesn't normally make big mistakes. So I guess the query could be just slow (e.g. there's no selective clauses).
I would be interested to see examples of massive discrepancies between estimation and actual sizes.
Since you upgraded to this version, and after that you haven't done too many transactions, so it is possible that the old estimations of sizes are still in there and they are off base. One fix is to re-index, if you can take the db offline, so counts are freshly collected.
If there's no new transactions. the resampling would not be triggered.
B-tree based DB will get slower as data volume increases , that's inevitable. I plan to mitigate this by sharding the data into multiple LMDB files. https://github.com/juji-io/datalevin/issues/303