[P.S.: none of the problems I describe below are a big deal; feel free to ignore unless you find any interesting.] I have this query
[:find ?t ?u
:where
[?log :time ?t]
[?log :username ?u]
:order-by [?t :desc]
:limit 50]
which gives correct results but with very erratic performance on a local DB with some 6GB of data (only one "entity type" with smallish values in all attributes; I can provide the schema if it matters). The first time I run this it took some 50 seconds (although similar queries with other attributes typically take 5-15sec). The second time it took less than one millisecond. I shut down my REPL, created a new one, and the first time it run in :limit and interrupted the evaluation in Cider before attempting this one. time reported the time as spent in the new query, but maybe something in Datalevin was busy from the previous one?]
I plan to use this setup for interactive log exploration, so for any given query the first query time is the one that matters to me. I suppose what's taking long the first time is query planning. In this case I wish I could just say Datalevin "please just fetch the 50 entities with highest time; trust me that all entities have all the attributes I'm querying".now there's an analyze function in master branch.
That query doesn’t engage query planner for it only has a single entity. However, it does require counting the datoms to know which attributes to use. The counting is intended to be in the background when the db runs. That is to say, Datalevin is designed for the job of Postgres and SQLite, not the job of DuckDB or Clickhouse
We compete with the former, not the later.
Let me explain what you observed: the first time you run, you have not give enough time for the DB to run to collect samples and counts, so it has to do them during the query. The second time you run, the results come straight from the cache. The third time, you restarted the JVM, the counts and samples are already stored in DB file. So they are used directly. There’s nothing mysterious going on.
:limit is after full query, so it doesn’t do what you think it does
If you plan to interactively explore log files, use duckdb or clickhouse, Datalevin is not designed to do that, at least not at the current stage. It can probably do some of those things after version 1.0.0
But in any case, Datalevin is intended to be an operational DB, I.e up and running all the time, not an one off thing
Maybe I can provide an analyze function to help with this use case
The 50 second case is likely due to counting the :time attribute, the current counting method is not optimal for it. It probably has a huge cardinality, so the our current method of summarizing the individual value counts is slower than counting directly.
Order attributes by cardinality probably helps too
If you mean :db/cardinality, the cardinality of all attributes is unspecified in the schema and 1 in practice, which I assumed was the same as specifying :db.cardinality/one. Would the latter help the planner in any way?
While I understand I'm using datalevin "off-label" here, for my exploration it still wins over DuckDB et al because of datalog: total "think query, write query, get results" time is still lower for me than with those other DBs.
Thanks for your insights into what's going on behind the scenes! Since I'm exploring with a single "entity type", possibly bigger logs than in this experiment, and order-by/limit use cases will happen often, I guess I'll make helpers to address those using datom scanning functions.
I did so and added a "warm-up-conn" helper that runs some typical queries in the background (i.e., a future) and ignores the results. I run this after DB load and before thinking and writing my own explorations. This is more than fast enough for me now, thanks again!
No, cardinality means the number of unique values of an attribute
I guess that's what the datom scanning functions are for. I just wish I can avoid those when quickly exploring something in the REPL. Maybe this problem is just hard in general and that's why other datalog DBs don't implement order-by / limit? (BTW the documentation for q mentions :order-by but not :limit).
Thx
Right. You should be using those. e.g. seek-datoms with n, will be much faster than queries that return the whole db
Order by and limit are intended for pagination, they are after the query, they donot limit the query in any way
This variant
(dtlv/q '[:find ?t (pull ?log ?ks)
:in $ ?ks
:where [?log :time ?t]
:order-by [?t :desc]
:limit 50]
(dtlv/db conn)
(keys schema))
Takes some 10-15sec in my machine the first time and again on the order of 1ms subsequently. But if I (pull ?log [:*]) instead, it takes 40-75sec the first time, even though it returns the exact same data.
To be clear, 10-15sec is workable for my purposes, so all in all this is not a big problem once I've learned to avoid [:*] and how.
Finally, I noticed that while I can pass the pull pattern as an input to the query (i.e., ?ks), the :limit needs to be provided as a constant, i.e., this doesn't work
(dtlv/q '[:find ?t (pull ?log ?ks)
:in $ ?ks ?n
:where [?log :time ?t]
:order-by [?t :desc]
:limit ?n]
(dtlv/db conn)
(keys schema)
50)
and thus if I want to try a different limit I need a new query plan. Is this because of a fundamental problem with parameterizing :limit?Parameterized :limit was not in my mind, but we can add that feature
These queries are returning everything in you db
:limit and :order-by are not part of the query semantics, so not part of the plan