datalevin

euccastro 2025-02-15T23:03:39.558219Z

[P.S.: none of the problems I describe below are a big deal; feel free to ignore unless you find any interesting.] I have this query

[:find ?t ?u
 :where
 [?log :time ?t]
 [?log :username ?u]
 :order-by [?t :desc]
 :limit 50]
which gives correct results but with very erratic performance on a local DB with some 6GB of data (only one "entity type" with smallish values in all attributes; I can provide the schema if it matters). The first time I run this it took some 50 seconds (although similar queries with other attributes typically take 5-15sec). The second time it took less than one millisecond. I shut down my REPL, created a new one, and the first time it run in 2sec, then subsequent times again under 1ms. I found this puzzling; Datalevin does not cache query plans in the LMDB stores, right? Maybe it's that my disk cache was warmer from the previous queries, or something? [P.S.: later I remembered that in the case where things took 50sec I had run a query before forgetting the :limit and interrupted the evaluation in Cider before attempting this one. time reported the time as spent in the new query, but maybe something in Datalevin was busy from the previous one?] I plan to use this setup for interactive log exploration, so for any given query the first query time is the one that matters to me. I suppose what's taking long the first time is query planning. In this case I wish I could just say Datalevin "please just fetch the 50 entities with highest time; trust me that all entities have all the attributes I'm querying".

Huahai 2025-02-17T18:58:30.683619Z

now there's an analyze function in master branch.

đź‘€ 1
Huahai 2025-02-16T01:21:04.976239Z

That query doesn’t engage query planner for it only has a single entity. However, it does require counting the datoms to know which attributes to use. The counting is intended to be in the background when the db runs. That is to say, Datalevin is designed for the job of Postgres and SQLite, not the job of DuckDB or Clickhouse

Huahai 2025-02-16T01:21:49.026259Z

We compete with the former, not the later.

Huahai 2025-02-16T01:27:21.247229Z

Let me explain what you observed: the first time you run, you have not give enough time for the DB to run to collect samples and counts, so it has to do them during the query. The second time you run, the results come straight from the cache. The third time, you restarted the JVM, the counts and samples are already stored in DB file. So they are used directly. There’s nothing mysterious going on.

🙏 1
Huahai 2025-02-16T01:34:01.350579Z

:limit is after full query, so it doesn’t do what you think it does

Huahai 2025-02-16T01:39:24.161569Z

If you plan to interactively explore log files, use duckdb or clickhouse, Datalevin is not designed to do that, at least not at the current stage. It can probably do some of those things after version 1.0.0

Huahai 2025-02-16T01:44:38.611229Z

But in any case, Datalevin is intended to be an operational DB, I.e up and running all the time, not an one off thing

Huahai 2025-02-16T03:50:43.907189Z

Maybe I can provide an analyze function to help with this use case

Huahai 2025-02-16T03:59:48.738199Z

The 50 second case is likely due to counting the :time attribute, the current counting method is not optimal for it. It probably has a huge cardinality, so the our current method of summarizing the individual value counts is slower than counting directly.

Huahai 2025-02-16T04:18:18.285189Z

Order attributes by cardinality probably helps too

euccastro 2025-02-16T12:27:05.455519Z

If you mean :db/cardinality, the cardinality of all attributes is unspecified in the schema and 1 in practice, which I assumed was the same as specifying :db.cardinality/one. Would the latter help the planner in any way? While I understand I'm using datalevin "off-label" here, for my exploration it still wins over DuckDB et al because of datalog: total "think query, write query, get results" time is still lower for me than with those other DBs. Thanks for your insights into what's going on behind the scenes! Since I'm exploring with a single "entity type", possibly bigger logs than in this experiment, and order-by/limit use cases will happen often, I guess I'll make helpers to address those using datom scanning functions.

euccastro 2025-02-16T13:45:55.438819Z

I did so and added a "warm-up-conn" helper that runs some typical queries in the background (i.e., a future) and ignores the results. I run this after DB load and before thinking and writing my own explorations. This is more than fast enough for me now, thanks again!

Huahai 2025-02-16T15:18:46.834419Z

No, cardinality means the number of unique values of an attribute

👍 1
euccastro 2025-02-15T23:08:47.249439Z

I guess that's what the datom scanning functions are for. I just wish I can avoid those when quickly exploring something in the REPL. Maybe this problem is just hard in general and that's why other datalog DBs don't implement order-by / limit? (BTW the documentation for q mentions :order-by but not :limit).

Huahai 2025-02-16T02:13:10.358139Z

Thx

Huahai 2025-02-16T04:33:06.241749Z

Right. You should be using those. e.g. seek-datoms with n, will be much faster than queries that return the whole db

Huahai 2025-02-16T04:34:28.956289Z

Order by and limit are intended for pagination, they are after the query, they donot limit the query in any way

euccastro 2025-02-15T23:29:44.262819Z

This variant

(dtlv/q '[:find ?t (pull ?log ?ks)
          :in $ ?ks
          :where [?log :time ?t]
          :order-by [?t :desc]
          :limit 50]
          (dtlv/db conn)
          (keys schema))
Takes some 10-15sec in my machine the first time and again on the order of 1ms subsequently. But if I (pull ?log [:*]) instead, it takes 40-75sec the first time, even though it returns the exact same data. To be clear, 10-15sec is workable for my purposes, so all in all this is not a big problem once I've learned to avoid [:*] and how. Finally, I noticed that while I can pass the pull pattern as an input to the query (i.e., ?ks), the :limit needs to be provided as a constant, i.e., this doesn't work
(dtlv/q '[:find ?t (pull ?log ?ks)
          :in $ ?ks ?n
          :where [?log :time ?t]
          :order-by [?t :desc]
          :limit ?n]
          (dtlv/db conn)
          (keys schema)
          50)
and thus if I want to try a different limit I need a new query plan. Is this because of a fundamental problem with parameterizing :limit?

Huahai 2025-02-16T01:51:46.020439Z

Parameterized :limit was not in my mind, but we can add that feature

Huahai 2025-02-16T04:27:07.295689Z

These queries are returning everything in you db

Huahai 2025-02-16T04:43:58.819339Z

:limit and :order-by are not part of the query semantics, so not part of the plan

👍 1