datalevin

2024-08-23T14:55:05.292759Z

So I’ve encountered this weird issue where if I run a query on an empty database before running a transaction it makes subsequent queries return nil for that attribute. This should work, but doesn’t.

(comment
  (def schema
    {:transaction/signature
     {:db/unique      :db.unique/identity
      :db/valueType   :db.type/string
      :db/cardinality :db.cardinality/one}
     :transaction/block-time
     {:db/valueType   :db.type/long
      :db/cardinality :db.cardinality/one}})
  
  (def conn
    (d/get-conn "db1" schema
      {:validate-data?    true
       :closed-schema?    true
       :auto-entity-time? true}))

  (d/q '[:find [?block-time ?signature]
         :where
         [?t :transaction/signature ?signature]
         [?t :transaction/block-time ?block-time]]
    @conn)

  (d/transact! conn [{:transaction/signature  "foo"
                      :transaction/block-time 234324324}])

  (d/q '[:find [(max ?bt)]
         :where
         [?t :transaction/block-time ?bt]]
    @conn)
  ;; => nil
  )
This does work.
(comment
  (def schema
    {:transaction/signature
     {:db/unique      :db.unique/identity
      :db/valueType   :db.type/string
      :db/cardinality :db.cardinality/one}
     :transaction/block-time
     {:db/valueType   :db.type/long
      :db/cardinality :db.cardinality/one}})
  
  (def conn
    (d/get-conn "db2" schema
      {:validate-data?    true
       :closed-schema?    true
       :auto-entity-time? true}))
  
  (d/transact! conn [{:transaction/signature  "foo"
                      :transaction/block-time 234324324}])

  (d/q '[:find [?block-time ?signature]
         :where
         [?t :transaction/signature ?signature]
         [?t :transaction/block-time ?block-time]]
    @conn)
  ;; => [234324324 "foo"]

  (d/q '[:find [(max ?bt)]
         :where
         [?t :transaction/block-time ?bt]]
    @conn)
  ;; => [234324324]

  )
Struggling to understand what the issue could be. Any pointers?

2024-08-23T15:05:35.837749Z

So looks like this issue was introduce in 0.9.9 I’ll try and narrow down the commit

2024-08-23T15:46:55.786509Z

So as far as I can tell it’s this commit that introduces the issue

2024-08-23T15:46:56.274679Z

https://github.com/juji-io/datalevin/commit/720a7d79b8800e58fb6136c76698b102ec77afdc Going to take a bash at fixing this.

2024-08-24T14:03:31.548389Z

So this seems to be caused by this line: https://github.com/juji-io/datalevin/blob/master/src/datalevin/query.clj#L1468

sort-by (fn [[_ attr _]] (db/-count db [nil attr nil]))
My guess is db/-count is stateful and is now getting called in the reduce and the sort, where as before it was just being called in the reduce.

Huahai 2024-08-24T14:25:38.775309Z

Looks like a caching problem, will investigate when I have time

Huahai 2024-08-24T14:34:31.273839Z

Thanks for narrowing it down.

2024-08-24T14:34:37.274079Z

Awesome! I’ve opened a PR that reverts the sort with all the detail (short term fix). Feel free to merge it or close it when you work out the caching issue. I tried to delve into the cache stuff but that is going to take me a fair bit longer to get to grips with. https://github.com/juji-io/datalevin/pull/269 Thanks again for the awesome project. We’ve recently started using it in production in embedded mode and it’s an absolute joy. No connection pool, no SQL query builder, no db management.

Tiago Luchini 2024-08-23T20:58:29.016149Z

We have a topology of a few VMs connected (`d/get-conn...`) to a dtlv server. Version is 0.9.8 . Every so often the clients start throwing The client is disconnected when querying/pulling/transacting. The current design has get-conn being long-lived on the client. Is that a safe approach? Should we reconnect per request or should 0.9.8 reconnect automatically on demand?

Huahai 2024-08-24T14:22:47.335669Z

Client is supposed to reconnect automatically.

Huahai 2024-08-24T14:23:43.457179Z

Recommend to run JVM version as the server. Native version cannot really handle many concurrent connections well, as the community version of GraalVM has only serial GC.

Huahai 2024-08-24T14:39:02.478499Z

If you could reproduce this somehow, it would be awesome.

Tiago Luchini 2024-09-04T17:48:23.607839Z

FYI @huahaiy. The db files are corrupt themselves. I am trying to find something creative here 🙂

Huahai 2024-09-04T17:53:46.787159Z

Can you read the db files and export them as txt file?

Tiago Luchini 2024-09-04T17:55:12.840169Z

dtlv dump -d . -a is working

Huahai 2024-09-04T17:55:59.462709Z

I mean, as datalog

Tiago Luchini 2024-09-04T17:56:30.485239Z

no. -g throws the same nippy thaw exception of when I try to open the db

Huahai 2024-09-04T18:04:03.052279Z

If only :datalevin/opts dbi is bad, you can dump the other dbi, and import them into a new dir

Huahai 2024-09-04T18:05:15.059339Z

An empty opts dbi is probably okay

Tiago Luchini 2024-09-04T18:05:48.265669Z

These are the ones I have:

#{"datalevin/meta"
  "datalevin/opts"
  "datalevin/schema"
  "datalevin/giants"
  "datalevin/ave"
  "datalevin/vae"
  "datalevin/eav"}
I would dump all of them but /opts?

Huahai 2024-09-04T18:06:49.400939Z

Yeah

Tiago Luchini 2024-09-04T18:07:58.889589Z

They all throw the same exception

Huahai 2024-09-04T18:08:34.438859Z

You said dump is working

Tiago Luchini 2024-09-04T18:08:53.120899Z

dump -a works no problem... is dump -g that throws

Huahai 2024-09-04T18:09:18.032229Z

You can dump individual dbi

Tiago Luchini 2024-09-04T18:10:06.626859Z

This is what I am trying. Let me know where I am getting it wrong: dtlv -d . -g dump datalevin/schemas

Huahai 2024-09-04T18:10:07.429349Z

Then load each into the same db file

Huahai 2024-09-04T18:10:15.190219Z

No

Huahai 2024-09-04T18:10:41.020229Z

Dump them as kv

Huahai 2024-09-04T18:11:24.217759Z

-g is for datalog, which is file scoped

Huahai 2024-09-04T18:11:40.786249Z

Kv is dbi scoped

Tiago Luchini 2024-09-04T18:12:15.226909Z

to dump as kv, should I use -a instead then? I am not familiar with kv stores

Huahai 2024-09-04T18:13:23.810839Z

Or dump them with -a, and edit the file, remove the opts content

Tiago Luchini 2024-09-04T18:13:34.128789Z

for instance.... I am uncertain what I would do with the exported file from -a

Huahai 2024-09-04T18:13:40.167909Z

Read the docs

Huahai 2024-09-04T18:13:58.297899Z

Import them

Tiago Luchini 2024-09-04T18:18:40.439829Z

My confusion is that I am not using the kv-store options and just datalog instead. The docs for dtlv only show options for -g (which is datalog and what I am using). There are no references to kv (or why it would be relevant in this particular case). I am happy to improve the docs once I wrap my head around this.

Huahai 2024-09-04T18:20:04.224109Z

Datalog store is on top of kv store

Huahai 2024-09-04T18:21:25.850249Z

Everything is stored in a kv store

Huahai 2024-09-04T18:21:54.737889Z

What’s confusing about it?

Tiago Luchini 2024-09-04T18:22:20.519929Z

Got you. So, for dtlv dump and dtlv load the absence of -g means just kv store, correct?

Huahai 2024-09-04T18:22:31.750499Z

correct

Tiago Luchini 2024-09-04T18:23:51.332369Z

Perfect! Thank you. This is a crucial detail that completely escaped me.

Huahai 2024-09-04T18:24:33.600439Z

Each dbi is a kv map

Huahai 2024-09-04T18:25:26.988309Z

The datalog store is implemented with the 7 dbi you listed above

Huahai 2024-09-04T18:26:17.505669Z

Only opts is using nippy encoding completely

Huahai 2024-09-04T18:26:43.293619Z

The hope is that only that dbi is damaged

Tiago Luchini 2024-09-04T18:27:02.138549Z

That's reasonable and makes total sense. And opts should be ok recycling from an empty db, right? That's your hypothesis

Huahai 2024-09-04T18:27:17.775929Z

Right

Huahai 2024-09-04T18:28:40.835439Z

If others are also damaged, it would not be able to recover

Tiago Luchini 2024-09-04T18:28:47.495019Z

Once I am done with this, I will open a PR with a few reminders for the dtlv docs page for your consideration. Keep up the good work! What you are up to is fantastic!

Huahai 2024-09-04T18:30:32.973199Z

When using -g to dump, only opts, schema, eav and maybe giants dbi are used.

Tiago Luchini 2024-09-04T18:31:27.688339Z

Your hypothesis is sound as it's throwing from load-opts from storage/open

Huahai 2024-09-04T18:32:13.373379Z

But maybe others are also damaged, opts are read first

Huahai 2024-09-04T18:32:59.677089Z

Anyway, this is something I will need to look into and test for

Tiago Luchini 2024-09-04T18:33:48.465699Z

I will gladly put a reproducible report once I find a way to consistently reproduce it

Huahai 2024-09-04T18:35:18.353569Z

Thanks

Tiago Luchini 2024-09-04T18:35:28.697329Z

Thank YOU!

Tiago Luchini 2024-09-04T20:17:20.792949Z

You were totally right @huahaiy. Dumping took a long time (~20 min for 64MB worth of data). Loading back onto a new db seemed to be very slow - left it running for 2h and gave up. Ultimately I ran the following on a copy of the db: dtlv -d . exec '(let [db (open-kv ".")] (drop-dbi db "datalevin/opts") (list-dbis db))' and when I reconnected, opts was recreated successfully.

Tiago Luchini 2024-09-04T20:18:57.796689Z

A slightly uncommon thing I have on this app is a :db.type/bytes node that receives a lot of reads with concurrent writes. Is that also possibly an area of concern?

Tiago Luchini 2024-09-03T19:22:54.289249Z

It seems to be when the server throws this:

[info] Exception in thread "main" clojure.lang.ExceptionInfo: Error creating server:"Fail to get-range: #error {\n :cause \"No reader provided for custom type id: -63\"\n :data {:type-id -63, :prefixed? nil}\n :via\n [{:type clojure.lang.ExceptionInfo\n :message \"Thaw failed against type-id: -63\"\n :data {:type-id -63}\n :at [taoensso.nippy$thaw_from_in_BANG_ invokeStatic \"nippy.clj\" 1722]}\n {:type clojure.lang.ExceptionInfo\n :message \"No reader provided for custom type id: -63\"\n :data {:type-id -63, :prefixed? nil}\n :at [taoensso.nippy$read_custom_BANG_ invokeStatic \"nippy.clj\" 1390]}]\n :trace\n [[taoensso.nippy$read_custom_BANG_ invokeStatic \"nippy.clj\" 1390]\n [taoensso.nippy$thaw_from_in_BANG_ invokeStatic \"nippy.clj\" 1722]\n [taoensso.nippy$fast_thaw invokeStatic \"nippy.clj\" 1781]\n [datalevin.bits$deserialize invokeStatic \"bits.clj\" 98]\n [datalevin.bits$get_data invokeStatic \"bits.clj\" 179]\n [datalevin.bits$read_buffer invokeStatic \"bits.clj\" 994]\n [datalevin.scan$get_range invokeStatic \"scan.clj\" 79]\n [datalevin.binding.java.LMDB get_range \"java.clj\" 545]\n [datalevin.binding.java.LMDB get_range \"java.clj\" 821]\n [datalevin.storage$load_opts invokeStatic \"storage.clj\" 1188]\n [datalevin.storage$open invokeStatic \"storage.clj\" 1265]\n [datalevin.storage$open invoke \"storage.clj\" 1255]\n [datalevin.storage$open invokeStatic \"storage.clj\" 1262]\n [datalevin.storage$open invoke \"storage.clj\" 1255]\n [datalevin.storage$open invokeStatic \"storage.clj\" 1260]\n [datalevin.server$open_store invokeStatic \"server.clj\" 913]\n [datalevin.server$reopen_dbs invokeStatic \"server.clj\" 1001]\n [datalevin.server$create invokeStatic \"server.clj\" 2388]\n [datalevin.main$_main invokeStatic \"main.clj\" 527]\n [datalevin.main$_main doInvoke \"main.clj\" 513]\n [clojure.lang.RestFn applyTo \"RestFn.java\" 137]\n [datalevin.main main nil -1]]}" {}

Tiago Luchini 2024-09-03T22:27:05.482339Z

Actually, this looks like our DB got corrupt. Any thoughts why @huahaiy?

Huahai 2024-09-03T22:30:27.415029Z

Don’t think so. You won’t be able to start if a db is corrupt

Tiago Luchini 2024-09-03T22:31:00.877449Z

Yes. That's what looks like happened

Tiago Luchini 2024-09-03T22:31:16.111239Z

Any idea why it would get corrupt? Or how to fix it? 🙂

Huahai 2024-09-03T22:34:51.899059Z

Prior to 0.9, a db may be corrupt if the program is killed. Shouldn’t happen after 0.9.0. So I don’t think it is the issue

Tiago Luchini 2024-09-03T22:35:31.384589Z

I am running 0.9.8

Tiago Luchini 2024-09-03T22:35:54.518439Z

I saw your thread about kv-opts prior to 0.9

Huahai 2024-09-03T22:37:59.821199Z

You have not described what you are doing.

Tiago Luchini 2024-09-03T22:42:13.332039Z

This is a java server running datalevin in server mode with a few clients connected to it. Clients run a pretty standard web app that writes (transacts) and reads (queries and pulls) from this server.

Tiago Luchini 2024-09-03T22:43:03.541939Z

Every few days this server throws this error. Last few times we restarted with a fresh DB.

Huahai 2024-09-03T22:54:45.663329Z

This error happens when reopen DB, and specifically when reading options

Huahai 2024-09-03T22:56:05.155069Z

I will need to know the history of these dbs, are they upgraded from older versions, etc

Tiago Luchini 2024-09-03T22:58:56.871149Z

They are fresh DBs. Never upgraded. We have a sibling environment where this has never happened. The difference between these two environments is the amount of concurrent clients. The environment where it never crashed has just one client for one server (both running java on different vms.) This environment that crashes every few days has one server and three clients (all on separate vms). Both connect to the server through the wire.

Huahai 2024-09-03T23:02:26.292529Z

Ok, sounds like a concurrency problem, it seems that we need to lock the system db when reconnecting

Tiago Luchini 2024-09-03T23:03:42.973169Z

Clients do fail and get restarted. They are behind an auto-scaler as well so I did have peaks of 13 client vms.

Tiago Luchini 2024-09-03T23:03:47.021579Z

If it's relevant

Huahai 2024-09-03T23:04:13.929169Z

If you could file an issue, it would be great

👍 1
Tiago Luchini 2024-09-03T23:04:26.110089Z

In terms of restoring this db, what are my options here? I can't connect to it but can dtlv copy

Huahai 2024-09-03T23:10:12.936429Z

Sounds like only the sever system db is having problems

Huahai 2024-09-03T23:12:32.997619Z

You can create a new server instance with a new root directory, recreate dbs, shut down server, and copy the old db files over

Huahai 2024-09-03T23:13:12.170779Z

System db only needs the db names

Tiago Luchini 2024-09-03T23:13:35.304699Z

Awesome! I will try. Thank you very much 🙇

Huahai 2024-09-03T23:17:29.324479Z

Also, using the latest version helps, as it does improve concurrent reading during transaction