Fork me on GitHub
#datalevin
<
2022-04-20
>
Eugen08:04:28

hi, is there a swap! or swap-vals! function for datalevin in kv mode? Can we / does it make sense to add one?

Huahai15:04:13

I am not sure what the semantics of swap! would be a good fit here. A map specific API would be more appropriate?

Eugen17:04:58

map intetrface would be very nice.

Eugen17:04:13

I would like the update a value safely concurrently

Eugen17:04:42

if I read then update there is a chance another thread might come in an change the data

Huahai17:04:02

sure. maybe file an issue

Eugen18:04:56

sure. will do that.

Eugen18:04:11

I also filed one for iterator api

Eugen19:04:46

should I make a PR to expose and document scan/visitor API ?!

Huahai19:04:18

sure, thanks

Eugen06:04:59

This is my attempt at an issue for CAS https://github.com/juji-io/datalevin/issues/110 . I hope it makes some sense

Eugen08:04:14

Also, an opinion - I find it confusing to have keep both kv api and datalog api in the same ns. Can I work with both on the same datalevin instance?

Huahai15:04:52

Yes you can work with both APIs in the same dir. The default number of DBI (sub-db, or maps) supported is 128 (we can make it configurable in the future). the Datalog DB takes up 10 of them if full-text search is enabled, so the rest are for your own KV DBs. The intention is for users to use both APIs in the same db environment, hence they are all in datalevin.core

Eugen17:04:12

thanks, that is good to know. should be added to docs.

Huahai17:04:47

docs will be added when we are ready to bump to 1.0.

vlad_poh13:04:43

Hi for this problem http://www.learndatalogtoday.org/chapter/8 why is this not a valid rule

[[(sequels ?m1 ?m2)
                [?m1 :movie/sequel ?m2]]
               [(sequels ?m1 ?m3)
                (sequels ?m1 ?m2)
                (sequels ?m2 ?m3)]]
but this is a valid rule
[[(sequels ?m1 ?m2) 
 [?m1 :movie/sequel ?m2]]
 [(sequels ?m1 ?m2) 
 [?m :movie/sequel ?m2] 
 (sequels ?m1 ?m)]]
Both work on the site which i assume runs datomic?

Huahai16:04:09

Does this work in Datascript? If so, it’s easy to port it over. If not, then there will be more work. Please file an issue if possible.

vlad_poh20:04:17

Nope does not work in datascript. It just hangs. (infinite loop?) Will file an issue. Thanks!

James Reber15:04:44

I have a question, too. I got this error while trying to load a 600 MB CSV into Datalevin:

{:type datalevin.ni.Lib$LMDBException
   :message MDB_MAP_RESIZED: Database contents grew beyond environment mapsize
   :at [datalevin.ni.Lib checkRc Lib.java 630]}
I received that error in Babashka; I tried reading the database from Babashka because my REPL had been trying to ingest that file for ten hours and I wanted to see what the database had in its 21 GB. After that error, I now get this error:
{:cause Fail to get-value: "MDB_CORRUPTED: Located page was wrong type"
 :data {:dbi datalevin/meta, :k :last-modified, :k-type :attr, :v-type :long}
 ...}
I’ve only ever used toy datasets with Datalevin. Can anyone point me to what I’m doing wrong here? Is a 600 MB file too much for Datalevin/LMDB?

Huahai16:04:07

You will have to be more specific about what you did. So we can potentially tell you possible cause of your problem. For example, maybe show some code of your “loading a 600MB CSV”? My guess is that you are not doing transactions in large batches. You should. The best would be to create datoms yourself and use init-db to load them directly, avoiding transactions.

James Reber16:04:59

This is the code that reads the CSV and loads it into Datalevin. I am chunking 1k rows at a time. Each row has… ~200 columns, This works when I ingested only ten rows. Actually, I was wrong: it is a 1.2 GB file.

Huahai16:04:21

i am assuming each is a map?

James Reber16:04:34

Yes, each row is a map.

Huahai16:04:37

there’s no need to do the chunk there

James Reber16:04:40

To manually create datoms, would I use datalevin.datom/datom? I can think of how to do that naively.

James Reber16:04:15

No need to chunk? d/transact! can handle an arbitrarily large lazy seq?

Huahai16:04:41

i don’t see what good that partition is doing there.

James Reber16:04:43

I just wasn’t sure how transact! would react to being given millions of maps. Do you think removing that chunking would solve this? Or should I still look at init-db and manually creating datoms?

Huahai16:04:54

internally, datalevin partition datoms into batch of 100k each

Huahai16:04:25

removing the chucking will improve the performance greatly, that may be enough

Huahai16:04:33

if not, you can try to init-db

Huahai16:04:23

you are producing lazy seqs, the size doesn’t really matter

James Reber16:04:20

That’s great. The rest of Datalevin seems so well-built that I should have trusted that transact! could handle large inputs. Thank you for your help!

Huahai16:04:34

i am even considering to remove the internal batching

Huahai16:04:56

LMDB is great at handling large data sets in big batches, not so great at doing tons of small writes while doing tons of reads at the same time (which is what Datalog transaction does), because it does MVCC but maintains only 2 copies of the DB, so a lot of dirty pages have to be kept around.

James Reber20:04:07

@huahaiy I have received two OOM exceptions now — first trying to load the 1.2 GB file, and then another trying to load a 600 MB file. I’ve pasted the 600 MB failure’s stack trace here. Do you have suggestions about how to load this data? Datalevin can handle this much data, right? Should I try the init-db approach you suggested, or will that face OOM as well?

Huahai20:04:35

data size is not a problem, but each map has 200 keys, that probably is not a usual case. You can init-db, which bypass the transaction logic

👍 1
Huahai20:04:48

Another option is to transact one map at a time. I would try that first.

James Reber04:04:28

If you don’t mind, I have a few more questions for you. I tried transacting rows one at a time but that was very slow (10k rows/min), and the queries were very slow too (something like 30s when I had 100k rows * 200 fields = 20MM datoms). I will probably have billions of datoms if I get this file loaded; I hope Datalevin can handle that scale. So I pivoted to manually creating the datoms like you suggested. Wasn’t too hard, except that now I am not getting anything in my database, I see data.mdb is 100M on disk, which tells me LMDB thinks it should be storing something. And (d/schema conn) works. But I can’t get any entities out of the database. Any suggestions of what I’m doing wrong?

James Reber15:04:24

@huahaiy No rush for you to respond, just wanted to ping you if you hadn’t seen my latest question ^^.

Huahai16:04:26

Probably your code is not adding anything to the DB. Unless you are using transducer, sequence function is rarely used in Clojure. You need to force the lazy seq, otherwise nothing is probably done.

Huahai16:04:16

As to the slowness of query, that’s the current state. It is why I am working on the query engine rewrite.

Huahai16:04:52

BTW, all Datalog DBs in the Clojure world are slow like that. In fact, I would venture to say that others will be even slower, since you are talking about billions of datoms and each entity has 200 attributes. My goal of query engine rewrite is to bring the query performance close to that of a relational DB. Basically, relational DB store a row together, whereas in a triple store, they are broken down. Triple store affords maximal flexibility, but demands the query engine to do a lot more work, hence it is much slower. It is a well known problem that I am attempting to solve.

Huahai16:04:40

As to the 100MB file size, that’s the default, even with empty data.

James Reber23:04:04

Makes sense about the query time. I ran into that with Crux (XTDB?) when I played with it a while back. It's a bummer because I much prefer Datalog to something like SQL. I'm interested to see what your query engine rewrite can do. I used sequence because I was using a transducer with mapcat, and I assumed conn-from-datoms would realize the lazy seq for me so I wouldn't need doall. Perhaps that assumption is wrong. I will play around with it and see what I can do.

Huahai18:04:49

0.6.7 is released, fixed the float data type bug and allowed all classes in babashka pods

🎉 5
👍 1
Norman Kabir17:04:44

Hi @huahaiy The latest version 0.6.7 produces this warning. Is this expected? >user=> (require '[datalevin.core :as d]) WARNING: abs already refers to: #'clojure.core/abs in namespace: taoensso.encore, being replaced by: #'taoensso.encore/abs nil

Huahai18:04:51

i think so. clojure 1.11 introduced abs, taoensso.encore still have its own abs.

👍 1
Eugen18:04:09

@huahaiy: would something like polilyth make sense for datalevin ? Also having the ability to build multiple artifacts from the same code base is a nice feature. I think it does provide benefits. Especially on the clarification of which ns are API interfaces and which are implementation (something that core ns already does - kind of) That could be used even without adopting polylith. The gist of it is: have an interface NS (can be named whatever) that imports the implementations and exports the public API. An example here https://github.com/furkan3ayraktar/clojure-polylith-realworld-example-app/blob/master/components/database/src/clojure/realworld/database/interface.clj .

Huahai18:04:39

core is the interface ns. Is it not sufficient?

Eugen19:04:35

looking at it right now - since I fiured it kind of does that. I got a bit conused since I saw some implementations there