Fork me on GitHub
#datalevin
<
2023-04-26
>
Eugen16:04:25

hi @huahaiy, trying to implement Iterator so the map implementation supports entrySet.

Huahai16:04:47

Iterators are already implemented.

Eugen16:04:08

you mean get-range?

Huahai16:04:34

all these functions use an iterator

Huahai17:04:25

right, get-range

Huahai17:04:18

As I said last time, all these functionalities are already there

Eugen17:04:19

that is statefull though and might not always work. I don't always control who calls entrySet

Huahai17:04:27

no need to implement your own

Eugen17:04:02

I am trying the implement the java.util.Map interface over datalevin API

Huahai17:04:10

iterator is stateful, that’s the contract

Huahai17:04:24

by definition

Huahai17:04:40

next, hasNext, that’s state

Eugen17:04:41

sure, but range-seq opens a transaction that should be closed

Eugen17:04:00

how can I make sure clients will close transaction ?

Huahai17:04:40

that’s why you don’t use range-seq, entrySet will return the whole thing in one go,that’s the contract

Eugen17:04:11

ok, but that won't work for large data sets - which is my case

Huahai17:04:39

it is already spilling to disk

Huahai17:04:44

so that’s not a problem

Huahai17:04:56

in term of respecting the API

Eugen17:04:59

if I have functions to get first key from db, and get next key after a key than this could be solved in a cleaner way IMO

Huahai17:04:40

that’s an iterator, which by defintion, is stateful

Huahai17:04:31

you are talking about map API, which is not stateful. The stateful thing are already implemented, what you do will not be different, you are not buying anything by doing your own

Huahai17:04:12

get a key, then get next, that’s what iterator is

Huahai17:04:27

which is already implemented, and it is complicated

Eugen17:04:51

I know get range can get keys using some filters > range-type can be one of :all, :at-least, :at-most, :closed, :closed-open, :greater-than, :less-than

Huahai17:04:00

don’t think that’s somethong you should be concerned about, you either use it, or not, there’s no point trying to implement it

Eugen17:04:04

I will see if I can get something that for a given key "a" I can get the next key in DB after it.

Eugen17:04:28

I think lmdb is the state. I can keep the current key in an atom inside iterator

Huahai17:04:40

You are not doing anything useful by doing that. If you want an iterator, it is in there

Huahai17:04:24

what you describe is an iterator, which itself requires state that is independent from data

Huahai17:04:03

not just LMDB, iterator requires its own state

Eugen17:04:08

(defn entry-set-iterator
  "Implemnts ^java.util.Iterator for use in entrySet implementation.
   
   "
  [db dbi & _opts]
  (let [state (atom {:current-key nil})]
    (reify java.util.Iterator
      (hasNext [_this]
        (throw (UnsupportedOperationException. "Not implemented")))
      (next [_this]
        (throw (UnsupportedOperationException. "Not implemented")))
      (remove [_this]
        (throw (UnsupportedOperationException. "Not implemented"))))))

Eugen17:04:37

I can keep iterator state in atom

Huahai17:04:42

you don’t need these

Huahai17:04:06

you are doing unnecessary work, because we already have an iterator, you can just use it

Huahai17:04:12

i don’t see how this is helping

Huahai17:04:09

Of course, you can layer things on top of things, but you gain nothing but lose performance. I don’t see why we you want to do that

Eugen17:04:53

ok, for some context. I pass my map implementation to a function that uses this code:

(doseq [x coll]
      (.write wr (print* x))
      (.write wr (int \newline)))
the above code called entrySet on my map and failed. If I can provide entrySet it will work out of the box and iterate over collection.

Eugen17:04:30

I understand fro you that I can use: • get-range - loads everything in memory and spills to disk if OOM • range-seq - opens a transaction that needs to be closed

Huahai17:04:31

iterate-kv is a part of IDB protocol, it returns an iterator

Huahai17:04:18

Really, I don’t say how you can do better than just use get-range

Huahai17:04:50

I mean, you say you have big data, how implement your own iterator can avoid this? You still got big data.

Eugen17:04:45

yes, but I don't need to keep a transaction open if I open a transaction only when next() is called

Eugen17:04:52

if I keep the current key

Huahai17:04:04

java.util.map API entrySet requires you return the whole set, you cannot escape that

Eugen17:04:11

and call functions to get next key from lmdb then I should be ok

Eugen17:04:37

> Returns a Set view of the mappings contained in this map.

Eugen17:04:47

a set VIEW - does not need to be realized

Eugen17:04:59

that is how I understand it anyway

Huahai17:04:01

you are still going to keep the data somewhere? right, what would that achieve?

Eugen17:04:18

yeah but I need to keep the current key in memory

Eugen17:04:27

get-range reads all

Eugen17:04:38

while range-seq needs an open transaction

Eugen17:04:52

in my case it should open a transaction whenever it's doing work

Eugen17:04:07

(e.g. call next() / hasNext() , etc)

Huahai17:04:24

sure, you can do that. it is doable, the performance will suffer, but it’s a trade off

Eugen17:04:00

I can put it in the notes

Eugen17:04:16

I guess all versions have some rade-off

Huahai17:04:17

as long as you are willing to open a transaction to read every key, sure, why not

Huahai17:04:50

we do renew read transaction, so it may not be too bad

Huahai17:04:21

you can use get-range and get-some to implement this

Huahai17:04:38

get-some will try to get the next one

Huahai17:04:49

so it is doable

Eugen17:04:52

ok, thanks for the pointers

Eugen17:04:03

and for the talk.

Eugen17:04:11

I feel it was a bit intense at some point

Eugen17:04:26

I am fine. I hope you are too

Huahai17:04:55

no worries, don’t be alarmed, that’s how I talk with people,

Huahai17:04:46

i tend to talk straight, it’s about ideas, not about people

Eugen17:04:02

I saw both get-* return k-v . I guesss it's no way to avoid getting the value back

Huahai17:04:03

i am not a people person

Eugen17:04:52

but I get a buffer so I can avoid reading it - and it should be ok ?!

Huahai17:04:56

get-kv is when you know the key, but you don’t in an iterator

Huahai17:04:02

it should be ok, you don’t have to read the buffer, when you do, you don’t have to read the whole buffer

Huahai17:04:29

you can do whatever with the buffer in fact. you can write the buffer as one data type, but read it as another, or partially. I do these in writing the search engine, for example.

Eugen17:04:24

thanks. I will see what I can make out.

Eugen17:04:36

there is also with-transaction-kv that I forgot about.

Eugen17:04:01

I can make so processing is using a single transaction

Huahai17:04:14

i would still not deal with these low level things for this purpose though, I would use get-some, etc.

Huahai17:04:31

i don’t think that’s possible

Huahai17:04:19

with-transaction-kv is for write

Huahai17:04:24

not for read

Huahai17:04:37

read transacton and write transaction are two different thing

Eugen17:04:01

I mean use it like this:

(with-transaction-kv ....

(.entrySet my-mao)
....
) 

Eugen17:04:08

iterators can remove elements

Eugen17:04:42

and I think the entrySet implementation can also support remove operation

Huahai17:04:46

the purpose of with-transaction-kv is to make sure reads during write see what’s just being written

Huahai17:04:12

otherwise, writes are only visible after commit

Huahai17:04:31

so with-transaction-kv is for implement things like CAS

Huahai17:04:27

when you use with-transaction-kv, you are blocking write, which is not user would expect

Huahai17:04:43

there can be only one writer at a time

Eugen17:04:47

yeah, unless it's documented 🙂

Eugen17:04:57

then it's a feature 🙂

Eugen17:04:47

btw, I did find guava-testlib that has unit tests for collection implementation. Planning to use those to check the implementation. Added a comment on issue with links

Huahai17:04:23

I wouldn’t think it is user friendly for a java.util.map to be blocking writes to the whole DB.

Eugen17:04:57

yeah, it wouldn't be

Eugen17:04:48

btw, for > 0.11.0 A new Datalog query engine with improved performance. I have an idea to implement datalog over Apache Calcite relational algebra. I think that might be a better query engine.

Eugen17:04:42

I did a small lib that allows you to use calcite from Clojure https://github.com/ieugen/calcite-clj

Eugen17:04:02

and do SQL over clojure collections

Eugen17:04:28

I believe it is possible to have datalog over relational algebra : https://calcite.apache.org/docs/algebra.html

Eugen17:04:52

and calcite has a powerfull query engine (or that is my impression)

Huahai17:04:50

Well, I tend to not believe marketing words. I think these things do need innovation from research, which is what I intend to do.

Huahai17:04:21

These things are not easily improved without doing innovative work, which most open source projects do not

Huahai17:04:25

the reason I do datalevin is because I think SQL is a bad language

Huahai17:04:34

even its inventors admit that

Huahai17:04:13

I came from IBM Research - Almaden, I worked in the same floor as people who invented SQL

Huahai17:04:14

I don’t think they like SQL themselves. LOL

Huahai17:04:56

its’ a historical accident that sql gets to where it is now. it’s not good

Huahai17:04:33

my intention is to bring datalog query processing performance to be on par with RDBMS, which is doable.

Huahai17:04:19

clojure is good for experimentation. If the experiment is successful, it is possible to write it in a lower level language like rust, zig, or whatever, that’s my long term plan

💯 2
Huahai17:04:47

the benefit of triple store is the unrolled storage, which makes the most difficult part of query optimization easy, that is cardinality estimation. This is what I see as the breakthrough.

Eugen17:04:19

ok, calcite has an SQL parser. it translates the SQL query to relational algebra that it uses to run the query. there could be a Datalog -> relational algebra conversion

Huahai17:04:59

relation algebra with recursion, that’s datalog. they are the same

Eugen17:04:33

final FrameworkConfig config;
final RelBuilder builder = RelBuilder.create(config);
final RelNode node = builder
  .scan("EMP")
  .build();
System.out.println(RelOptUtil.toString(node));
> LogicalTableScan(table=[[scott, EMP]])

Huahai17:04:47

SQL is not a good relational algebra implmentation. That’s the part where it fails.

Huahai17:04:59

anyway, that’s why i am doing this, so database is so central to programming. a litle improvement will have huge impact.

Eugen18:04:16

not sure, but I don't think calcite's relational algebra is SQL only.

Eugen18:04:23

thanks for the personal info

Eugen18:04:40

I don't have such a high vision

Eugen18:04:45

just playing around for now 🙂

Eugen18:04:21

I would like to expore datalog on top of calcite relational algebra when I get to that part

Huahai18:04:23

somebody has to have.

Huahai18:04:40

another vision is to bring most of data access into an integrated story for application development.

Huahai18:04:16

since most of code is just transforming data from one db to another, such a waste of human efforts.

Huahai18:04:53

these are all laid out in README, if one reads it carefully.

Eugen18:04:04

sounds like a long term plan.

Eugen18:04:17

good luck!

Eugen18:04:31

thanks again for the conversation.

Eugen18:04:38

signing off soon

Huahai18:04:56

cheers, thanks for the interest.

Huahai19:04:34

I agree that adding a third way of iterate over the data would be a good addition.

Eugen16:04:52

Any ideas on how to do that? are there any functions to work with keys / iterate over them, get next item after a key>