datalevin

borkdude 2025-06-18T21:24:13.032769Z

Hi @huahaiy - I'm looking for information on how to expose a datalevin database to the public by just exposing d/q via an endpoint. Is there any way people could abuse and mess up the system? One attack vector would be if datalevin lets you execute random functions in queries. In Datomic I think you have to register those functions first, how does it work in datalevin?

2025-06-20T07:16:21.725739Z

> There are too many ways to create an infinite loop. There is already a :timeout option in the query that one can set, you can try it to see if that works for you. (edited) I think it can be narrowed down a bit. From what I have seen in the code and if understand correctly you use one Sci setup for two different kind of uses. One for the evaluation of developer approved functions and one for ad hoc code like (and (like? ?a) (or (= ?b 2))) . For developer approved code you can allow things like (range) and (repeat) etc, but for the ad hoc code added to queries by external users you could have a different Sci context with a much smaller whitelist. I noticed that the problem is not what predicates are allowed, such as .toString or str , but the arguments that these predicates can receive such as (vec (range)). So I think splitting this two uses up would probably fix this potential exploit without removing any intentional features

Huahai 2025-06-20T15:36:12.958729Z

What is "developer approved functions" vs. "ad hoc code"? I don't see the distinction. The goal is to have the same code working regardless the modes of DB operation. It is possible for a developer of embedded use to write code of infinite loop. Therefore, the fact that you directly use a code 3rd party submits to you is your own problem, not Datalevin's. All Datalevin can provide, is a timeout facility that can deal with infinite loop. That's it. It's developer's responsibility to prevent such code from running. Just like you don't take 3rd party's SQL statement and directly run on your DB, you shouldn't be taking a random query string to run in Datalevin.

2025-06-20T15:39:12.196809Z

Fair enough. Yeah i guess it is a usecase that is not really what Datalevin was intended for

Huahai 2025-06-20T15:39:49.094579Z

Remember the basic tenant of computer science, you cannot detect a code will Halt or not.

2025-06-20T15:40:49.208859Z

Of course it is up to you what you want to do with. I was just offering a potential solution to avoid any change and give a bit of protection for this particular case

2025-06-20T15:41:31.788199Z

I'm also curious how you will do a timeout. You cannot really stop a thread so you have to wait for the VM to give up i think?

Huahai 2025-06-20T15:41:36.581849Z

There's no such distinction as you described. There's no different context.

2025-06-20T15:42:33.042899Z

In the case of what @borkdude i think there is but he is the user of Datalevin so for Datalevin it might just be one context. He can of course try to sanitize the input somehow

borkdude 2025-06-20T15:42:51.663579Z

The use case for exposing a read only dataset to the public and making it queryable via a public API is a different use case than using the db in your own app

borkdude 2025-06-20T15:43:19.025869Z

But thatโ€™s not what most dbs provide out of the box

Huahai 2025-06-20T15:43:20.878669Z

There's a way to do it, if the query is done asynchronously, which I plan to do for other reasons as well, e.g. to prevent the dreaded "max reader reached" error.

Huahai 2025-06-20T15:45:29.848099Z

we can of course add an option to disable running certain forms in query, that should handle these cases.

Huahai 2025-06-20T15:47:41.189089Z

Basically, when this option is turned on, some forms or functions are not allowed.

Huahai 2025-06-20T15:47:57.422439Z

we can certainly do that.

Huahai 2025-06-20T15:49:23.242909Z

But it won't be 100% safe, or it may feel too strict, in any case, it would not make everyone happy, but at least good for 80% cases.

Huahai 2025-06-20T15:50:00.782579Z

so let user to customize the ban list would be necessary.

borkdude 2025-06-20T15:53:42.125239Z

No need to do this right now, at least not for me, right now just exploring

Huahai 2025-06-20T15:54:38.057769Z

yeah, we can file a ticket and get to it when we have time

2025-06-20T15:56:10.654749Z

I was more optimistic about the possibilities. I don't think anyone would need things like (range) or (repeatedly) in the query itself. I think for the query itself you would use things like (and (like ?a) (= ?b 2)) like in the examples in the tests. So i think for the query you can make a simple Sci context. And for the internal functions you can support all the other things with more elaborate Sci context. So that's why I was talking about two contexts if that makes sense now. I think Sci is pretty safe if you give only limited options. But anyway, like @borkdude says this not required yet

Huahai 2025-06-20T15:57:31.760529Z

SCI is slow, so you don't want to run everything in SCI.

Huahai 2025-06-20T16:00:16.061879Z

Query predicates should run in JVM directly to have acceptable performance. SCI is used only for 1. stored functions, 2. functions that sent over wire. 3. code from command line.

Huahai 2025-06-20T16:01:01.520009Z

We use SCI to accommodate GraalVM, for eval is not allowed in GraalVM.

Huahai 2025-06-20T16:01:24.099079Z

if eval is allowed, we do't need sci

2025-06-20T16:01:25.866649Z

Ok I understand, but my tests seems to indicate that Sci is used for all predicates. Am I wrong?

(d/q '[:find  ?sales
              :in    $ ?sales
              :where [(.toString (vec (range)))
                      ]]
             (d/db conn) 10)

Huahai 2025-06-20T16:01:37.573409Z

no

Huahai 2025-06-20T16:01:52.304609Z

predicates are all in jvm

2025-06-20T16:04:20.714379Z

Ah ok. Maybe I should not say anything about projects that I haven't studied for 100%. So you are saying that you never intended to use Sci as a Sandbox?

Huahai 2025-06-20T16:04:58.774559Z

no, sci was introduced entirely to get around the eval limitation

2025-06-20T16:05:15.599459Z

ah ok my bad. Sorry misunderstood that completely

borkdude 2025-06-20T16:05:59.309889Z

of course you could use eval in JVM clojure if you want to and SCI for the native image

borkdude 2025-06-20T16:06:58.628919Z

I would need to see benchmarks about which is really slower though since eval can be slower if you don't cache the evaluation since the clojure analyzer + evaluation can take more time in total in some cases

Huahai 2025-06-20T16:09:01.566189Z

of course, SCI is also needed for the command line tool dtlv

Huahai 2025-06-20T16:10:28.889799Z

for embedded use, predicates are already clojure forms, so we don't need eval for those, I don't think sci can compete with that.

Huahai 2025-06-20T16:17:34.751069Z

so the history is like this, sci was introduced early for the command line tool, for we need to evaluate code strings from command line, here we indeed considered the sandbox aspect; then I found that graalvm doesn't work with eval, since I am already using sci, so I introduced inter-fn macro to use sci to evaluate code for stored code in :db/fn and over the wire code (babushka and server).

Huahai 2025-06-20T16:18:53.817369Z

so sci was never intended for evaluating query predicates, which always run in JVM.

Huahai 2025-06-20T16:22:20.540659Z

Query predicates can run millions of time in a single query, so it is really not something should be running in an interpreter. Functions such as like? are compiled into FSM and cached in memory, and these are written in Java.

Huahai 2025-06-20T16:27:52.399279Z

We have to do a lot of optimizations. How else could we compete with PostgreSQL in running complex query? It is written in highly optimized C.

Huahai 2025-06-20T16:30:27.836569Z

We call Datalevin "simple", meaning it is simple in concept and use, but the implementation is not really simple, otherwise, we won't be able to compete.

Huahai 2025-06-20T16:51:41.692089Z

Just looked at how SQLite handles this, if the user defined function is written in C and it has an infinite loop, SQLite won't be able to handle it either. So if we provide a ban list, it would be an improvement.

borkdude 2025-06-20T16:58:32.466189Z

๐Ÿ‘

Huahai 2025-06-18T21:26:25.966499Z

DL execute random query functions only in the embedded library mode. So this should not be applicable for an API endpoint.

Huahai 2025-06-18T21:27:12.218889Z

Otherwise, the function has to be provided in source form, and executed in SCI, which you know, are white list based.

borkdude 2025-06-18T21:28:18.278159Z

๐Ÿ™ which docs are best to read on this?

Huahai 2025-06-18T21:34:59.293799Z

so really, if you provides an API endpoint that takes a query string, I don't think it is possible to abuse it, as the function won't work since it is not part of your clojure environment.

Huahai 2025-06-18T21:35:23.956939Z

you will just get an exception.

borkdude 2025-06-18T21:54:10.153229Z

ah right. I see how it uses SCI now

borkdude 2025-06-18T21:58:02.309949Z

so it seems to give access to datalevin.core etc by default. does this mean that you could potentially change/delete stuff while doing a query?

Huahai 2025-06-18T23:26:35.211829Z

Not really, thereโ€™s no access to the reference to db at api endpoint

Huahai 2025-06-18T23:27:12.076599Z

So impossible to call those functions in a query.

2025-06-19T09:36:07.073529Z

@borkdude As an attack vector I would consider a DOS via infinite recursion. There are several ways to do this in Sci if you have enough available functions

borkdude 2025-06-19T09:36:29.195479Z

yep

borkdude 2025-06-19T09:37:03.495369Z

just a simple (loop [] (recur)) would do if you allow those

2025-06-19T09:37:29.918249Z

And any fn definition that has a self reference (via atom, volatile etc)

borkdude 2025-06-19T09:37:55.554119Z

I don't know if/how datalevin allows function execution by default though, perhaps there's way to disable it

2025-06-19T09:45:43.675799Z

My guess is that you can do recursion. I cannot find any :deny or :allow in the repo. Here is the ctx definition https://github.com/juji-io/datalevin/blob/4289de2c3577ca59190f2349a77dd23cd0814139/src/datalevin/interpret.clj#L205-L213

borkdude 2025-06-19T09:46:47.275179Z

does anyone have an example of how you can run a loop in an arbitrary query?

2025-06-19T10:11:46.772669Z

No and I am probably wrong there is some whitelisting going on here https://github.com/juji-io/datalevin/blob/4289de2c3577ca59190f2349a77dd23cd0814139/src/datalevin/query.clj#L664

2025-06-19T10:58:15.538769Z

Maybe I should change career

(def conn (d/get-conn "/tmp/datalevin-mydb" {}
                             )) 
(d/q '[:find  ?sales
              :in    $ ?sales
              :where [(.toString (vec (range)))
                      ]]
             (d/db conn) 10)
Do it where you can kill the process ๐Ÿ™‚

2025-06-19T10:59:06.120569Z

the dot-form is the weak point

borkdude 2025-06-19T11:00:37.161389Z

Any reason for not using str?

2025-06-19T11:00:55.174879Z

so i can exploit the weakness ๐Ÿ™‚

2025-06-19T11:01:06.693069Z

the dotform gives me a free pass

2025-06-19T11:12:27.380939Z

Oh but you are right, with normal str it also does what i intended. Just didn't think about that before ๐Ÿ˜ฌ

Huahai 2025-06-19T15:33:24.606719Z

There are too many ways to create an infinite loop. There is already a :timeout option in the query that one can set, you can try it to see if that works for you.

Huahai 2025-06-19T17:36:35.304289Z

The :timeout option probably won't work for handling infinite loop in code. Need to file an GitHub issue for this.

borkdude 2025-06-19T19:13:11.902039Z

@huahaiy Just FYI, I'm not running into an urgent situation with this, I was just exploring options. Mostly for #dewey which is a dataset about the clojure ecosystem which we could make queryable with datalevin potentially, via the web