Hi @huahaiy - I'm looking for information on how to expose a datalevin database to the public by just exposing d/q via an endpoint. Is there any way people could abuse and mess up the system? One attack vector would be if datalevin lets you execute random functions in queries. In Datomic I think you have to register those functions first, how does it work in datalevin?
> There are too many ways to create an infinite loop. There is already a :timeout option in the query that one can set, you can try it to see if that works for you. (edited)
I think it can be narrowed down a bit. From what I have seen in the code and if understand correctly you use one Sci setup for two different kind of uses. One for the evaluation of developer approved functions and one for ad hoc code like (and (like? ?a) (or (= ?b 2))) . For developer approved code you can allow things like (range) and (repeat) etc, but for the ad hoc code added to queries by external users you could have a different Sci context with a much smaller whitelist.
I noticed that the problem is not what predicates are allowed, such as .toString or str , but the arguments that these predicates can receive such as (vec (range)). So I think splitting this two uses up would probably fix this potential exploit without removing any intentional features
What is "developer approved functions" vs. "ad hoc code"? I don't see the distinction. The goal is to have the same code working regardless the modes of DB operation. It is possible for a developer of embedded use to write code of infinite loop. Therefore, the fact that you directly use a code 3rd party submits to you is your own problem, not Datalevin's. All Datalevin can provide, is a timeout facility that can deal with infinite loop. That's it. It's developer's responsibility to prevent such code from running. Just like you don't take 3rd party's SQL statement and directly run on your DB, you shouldn't be taking a random query string to run in Datalevin.
Fair enough. Yeah i guess it is a usecase that is not really what Datalevin was intended for
Remember the basic tenant of computer science, you cannot detect a code will Halt or not.
Of course it is up to you what you want to do with. I was just offering a potential solution to avoid any change and give a bit of protection for this particular case
I'm also curious how you will do a timeout. You cannot really stop a thread so you have to wait for the VM to give up i think?
There's no such distinction as you described. There's no different context.
In the case of what @borkdude i think there is but he is the user of Datalevin so for Datalevin it might just be one context. He can of course try to sanitize the input somehow
The use case for exposing a read only dataset to the public and making it queryable via a public API is a different use case than using the db in your own app
But thatโs not what most dbs provide out of the box
There's a way to do it, if the query is done asynchronously, which I plan to do for other reasons as well, e.g. to prevent the dreaded "max reader reached" error.
we can of course add an option to disable running certain forms in query, that should handle these cases.
Basically, when this option is turned on, some forms or functions are not allowed.
we can certainly do that.
But it won't be 100% safe, or it may feel too strict, in any case, it would not make everyone happy, but at least good for 80% cases.
so let user to customize the ban list would be necessary.
No need to do this right now, at least not for me, right now just exploring
yeah, we can file a ticket and get to it when we have time
I was more optimistic about the possibilities. I don't think anyone would need things like (range) or (repeatedly) in the query itself. I think for the query itself you would use things like (and (like ?a) (= ?b 2)) like in the examples in the tests. So i think for the query you can make a simple Sci context. And for the internal functions you can support all the other things with more elaborate Sci context. So that's why I was talking about two contexts if that makes sense now. I think Sci is pretty safe if you give only limited options.
But anyway, like @borkdude says this not required yet
SCI is slow, so you don't want to run everything in SCI.
Query predicates should run in JVM directly to have acceptable performance. SCI is used only for 1. stored functions, 2. functions that sent over wire. 3. code from command line.
We use SCI to accommodate GraalVM, for eval is not allowed in GraalVM.
if eval is allowed, we do't need sci
Ok I understand, but my tests seems to indicate that Sci is used for all predicates. Am I wrong?
(d/q '[:find ?sales
:in $ ?sales
:where [(.toString (vec (range)))
]]
(d/db conn) 10)no
predicates are all in jvm
Ah ok. Maybe I should not say anything about projects that I haven't studied for 100%. So you are saying that you never intended to use Sci as a Sandbox?
no, sci was introduced entirely to get around the eval limitation
ah ok my bad. Sorry misunderstood that completely
of course you could use eval in JVM clojure if you want to and SCI for the native image
I would need to see benchmarks about which is really slower though since eval can be slower if you don't cache the evaluation since the clojure analyzer + evaluation can take more time in total in some cases
of course, SCI is also needed for the command line tool dtlv
for embedded use, predicates are already clojure forms, so we don't need eval for those, I don't think sci can compete with that.
so the history is like this, sci was introduced early for the command line tool, for we need to evaluate code strings from command line, here we indeed considered the sandbox aspect; then I found that graalvm doesn't work with eval, since I am already using sci, so I introduced inter-fn macro to use sci to evaluate code for stored code in :db/fn and over the wire code (babushka and server).
so sci was never intended for evaluating query predicates, which always run in JVM.
Query predicates can run millions of time in a single query, so it is really not something should be running in an interpreter. Functions such as like? are compiled into FSM and cached in memory, and these are written in Java.
We have to do a lot of optimizations. How else could we compete with PostgreSQL in running complex query? It is written in highly optimized C.
We call Datalevin "simple", meaning it is simple in concept and use, but the implementation is not really simple, otherwise, we won't be able to compete.
Just looked at how SQLite handles this, if the user defined function is written in C and it has an infinite loop, SQLite won't be able to handle it either. So if we provide a ban list, it would be an improvement.
๐
DL execute random query functions only in the embedded library mode. So this should not be applicable for an API endpoint.
Otherwise, the function has to be provided in source form, and executed in SCI, which you know, are white list based.
๐ which docs are best to read on this?
https://cljdoc.org/d/datalevin/datalevin/0.9.22/api/datalevin.interpret
https://cljdoc.org/d/datalevin/datalevin/0.9.22/api/datalevin.interpret#inter-fn
so really, if you provides an API endpoint that takes a query string, I don't think it is possible to abuse it, as the function won't work since it is not part of your clojure environment.
you will just get an exception.
ah right. I see how it uses SCI now
so it seems to give access to datalevin.core etc by default. does this mean that you could potentially change/delete stuff while doing a query?
Not really, thereโs no access to the reference to db at api endpoint
So impossible to call those functions in a query.
@borkdude As an attack vector I would consider a DOS via infinite recursion. There are several ways to do this in Sci if you have enough available functions
yep
just a simple (loop [] (recur)) would do if you allow those
And any fn definition that has a self reference (via atom, volatile etc)
I don't know if/how datalevin allows function execution by default though, perhaps there's way to disable it
My guess is that you can do recursion. I cannot find any :deny or :allow in the repo. Here is the ctx definition https://github.com/juji-io/datalevin/blob/4289de2c3577ca59190f2349a77dd23cd0814139/src/datalevin/interpret.clj#L205-L213
does anyone have an example of how you can run a loop in an arbitrary query?
No and I am probably wrong there is some whitelisting going on here https://github.com/juji-io/datalevin/blob/4289de2c3577ca59190f2349a77dd23cd0814139/src/datalevin/query.clj#L664
Maybe I should change career
(def conn (d/get-conn "/tmp/datalevin-mydb" {}
))
(d/q '[:find ?sales
:in $ ?sales
:where [(.toString (vec (range)))
]]
(d/db conn) 10)
Do it where you can kill the process ๐the dot-form is the weak point
Any reason for not using str?
so i can exploit the weakness ๐
the dotform gives me a free pass
Oh but you are right, with normal str it also does what i intended. Just didn't think about that before ๐ฌ
There are too many ways to create an infinite loop. There is already a :timeout option in the query that one can set, you can try it to see if that works for you.
The :timeout option probably won't work for handling infinite loop in code. Need to file an GitHub issue for this.
@huahaiy Just FYI, I'm not running into an urgent situation with this, I was just exploring options. Mostly for #dewey which is a dataset about the clojure ecosystem which we could make queryable with datalevin potentially, via the web