This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
I have been testing large language models' ability to translate natural language query to Datomic flavored Datalog. Basically, the prompt will include a schema and a natural language query, and LLM needs to produce a Datalog query.
In the open source LLMs, mixtral is quite good, and is the only model that can do this task with decent performance, among all the models that I tested, which includes every code generation models in the ollama library.
However, even GPT4 makes a lot of mistakes. mixtral is a bit worse in following instructions. I think these models have not seen enough Datalog in their training data.
On the other hand, I think translating natural language query to Datalog is a much simpler task than translating to SQL, which, even with so much efforts and so much training data available, is still far from being solved.
Datomic flavored Datalog is very simple conceptually, that's why GPT4 and mixtral can do a decent job without having to see a lot of examples in their training data. I think with fine-turning, we may solve this completely, unlike SQL.
My plan is to collect enough good training data for natural language to Datalevin query translation, and use the data to do a fine-tuning using mixtral as the base, and release the model to huggingface.
Hey Huahai. This project looks like a great initiative, I’m not an expert in training LLMs but interested in participating
Yeah sounds interesting indeed. Maybe also an idea to share this in #C054XC5JVDZ?
I am not an expert either. Main effort will be data preparation though. I am targeting Datalevin specifically, so it may not be of general interest
If a small model can be successfully trained, we could even add it as an optional module of Datalevin, so we become a DB that comes with a NL interface. Other AI features can be gradually added, like schema generation, sample data population, etc
Again, this model is going to be Datalevin specific, because of the schema difference. Datomic treats schema as data, which needlessly complicates things, e.g. partition etc; Datascript doesn’t really do data types, and so on. In addition, we will need to enforce some conventions for this to work, e.g. namespaced attributes, ref naming, etc.
The goal is to beat NL to SQL interface with a large margin. Not just in benchmarks like spider, wikisql, etc, which are too simple, but in real world database that are more complicated.
We need to collect data: schema, NL and Datalog pairs. And use an intermediary EDN format for these, so we can generate various training data formats out of these.
For 1, we need to write corresponding NL query. Also, code in the wild often avoids query, but write clojure code to work on pull results instead. So may not suitable for our purposes
Once collected, this data set can be used to train smaller models as well. For example, several small models, such as codellama 7b and deepseek-coder 6.7b already know the syntax of Datomic flavored Datalog, so it is possible to fine tune these to achieve good results. If we can have a small open source model that does a great job in doing NL-to-Datalog translation, it probably would spur the use of Datalog query language.
