Fork me on GitHub
#datalevin
<
2024-01-14
>
Huahai02:01:50

I have been testing large language models' ability to translate natural language query to Datomic flavored Datalog. Basically, the prompt will include a schema and a natural language query, and LLM needs to produce a Datalog query.

Huahai02:01:58

Not surprisingly, GPT4 is the best in the commercial LLMs for this task.

Huahai02:01:15

In the open source LLMs, mixtral is quite good, and is the only model that can do this task with decent performance, among all the models that I tested, which includes every code generation models in the ollama library.

Huahai02:01:50

However, even GPT4 makes a lot of mistakes. mixtral is a bit worse in following instructions. I think these models have not seen enough Datalog in their training data.

Huahai02:01:04

On the other hand, I think translating natural language query to Datalog is a much simpler task than translating to SQL, which, even with so much efforts and so much training data available, is still far from being solved.

Huahai02:01:17

Datomic flavored Datalog is very simple conceptually, that's why GPT4 and mixtral can do a decent job without having to see a lot of examples in their training data. I think with fine-turning, we may solve this completely, unlike SQL.

Huahai02:01:27

My plan is to collect enough good training data for natural language to Datalevin query translation, and use the data to do a fine-tuning using mixtral as the base, and release the model to huggingface.

👍 1
Huahai02:01:45

If any of you are interested in joining me in this project, please let me know.

ts150309:01:10

Hey Huahai. This project looks like a great initiative, I’m not an expert in training LLMs but interested in participating

jeroenvandijk11:01:50

Yeah sounds interesting indeed. Maybe also an idea to share this in #C054XC5JVDZ?

Huahai18:01:32

I am not an expert either. Main effort will be data preparation though. I am targeting Datalevin specifically, so it may not be of general interest

Huahai18:01:08

If a small model can be successfully trained, we could even add it as an optional module of Datalevin, so we become a DB that comes with a NL interface. Other AI features can be gradually added, like schema generation, sample data population, etc

Huahai18:01:34

Again, this model is going to be Datalevin specific, because of the schema difference. Datomic treats schema as data, which needlessly complicates things, e.g. partition etc; Datascript doesn’t really do data types, and so on. In addition, we will need to enforce some conventions for this to work, e.g. namespaced attributes, ref naming, etc.

Huahai18:01:22

That is to say, not everything the DB can do will be supported by the NL interface

Huahai18:01:03

The goal is to beat NL to SQL interface with a large margin. Not just in benchmarks like spider, wikisql, etc, which are too simple, but in real world database that are more complicated.

ts150319:01:26

Do you have a plan in mind? What steps should be taken?

Huahai21:01:56

We need to collect data: schema, NL and Datalog pairs. And use an intermediary EDN format for these, so we can generate various training data formats out of these.

Huahai21:01:01

1. We can scrape clojure repos for code using datomic, xtdb, datascript etc.

Huahai21:01:04

2. Convert nl2sql benchmark datasets, spider, wikisql

Huahai21:01:27

For 1, we need to write corresponding NL query. Also, code in the wild often avoids query, but write clojure code to work on pull results instead. So may not suitable for our purposes

Huahai21:01:13

For 2, we need to write the corresponding datalog query

Huahai00:01:01

These can all be semi-automated with gpt4

Huahai02:01:11

The main effort will in data preparation.

Huahai02:01:30

Once collected, this data set can be used to train smaller models as well. For example, several small models, such as codellama 7b and deepseek-coder 6.7b already know the syntax of Datomic flavored Datalog, so it is possible to fine tune these to achieve good results. If we can have a small open source model that does a great job in doing NL-to-Datalog translation, it probably would spur the use of Datalog query language.

nice 2