Fork me on GitHub
#ai
<
2023-04-24
>
john00:04:25

A full dump of clojurians chat logs for clojure instruction tuning could be useful for that too. We could also host a chat-based "clojure coaching" service, where Clojure newbies can get free help and Clojure coaches can know that their training is contributing to a clojure coding assistant instruction training dataset

john00:04:57

A few open source groups are crowd sourcing their fine tuning instruction datasets, like Open Assistant. They've got over 150 thousand conversation trees. But we might be able to get away with just a few tens of thousands of high quality clojure coding instruction trees, plus existing chatlogs and gen'd repl interactions

Rupert (All Street)17:04:02

> I'm curious if we could just spec gen billions of functions and record the output This sounds like a good idea. It's also plays well to clojure that you can generate code as data (harder to generate valid java from just a schema). > When playing with chatgpt, it's clear that it's transferring knowledge about programming solutions in other languages sometimes when giving a clojure solution. You need a high quality underlying model - it's not enough to just be good at Clojure the LLM also needs a good model of the world to be able to interpret the user's request. > where having a language coding assistant is as common as having a code linter or an LSP server, so we may want to think about how to make sure we have a clojure assistant story Yes, I think an LLM to assist with Clojure coding either in Editor (like GitHub co-pilot) or as a chat bot (like chatGPT) - could be very important for Clojure communities continued growth. It may be that Open AI/Microsoft/etc will do this for us - or it may be that we can do better with our own fine tuned ClojureLLM. > A full dump of clojurians chat logs for clojure instruction tuning could be useful for that too. We could use an LLM to analyse each message "Is this a Question or answer?" "Does this answer the question above?" "Do people seem satisfied with this answer based on the subsequent replies?" - then we can build a nice training set from it. > We could also host a chat-based "clojure coaching" service, That's a good idea, alternaively we could just use Slack as we are today and use the above classification approach mentioned above. > They've got over 150 thousand conversation trees. But we might be able to get away with just a few tens of thousands of high quality clojure coding instruction trees, plus existing chatlogs and gen'd repl interactions. Agreed, Github Clojure Code + Clojurians Slack + Clojure StackOverflow + some extra examples could provide enough examples for fine tuning.

john17:04:33

> We could use an LLM to analyse each message "Is this a Question or answer?" "Does this answer the question above?" "Do people seem satisfied with this answer based on the subsequent replies?" For the highest quality data there, in the shortest time, we could use gpt3/4 to gen all that. But the licensing gets tricky. But I'm not sure it matters if it's an open source project. I believe the licensing is only tricky if you want to commercialize the output.

Rupert (All Street)17:04:56

We could also get the LLM to tidy up the message "Make this message more readable and understable for programmers" "Remove any references to usernames" etc

Rupert (All Street)17:04:12

Classification of question/answer could probably work on a slightly less capable model like Pythia.

john17:04:22

Might want to replace each username with a stable anonymized name. I heard a recent story that stabilityAI's recent model problems came from blanking out usernames and the model didn't know who was talking to who lol

Rupert (All Street)17:04:15

Yeah - anonymization needs to happen at the right step in the fining tune data generation pipeline.

john17:04:01

Aye. Well some new clojure llm mutl-gpt hotness just dropped: https://clojurians.slack.com/archives/C06MAR553/p1682349923753599

john17:04:17

So I'm down to help out with a ClojureLLM project 🙂 Looks like that project is just using OpenAI for now. I wouldn't mind just starting with that, with the option to bolt on private LLMs when they're ready

Rupert (All Street)17:04:18

Agreed - I'm down to help too. A few high level tasks that come to mind: • Collect fine tuning data (raw data -> data cleanup pipeline) • Run fine tuning on top of open source models • Provide APIs, Command lines interfaces, User interfaces, Websites and IDE integrations for interacting with the new model.

john17:04:31

Those are pretty good, broad three categories

john17:04:13

I'm probably the grayest around the second point. I've been doing a lot of research but I haven't yet gotten my hands dirty with doing a fine tuning run

Rupert (All Street)17:04:29

I can probably do that and cover compute cost for it too.

john17:04:35

And fine tuning one of the smaller open source models to be a reliable coding assistant might take a whole lot of experimentation... I haven't yet assessed how good they all are yet

Rupert (All Street)17:04:34

If it doesn't work great - then we still have the data and wait for a better model to come to fine tune on top of that.

Rupert (All Street)17:04:44

No guarantees with this project- could be that we do this work and end up with a model that is too low quality in the end.

john17:04:53

Nice! We should probably do an assessment. yeah, if there are any other llama code assistant models that are really good, we could probably run with that. If we're sure this bot won't be commercial, and will always be a clojure community asset, then I think we can use existing high quality llama

john17:04:55

But we can have multiple models in there, it's whatever

Rupert (All Street)17:04:06

Yeah - no reason to be stuck to just one model. I guess another task is assessing the output of the model - perhaps running it's suggestions side by side with another AI and asking an GPT-4 to decide which is best.

john17:04:10

The same data can be used on any of the models as they come out

john17:04:35

Yeah, that's what folks are doing

Rupert (All Street)17:04:46

Or might even get picked up and used by other groups and used in their models by defaut.

john17:04:49

Having gpt4 score the output against itself

john17:04:21

yeah, there might be a lot of model sexing going on in the near future lol

john17:04:32

People folding weights back in from branch LLMs that were trained on other datasets

john18:04:35

This is interesting https://replit.com/site/ghostwriter Aren't there clojure folks over at replit?

john18:04:32

Codium is free for individuals and claims to support clojure here: https://codeium.com/blog/code-assistant-comparison-copilot-tabnine-ghostwriter-codeium

john18:04:08

Might be worth looking first into dolly or pythia as they allow commercial use

john18:04:44

Open Assistent will likely have some decent pythia models coming on soon too

john22:04:35

> use the combination of HuggingFace, DeepSpeed, and Ray to build a system for fine-tuning and serving LLMs, in 40 minutes for less than $7 for a 6 billion parameter model

Carlo11:04:56

If I can be of help here, I'm all for it! In my experience trying to get llms to spit reasonable clojure, chatgpt is really the only one that does an acceptable job. But then, maybe we just need to gather or generate a training dataset and finetune codealpaca?

Carlo11:04:08

The issue with fine tuning openai stuff is that you can't finetune the chat models; you can finetune the davinci llm, which has a quality comparable to chatgpt while costing 10x more per token. On coding tasks the tokens pile up fast, on top of the privacy issues

Carlo11:04:45

I'm not sure how good a 6billion parameter model can get at generating code, I have huge doubts but I'm curious 🙂

Carlo11:04:56

Using logs as a starting point for a high quality dataset seems a great idea!

Rupert (All Street)11:04:46

> If I can be of help here, I'm all for it! Great, the more the merrier. If we have a good fine tuning data then we can use it to fine tune multiple models, both now, but also future models as they come out. If we have the end to end workflow we can optimise it later or at least learned the limitations.