ai

john 2023-04-24T00:05:25.698599Z

A full dump of clojurians chat logs for clojure instruction tuning could be useful for that too. We could also host a chat-based "clojure coaching" service, where Clojure newbies can get free help and Clojure coaches can know that their training is contributing to a clojure coding assistant instruction training dataset

john 2023-04-24T00:09:57.920309Z

A few open source groups are crowd sourcing their fine tuning instruction datasets, like Open Assistant. They've got over 150 thousand conversation trees. But we might be able to get away with just a few tens of thousands of high quality clojure coding instruction trees, plus existing chatlogs and gen'd repl interactions

Rupert (Sevva/All Street) 2023-04-24T17:30:02.644799Z

> I'm curious if we could just spec gen billions of functions and record the output This sounds like a good idea. It's also plays well to clojure that you can generate code as data (harder to generate valid java from just a schema). > When playing with chatgpt, it's clear that it's transferring knowledge about programming solutions in other languages sometimes when giving a clojure solution. You need a high quality underlying model - it's not enough to just be good at Clojure the LLM also needs a good model of the world to be able to interpret the user's request. > where having a language coding assistant is as common as having a code linter or an LSP server, so we may want to think about how to make sure we have a clojure assistant story Yes, I think an LLM to assist with Clojure coding either in Editor (like GitHub co-pilot) or as a chat bot (like chatGPT) - could be very important for Clojure communities continued growth. It may be that Open AI/Microsoft/etc will do this for us - or it may be that we can do better with our own fine tuned ClojureLLM. > A full dump of clojurians chat logs for clojure instruction tuning could be useful for that too. We could use an LLM to analyse each message "Is this a Question or answer?" "Does this answer the question above?" "Do people seem satisfied with this answer based on the subsequent replies?" - then we can build a nice training set from it. > We could also host a chat-based "clojure coaching" service, That's a good idea, alternaively we could just use Slack as we are today and use the above classification approach mentioned above. > They've got over 150 thousand conversation trees. But we might be able to get away with just a few tens of thousands of high quality clojure coding instruction trees, plus existing chatlogs and gen'd repl interactions. Agreed, Github Clojure Code + Clojurians Slack + Clojure StackOverflow + some extra examples could provide enough examples for fine tuning.

john 2023-04-24T17:33:33.291189Z

> We could use an LLM to analyse each message "Is this a Question or answer?" "Does this answer the question above?" "Do people seem satisfied with this answer based on the subsequent replies?" For the highest quality data there, in the shortest time, we could use gpt3/4 to gen all that. But the licensing gets tricky. But I'm not sure it matters if it's an open source project. I believe the licensing is only tricky if you want to commercialize the output.

Rupert (Sevva/All Street) 2023-04-24T17:34:56.301209Z

We could also get the LLM to tidy up the message "Make this message more readable and understable for programmers" "Remove any references to usernames" etc

Rupert (Sevva/All Street) 2023-04-24T17:36:12.271919Z

Classification of question/answer could probably work on a slightly less capable model like Pythia.

john 2023-04-24T17:36:22.347679Z

Might want to replace each username with a stable anonymized name. I heard a recent story that stabilityAI's recent model problems came from blanking out usernames and the model didn't know who was talking to who lol

Rupert (Sevva/All Street) 2023-04-24T17:37:15.983139Z

Yeah - anonymization needs to happen at the right step in the fining tune data generation pipeline.

john 2023-04-24T17:38:01.802449Z

Aye. Well some new clojure llm mutl-gpt hotness just dropped: https://clojurians.slack.com/archives/C06MAR553/p1682349923753599

john 2023-04-24T17:39:17.517769Z

So I'm down to help out with a ClojureLLM project 🙂 Looks like that project is just using OpenAI for now. I wouldn't mind just starting with that, with the option to bolt on private LLMs when they're ready

Rupert (Sevva/All Street) 2023-04-24T17:42:18.381759Z

Agreed - I'm down to help too. A few high level tasks that come to mind: • Collect fine tuning data (raw data -> data cleanup pipeline) • Run fine tuning on top of open source models • Provide APIs, Command lines interfaces, User interfaces, Websites and IDE integrations for interacting with the new model.

john 2023-04-24T17:43:31.895019Z

Those are pretty good, broad three categories

john 2023-04-24T17:45:13.272299Z

I'm probably the grayest around the second point. I've been doing a lot of research but I haven't yet gotten my hands dirty with doing a fine tuning run

Rupert (Sevva/All Street) 2023-04-24T17:46:29.436909Z

I can probably do that and cover compute cost for it too.

john 2023-04-24T17:46:35.973969Z

And fine tuning one of the smaller open source models to be a reliable coding assistant might take a whole lot of experimentation... I haven't yet assessed how good they all are yet

Rupert (Sevva/All Street) 2023-04-24T17:47:34.240109Z

If it doesn't work great - then we still have the data and wait for a better model to come to fine tune on top of that.

Rupert (Sevva/All Street) 2023-04-24T17:48:44.808119Z

No guarantees with this project- could be that we do this work and end up with a model that is too low quality in the end.

john 2023-04-24T17:48:53.863709Z

Nice! We should probably do an assessment. yeah, if there are any other llama code assistant models that are really good, we could probably run with that. If we're sure this bot won't be commercial, and will always be a clojure community asset, then I think we can use existing high quality llama

john 2023-04-24T17:50:55.812689Z

But we can have multiple models in there, it's whatever

Rupert (Sevva/All Street) 2023-04-24T17:51:06.777749Z

Yeah - no reason to be stuck to just one model. I guess another task is assessing the output of the model - perhaps running it's suggestions side by side with another AI and asking an GPT-4 to decide which is best.

john 2023-04-24T17:51:10.300799Z

The same data can be used on any of the models as they come out

john 2023-04-24T17:51:15.025469Z

right

john 2023-04-24T17:51:35.865299Z

Yeah, that's what folks are doing

Rupert (Sevva/All Street) 2023-04-24T17:51:46.361439Z

Or might even get picked up and used by other groups and used in their models by defaut.

john 2023-04-24T17:51:49.258129Z

Having gpt4 score the output against itself

john 2023-04-24T17:52:21.986939Z

yeah, there might be a lot of model sexing going on in the near future lol

john 2023-04-24T17:53:32.674479Z

People folding weights back in from branch LLMs that were trained on other datasets

john 2023-04-24T17:58:48.460909Z

https://github.com/sourcegraph/awesome-code-ai

john 2023-04-24T18:00:35.004599Z

This is interesting https://replit.com/site/ghostwriter Aren't there clojure folks over at replit?

john 2023-04-24T18:07:32.968889Z

Codium is free for individuals and claims to support clojure here: https://codeium.com/blog/code-assistant-comparison-copilot-tabnine-ghostwriter-codeium

john 2023-04-24T18:11:21.378889Z

Open source version of copilot: https://github.com/fauxpilot/fauxpilot

john 2023-04-24T18:15:36.767119Z

https://github.com/Hannibal046/Awesome-LLM

john 2023-04-24T18:24:08.280979Z

Might be worth looking first into dolly or pythia as they allow commercial use

john 2023-04-24T18:24:44.483729Z

Open Assistent will likely have some decent pythia models coming on soon too

john 2023-04-24T22:27:35.312999Z

> use the combination of HuggingFace, DeepSpeed, and Ray to build a system for fine-tuning and serving LLMs, in 40 minutes for less than $7 for a 6 billion parameter model

Carlo 2023-04-25T11:27:56.498589Z

If I can be of help here, I'm all for it! In my experience trying to get llms to spit reasonable clojure, chatgpt is really the only one that does an acceptable job. But then, maybe we just need to gather or generate a training dataset and finetune codealpaca?

Carlo 2023-04-25T11:30:08.205109Z

The issue with fine tuning openai stuff is that you can't finetune the chat models; you can finetune the davinci llm, which has a quality comparable to chatgpt while costing 10x more per token. On coding tasks the tokens pile up fast, on top of the privacy issues

Carlo 2023-04-25T11:31:45.285889Z

I'm not sure how good a 6billion parameter model can get at generating code, I have huge doubts but I'm curious 🙂

Carlo 2023-04-25T11:32:56.469859Z

Using logs as a starting point for a high quality dataset seems a great idea!

Rupert (Sevva/All Street) 2023-04-25T11:38:46.669659Z

> If I can be of help here, I'm all for it! Great, the more the merrier. If we have a good fine tuning data then we can use it to fine tune multiple models, both now, but also future models as they come out. If we have the end to end workflow we can optimise it later or at least learned the limitations.