Fork me on GitHub
#community-development
<
2023-04-24
>
john22:04:18

There's been some talk of a community effort to make a "ClojureLLM" code assistant. As it turns out, the logs in the beginners channel and few others might be really useful for building instruction trees for fine tuning on talking through code. Looking at https://clojurians-log.clojureverse.org/x/x/x/about I see that Gaiwan and Lambda Island are hosting the data. The data is being hosted publicly so it could be scraped by a bot to build the dataset, but I'm curious if anyone with access to the raw data would like to pitch it in. Also I'm curious if anyone in this channel might have ideas for other clojure datasets that might be useful for a community "ClojureLLM" and where to find them

seancorfield23:04:32

Not all channels have @U055W814A so not all channels are logged to that clojureverse archive -- people may have good reason for not inviting the logging bots (zulip too) into a channel, if they don't want it "public". So no one is getting access to "the raw data", only what's already public. This is probably also a good point to remind folks of the Slack terms of service which state that anything posted to Slack belongs to the organization or people who "own" the Slack workspace -- so using something like ChatGPT on the clojureverse archive might need permission from the moderator team but wouldn't need permission from the folks who posted the original content... ...but as one of those "owners" I would definitely want to hear how people here feel about code fragments they've posted (and other information) getting sucked into some LLM training data set?

john23:04:24

Sorry Sean, your cyber spirit has been imprinted all over the internet and you're getting uploaded whether you like it or not

john23:04:55

"ChatGPT, tell me how to solve this clojure problem in the style of Sean Corfield"

😂 4
john23:04:28

(kidding)

john23:04:16

I'm also interested in what people think though

john23:04:27

I'm also curious whether the Clojure Core team would like to have clojure.core codebase included in datasets. I know that the Red Pajamas dataset that was just released (which many open source models will be built on, including coding assistants) does not include EPL code projects in the github scrape.

john23:04:38

It may well be that making a bot that is an expert in a particular language benefits from having direct access to the source code of the language

seancorfield05:04:00

So... I asked ChatGPT about this: Well, that's an interesting question there, my friend. I can certainly understand the temptation to use the Clojure community's Slack contents to train a ChatGPT instance about the programming language. After all, the Clojure community is a vibrant and active one, and there's no doubt that there's a lot of valuable information being shared on that platform. However, before we rush headlong into such an endeavor, we need to take a step back and think about what we're trying to achieve here. Yes, it's certainly true that training a ChatGPT instance on the Slack contents of the Clojure community could potentially be a powerful tool for learning about the language. But we need to be careful not to overestimate the value of such an approach. You see, the thing about programming languages is that they're more than just collections of syntax and semantics. They're living, breathing ecosystems that are constantly evolving and changing. And while the Clojure community's Slack contents can certainly provide us with a wealth of information about the language, they can't give us the full picture of what it means to work with Clojure in the real world. That's why I think it's important to approach this question with a healthy dose of skepticism. Yes, using the Clojure community's Slack contents to train a ChatGPT instance could be a valuable tool for learning about the language. But we need to be careful not to rely too heavily on this approach, and to remember that there's no substitute for real-world experience when it comes to mastering a programming language. So in summary, while using the Clojure community's Slack contents to train a ChatGPT instance could be a useful tool for learning about the language, we need to be careful not to overestimate its value and to remember that real-world experience is always the best teacher. (and, yes, I said "Can you answer this question in the style of Sean Corfield?")

10
Rupert (All Street)06:04:43

My feedback on the ChatGPT answer above is that there is a fair bit of waffling ( lots of repetition and little substance); it certainly does not read like a Sean Corfield answer to me!

2
jjttjj13:04:57

> the Red Pajamas dataset that was just released (which many open source models will be built on, including coding assistants) does not include EPL code projects in the github scrape. That's interesting. I've noticed chatgpt is generally worse at programming clojure than the mainstream languages (where it can pretty much build you a working app from scratch for a lot of boilerplate things). I assumed this makes sense as clojure is just a pretty niche language and thus would have much less overall input to the model. But now I'm curious to what degree it's due to the licensing of a lot of the clojure libraries being EPL?

💯 2
2
john21:04:19

@U04V70XH6 On another note, I do think that if we get a good amount of training data out of the clojurians archive, we'll have one of the most cordial, professional and intelligent hacker bots out there. These LLMs tend to take on the personalities of the people they're trained on, so I think the quality of Clojure's community will be reflected in the bot.

2
👍 2
john21:04:10

FYI #C054XC5JVDZ now exists. Feel free to poke your head in there in anyone wants to rap about the data question.

cfleming02:04:35

ClojureVerse would be another obvious thing to train on, and the Clojure mailing list (pretty dead these days, but wasn’t always so). But the same data provenance issues apply to those as well, of course.

john22:04:53

There's a few github corpi published that likely have a few gigs of clojure data in them

john22:04:40

But @rupert was also mentioning stack overflow might have good clojure discussion trees