Not sure if this is right place.. I tried to give llama.clj a spin It worked for the qwen 2 model in the notes. But if i try to use a qwen 3 model (gguf file) it errors. Any guidance? Error in thread.
https://github.com/phronmophobic/llama.clj/issues/18 I will still try to update it myself when I can. (still trying to catch up with all this MCP stuff!)
Ps. Do you have any hot takes on using llama.clj directly with RAG or whatever approaches, vs the MCP praxis?
I haven't had a chance to catch up with the MCP stuff. I'm still trying to get #easel up and running so I can use it full time.
My guess is that the LLMs available via the API are higher quality. I think the llama.clj options can also be interesting as freely available LLMs get better. Running everything locally can offer more privacy and doesn't require a network connection.
but I don't know.
I still have trouble believing that an LLM would be worth using to write most of the code that I'm working on, but maybe if I gave it a very try it would be useful.
> I’m still trying to get #easel up and running so I can use it full time. Yes please! Waiting on baited breath. Interactive development is why i got into clojure, yet all the various client/server stuff and distended runtimes, even at the fine-grained repl access layer, is endlessly confusing to me and feels like accidental complexity.
We'll see how things play out, but I think I'm starting to realize that having my IDE in another language/process/runtime is just enough of a barrier that I don't invest in it even though it's a lisp (emacs).
I still have trouble believing that an LLM would be worth using to write most of the code that I’m working on, I’m sort of thinking of a level of fuzzy tool calling, rather than writing novel code. Maybe using a locall llm that could run tools almost deterministically within a defined space, but provide some common sense reasoning and a natural language interface. I’m assuming this might not need the heaviest world-beating models, and also would have the benefit of being able to run offline. One scenario I’m interested in is querying my personal pkm knowledgebase. The data is there. I want to ask freeform questions (that would be formatted to proper queries). There exists an mcp server. But I’m hesitant to open my most personal data to OpenAI or Anthropic.
> One scenario I’m interested in is queriny my personal pkm knowledgebase. > The data is there. I want to ask freeform questions (that would be formatted to proper queries). But I’m hesitant to open my most personal data to openia or anthropic. Yea, I think this is an awesome idea
You can just record thoughts and have a fuzzy index.
I think it's the type of problem where AIs are actually already pretty good at. I don't think it would be that hard to implement either.
> record thoughts and have a fuzzy index Not quite getting this. Prior art?
I was thinking something like https://www.youtube.com/watch?v=OxzUjpihIH4. You insert a fact like "Using LLMs as a personal knowledge DB seems like an interesting idea.". The tool turns that into RDF. You can then query by creating a natural language question, which the LLM turns into datalog and pulls up related facts.
It's pretty half baked, but something like that
You can also create embeddings for thoughts that can be queried via RAG
Thanks for the tip! Will explore that. Do you have any recommendations for a model (size) for a Macbook pro with 16gb ram, (that would be general purpose, and this text to query stuff). The readme says start with 0.5b for a test. do you have any sense of sweet spot for the upper limit, If I want to try out the best possible local reasoning within my hardware budget, with usable latency. ps. I know this is subjective, and depends on the particular model, and how much resources I want to use atm. Just trying to suss out the ballpark to experiment with.
Unfortunately, I have no idea. I've only tested a few models
I bet https://www.reddit.com/r/LocalLLaMA/ has some recommendations
Thx. Just poking around it seems I should be able to run the 7-8B models ok
Just for reference the current setup works with qwen 2.5 model fine 🎉 Took me a while to realize I didnt have to join split .gguf files 😅 But the 7B model seemed to load fine.
I noticed the newer models dont neccessarily publish in .gguf format. Is that something that would be resolved with llama.cpp update, or does the llama.clj need and issue for that?
As far as I know, llama.cpp requires models to be in gguf format, but there are tools in the llama.cpp repo for converting models to gguf from other formats.
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#obtaining-and-quantizing-models
> llama.cpp requires the model to be stored in the https://github.com/ggml-org/ggml/blob/master/docs/gguf.md file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.
clojure -M:mvn-llama -m com.phronemophobic.llama "models/Qwen3-0.6B-Q8_0.gguf" "what is 2 + 2?"
llama_model_load_from_file_impl: using device Metal (Apple M1) - 10922 MiB free
llama_model_loader: loaded meta data with 28 key-value pairs and 310 tensors from models/Qwen3-0.6B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 0.6B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3
llama_model_loader: - kv 5: general.size_label str = 0.6B
llama_model_loader: - kv 6: qwen3.block_count u32 = 28
llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
llama_model_loader: - kv 8: qwen3.embedding_length u32 = 1024
llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 3072
llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 7
llama_model_loader: - type f32: 113 tensors
llama_model_loader: - type q8_0: 197 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 604.15 MiB (8.50 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3'
llama_model_load_from_file_impl: failed to load model
Execution error (ExceptionInfo) at com.phronemophobic.llama.raw-gguf-b4634/create-context (raw_gguf_b4634.clj:655).
Error creating model
Full report at:
/var/folders/9n/gnkbswvn6l30l2jt5dzn_cc00000gn/T/clojure-2657754005343506520.edn{:clojure.main/message
"Execution error (ExceptionInfo) at com.phronemophobic.llama.raw-gguf-b4634/create-context (raw_gguf_b4634.clj:655).\nError creating model\n",
:clojure.main/triage
{:clojure.error/class clojure.lang.ExceptionInfo,
:clojure.error/line 655,
:clojure.error/cause "Error creating model",
:clojure.error/symbol
com.phronemophobic.llama.raw-gguf-b4634/create-context,
:clojure.error/source "raw_gguf_b4634.clj",
:clojure.error/phase :execution},
:clojure.main/trace
{:via
[{:type clojure.lang.ExceptionInfo,
:message "Error creating model",
:data {:params nil, :model-path "models/Qwen3-0.6B-Q8_0.gguf"},
:at
[com.phronemophobic.llama.raw_gguf_b4634$create_context
invokeStatic
"raw_gguf_b4634.clj"
655]}],
:trace
[[com.phronemophobic.llama.raw_gguf_b4634$create_context
invokeStatic
"raw_gguf_b4634.clj"
655]
[com.phronemophobic.llama.raw_gguf_b4634$create_context
invoke
"raw_gguf_b4634.clj"
645]
[com.phronemophobic.llama.raw_gguf_b4634$reify__22943
create_context
"raw_gguf_b4634.clj"
791]
[com.phronemophobic.llama$create_context
invokeStatic
"llama.clj"
231]
[com.phronemophobic.llama$create_context invoke "llama.clj" 151]
[com.phronemophobic.llama$create_context
invokeStatic
"llama.clj"
198]
[com.phronemophobic.llama$create_context invoke "llama.clj" 151]
[com.phronemophobic.llama$_main invokeStatic "llama.clj" 452]
[com.phronemophobic.llama$_main invoke "llama.clj" 451]
[clojure.lang.AFn applyToHelper "AFn.java" 156]
[clojure.lang.AFn applyTo "AFn.java" 144]
[clojure.lang.Var applyTo "Var.java" 707]
[clojure.core$apply invokeStatic "core.clj" 667]
[clojure.main$main_opt invokeStatic "main.clj" 515]
[clojure.main$main_opt invoke "main.clj" 511]
[clojure.main$main invokeStatic "main.clj" 665]
[clojure.main$main doInvoke "main.clj" 617]
[clojure.lang.RestFn applyTo "RestFn.java" 140]
[clojure.lang.Var applyTo "Var.java" 707]
[clojure.main main "main.java" 40]],
:cause "Error creating model",
:data {:params nil, :model-path "models/Qwen3-0.6B-Q8_0.gguf"}}}Most models have a slightly different architecture which require llama.cpp to be updated. Depending on how new the model is, llama.cpp may require updating.
Or it might be a bug.
I try to update llama.cpp every few months. Previously, they were making breaking changes every 1-2 weeks. It's possible the part of the API that llama.clj uses has stabilized a bit.
If llama.cpp didn’t make breaking changes all the time, I would just publish a new build of the native library regularly.
Ok, thanks, Ill give that a shot
If you file an issue on github, I can try to look into it later this week