membrane

chromalchemy 2025-05-27T18:44:28.184719Z

Not sure if this is right place.. I tried to give llama.clj a spin It worked for the qwen 2 model in the notes. But if i try to use a qwen 3 model (gguf file) it errors. Any guidance? Error in thread.

chromalchemy 2025-05-29T16:40:29.316629Z

https://github.com/phronmophobic/llama.clj/issues/18 I will still try to update it myself when I can. (still trying to catch up with all this MCP stuff!)

🙏 1
chromalchemy 2025-05-29T16:41:33.602519Z

Ps. Do you have any hot takes on using llama.clj directly with RAG or whatever approaches, vs the MCP praxis?

phronmophobic 2025-05-29T16:43:11.411459Z

I haven't had a chance to catch up with the MCP stuff. I'm still trying to get #easel up and running so I can use it full time.

phronmophobic 2025-05-29T16:44:44.481849Z

My guess is that the LLMs available via the API are higher quality. I think the llama.clj options can also be interesting as freely available LLMs get better. Running everything locally can offer more privacy and doesn't require a network connection.

phronmophobic 2025-05-29T16:44:57.279559Z

but I don't know.

phronmophobic 2025-05-29T16:45:41.517719Z

I still have trouble believing that an LLM would be worth using to write most of the code that I'm working on, but maybe if I gave it a very try it would be useful.

chromalchemy 2025-05-29T16:46:21.627879Z

> I’m still trying to get #easel up and running so I can use it full time. Yes please! Waiting on baited breath. Interactive development is why i got into clojure, yet all the various client/server stuff and distended runtimes, even at the fine-grained repl access layer, is endlessly confusing to me and feels like accidental complexity.

phronmophobic 2025-05-29T16:49:57.605319Z

We'll see how things play out, but I think I'm starting to realize that having my IDE in another language/process/runtime is just enough of a barrier that I don't invest in it even though it's a lisp (emacs).

chromalchemy 2025-05-29T16:53:32.452539Z

I still have trouble believing that an LLM would be worth using to write most of the code that I’m working on, I’m sort of thinking of a level of fuzzy tool calling, rather than writing novel code. Maybe using a locall llm that could run tools almost deterministically within a defined space, but provide some common sense reasoning and a natural language interface. I’m assuming this might not need the heaviest world-beating models, and also would have the benefit of being able to run offline. One scenario I’m interested in is querying my personal pkm knowledgebase. The data is there. I want to ask freeform questions (that would be formatted to proper queries). There exists an mcp server. But I’m hesitant to open my most personal data to OpenAI or Anthropic.

phronmophobic 2025-05-29T16:54:22.074239Z

> One scenario I’m interested in is queriny my personal pkm knowledgebase. > The data is there. I want to ask freeform questions (that would be formatted to proper queries). But I’m hesitant to open my most personal data to openia or anthropic. Yea, I think this is an awesome idea

phronmophobic 2025-05-29T16:54:52.639229Z

You can just record thoughts and have a fuzzy index.

phronmophobic 2025-05-29T16:55:47.279789Z

I think it's the type of problem where AIs are actually already pretty good at. I don't think it would be that hard to implement either.

chromalchemy 2025-05-29T16:57:11.172659Z

> record thoughts and have a fuzzy index Not quite getting this. Prior art?

phronmophobic 2025-05-29T17:05:09.949849Z

I was thinking something like https://www.youtube.com/watch?v=OxzUjpihIH4. You insert a fact like "Using LLMs as a personal knowledge DB seems like an interesting idea.". The tool turns that into RDF. You can then query by creating a natural language question, which the LLM turns into datalog and pulls up related facts.

phronmophobic 2025-05-29T17:05:25.011449Z

It's pretty half baked, but something like that

phronmophobic 2025-05-29T17:05:58.589779Z

You can also create embeddings for thoughts that can be queried via RAG

chromalchemy 2025-05-29T17:10:12.037819Z

Thanks for the tip! Will explore that. Do you have any recommendations for a model (size) for a Macbook pro with 16gb ram, (that would be general purpose, and this text to query stuff). The readme says start with 0.5b for a test. do you have any sense of sweet spot for the upper limit, If I want to try out the best possible local reasoning within my hardware budget, with usable latency. ps. I know this is subjective, and depends on the particular model, and how much resources I want to use atm. Just trying to suss out the ballpark to experiment with.

phronmophobic 2025-05-29T17:11:28.163819Z

Unfortunately, I have no idea. I've only tested a few models

phronmophobic 2025-05-29T17:11:43.032349Z

I bet https://www.reddit.com/r/LocalLLaMA/ has some recommendations

chromalchemy 2025-05-29T17:34:20.965439Z

Thx. Just poking around it seems I should be able to run the 7-8B models ok

chromalchemy 2025-05-29T18:17:08.409039Z

Just for reference the current setup works with qwen 2.5 model fine 🎉 Took me a while to realize I didnt have to join split .gguf files 😅 But the 7B model seemed to load fine.

chromalchemy 2025-05-29T18:18:40.564679Z

I noticed the newer models dont neccessarily publish in .gguf format. Is that something that would be resolved with llama.cpp update, or does the llama.clj need and issue for that?

phronmophobic 2025-05-29T18:51:37.507919Z

As far as I know, llama.cpp requires models to be in gguf format, but there are tools in the llama.cpp repo for converting models to gguf from other formats.

👍 1
phronmophobic 2025-05-29T18:52:23.729669Z

https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#obtaining-and-quantizing-models > llama.cpp requires the model to be stored in the https://github.com/ggml-org/ggml/blob/master/docs/gguf.md file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

chromalchemy 2025-05-27T18:45:04.783589Z

clojure -M:mvn-llama -m com.phronemophobic.llama "models/Qwen3-0.6B-Q8_0.gguf" "what is 2 + 2?"


llama_model_load_from_file_impl: using device Metal (Apple M1) - 10922 MiB free
llama_model_loader: loaded meta data with 28 key-value pairs and 310 tensors from models/Qwen3-0.6B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 0.6B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3
llama_model_loader: - kv   5:                         general.size_label str              = 0.6B
llama_model_loader: - kv   6:                          qwen3.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 1024
llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 3072
llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type q8_0:  197 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 604.15 MiB (8.50 BPW) 
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3'
llama_model_load_from_file_impl: failed to load model
Execution error (ExceptionInfo) at com.phronemophobic.llama.raw-gguf-b4634/create-context (raw_gguf_b4634.clj:655).
Error creating model

Full report at:
/var/folders/9n/gnkbswvn6l30l2jt5dzn_cc00000gn/T/clojure-2657754005343506520.edn

chromalchemy 2025-05-27T18:45:08.626759Z

{:clojure.main/message
 "Execution error (ExceptionInfo) at com.phronemophobic.llama.raw-gguf-b4634/create-context (raw_gguf_b4634.clj:655).\nError creating model\n",
 :clojure.main/triage
 {:clojure.error/class clojure.lang.ExceptionInfo,
  :clojure.error/line 655,
  :clojure.error/cause "Error creating model",
  :clojure.error/symbol
  com.phronemophobic.llama.raw-gguf-b4634/create-context,
  :clojure.error/source "raw_gguf_b4634.clj",
  :clojure.error/phase :execution},
 :clojure.main/trace
 {:via
  [{:type clojure.lang.ExceptionInfo,
    :message "Error creating model",
    :data {:params nil, :model-path "models/Qwen3-0.6B-Q8_0.gguf"},
    :at
    [com.phronemophobic.llama.raw_gguf_b4634$create_context
     invokeStatic
     "raw_gguf_b4634.clj"
     655]}],
  :trace
  [[com.phronemophobic.llama.raw_gguf_b4634$create_context
    invokeStatic
    "raw_gguf_b4634.clj"
    655]
   [com.phronemophobic.llama.raw_gguf_b4634$create_context
    invoke
    "raw_gguf_b4634.clj"
    645]
   [com.phronemophobic.llama.raw_gguf_b4634$reify__22943
    create_context
    "raw_gguf_b4634.clj"
    791]
   [com.phronemophobic.llama$create_context
    invokeStatic
    "llama.clj"
    231]
   [com.phronemophobic.llama$create_context invoke "llama.clj" 151]
   [com.phronemophobic.llama$create_context
    invokeStatic
    "llama.clj"
    198]
   [com.phronemophobic.llama$create_context invoke "llama.clj" 151]
   [com.phronemophobic.llama$_main invokeStatic "llama.clj" 452]
   [com.phronemophobic.llama$_main invoke "llama.clj" 451]
   [clojure.lang.AFn applyToHelper "AFn.java" 156]
   [clojure.lang.AFn applyTo "AFn.java" 144]
   [clojure.lang.Var applyTo "Var.java" 707]
   [clojure.core$apply invokeStatic "core.clj" 667]
   [clojure.main$main_opt invokeStatic "main.clj" 515]
   [clojure.main$main_opt invoke "main.clj" 511]
   [clojure.main$main invokeStatic "main.clj" 665]
   [clojure.main$main doInvoke "main.clj" 617]
   [clojure.lang.RestFn applyTo "RestFn.java" 140]
   [clojure.lang.Var applyTo "Var.java" 707]
   [clojure.main main "main.java" 40]],
  :cause "Error creating model",
  :data {:params nil, :model-path "models/Qwen3-0.6B-Q8_0.gguf"}}}

phronmophobic 2025-05-27T18:52:52.429259Z

Most models have a slightly different architecture which require llama.cpp to be updated. Depending on how new the model is, llama.cpp may require updating.

phronmophobic 2025-05-27T18:53:18.264359Z

Or it might be a bug.

phronmophobic 2025-05-27T18:54:58.327659Z

I try to update llama.cpp every few months. Previously, they were making breaking changes every 1-2 weeks. It's possible the part of the API that llama.clj uses has stabilized a bit.

phronmophobic 2025-05-27T19:03:24.575609Z

If llama.cpp didn’t make breaking changes all the time, I would just publish a new build of the native library regularly.

chromalchemy 2025-05-27T20:39:00.317119Z

Ok, thanks, Ill give that a shot

phronmophobic 2025-05-27T21:07:24.980389Z

If you file an issue on github, I can try to look into it later this week