announcements

tony.kay 2026-05-21T13:47:39.483489Z

I’ve created an LLM-driven and sqlite-based code search tool for CLJ(C). It does a SHA-like signature of each function to be able to do targeted change detection, and the way it indexes is as follows: • Send the code form to an LLM for description OR in fast mode just use the docstring ◦ This allows indexing of poorly documented code • Stored in SQLite (without embeddings for now) but with FTS5 with BM25 ranking over (qualified_name, description_llm, docstring, tags_llm, domain_signals_llm). https://github.com/awkay/clj-code-search The result is a high-speed bb tool (installable with bbin) where you or an LLM can ask questions:

code-search "path string"
and get a near instant response of functions in your code base that are of interest.

🤘 5
🤘🏻 1
🤔 2
tony.kay 2026-05-21T13:49:42.104939Z

I tried embeddings using a local ollama model and it was so slow as to be useless…so, in case you’re wondering

respatialized 2026-05-21T17:18:20.739199Z

if local speed is a feature, an encoder-only model may be a better choice for the embeddings for semantic search: https://huggingface.co/docs/transformers/main/en/model_doc/modernbert

respatialized 2026-05-21T17:19:37.058699Z

background: https://huggingface.co/blog/modernbert

respatialized 2026-05-21T17:20:10.912059Z

you'll get much better throughput and probably more "deterministic" results as compared with using a LLM for this, I'd bet

tony.kay 2026-05-21T17:38:41.208619Z

I did try an encoder-only model for trying embeddings. The point of the LLM usage wasn’t embeddings. The LLM usage is to get descriptions for functions that have no docstring so that there’s something to search for 😄

✔️ 1
tony.kay 2026-05-21T17:39:38.982299Z

The “fast” mode in the code search indexing is to just rely docstrings by only indexing on documented functions, which is useful when you want to add say a library’s functions to the index

tony.kay 2026-05-21T17:40:40.003929Z

My speed comment was that when I put the encodings into the db, that the search got a lot slower (because it’s SCI and math is slow, so searching a large db of possible matches was too CPU intensive). I want sub-ms searches, and my code base has 20k+ functions.

Josh 2026-05-21T19:47:12.086059Z

did you try using https://github.com/asg017/sqlite-vec for search? it's pretty fast for 20k docs

tony.kay 2026-05-22T09:24:12.311079Z

This is using vector search…just not embeddings. Which is why I find it “good enough”

steveb8n 2026-05-22T10:28:18.717699Z

This QMD fork does similar things. Works great for me across all my projects https://github.com/nextdoc/qmd/commit/e5ac6379d7535b907765a3b33427bca0f2f1d9b1

tony.kay 2026-05-22T10:39:10.475919Z

Nice, that has some nice attributes. I may give it a shot. Any idea how is it on large projects? I’m indexing 400k+ LOC and need the search speed on 20k+ functions to be super fast…

steveb8n 2026-05-22T10:41:09.305679Z

I haven’t counted up my projects lines of code, but I have more than 20 and I have a few enterprise software sized repo so I haven’t seen anything close to a limit yet. I run on an M4 Mac mini and each search takes 2 to 4 seconds.

steveb8n 2026-05-22T10:41:36.224099Z

The automated search expansion is really valuable

tony.kay 2026-05-22T10:41:53.916869Z

be interested in what cloc says for your project.

tony.kay 2026-05-22T10:42:37.173729Z

The other thing I’m doing is keeping a signature of functions so when I want to reindex (which is often) it can figure out precisely which ones have changed. The signatures tolerate trivial changes (movement and docstring edits).

steveb8n 2026-05-22T10:42:50.759389Z

I won’t get back to work for another 10 hours, but you can very easily install and test it on one repo

tony.kay 2026-05-22T10:42:59.502859Z

yeah, will do. Thanks

tony.kay 2026-05-22T11:48:04.483569Z

OK, here’s the results of my trials. qmd has the cool attribute of being able to find code by comprehension without an LLM. My tool has a high cost of initial indexing (I’m asking an LLM through claude -p to tell me what an undocumented function does and saving that string). So, qmd seems to excel when the query is complicated around concepts “where do I obfuscate user data?“: The quality of those results is high, though the runtime on my code base is around 10s. So, not great, but for that specific purpose it’s pretty nice. Defnitely a complimentary tool. My code-search index fails on that kind of query. But, my tools is targeting a different task: Is there a function that already does what I want? It’s specifically a function index. The problem I’m solving is LLMs rewriting stuff that already exists over and over in my code base. So, I want a fast tool so they have a chance of finding the candidates quickly. When asking qmd took 20 seconds, and did return a match

Expanding query... (2.5s)
├─ strip slash
├─ lex: remove slashes
├─ lex: cut away slashes
├─ vec: remove slashes
├─ vec: cut away slashes
└─ hyde: The topic of strip slash covers remove slashes. Proper implementation...
Searching 6 queries...
Embedding 4 queries... (557ms)
Reranking 40 chunks... (16.0s)
dataico/lib/strings.cljc:22 #ea110f
Title: strings
Score:  90%

@@ -21,4 @@ (20 before, 87 after)

(>defn trailing-slash
  "Ensures the given string has exactly one trailing slash.
   Converts nil to \"/\".


dataico/model/payroll-replacement.cljc #309fce
but the second ranked item was junk, and it missed the other good candidates. code-search was way faster and still found what I wanted (ran in 0.1 seconds…200x faster):
$ time code-search 'strip slash'
dataico.lib.strings/trailing-slash
  Ensures the given string has exactly one trailing slash.
dataico.lib.strings/without-leading-slash
  Ensures the given string does not start with a forward slash.
dataico.lib.strings/without-trailing-slash
  Ensures the given string does not end with a forward slash.
and the output is way more succinct and compact, though there is an expanded mode that is useful for LLMs:
$ code-search 'strip slash' -v
dataico.lib.strings/trailing-slash
  file: src/main/dataico/lib/strings.cljc:22-32
  desc: Ensures the given string has exactly one trailing slash.
  gp: 0.70  conf: 0.98  callers: 16  tags: string, path, normalization  score: -25.9600
dataico.lib.strings/without-leading-slash
  file: src/main/dataico/lib/strings.cljc:34-44
  desc: Ensures the given string does not start with a forward slash.
  gp: 0.75  conf: 0.98  callers: 15  tags: string, path, normalization  score: -25.0706
dataico.lib.strings/without-trailing-slash
  file: src/main/dataico/lib/strings.cljc:46-56
  desc: Ensures the given string does not end with a forward slash.
  gp: 0.70  conf: 0.98  callers: 14  tags: string, path, normalization  score: -25.0060
still same speed and compact, but with code locations so an LLM can go check for itself without over-reading file content into context.

tony.kay 2026-05-22T11:52:11.582419Z

I will admit that the other (current) main weakness of mine is that it assumes you’ll use claude -p with haiku on each function of the program (meant for subscription use) and has no “api” mode at present, or use of a local engine. Though the code-search index --fast will use just docstrings instead of LLM.