I’ve created an LLM-driven and sqlite-based code search tool for CLJ(C). It does a SHA-like signature of each function to be able to do targeted change detection, and the way it indexes is as follows:
• Send the code form to an LLM for description OR in fast mode just use the docstring
◦ This allows indexing of poorly documented code
• Stored in SQLite (without embeddings for now) but with FTS5 with BM25 ranking over (qualified_name, description_llm, docstring, tags_llm, domain_signals_llm).
https://github.com/awkay/clj-code-search
The result is a high-speed bb tool (installable with bbin) where you or an LLM can ask questions:
code-search "path string"
and get a near instant response of functions in your code base that are of interest.I tried embeddings using a local ollama model and it was so slow as to be useless…so, in case you’re wondering
if local speed is a feature, an encoder-only model may be a better choice for the embeddings for semantic search: https://huggingface.co/docs/transformers/main/en/model_doc/modernbert
background: https://huggingface.co/blog/modernbert
you'll get much better throughput and probably more "deterministic" results as compared with using a LLM for this, I'd bet
I did try an encoder-only model for trying embeddings. The point of the LLM usage wasn’t embeddings. The LLM usage is to get descriptions for functions that have no docstring so that there’s something to search for 😄
The “fast” mode in the code search indexing is to just rely docstrings by only indexing on documented functions, which is useful when you want to add say a library’s functions to the index
My speed comment was that when I put the encodings into the db, that the search got a lot slower (because it’s SCI and math is slow, so searching a large db of possible matches was too CPU intensive). I want sub-ms searches, and my code base has 20k+ functions.
did you try using https://github.com/asg017/sqlite-vec for search? it's pretty fast for 20k docs
This is using vector search…just not embeddings. Which is why I find it “good enough”
This QMD fork does similar things. Works great for me across all my projects https://github.com/nextdoc/qmd/commit/e5ac6379d7535b907765a3b33427bca0f2f1d9b1
Nice, that has some nice attributes. I may give it a shot. Any idea how is it on large projects? I’m indexing 400k+ LOC and need the search speed on 20k+ functions to be super fast…
I haven’t counted up my projects lines of code, but I have more than 20 and I have a few enterprise software sized repo so I haven’t seen anything close to a limit yet. I run on an M4 Mac mini and each search takes 2 to 4 seconds.
The automated search expansion is really valuable
be interested in what cloc says for your project.
The other thing I’m doing is keeping a signature of functions so when I want to reindex (which is often) it can figure out precisely which ones have changed. The signatures tolerate trivial changes (movement and docstring edits).
I won’t get back to work for another 10 hours, but you can very easily install and test it on one repo
yeah, will do. Thanks
OK, here’s the results of my trials. qmd has the cool attribute of being able to find code by comprehension without an LLM. My tool has a high cost of initial indexing (I’m asking an LLM through claude -p to tell me what an undocumented function does and saving that string). So, qmd seems to excel when the query is complicated around concepts “where do I obfuscate user data?“: The quality of those results is high, though the runtime on my code base is around 10s. So, not great, but for that specific purpose it’s pretty nice. Defnitely a complimentary tool. My code-search index fails on that kind of query. But, my tools is targeting a different task: Is there a function that already does what I want? It’s specifically a function index. The problem I’m solving is LLMs rewriting stuff that already exists over and over in my code base. So, I want a fast tool so they have a chance of finding the candidates quickly. When asking qmd took 20 seconds, and did return a match
Expanding query... (2.5s)
├─ strip slash
├─ lex: remove slashes
├─ lex: cut away slashes
├─ vec: remove slashes
├─ vec: cut away slashes
└─ hyde: The topic of strip slash covers remove slashes. Proper implementation...
Searching 6 queries...
Embedding 4 queries... (557ms)
Reranking 40 chunks... (16.0s)
dataico/lib/strings.cljc:22 #ea110f
Title: strings
Score: 90%
@@ -21,4 @@ (20 before, 87 after)
(>defn trailing-slash
"Ensures the given string has exactly one trailing slash.
Converts nil to \"/\".
dataico/model/payroll-replacement.cljc #309fce
but the second ranked item was junk, and it missed the other good candidates.
code-search was way faster and still found what I wanted (ran in 0.1 seconds…200x faster):
$ time code-search 'strip slash'
dataico.lib.strings/trailing-slash
Ensures the given string has exactly one trailing slash.
dataico.lib.strings/without-leading-slash
Ensures the given string does not start with a forward slash.
dataico.lib.strings/without-trailing-slash
Ensures the given string does not end with a forward slash.
and the output is way more succinct and compact, though there is an expanded mode that is useful for LLMs:
$ code-search 'strip slash' -v
dataico.lib.strings/trailing-slash
file: src/main/dataico/lib/strings.cljc:22-32
desc: Ensures the given string has exactly one trailing slash.
gp: 0.70 conf: 0.98 callers: 16 tags: string, path, normalization score: -25.9600
dataico.lib.strings/without-leading-slash
file: src/main/dataico/lib/strings.cljc:34-44
desc: Ensures the given string does not start with a forward slash.
gp: 0.75 conf: 0.98 callers: 15 tags: string, path, normalization score: -25.0706
dataico.lib.strings/without-trailing-slash
file: src/main/dataico/lib/strings.cljc:46-56
desc: Ensures the given string does not end with a forward slash.
gp: 0.70 conf: 0.98 callers: 14 tags: string, path, normalization score: -25.0060
still same speed and compact, but with code locations so an LLM can go check for itself without over-reading file content into context.I will admit that the other (current) main weakness of mine is that it assumes you’ll use claude -p with haiku on each function of the program (meant for subscription use) and has no “api” mode at present, or use of a local engine. Though the code-search index --fast will use just docstrings instead of LLM.