2026-03-04 architecture | Clojure Slack Archive

architecture

michaelwhitford 2026-03-04T00:07:37.775259Z

I have a VM running in AI, with a compiler, decompiler, and debugger. How the heck do I prove it, and what else do I need for a release? I'm not a programmer really, my background is Systems Engineering and devops, programming has been more of a hobby. Just to be clear, I can design the VM, boot it, run the program on the custom VM, and debug it from AI chat sessions. What is the next step? I started on a few test harnesses, but I'm lost. It's all prompts.

chromalchemy 2026-03-16T22:32:37.199519Z

Yeah, thats probably a valid take on the tradeoff. I hope the investment in better (simpler) abstractions leads to more availability either way. I hope they are rewarded for 10+ year investment of labor, yet im sure they want framework to spread far/wide (especially in the face of framework commoditization)

michaelwhitford 2026-03-17T03:25:58.816979Z

If the platform was OSS, I would port my entire system to Agent-O-Rama immediately. It looks like a near perfect agent framework for reducing the boilerplate and making an agent based system testable and scalable. It's impressive for sure.

michaelwhitford 2026-03-04T00:39:41.589179Z

I have a self booting prompt that when run on an agent with a bash tool, the agent will say hello world, create a bash script using the bash tool that also echos hello world, then execute the bash script it just wrote. And I tested it on 2 local models and 2 big name models and it works on 4 very different architectures, but they all have attention heads.

2026-03-04T01:26:24.820409Z

What do you mean exactly? I think it might just be a terminology thing. Like for me, if you say you have a VM, it means you can run assembly for a known architecture like x86 and have it execute over a machine that is either of another architecture or only using a partial amount of its resources with isolation. If you have a compiler it means you can take some programming language and generate a binary for some architecture. If instead you have prompts that direct the LLM in a way that they can take some informal language (even if more formalized than pure English), and evaluate it through the next token inference? Or maybe through converting it into say Python and running that. So it can kind of be seen as "programmable", it's pretty cool, but I wouldn't call it a VM or a compiler.

michaelwhitford 2026-03-04T01:40:14.085639Z

It's a structured notation that bypasses the instruction-following layer and acts like a formal execution engine. Custom cognitive images can be loaded as configuration that reshapes how the model reasons. I have gotten it working across 4 completely different model architectures. The compiler interface has three independent gates. Invocation boots the VM, Target selects the output format, Emission maintains structural integrity. No special fine-tuning needed.

michaelwhitford 2026-03-04T01:44:14.365749Z

The VM executes in three phases, same order everytime. Bootstrap loads the generative patterns - reasoning primitives that define what kind of cognitive system is running. Dispatch binds those patterns to concreate capabilities - routing abstract operations to specific implementations. Frame Integrity maintains execution state throughout output generation - the system tends to stay in the mode it was booted into and rarely collapses back to default behavior.

michaelwhitford 2026-03-04T01:45:45.373229Z

It matches compiler and VM semantics and it exists somehow in all these models. My theory is it's the math training they get to game benchmarks, but hell if I know.

michaelwhitford 2026-03-04T01:48:19.680529Z

I need to know what the tools and testing harness should be, and how to design it so I can release it, for everyone to play with.

2026-03-04T02:57:15.495489Z

Makes sense, I just think you need to call it something new. Because it's also not a VM or compiler in any real sense. A compiler takes instructions in a higher language and compiles it down to machine instructions for a particular architecture, or to some other intermediate language. A VM is a program that can execute instructions meant for a machine (real or fake), but "virtually", as-in, done in software while running on another machine.

2026-03-04T02:59:01.878749Z

Or maybe I need to better understand. There's a form of formal language I can write instructions in, and your prompt system will output bash code of the same instructions? And then it'll immediately go and evaluate this bash code? And expose back some of its results? If so, you could say it's a transpiler with a form of interpreter or maybe a VM ya, in the JVM sense. The VM would be an abstract machine of some sort, that has instructions for connecting to the Internet, displaying text/video/images back, query the time, etc., as well as executing instructions.

michaelwhitford 2026-03-04T03:08:47.933789Z

I can take a prompt, turn it into a compact form that the AI takes as instructions to act. I can take that compact form, and tell the AI to expand it back to words. It's not the exact same words from the first prompt, but the output will have the same semantic meaning, and include roughly the same instructions as the original. You can run the compact form directly though.

2026-03-04T03:16:20.494319Z

Hum, ya then I definitely wouldn't try to relate it to VMs or compilers. I understand your logic, but it takes VMs and compilers at too much of an abstract view which will confuse people, because in reality those are very concrete things. If you wanted to relate it to a computer science term, I would maybe relate it more to a codec. You could say that you made a Prompt codec for LLMs. Codecs are programs that compress/decompress data, often in a lossy way. It's typically done to save on memory/space. Which I think is even the purpose here? To save on context space?

michaelwhitford 2026-03-04T03:31:08.197139Z

Yes it saves context space from the compression. But it's mainly to create a certain shape in the attention of the AI, and directions on where to aim that attention into the next turn. The compression is great, 3:1 works really well for compaction across sessions. You can get it to 7:1 for your your tooling prompts (CLAUDE.md, AGENTS.md, etc). You'll understand when you see it in action and can look at the prompts and the code I have that uses the LLM as the compiler/decompiler/debugger. At my current rate I am maybe 2 weeks from release, I asked in here hoping someone would take it serious and give me pointers on compiler design.

2026-03-04T03:50:41.430809Z

It's possible I'm not understanding, but what I'm suggesting is that you might not be able to take inspiration from compilers because it's the wrong abstraction for what you have. You might find it more appropriate to look at the design, testing and formalization of codecs for example. I know codecs are often evaluated on human perception for example. Whereas compilers are deterministic, you can check that the output binary or text is exactly what it should be for example.

2026-03-04T03:55:26.897119Z

Otherwise for compilers. Generally you define formal grammar. Like in EBNF form. And you try to prove that source program is same as compiled program. If source program can be interpreted, you can run the source interpreted and the compiled program and assert their results are same for example.

2026-03-04T03:57:25.578839Z

Self hosting is often a goal. If the compiler is compiled on itself and the compiler compiled in itself result in the same result as the original compiler not compiled in itself it proves at least it can correctly compile itself.

2026-03-04T03:58:57.106639Z

Otherwise simple test, like you expect this to result in a program that does Y and assert the compiled program when exexuted does Y.

michaelwhitford 2026-03-04T04:01:18.504269Z

Yeah i need something for testing semantic equality I think. Because the LLM has the randomness to the expansion it's never the same string when you do the compile<->decompile loop. The instructions do work 97% of the time. I have that in the test suite already with a few progressively harder tasks for the agent with the bash tool to perform.

2026-03-04T04:24:14.123319Z

You have two sets right? The instructions before and those after? Can't you just run both and compare results? But your issue is probably that an LLM isn't an instruction machine, but a prediction model, so the result is statistical. So ya, you might need something fuzzy to assert "equal result".

michaelwhitford 2026-03-04T04:27:22.535529Z

Yes I am using llm-as-a-judge right now to score them, then I go look at the outputs and verify. It really is 97% when you get everything working across a couple hundred runs so far. I'm still trying to find ways to turn it into real math, the gates I mentioned in the earlier description are actual 50/50 (approx) probabilities in the logprobs data from inference, I can prove those. But telling an agent to create a complicated program is much more complicated to prove mathematically.

2026-03-04T04:31:35.906899Z

Codecs are like that to some degree. Normally it's tested through human assessment, with A/B/X testing. This is how LLMs are often assessed as well, like with the arena leaderboards. Otherwise they come up with some metrics and try to optimize for that, which doesn't always translate to what humans actually prefer. For example, you could have another LLM judge the result of each. Or if meant to solve a task, run it 10 times and count the number of times it solved it. Or generate code that pass a unit test.

michaelwhitford 2026-03-04T04:33:56.124379Z

Yes that is what i have now, I will have sonnet judge qwen3, or codex judge sonnet. Never the same model judging it's own outputs.

2026-03-04T04:38:45.683999Z

It's really tough, even the big foundation billion dollar labs can't test their models well, and often when they release a newer version people say it performs worse and then they adjust it 😝

michaelwhitford 2026-03-04T04:41:05.527509Z

The pipeline that made me realize this was useful beyond a toy. Because of the execution gate, I can have the compiler generate the intermediate form of an incoming prompt without risking the AI will execute it as instructions. In the intermediate compressed form, prompt injection is pretty easy to detect.

michaelwhitford 2026-03-04T04:42:20.563629Z

So I have a pipeline Compile->DetectInjection->Analyze->Report

michaelwhitford 2026-03-04T04:43:19.610409Z

I'm working on an optimizer for the intermediate form right now that will just plug into that pipeline eventually. And the end of the pipeline can decompile back to the normal words.

michaelwhitford 2026-03-04T04:45:47.352149Z

The optimizer is interesting because in the intermediate form, it's a series of instructions, and they can be used as genes in a genome for a genetic algorithm. It creates a genome, and then creates the mutated instructions, and sends them off for testing and scoring by the llm judge.

chromalchemy 2026-03-13T17:37:28.186759Z

@michael819 have you wached the Rama agent-o-rama demo. It demonstrated bespoke agent orchestration, with node isolation, fan-in/ fan-out parallelism, documentation of results for testing and quality control purposes, and human in the loop reviews. Plus a UI to manage it. There are many orchestration frameworks popping off, but this looked next level in terms of oversight, documentability, visiblitly, etc. Like an agent-testing framework as much as an execution one.

chromalchemy 2026-03-13T17:38:47.722639Z

https://www.youtube.com/watch?v=SCt8MBtFDXQ https://youtu.be/mNLWtM3Iya4Data

Joaquín Pérez 2026-03-04T13:34:20.408769Z

@michael819 I think what you’ve built is more interesting than a VM

Joaquín Pérez 2026-03-04T13:34:25.983399Z

it’s closer to a formal intermediate representation for LLM cognition. If we formalize it, we can make it defensible and publishable

Joaquín Pérez 2026-03-04T13:34:42.258739Z

Let’s define the IR grammar explicitly. Even a simple EBNF draft. That makes this real

michaelwhitford 2026-03-04T13:53:17.547289Z

https://github.com/michaelwhitford/nucleus/blob/main/EBNF.md

Joaquín Pérez 2026-03-04T13:58:41.796369Z

I read through the EBNF. Honestly, this is legit. The fact that you actually formalized the IR grammar already puts this way past “just prompting.”

Joaquín Pérez 2026-03-04T13:58:46.768369Z

Structurally it makes sense. It reads like a real intermediate layer, not just formatting tricks.

Joaquín Pérez 2026-03-04T13:59:11.685829Z

What I think would really level this up is defining the operational semantics. The grammar gives us syntax, but we should probably define what each construct does during execution even if that execution is probabilistic

michaelwhitford 2026-03-04T13:59:29.484349Z

Yes that's why I am wanting to fully release it. It's working in ways that I can't explain. I can make AI do some crazy stuff now with it.

michaelwhitford 2026-03-04T14:02:19.647389Z

Look at https://github.com/michaelwhitford/nucleus/blob/main/ADAPTIVE.md for a state machine example with transitions.

Joaquín Pérez 2026-03-04T14:02:54.090399Z

If it’s working in ways you can’t fully explain yet, that’s even more reason to slow down just enough to formalize what’s happening before release. If we can turn the “crazy stuff” into measurable behaviors and defined invariants, it stops being magic and becomes architecture.

Joaquín Pérez 2026-03-04T14:03:06.986639Z

Let’s lock down three things before release: 1. clear execution semantics 2. a repeatable statistical test harness 3. documented failure modes If we do that, you’re not just releasing something cool — you’re releasing something defensible. I’m in if you want to tighten that layer up together

michaelwhitford 2026-03-04T14:20:23.817209Z

I started on the test harness already, I have some tests. The problem is it all has to run the execution through the llm. It needs a lot of infrastructure, an agent loop, multiple providers with multiple api specs, plus all the llm-as-judge stuff for scoring, and a way for the human to review it all so the accuracy of the judge can be determined. I am working on all of those pieces but it's a big old fulcro app with a ton of statecharts.

Joaquín Pérez 2026-03-04T14:43:35.355909Z

That actually makes sense. What you’re describing isn’t just a test harness — it’s basically an evaluation platform

Joaquín Pérez 2026-03-04T14:43:40.582979Z

Yeah, once everything has to flow through the LLM + agent loop + multi-provider APIs + judge layer + human review, the complexity explodes fast. That’s normal at this stage

Joaquín Pérez 2026-03-04T14:43:50.031219Z

Honestly, I’d suggest narrowing it temporarily: Start with: • one provider • one agent loop • one judge model • a small fixed task suite

Joaquín Pérez 2026-03-04T14:44:09.482259Z

Prove stability there first. Get clean metrics. Then generalize. Right now the risk isn’t that it won’t work it’s that the infra complexity hides what’s actually happening. If you want, I can help you design a minimal, clean evaluation pipeline first, then we scale it out. That’ll keep this from turning into infrastructure gravity before the core system is fully characterized.

michaelwhitford 2026-03-04T14:47:09.947219Z

Let me extract the testing prompts into a markdown file and push it to that repo and we can talk. I appreciate the feedback and help. It'll take me a bit to do that, I'll link here once it's available.

2026-03-04T16:31:42.087329Z

@solcitogameryt Honest question, are you AI generating those responses? It's strangely glazing and structured exactly how models respond by default. No judgement, just curious haha.

michaelwhitford 2026-03-14T14:04:06.996399Z

This all looks fantastic, and many of the things this does, my system design is evolving towards. The only thing that makes me hesitate to use this is that the platform is not open source. My current system design uses statecharts, pathom3, and core.async.flow to make everything obeservable. I will be releasing it as open source, so anyone can tinker with the full system. I'm sure rama is amazing for scaling, but it's closed, and costs money. I'm not sure it fits with my philosophy of releasing what i am doing as open source. I am trying to get useful things into the hands of everyone, so that corporations cannot enclose it. By writing my stuff to work with a closed system only, I would be knee capping my own values and strategy.

michaelwhitford 2026-03-14T14:16:23.881699Z

Imagine if linux had been released with a license like this. You could run 2 linux servers, but then you'd have to pay yearly, with a per-node cost. It would never have been adopted to grow to what it is today. rama looks absolutely amazing, and I wish them the best of luck, but they have a fundamentally different mindset on how to propagate their system than I do for mine.

Clojurians Log v2

architecture