I have a VM running in AI, with a compiler, decompiler, and debugger. How the heck do I prove it, and what else do I need for a release? I'm not a programmer really, my background is Systems Engineering and devops, programming has been more of a hobby. Just to be clear, I can design the VM, boot it, run the program on the custom VM, and debug it from AI chat sessions. What is the next step? I started on a few test harnesses, but I'm lost. It's all prompts.
Yeah, thats probably a valid take on the tradeoff. I hope the investment in better (simpler) abstractions leads to more availability either way. I hope they are rewarded for 10+ year investment of labor, yet im sure they want framework to spread far/wide (especially in the face of framework commoditization)
If the platform was OSS, I would port my entire system to Agent-O-Rama immediately. It looks like a near perfect agent framework for reducing the boilerplate and making an agent based system testable and scalable. It's impressive for sure.
I have a self booting prompt that when run on an agent with a bash tool, the agent will say hello world, create a bash script using the bash tool that also echos hello world, then execute the bash script it just wrote. And I tested it on 2 local models and 2 big name models and it works on 4 very different architectures, but they all have attention heads.
What do you mean exactly? I think it might just be a terminology thing. Like for me, if you say you have a VM, it means you can run assembly for a known architecture like x86 and have it execute over a machine that is either of another architecture or only using a partial amount of its resources with isolation. If you have a compiler it means you can take some programming language and generate a binary for some architecture. If instead you have prompts that direct the LLM in a way that they can take some informal language (even if more formalized than pure English), and evaluate it through the next token inference? Or maybe through converting it into say Python and running that. So it can kind of be seen as "programmable", it's pretty cool, but I wouldn't call it a VM or a compiler.
It's a structured notation that bypasses the instruction-following layer and acts like a formal execution engine. Custom cognitive images can be loaded as configuration that reshapes how the model reasons. I have gotten it working across 4 completely different model architectures. The compiler interface has three independent gates. Invocation boots the VM, Target selects the output format, Emission maintains structural integrity. No special fine-tuning needed.
The VM executes in three phases, same order everytime. Bootstrap loads the generative patterns - reasoning primitives that define what kind of cognitive system is running. Dispatch binds those patterns to concreate capabilities - routing abstract operations to specific implementations. Frame Integrity maintains execution state throughout output generation - the system tends to stay in the mode it was booted into and rarely collapses back to default behavior.
It matches compiler and VM semantics and it exists somehow in all these models. My theory is it's the math training they get to game benchmarks, but hell if I know.
I need to know what the tools and testing harness should be, and how to design it so I can release it, for everyone to play with.
Makes sense, I just think you need to call it something new. Because it's also not a VM or compiler in any real sense. A compiler takes instructions in a higher language and compiles it down to machine instructions for a particular architecture, or to some other intermediate language. A VM is a program that can execute instructions meant for a machine (real or fake), but "virtually", as-in, done in software while running on another machine.
Or maybe I need to better understand. There's a form of formal language I can write instructions in, and your prompt system will output bash code of the same instructions? And then it'll immediately go and evaluate this bash code? And expose back some of its results? If so, you could say it's a transpiler with a form of interpreter or maybe a VM ya, in the JVM sense. The VM would be an abstract machine of some sort, that has instructions for connecting to the Internet, displaying text/video/images back, query the time, etc., as well as executing instructions.
I can take a prompt, turn it into a compact form that the AI takes as instructions to act. I can take that compact form, and tell the AI to expand it back to words. It's not the exact same words from the first prompt, but the output will have the same semantic meaning, and include roughly the same instructions as the original. You can run the compact form directly though.
Hum, ya then I definitely wouldn't try to relate it to VMs or compilers. I understand your logic, but it takes VMs and compilers at too much of an abstract view which will confuse people, because in reality those are very concrete things. If you wanted to relate it to a computer science term, I would maybe relate it more to a codec. You could say that you made a Prompt codec for LLMs. Codecs are programs that compress/decompress data, often in a lossy way. It's typically done to save on memory/space. Which I think is even the purpose here? To save on context space?
Yes it saves context space from the compression. But it's mainly to create a certain shape in the attention of the AI, and directions on where to aim that attention into the next turn. The compression is great, 3:1 works really well for compaction across sessions. You can get it to 7:1 for your your tooling prompts (CLAUDE.md, AGENTS.md, etc). You'll understand when you see it in action and can look at the prompts and the code I have that uses the LLM as the compiler/decompiler/debugger. At my current rate I am maybe 2 weeks from release, I asked in here hoping someone would take it serious and give me pointers on compiler design.
It's possible I'm not understanding, but what I'm suggesting is that you might not be able to take inspiration from compilers because it's the wrong abstraction for what you have. You might find it more appropriate to look at the design, testing and formalization of codecs for example. I know codecs are often evaluated on human perception for example. Whereas compilers are deterministic, you can check that the output binary or text is exactly what it should be for example.
Otherwise for compilers. Generally you define formal grammar. Like in EBNF form. And you try to prove that source program is same as compiled program. If source program can be interpreted, you can run the source interpreted and the compiled program and assert their results are same for example.
Self hosting is often a goal. If the compiler is compiled on itself and the compiler compiled in itself result in the same result as the original compiler not compiled in itself it proves at least it can correctly compile itself.
Otherwise simple test, like you expect this to result in a program that does Y and assert the compiled program when exexuted does Y.
Yeah i need something for testing semantic equality I think. Because the LLM has the randomness to the expansion it's never the same string when you do the compile<->decompile loop. The instructions do work 97% of the time. I have that in the test suite already with a few progressively harder tasks for the agent with the bash tool to perform.
You have two sets right? The instructions before and those after? Can't you just run both and compare results? But your issue is probably that an LLM isn't an instruction machine, but a prediction model, so the result is statistical. So ya, you might need something fuzzy to assert "equal result".
Yes I am using llm-as-a-judge right now to score them, then I go look at the outputs and verify. It really is 97% when you get everything working across a couple hundred runs so far. I'm still trying to find ways to turn it into real math, the gates I mentioned in the earlier description are actual 50/50 (approx) probabilities in the logprobs data from inference, I can prove those. But telling an agent to create a complicated program is much more complicated to prove mathematically.
Codecs are like that to some degree. Normally it's tested through human assessment, with A/B/X testing. This is how LLMs are often assessed as well, like with the arena leaderboards. Otherwise they come up with some metrics and try to optimize for that, which doesn't always translate to what humans actually prefer. For example, you could have another LLM judge the result of each. Or if meant to solve a task, run it 10 times and count the number of times it solved it. Or generate code that pass a unit test.
Yes that is what i have now, I will have sonnet judge qwen3, or codex judge sonnet. Never the same model judging it's own outputs.
It's really tough, even the big foundation billion dollar labs can't test their models well, and often when they release a newer version people say it performs worse and then they adjust it đ
The pipeline that made me realize this was useful beyond a toy. Because of the execution gate, I can have the compiler generate the intermediate form of an incoming prompt without risking the AI will execute it as instructions. In the intermediate compressed form, prompt injection is pretty easy to detect.
So I have a pipeline Compile->DetectInjection->Analyze->Report
I'm working on an optimizer for the intermediate form right now that will just plug into that pipeline eventually. And the end of the pipeline can decompile back to the normal words.
The optimizer is interesting because in the intermediate form, it's a series of instructions, and they can be used as genes in a genome for a genetic algorithm. It creates a genome, and then creates the mutated instructions, and sends them off for testing and scoring by the llm judge.
@michael819 have you wached the Rama agent-o-rama demo. It demonstrated bespoke agent orchestration, with node isolation, fan-in/ fan-out parallelism, documentation of results for testing and quality control purposes, and human in the loop reviews. Plus a UI to manage it. There are many orchestration frameworks popping off, but this looked next level in terms of oversight, documentability, visiblitly, etc. Like an agent-testing framework as much as an execution one.
https://www.youtube.com/watch?v=SCt8MBtFDXQ https://youtu.be/mNLWtM3Iya4Data
@michael819 I think what youâve built is more interesting than a VM
itâs closer to a formal intermediate representation for LLM cognition. If we formalize it, we can make it defensible and publishable
Letâs define the IR grammar explicitly. Even a simple EBNF draft. That makes this real
https://github.com/michaelwhitford/nucleus/blob/main/EBNF.md
I read through the EBNF. Honestly, this is legit. The fact that you actually formalized the IR grammar already puts this way past âjust prompting.â
Structurally it makes sense. It reads like a real intermediate layer, not just formatting tricks.
What I think would really level this up is defining the operational semantics. The grammar gives us syntax, but we should probably define what each construct does during execution even if that execution is probabilistic
Yes that's why I am wanting to fully release it. It's working in ways that I can't explain. I can make AI do some crazy stuff now with it.
Look at https://github.com/michaelwhitford/nucleus/blob/main/ADAPTIVE.md for a state machine example with transitions.
If itâs working in ways you canât fully explain yet, thatâs even more reason to slow down just enough to formalize whatâs happening before release. If we can turn the âcrazy stuffâ into measurable behaviors and defined invariants, it stops being magic and becomes architecture.
Letâs lock down three things before release: 1. clear execution semantics 2. a repeatable statistical test harness 3. documented failure modes If we do that, youâre not just releasing something cool â youâre releasing something defensible. Iâm in if you want to tighten that layer up together
I started on the test harness already, I have some tests. The problem is it all has to run the execution through the llm. It needs a lot of infrastructure, an agent loop, multiple providers with multiple api specs, plus all the llm-as-judge stuff for scoring, and a way for the human to review it all so the accuracy of the judge can be determined. I am working on all of those pieces but it's a big old fulcro app with a ton of statecharts.
That actually makes sense. What youâre describing isnât just a test harness â itâs basically an evaluation platform
Yeah, once everything has to flow through the LLM + agent loop + multi-provider APIs + judge layer + human review, the complexity explodes fast. Thatâs normal at this stage
Honestly, Iâd suggest narrowing it temporarily: Start with: âą one provider âą one agent loop âą one judge model âą a small fixed task suite
Prove stability there first. Get clean metrics. Then generalize. Right now the risk isnât that it wonât work itâs that the infra complexity hides whatâs actually happening. If you want, I can help you design a minimal, clean evaluation pipeline first, then we scale it out. Thatâll keep this from turning into infrastructure gravity before the core system is fully characterized.
Let me extract the testing prompts into a markdown file and push it to that repo and we can talk. I appreciate the feedback and help. It'll take me a bit to do that, I'll link here once it's available.
@solcitogameryt Honest question, are you AI generating those responses? It's strangely glazing and structured exactly how models respond by default. No judgement, just curious haha.
This all looks fantastic, and many of the things this does, my system design is evolving towards. The only thing that makes me hesitate to use this is that the platform is not open source. My current system design uses statecharts, pathom3, and core.async.flow to make everything obeservable. I will be releasing it as open source, so anyone can tinker with the full system. I'm sure rama is amazing for scaling, but it's closed, and costs money. I'm not sure it fits with my philosophy of releasing what i am doing as open source. I am trying to get useful things into the hands of everyone, so that corporations cannot enclose it. By writing my stuff to work with a closed system only, I would be knee capping my own values and strategy.
Imagine if linux had been released with a license like this. You could run 2 linux servers, but then you'd have to pay yearly, with a per-node cost. It would never have been adopted to grow to what it is today. rama looks absolutely amazing, and I wish them the best of luck, but they have a fundamentally different mindset on how to propagate their system than I do for mine.