malli

2025-12-01T06:40:08.881949Z

I have a performance improvement for recursive validators here https://github.com/metosin/malli/pull/1245 It compiles recursive functions for recursive validators instead of lazily caching each level, leading to better time and space perf for recursive validators. Please try it out in the wild and let me know if this breaks anything.

👀 4
❤️ 1
2026-01-12T15:26:28.866269Z

> I think this way we can remove the dynamic variables in the validator case. I'll try it out. seems to work by sharing rf via the options map https://github.com/metosin/malli/pull/1254/changes

wotbrew 2026-01-09T10:54:47.997539Z

Morning, warming up now - here is what I meant yesterday about 'multiple roots'. > (let [reg (merge (m/default-schemas) > {A [:tuple :int [:maybe [:ref B]]] > B [:tuple A A A A A]}) > v (m/validator B {:registry reg})] > (println (mm/measure v))) Asking for a validator for ::B is much more expensive than one for ::A. If you ask for ::B - a new ->validator is created for each reference to ::A This follows as the [:ref] establishes the binding scope. Only the nodes along the path have been cached in the *ref-validators* table. So if you cycle you will not reallocate. But branches to new nodes will not share a cache. Does that make sense? Sorry I'm sure I could probably do a better job of explaining here. If only a registry could cache the Schema for each node, things would be a lot more straightforward. We also have a problem with explainer right, which seems hard to square.

opqdonut 2026-01-09T10:56:57.021969Z

that makes sense to me, at least

wotbrew 2026-01-09T10:57:09.896249Z

For the validator case maybe we can make id->validator mutable, but then you got races / cljs yuk

wotbrew 2026-01-09T11:02:22.698439Z

@joel.kaasinen would there be any appetite in having intepretation be the default and push graph caching concerns into a special kind of registry / wrapper? One could keep all the existing memoization machinery everwhere but refs (and maybe pointers/bare-kws).

opqdonut 2026-01-09T11:02:56.880739Z

I'm not sure what you mean

wotbrew 2026-01-09T11:17:25.158509Z

Hmm, maybe something like this: I think you could introduce a mutable hash map in the top level options call. -validator looks it up in the options. If it is absent, we do not memoize/cache/close-over validators at the reference site at all. If present we use it to retain the validator for some referenced node. Because its not scoped to the any particular reference - it ought to cache each node just once.

opqdonut 2026-01-09T11:18:45.103109Z

That sounds reasonable, yeah. Might also simplify the various recursive memoization cases? Performance by default would be nice, but opt-in performance sounds good as well.

wotbrew 2026-01-09T11:18:52.291039Z

> If it is absent, we do not memoize/cache/close-over validators at the reference site at all. This case would be unlikely if we introduce the cache on the users behalf right, because nobodyh is gonna call -validator most of the time

wotbrew 2026-01-09T11:19:07.337579Z

its more we do not have the impl baggage/headache from trying to get it right in-place

opqdonut 2026-01-09T11:19:13.913159Z

Right, we could init an empty cache in the public validator?

wotbrew 2026-01-09T11:19:16.410579Z

ya

wotbrew 2026-01-09T11:19:21.700159Z

Its an idea

wotbrew 2026-01-09T11:19:30.700059Z

right now things are so hard to think about it really does make my head spin

opqdonut 2026-01-09T11:20:06.208519Z

I feel like we need some collective hammock time on this instead of merging multiple small surgical PRs. If you could write up a draft of this approach, it would be easier to talk about.

opqdonut 2026-01-09T11:20:52.007689Z

Also want to hear what Ambrose thinks about this.

wotbrew 2026-01-09T11:21:15.396119Z

yea totally agreed! I plan to do a write up today, so hopefully will have something to think about. Also keen to hear thoughts from others. I am not super experienced with malli and its my first time in the internals here

opqdonut 2026-01-09T11:21:54.883869Z

One thing that I'm wondering about is what would the keys in the mutable hash map be...

opqdonut 2026-01-09T11:22:41.859009Z

refs are easy to turn into keys, but other schemas might not

wotbrew 2026-01-09T11:22:50.986969Z

I think you only need to do this for refs

opqdonut 2026-01-09T11:22:51.448389Z

is it enough to cache refs?

wotbrew 2026-01-09T11:22:54.052239Z

yea

wotbrew 2026-01-09T11:23:09.406979Z

explainer I think is the spanner with my idea though

wotbrew 2026-01-09T11:23:20.368959Z

explainer closes over paths that seem somewhat important

wotbrew 2026-01-09T11:23:32.503199Z

so its kind of baked in they need to allocate for different pathways

wotbrew 2026-01-09T11:23:43.656399Z

the same kind of memory pathology ought to exist there

opqdonut 2026-01-09T11:23:55.240949Z

I don't see why explainers couldn't be curried like (-explainer [this] (fn [path] ...))

1
opqdonut 2026-01-09T11:24:10.642369Z

of course that would be a big-ish and probably slightly breaking change

wotbrew 2026-01-09T11:24:43.770769Z

it might be easier to swallow recomputing explainers on demand I would guess, as they are not used so often where that kind of perf would matter? Again only for refs.

opqdonut 2026-01-09T11:25:04.458529Z

yeah, agreed. getting the perf for validators & coercers is most important.

👍 1
opqdonut 2026-01-09T11:26:23.643219Z

adding a note to the channel that there's an interesting discussion about recursive validators & performance here

❗ 1
2026-01-09T16:51:47.046989Z

regarding tying the knot for explainers, here's an implementation that doesn't introduce breaking changes but should save memory on creating redundant schema children https://github.com/metosin/malli/pull/1254/changes

2026-01-09T16:56:41.967809Z

@danstone I think a top-level mutable cache is my eventual goal (this is what Plumatic Schema does). We should figure out what is happening here first tho.

wotbrew 2026-01-09T16:56:47.639409Z

Write up is in progress, I'll link that new explainers commit (haven't looked at yet tho)

👏 1
wotbrew 2026-01-09T16:57:49.827939Z

@ambrosebs here is the WIP, worked a bit on examples that I think show what is going on: https://gist.github.com/wotbrew/6bc413291e2c667d66d35b78868dc2d2 Apologies not quite done / missing stuff / spelling etc 😉

👀 1
wotbrew 2026-01-09T17:02:04.851409Z

The actual schemas at metabase have lots of structure, conditionality, recursion points - it all makes sense why now we get GB allocated up front to me - and why it can still be validators holding the memory despite the fix. If I add the lazyness back in, things get better (up front) - but still pretty bad.

wotbrew 2026-01-09T17:04:14.161989Z

Obviously can comment on the issue etc when I create it so dont feel you have to read it all now

2026-01-09T17:04:50.475029Z

I agree with everything in the first section. I think we're jumping ahead tho, the knot tying addresses validators leaking memory at runtime, not compile time. There's been no attempt at fixing the second problem of expecting linear growth.

wotbrew 2026-01-09T17:06:14.375679Z

Yea I agree, sorry I'll clear that up!

2026-01-09T17:07:13.514579Z

fwiw I describe the same problem as future work in my summary for the knot tying work https://www.clojuriststogether.org/news/december-q3-2025-project-updates/#ambrose-bonnaire-seargent-malli

2026-01-09T17:08:30.914929Z

so it's absolutely on my radar. first thing I'd like to know is whether metabase is actually observing constant memory at runtime. it sounds like yes? or, it might if you could get it to start? 🙂

wotbrew 2026-01-09T17:10:57.467709Z

I can get it to start occasionally but it is left in a broken state. I have not looked at runtime behaviour as the previous way I was doing so is not possible.

wotbrew 2026-01-09T17:11:20.680219Z

I could do so but it would be somewhat synthetic and the same as what I am doing in the more minimal examples

2026-01-09T17:12:07.779949Z

ok then the broken state is the most important thing. there must be some assumption either the patch or metabase is violating.

wotbrew 2026-01-09T17:12:31.008209Z

I would not put it on malli, I would put good money on its metabase - some job dies, timeout is hit etc etc

2026-01-09T17:13:58.064519Z

ok. I think we should park all the perf stuff until we get to the bottom of how it breaks metabase.

2026-01-09T17:15:53.107039Z

it sounds like there are two problems: 1. the preallocation of validators takes up too much memory 2. even if you allocate enough memory, metabase doesn't like the validators generated

2026-01-09T17:16:22.431319Z

on 1, you said that lazy vars help. I'd like to know more about that.

wotbrew 2026-01-09T17:23:44.702229Z

2. even if you allocate enough memory, metabase doesn't like the validators generatedI think it is memory in each case (pressure on something else), but let me play some more forcing the if lazy branch puts us in a similar spot to 0.2.0 - everything loads ok at this point. The memory usage is high, but does not obviously grow at runtime. In the 0.2.0 case it looks like a memory leak.

2026-01-09T17:25:29.699449Z

ah ok. for lazy refs, it will grow at runtime but to a bounded amount until it ties the knot.

2026-01-09T17:27:57.934639Z

my first hunch with your report is that this work revealed a failure mode in metabase on what actually happens if all ref schemas are realized, even to one level.

wotbrew 2026-01-09T17:30:54.813489Z

I agree with that - and I did not know the scaling was a known issue. If we have to live with the (apparent) exponential scaling, we might decide to keep validators lazy. I would then discuss removing any top level caching of validators at metabase - as we would then at least limit the memory balloon to a specific validation scope.

wotbrew 2026-01-09T17:31:51.855609Z

I'd be happy with that outcome - OTOH a PR for the top-level mutable cache might not be too hard? What do you think? I kind of would like the eagerness now 🙂

2026-01-09T17:31:58.129779Z

another possibility is that the optimization overstepped. it's really two optimizations: tying the knot and eager realization.

👍 1
2026-01-09T17:32:33.971549Z

maybe a simple way forward is to force the lazy path until we deal with the "other" issue

2026-01-09T17:32:50.016679Z

it will at least bound the memory leak

wotbrew 2026-01-09T17:33:47.990829Z

Yep totally agree 👍

2026-01-09T17:34:40.980219Z

re: top-level cache, it's a few steps ahead of where I'm comfortable, I think we can break it up into smaller steps like we're doing here, fixing smaller issues and get to that destination

2026-01-09T17:35:32.331249Z

ok, but you've already tried forcing the lazy case and still found issues. maybe concentrate on that case in your investigation going forward?

2026-01-09T17:36:55.260609Z

once we get to the bottom of it I will make the non-lazy case opt-in.

2026-01-09T17:41:32.822079Z

also, the reason you didn't hear about the exponential scaling issue is because I didn't advertise it. my head is also spinning, but over several years on how to even explain these issues.

wotbrew 2026-01-09T17:42:31.198769Z

The issue with 'force lazy' for us is 'malli memory usage is still very high'. Obviously this is metabase specific! This is more of a 'it'd be nice to reduce that' type issue for me. A second issue is it still leaves open runtime growth. On smaller heaps I'd have to say OOM's are only less likely due to malli, instead of practically eliminated - it would still 'look like a leak'. This is all a bit speculative until we roll out into production or run some significant experiment tbh 🙂. I figure trying the top level cache might be relatively cheap (not proposing merging anything into malli yet!)

wotbrew 2026-01-09T17:43:14.561639Z

Malli is free to allocate what it wants of course, and I 💯 agree that without the forced eagerness the situation seems improved.

wotbrew 2026-01-09T17:43:33.295519Z

And I have options metabase side to mitigate the kinds of situations we see now

2026-01-09T17:43:58.094179Z

another issue is that I proposed a top-level validator cache last year and it was not accepted. Instead of pushing for it I decided to break the problem down further.

wotbrew 2026-01-09T17:45:20.427709Z

By top-level do you mean as I propose? Injected into the options map? Sorry, I should probably start by reading your linked doc.

2026-01-09T17:46:25.067419Z

I mean for the purposes of this discussion, starting at that high level of abstraction had buy-in issues and was hard to explain.

👍 1
2026-01-09T17:46:50.539169Z

that's one reason why I tried to break it down into tiny but effective pr's.

2026-01-09T17:47:40.851589Z

I can dig up my attempt.

2026-01-09T17:49:09.860049Z

https://github.com/metosin/malli/pull/1180

👀 1
2026-01-09T17:49:51.649679Z

the feedback I got was basically, it's against malli's design to cache so much.

2026-01-09T17:50:49.131609Z

I struggled to formulate my argument against it, eventually it emerged from this work via this pr, which basically says that malli is free to cache schemas as it sees fit https://github.com/metosin/malli/pull/1244

2026-01-09T18:00:30.502239Z

I would go with a different approach today tho, back then I didn't realize the exponential proliferation was a root cause. It really was fixing a symptom of that problem, by caching validators at the schema ctor level.

2026-01-09T18:01:21.623699Z

my dream scenario is that we can tie the knot at schema-creation time. I don't know if that's possible in practice yet.

2026-01-09T18:01:55.662599Z

that way we might not have to have custom knot tying for every op.

2026-01-09T18:03:04.302389Z

I have a feeling that it wouldn't work in practice. there are so many different ops and different use cases.

2026-01-09T18:04:23.959489Z

however, we could probably use the same tech to automatically create :ref schemas.

2026-01-09T18:05:25.998089Z

instead of manually changing ::foo to [:ref ::foo], just detect if you've seen ::foo and if so turn it into the latter.

2026-01-09T18:15:14.209069Z

hmm, just had a thought that it's not the refs themselves we'd want to tie the knot with, but their children. this is the approach I took for explainers. what if the shared child is available to all ops at parse time? https://github.com/metosin/malli/pull/1254/changes

2026-01-09T18:17:09.502759Z

I think this way we can remove the dynamic variables in the validator case. I'll try it out.

🤞 1
wotbrew 2026-01-08T09:21:44.386619Z

yes I did try that. I'm not convinced in our case that the validator call is the only problem. I'm continuing to investigate today, hopefully I'll have more soon.

wotbrew 2026-01-08T10:22:57.363209Z

here is an interesting example that I think demonstrates the original problem (might?) still exist in the coercer/explainer case. I suspect something might be going on with the early forcing of (rf) in the non-lazy ref case. https://gist.github.com/wotbrew/e5af4d05655f70fe4cbf21b6820021f1

wotbrew 2026-01-08T10:23:04.936869Z

I'll put together an issue for GH

wotbrew 2026-01-08T11:08:03.304779Z

Still not sure why forcing the lazy validator as you say did not seem to work - it really seems like it should be better than 0.2.0!. I'm going to play around with this again as maybe I just messed something up.

👍 1
2026-01-08T14:25:45.332089Z

the "after 10 random data structures" results look promising

2026-01-08T14:27:01.444639Z

tho maybe misleading since I don't think ref's -validator uses it any more

wotbrew 2026-01-08T15:11:39.940639Z

I suspect somehow if you switch if lazy for if true you end up with the same runtime combinatorics but it moves to the knot/->validator closure. looking at heap you see the same sorts of AtomicReference numbers but its dominated more by Atom paths (via ->validator closure).

wotbrew 2026-01-08T15:14:55.657679Z

Making my head hurt a bit 😄

wotbrew 2026-01-08T15:36:19.517429Z

This is where I am: • The validator path is healthy, bounded more to schema complexity than runtime data structure complexity. • explainer is the same as before (still growing when encountering new structure), this is not surprising as that method has not been touched • coercer is really bad, if lazy = true makes things better, but its still allocating huge amounts for a datastructure with any sort of depth or complexity.

wotbrew 2026-01-08T15:44:05.435509Z

coercer is probably because of the same issue on the transform side - we have a fix for that so I'll pull that commit in. Almost there. I think the only remaining knot to tie will be the explainer case.

2026-01-08T17:13:59.163529Z

thanks for investigating. I'm not sure how to tie the knot with the explainer since path grows with each step, so needs a new thunk at each level.

1
2026-01-08T17:17:18.474919Z

thinking it through, we could have the functions in the id->explainer map take a path and initialize a new explainer. this should at least prevent us from needing to recreate another layer of schemas.

wotbrew 2026-01-08T17:19:42.339389Z

Even if I remove the (rf) and turn any caching for explainers off, numbers get better in the test example - I still have a lot of memory usage in metabase, but as I say heap dump looks different. ->validator retaining everything. The thing is I do not see the same in my toy example, so there is something else in the mix I am missing.

wotbrew 2026-01-08T17:20:59.509799Z

Hopefully will actually be able to describe what is going on by the end of the week. Actual fix is probably like one line of code somewhere 😄

wotbrew 2026-01-08T17:22:11.389979Z

Metabase schemas are a lot more complicated. But perhaps there are other indirection mechanisms that break/create multiple id->validator roots across the tree.

2026-01-08T17:25:07.902419Z

I think it would help to print out the keys of *ref-validators* also. we can see which schemas are being realized and where the knots are being tied. it's unsorted tho, but might still be useful unchanged.

👍 1
2026-01-08T17:28:57.423299Z

ideally I'd like to see a (def ^:dynamic *nested-ref-path* []) where we add (binding [*nested-ref-path* (conj *nested-ref-path* (:name id))] around where we rebind the *ref-validators* var. then collect the prints of that var.

2026-01-08T17:33:25.917489Z

try this pr https://github.com/metosin/malli/pull/1251

2026-01-08T17:34:30.983329Z

should print out stuff like

*nested-ref-path* [:malli.swagger-test/a]
*nested-ref-path* [:malli.swagger-test/a :malli.swagger-test/b]
*nested-ref-path* [:malli.swagger-test/a :malli.swagger-test/b :malli.swagger-test/c]
*nested-ref-path* [:malli.swagger-test/b]
*nested-ref-path* [:malli.swagger-test/b :malli.swagger-test/c]
*nested-ref-path* [:malli.swagger-test/b :malli.swagger-test/c :malli.swagger-test/a]
*nested-ref-path* [:malli.swagger-test/c]
*nested-ref-path* [:malli.swagger-test/c :malli.swagger-test/a]
*nested-ref-path* [:malli.swagger-test/c :malli.swagger-test/a :malli.swagger-test/b]

2026-01-08T17:35:53.817359Z

oh I messed it up fixed

wotbrew 2026-01-08T17:36:57.763719Z

Will do, just fyi I already identified you can get multiple roots as the root var binding can be established multiple times when you enter via a non-ref. This is why I added the ptr/direct cases.

2026-01-08T17:37:56.980359Z

hmm I don't follow, example?

wotbrew 2026-01-08T17:38:07.506239Z

give me a few mins, typing this while in a call 😄

👌 1
wotbrew 2026-01-08T17:39:43.344499Z

Might be tomorrow tbh, I can also smell dinner cooking

😄 1
opqdonut 2026-01-12T06:38:18.833449Z

I'm definitely willing to revisit the top-level cache decision once we have more information & experiments. But definitely don't want to rush it. Thanks for looking into this.

➕ 1
wotbrew 2026-01-07T16:32:28.930739Z

I have been looking at memory issues with malli at metabase. Bad news, I tried your new commit to see if it helps - but now I cannot load metabase at all - malli completely fills as many Xmx you want to give it. So I think something might be wrong. I have not gotten a minimal repro yet, so just a heads up. I suspect malli is uniquely memoizing subgraphs for each ref pointer jump as the schema/validator is cached at the reference site. Each subpath through the registry gets its own chain of memoized thunks. This means you get combinatoric growth as unique paths are taken at runtime. As metabase also caches top-level schema validators/coercers in a global cache we end up retaining (and effectively leaking) large amounts of memory. If I redefine -memoize to identity the memory usage comes right down 😅 . @ambrosebs are you still working on issues relating to memoization / refs? I would love not to duplicate any effort - and would love to know any early thoughts before I raise any issue against malli (or think about solutions).

2026-01-07T18:39:10.855749Z

@danstone thanks for the report! nope I'm not working on it at the moment. we should start investigating so please go ahead with raising issues and we'll discuss solutions.

🙏 1
2026-01-08T05:22:24.380259Z

@danstone could you try changing this if lazy to if true https://github.com/metosin/malli/blob/80138076960e7820523b4cb932c5b5d1936d4e7f/src/malli/core.cljc#L1997