I have a performance improvement for recursive validators here https://github.com/metosin/malli/pull/1245 It compiles recursive functions for recursive validators instead of lazily caching each level, leading to better time and space perf for recursive validators. Please try it out in the wild and let me know if this breaks anything.
> I think this way we can remove the dynamic variables in the validator case. I'll try it out.
seems to work by sharing rf via the options map https://github.com/metosin/malli/pull/1254/changes
Morning, warming up now - here is what I meant yesterday about 'multiple roots'.
> (let [reg (merge (m/default-schemas)
> {A [:tuple :int [:maybe [:ref B]]]
> B [:tuple A A A A A]})
> v (m/validator B {:registry reg})]
> (println (mm/measure v)))
Asking for a validator for ::B is much more expensive than one for ::A. If you ask for ::B - a new ->validator is created for each reference to ::A
This follows as the [:ref] establishes the binding scope. Only the nodes along the path have been cached in the *ref-validators* table. So if you cycle you will not reallocate. But branches to new nodes will not share a cache.
Does that make sense? Sorry I'm sure I could probably do a better job of explaining here.
If only a registry could cache the Schema for each node, things would be a lot more straightforward. We also have a problem with explainer right, which seems hard to square.
that makes sense to me, at least
For the validator case maybe we can make id->validator mutable, but then you got races / cljs yuk
@joel.kaasinen would there be any appetite in having intepretation be the default and push graph caching concerns into a special kind of registry / wrapper? One could keep all the existing memoization machinery everwhere but refs (and maybe pointers/bare-kws).
I'm not sure what you mean
Hmm, maybe something like this: I think you could introduce a mutable hash map in the top level options call. -validator looks it up in the options. If it is absent, we do not memoize/cache/close-over validators at the reference site at all. If present we use it to retain the validator for some referenced node. Because its not scoped to the any particular reference - it ought to cache each node just once.
That sounds reasonable, yeah. Might also simplify the various recursive memoization cases? Performance by default would be nice, but opt-in performance sounds good as well.
> If it is absent, we do not memoize/cache/close-over validators at the reference site at all.
This case would be unlikely if we introduce the cache on the users behalf right, because nobodyh is gonna call -validator most of the time
its more we do not have the impl baggage/headache from trying to get it right in-place
Right, we could init an empty cache in the public validator?
ya
Its an idea
right now things are so hard to think about it really does make my head spin
I feel like we need some collective hammock time on this instead of merging multiple small surgical PRs. If you could write up a draft of this approach, it would be easier to talk about.
Also want to hear what Ambrose thinks about this.
yea totally agreed! I plan to do a write up today, so hopefully will have something to think about. Also keen to hear thoughts from others. I am not super experienced with malli and its my first time in the internals here
One thing that I'm wondering about is what would the keys in the mutable hash map be...
refs are easy to turn into keys, but other schemas might not
I think you only need to do this for refs
is it enough to cache refs?
yea
explainer I think is the spanner with my idea though
explainer closes over paths that seem somewhat important
so its kind of baked in they need to allocate for different pathways
the same kind of memory pathology ought to exist there
I don't see why explainers couldn't be curried like (-explainer [this] (fn [path] ...))
of course that would be a big-ish and probably slightly breaking change
it might be easier to swallow recomputing explainers on demand I would guess, as they are not used so often where that kind of perf would matter? Again only for refs.
yeah, agreed. getting the perf for validators & coercers is most important.
adding a note to the channel that there's an interesting discussion about recursive validators & performance here
regarding tying the knot for explainers, here's an implementation that doesn't introduce breaking changes but should save memory on creating redundant schema children https://github.com/metosin/malli/pull/1254/changes
@danstone I think a top-level mutable cache is my eventual goal (this is what Plumatic Schema does). We should figure out what is happening here first tho.
Write up is in progress, I'll link that new explainers commit (haven't looked at yet tho)
@ambrosebs here is the WIP, worked a bit on examples that I think show what is going on: https://gist.github.com/wotbrew/6bc413291e2c667d66d35b78868dc2d2 Apologies not quite done / missing stuff / spelling etc 😉
The actual schemas at metabase have lots of structure, conditionality, recursion points - it all makes sense why now we get GB allocated up front to me - and why it can still be validators holding the memory despite the fix. If I add the lazyness back in, things get better (up front) - but still pretty bad.
Obviously can comment on the issue etc when I create it so dont feel you have to read it all now
I agree with everything in the first section. I think we're jumping ahead tho, the knot tying addresses validators leaking memory at runtime, not compile time. There's been no attempt at fixing the second problem of expecting linear growth.
Yea I agree, sorry I'll clear that up!
fwiw I describe the same problem as future work in my summary for the knot tying work https://www.clojuriststogether.org/news/december-q3-2025-project-updates/#ambrose-bonnaire-seargent-malli
so it's absolutely on my radar. first thing I'd like to know is whether metabase is actually observing constant memory at runtime. it sounds like yes? or, it might if you could get it to start? 🙂
I can get it to start occasionally but it is left in a broken state. I have not looked at runtime behaviour as the previous way I was doing so is not possible.
I could do so but it would be somewhat synthetic and the same as what I am doing in the more minimal examples
ok then the broken state is the most important thing. there must be some assumption either the patch or metabase is violating.
I would not put it on malli, I would put good money on its metabase - some job dies, timeout is hit etc etc
ok. I think we should park all the perf stuff until we get to the bottom of how it breaks metabase.
it sounds like there are two problems: 1. the preallocation of validators takes up too much memory 2. even if you allocate enough memory, metabase doesn't like the validators generated
on 1, you said that lazy vars help. I'd like to know more about that.
2. even if you allocate enough memory, metabase doesn't like the validators generatedI think it is memory in each case (pressure on something else), but let me play some more
forcing the if lazy branch puts us in a similar spot to 0.2.0 - everything loads ok at this point. The memory usage is high, but does not obviously grow at runtime. In the 0.2.0 case it looks like a memory leak.
ah ok. for lazy refs, it will grow at runtime but to a bounded amount until it ties the knot.
my first hunch with your report is that this work revealed a failure mode in metabase on what actually happens if all ref schemas are realized, even to one level.
I agree with that - and I did not know the scaling was a known issue. If we have to live with the (apparent) exponential scaling, we might decide to keep validators lazy. I would then discuss removing any top level caching of validators at metabase - as we would then at least limit the memory balloon to a specific validation scope.
I'd be happy with that outcome - OTOH a PR for the top-level mutable cache might not be too hard? What do you think? I kind of would like the eagerness now 🙂
another possibility is that the optimization overstepped. it's really two optimizations: tying the knot and eager realization.
maybe a simple way forward is to force the lazy path until we deal with the "other" issue
it will at least bound the memory leak
Yep totally agree 👍
re: top-level cache, it's a few steps ahead of where I'm comfortable, I think we can break it up into smaller steps like we're doing here, fixing smaller issues and get to that destination
ok, but you've already tried forcing the lazy case and still found issues. maybe concentrate on that case in your investigation going forward?
once we get to the bottom of it I will make the non-lazy case opt-in.
also, the reason you didn't hear about the exponential scaling issue is because I didn't advertise it. my head is also spinning, but over several years on how to even explain these issues.
The issue with 'force lazy' for us is 'malli memory usage is still very high'. Obviously this is metabase specific! This is more of a 'it'd be nice to reduce that' type issue for me. A second issue is it still leaves open runtime growth. On smaller heaps I'd have to say OOM's are only less likely due to malli, instead of practically eliminated - it would still 'look like a leak'. This is all a bit speculative until we roll out into production or run some significant experiment tbh 🙂. I figure trying the top level cache might be relatively cheap (not proposing merging anything into malli yet!)
Malli is free to allocate what it wants of course, and I 💯 agree that without the forced eagerness the situation seems improved.
And I have options metabase side to mitigate the kinds of situations we see now
another issue is that I proposed a top-level validator cache last year and it was not accepted. Instead of pushing for it I decided to break the problem down further.
By top-level do you mean as I propose? Injected into the options map? Sorry, I should probably start by reading your linked doc.
I mean for the purposes of this discussion, starting at that high level of abstraction had buy-in issues and was hard to explain.
that's one reason why I tried to break it down into tiny but effective pr's.
I can dig up my attempt.
the feedback I got was basically, it's against malli's design to cache so much.
I struggled to formulate my argument against it, eventually it emerged from this work via this pr, which basically says that malli is free to cache schemas as it sees fit https://github.com/metosin/malli/pull/1244
I would go with a different approach today tho, back then I didn't realize the exponential proliferation was a root cause. It really was fixing a symptom of that problem, by caching validators at the schema ctor level.
my dream scenario is that we can tie the knot at schema-creation time. I don't know if that's possible in practice yet.
that way we might not have to have custom knot tying for every op.
I have a feeling that it wouldn't work in practice. there are so many different ops and different use cases.
however, we could probably use the same tech to automatically create :ref schemas.
instead of manually changing ::foo to [:ref ::foo], just detect if you've seen ::foo and if so turn it into the latter.
hmm, just had a thought that it's not the refs themselves we'd want to tie the knot with, but their children. this is the approach I took for explainers. what if the shared child is available to all ops at parse time? https://github.com/metosin/malli/pull/1254/changes
I think this way we can remove the dynamic variables in the validator case. I'll try it out.
yes I did try that. I'm not convinced in our case that the validator call is the only problem. I'm continuing to investigate today, hopefully I'll have more soon.
here is an interesting example that I think demonstrates the original problem (might?) still exist in the coercer/explainer case. I suspect something might be going on with the early forcing of (rf) in the non-lazy ref case.
https://gist.github.com/wotbrew/e5af4d05655f70fe4cbf21b6820021f1
I'll put together an issue for GH
Still not sure why forcing the lazy validator as you say did not seem to work - it really seems like it should be better than 0.2.0!. I'm going to play around with this again as maybe I just messed something up.
the "after 10 random data structures" results look promising
tho maybe misleading since I don't think ref's -validator uses it any more
I suspect somehow if you switch if lazy for if true you end up with the same runtime combinatorics but it moves to the knot/->validator closure. looking at heap you see the same sorts of AtomicReference numbers but its dominated more by Atom paths (via ->validator closure).
Making my head hurt a bit 😄
This is where I am: • The validator path is healthy, bounded more to schema complexity than runtime data structure complexity. • explainer is the same as before (still growing when encountering new structure), this is not surprising as that method has not been touched • coercer is really bad, if lazy = true makes things better, but its still allocating huge amounts for a datastructure with any sort of depth or complexity.
coercer is probably because of the same issue on the transform side - we have a fix for that so I'll pull that commit in. Almost there. I think the only remaining knot to tie will be the explainer case.
thanks for investigating. I'm not sure how to tie the knot with the explainer since path grows with each step, so needs a new thunk at each level.
thinking it through, we could have the functions in the id->explainer map take a path and initialize a new explainer. this should at least prevent us from needing to recreate another layer of schemas.
Even if I remove the (rf) and turn any caching for explainers off, numbers get better in the test example - I still have a lot of memory usage in metabase, but as I say heap dump looks different. ->validator retaining everything. The thing is I do not see the same in my toy example, so there is something else in the mix I am missing.
Hopefully will actually be able to describe what is going on by the end of the week. Actual fix is probably like one line of code somewhere 😄
Metabase schemas are a lot more complicated. But perhaps there are other indirection mechanisms that break/create multiple id->validator roots across the tree.
I think it would help to print out the keys of *ref-validators* also. we can see which schemas are being realized and where the knots are being tied. it's unsorted tho, but might still be useful unchanged.
ideally I'd like to see a (def ^:dynamic *nested-ref-path* []) where we add (binding [*nested-ref-path* (conj *nested-ref-path* (:name id))] around where we rebind the *ref-validators* var. then collect the prints of that var.
try this pr https://github.com/metosin/malli/pull/1251
should print out stuff like
*nested-ref-path* [:malli.swagger-test/a]
*nested-ref-path* [:malli.swagger-test/a :malli.swagger-test/b]
*nested-ref-path* [:malli.swagger-test/a :malli.swagger-test/b :malli.swagger-test/c]
*nested-ref-path* [:malli.swagger-test/b]
*nested-ref-path* [:malli.swagger-test/b :malli.swagger-test/c]
*nested-ref-path* [:malli.swagger-test/b :malli.swagger-test/c :malli.swagger-test/a]
*nested-ref-path* [:malli.swagger-test/c]
*nested-ref-path* [:malli.swagger-test/c :malli.swagger-test/a]
*nested-ref-path* [:malli.swagger-test/c :malli.swagger-test/a :malli.swagger-test/b]oh I messed it up fixed
Will do, just fyi I already identified you can get multiple roots as the root var binding can be established multiple times when you enter via a non-ref. This is why I added the ptr/direct cases.
hmm I don't follow, example?
give me a few mins, typing this while in a call 😄
Might be tomorrow tbh, I can also smell dinner cooking
I'm definitely willing to revisit the top-level cache decision once we have more information & experiments. But definitely don't want to rush it. Thanks for looking into this.
I have been looking at memory issues with malli at metabase.
Bad news, I tried your new commit to see if it helps - but now I cannot load metabase at all - malli completely fills as many Xmx you want to give it. So I think something might be wrong. I have not gotten a minimal repro yet, so just a heads up.
I suspect malli is uniquely memoizing subgraphs for each ref pointer jump as the schema/validator is cached at the reference site. Each subpath through the registry gets its own chain of memoized thunks. This means you get combinatoric growth as unique paths are taken at runtime. As metabase also caches top-level schema validators/coercers in a global cache we end up retaining (and effectively leaking) large amounts of memory. If I redefine -memoize to identity the memory usage comes right down 😅 .
@ambrosebs are you still working on issues relating to memoization / refs? I would love not to duplicate any effort - and would love to know any early thoughts before I raise any issue against malli (or think about solutions).
@danstone thanks for the report! nope I'm not working on it at the moment. we should start investigating so please go ahead with raising issues and we'll discuss solutions.
@danstone could you try changing this if lazy to if true https://github.com/metosin/malli/blob/80138076960e7820523b4cb932c5b5d1936d4e7f/src/malli/core.cljc#L1997