Fork me on GitHub
Ben Sless14:10:08

What's the reason Keyword doesn't implement IObj and can't carry metadata?


at the very least, you can’t intern them directly


keywords are interned

Alex Miller (Clojure team)14:10:09

they're cached and reused so metadata in one place would affect other uses

👍 3

When saying they are interned, or they are cached, is that the same thing as saying: "we want equality to be fast, determined by whether references/pointers to them are equal", or is there any more nuance to it than that?

Alex Miller (Clojure team)14:10:44

fast equality (identity) checks and memory impact


it also means they won't ever be garbage collected, so creating a gazillion different keywords in a long running process isn't a good idea maybe?


that's not the typical clojure program of course


The clj implementation uses weakrefs to intern, which can be GCed


ah, very good


but the map holding them I guess could still get very big

Alex Miller (Clojure team)15:10:52

under gc pressure, they are gc'ed

👍 3

just as a btw, cljs will “intern” (make a fixed map of) keywords discovered statically during compilation, but keywords created dynamically are not because JS has no facility to intern them without leaking. So cljs doesn’t guarantee equal keywords are identical.

💯 3

that's an interesting detail, thank you


it has an extra predicate keyword-identical? to get some of the performance back


ah that's where that comes from. yeah. in .cljc I usually write (defn kw-identical? [k v] #?(:clj (identical? k v) :cljs (keyword-identical? k v)))


I guess one could make the Symbol -> Reference to Keyword map smaller under GC pressure, too, but that would require some kind of sweep over that map to remove entries from it, and some kind of trigger to call that sweep?


hmm, so the corresponding symbols still won't get GC-ed... what's the point then of these weak refs?


Hmmm, perhaps Clojure already does make that map smaller, e.g. in its method clearCache ...


with the trigger to call it for the Symbol->Reference to Keyword map being an intern call on a keyword that finds a null reference


so for each new (non-identical) keyword that trigger is called?


The trigger appears to be calling intern on a Symbol that already has an entry in the map, but its weak reference has been made null, which is evidence that an earlier GC run freed the Keyword object. If you never do an intern call on such a Symbol, there will be no call to clearCache that I can see:


right, that makes a lot of sense


One could imagine trying to call clearCache in other situations, but it appears to be a computation time vs. space tradeoff, e.g. having a time-based periodic trigger to call ClearCache would be a waste of CPU time in most situations.


I suppose the "perfect" trigger would be some thread whose only job was to be an an infinite loop doing a blocking remove() call on the reference queue of GC'ed WeakReferences, and call clearCache every time that remove() returned something. Then you'd need synchronization on that map.


errr, or maybe the ConcurrentHashMap is safe for that use already


doesn’t this actually only clearcache when there’s a keyword miss?


That is what it appears to me, yes.


well, nm, it runs intern again after removing the entry, and that ends up clearing out the table and reference queue


What line(s) of code are you referring to when you say "it runs intern again after removing the entry"?


if(existingk != null)
		return existingk;
	//entry died in the interim, do over
	table.remove(sym, existingRef);
	return intern(sym);


existingk=null means a cached item was GCed (the table has an empty weakref in it), so it removes the entry and runs intern again; this time table.get(sym) will be null, so it runs clearCache


Suppose some Keywords were GC'ed, and every call you made to intern after that were for Symbol's that had Keyword's that were never GC'ed. It appears to me that clearCache would never be called. Do you see a way that clearCache could be called in that scenario?


no, but that would also not be a case with memory pressure


It could be. When I said "some Keywords were GC'ed", it could be 99% of a billion of them that were GC'ed.


eventually the GC will collect a keyword that was only not in use for a brief time and gets used again; at that point it will be detected


if the GC never bothers to do that, then it has spare memory


the worst case I can think of would be 1) create a large number of unique keywords 2) then stop and never use a new keyword again


The scenario I am describing is one where the keywords that are GC'ed, are never used again by the application. I know there are applications that will not behave that way, but that is the scenario I was trying to describe above.


correct, but if there’s memory pressure from those empty entries, eventually it seems that some keyword somewhere that the application still uses will eventually be GCed, if the application is still using keywords at all that don’t have permanent lifetimes


Ah, I see what you mean. clearCache is called either if you re-intern an old GC'ed Keyword, OR if you intern a brand new keyword. The only situation where you avoid it indefinitely is by sticking with the Keywords you have now, forever.


not only sticking with them, but keeping them “alive”


so, it’s possible but seems very unlikely


doesn’t mean you won’t have a bad time when clearCache gets called though


It is a kind of stop-the-world GC on that map, yes.


or at least stop-the-thread-calling-intern, not the world


I think the lesson here is don’t dynamically create large numbers of unique short-lived keywords


So don’t create keywords on user input (form, json, edn, xml… parsing).


edn billion laughs attack

😀 3

It would be fun to get some statistics on the amount of interned keywords in various clojure apps

Alex Miller (Clojure team)16:10:53

I have done that in the past

Alex Miller (Clojure team)16:10:22

I also have a benchmark (based on some real world cases) that is importing json with a lot of unique keywords to push this case

Alex Miller (Clojure team)16:10:06

the last set of changes done were specifically to make that case better (used to be kind of slow to hit the gc conditions)