Fork me on GitHub

I have recently finished a transcript of Rich Hickey's 2008 talk "Clojure for Java Programmers". I will announce it in the usual places when the PR is accepted to merge it into the master Github repo for it, but I wanted to ask here about one particular point of it that seems misleading or wrong to me. Search for the words "that is made faster" here, and then a few sentences later my editorial note/question starting at "Rich says above"

đź‘Ť 4

AFAIK, Clojure = often has a performance optimization by doing identical? before a deep comparison, but what he says in that talk implies that Clojure uses hash values to speed up =, which I have never seen in the implementation. It seems imaginable that it could be done (e.g. if two collection hash values differ, quickly return false with no deep equality check), but I have never seen anything in the implementation that does.


@andy.fingerhut But Clojure did (does?) cache hash values, which is what he says. For example addressed records not matching hash maps in that respect.


(I don't see the direct correlation with equality tho' now)

Alex Miller (Clojure team)23:12:37

Some data structures cache hash values, and hash value checks can be a “fast fail” check

Alex Miller (Clojure team)23:12:05

Off the top of my head I’m not sure if that’s done anywhere


I agree that Clojure caches hash values for collections. But the context of that statement is these three sentences (maybe not enough context, which I am happy to be corrected on): "Identical contents are identical by equals. That is made faster than you might imagine by caching hash values. But equality is equality of value in Clojure." I could be misinterpreting things here, of course, but it could be read to imply that cached hash values can make equality on collections faster.


Yeah, and maybe that was the way he was thinking of doing the equality checks back then? But later changed his mind.


Could be. This talk was June 2008, about a year before Clojure 1.0 was released.

Alex Miller (Clojure team)23:12:45

Identity is a “fast pass” check. Hash checking is a “fast fail” check. Value equality is the final answer. Not all hashes are cached, and it may not always be worth checking. In general, we make those decisions based on testing.


related, I’ve wondered why distinct doesn’t use a set of hashes instead of holding elements in a set? Seems hashes would cost less memory if you’re streaming over the values in the seq and could free the elements for gc?


I agree hash checking could be a fast fail check on =. I just don't think the implementation does that anywhere.


@rutledgepaulv I suspect if you actually benchmark those implementations over different sizes of collections with different types of elements, you might find it's only faster in a limited number of cases -- and will be slower in some common cases...


@rutledgepaulv All of the elements in the set are held by reference to the existing objects, not making new copies of those objects. So storing hashes would involve creating new Long or Integer objects that might actually cost more memory than how it is done now.


yeah not suggesting it’s faster, but maybe equivalent + less memory?


yeah but if they were large structures or something, it might be cheaper to allocate a new hash and discard the larger thing sooner (but agreed, there may be cases where it’s worse)


just had a hunch it might be better in the typical case, but haven’t proven that obviously 🙂


You can't discard anything before distinct returns, can you?


Hmm. It returns a lazy sequence, so I guess maybe. Also note that of course you would have to deal with hash collisions correctly


as you’re consuming elements from the seq (if you don’t hold onto the head) then presumably elements could be discarded except for the fact that they’re held in the closure by the set?


are sets of elements not subject to the same hash collision problems? or do they fall back to equality checks when it finds a matching hash?


In fact, to deal with hash collisions correctly, you could never remove references to previously seen objects.


You never know if a future object will collide in its hash with a previously seen object, until you have seen it.


hash sets use linked lists for all objects with equal hashes


thanks. for some reason i was thinking the hashes were a safe replacement but maybe that’s asking for a non-existent hash function


It would. People have done research and development on perfect hash algorithms, but as far as I know those are always for a set of objects that you know ahead of time, and it doesn't change.


There are only 2^32 possible Java int values. Birthday paradox says to expect 50% chance of collision after sqrt of that many objects.