clojure 2024-04-10 | Slack Archive

The Clojure core team is not a stickler for changes having indenting consistent with the original code. And it is a stickler for not making wholesale code changes just for whitespace.

andy.fingerhut17:04:38

I recommend running a pretty-formatter-thingy on it before browsing it, if it bothers you.

craftybones17:04:53

It doesn’t bother me one bit. Just curious that’s all

flowthing18:04:06

I vaguely remember seeing discussion here on the relative difficulty of reliably writing EDN, and seeing someone recommend a set of dynamic bindings that they use to minimize the pitfalls. Anyone happen to have some variant of those bindings handy? (Or perhaps I'm making things up?)

flowthing18:04:40

Oh, here we go, never mind! https://clojurians.slack.com/archives/C03S1KBA2/p1679254983827799?thread_ts=1679245480.586599&channel=C03S1KBA2&message_ts=1679254983.827799

👍 3

pez19:04:42

A recent discussion on X reminded me about a thing I have been wondering about. How are equality checks between large data structures made efficient? I sort of have some intuition for it when the two structures are created such that they share things (b is a derivate of a). But if a and b are created in two different ways, say a is from an API call, and b comes from reading something from a file. I sense that I am missing some piece of the puzzle here that I need to understand.

pez19:04:38

The discussion: https://twitter.com/meekaale/status/1777963616144171073 (It’s a bit sad reading in parts, you have been warned).

cjohansen19:04:48

Equality checks aren't really very efficient when the two structures do not share ancestry 😅

😂 2

pez19:04:06

Ha!

cjohansen19:04:09

So it's more a case of it being possible, because they're values, not that it's always fast

pez19:04:42

Cool. Then I can relax. This has been bothering me a while.

cjohansen19:04:35

I know the feeling 😄

p-himik19:04:45

Even when they do share stuff, it's not as efficient. For hash maps, for example, it still boils down to comparing all kv-pairs when the size is the same.

p-himik19:04:15

Just double-checked tree maps and vectors - same thing.

john19:04:42

hashes

☝️ 1

p-himik19:04:40

No. :)

john19:04:43

(def a {:a 1 :b 2})
(def b {:b 2 :a 1})
(= a b) ;=> true
(= (hash a) (hash b)) ;=> true

p-himik19:04:52

= doesn't use hash.

john19:04:20

Not for some call paths?

john19:04:42

Isn't that why "deep equality" gets you so much for free?

cjohansen19:04:49

(def a {:a 1 :b 2})
  (def b {:b 2 :a 1})

  (time
   (dotimes [i 10000000]
     (= a b)))

  "Elapsed time: 620.7485 msecs"

  (time
   (dotimes [i 10000000]
     (identical? a b)))

  "Elapsed time: 12.702459 msecs"

😅

💡 1

p-himik19:04:57

There is no deep equality in the code.

💯 1

p-himik19:04:28

It's just a few Java classes, there's no magic in there - you can trace the whole path = takes to return something for any combination of types.

p-himik19:04:56

identical? is fun in the sense that (identical? {} {}) is true but (identical? {:a 1} {:a 1}) is false. For unrelated reason, though.

😂 2

cjohansen19:04:36

Because of the int or what?

cjohansen19:04:43

No, it has to be the empty map that's the special case, right?

p-himik19:04:57

Because identical? is == in Java. It doesn't look inside the object at all. {} is identical to another {} only because the reader expands it to the same static constant.

cjohansen19:04:42

Right

pez19:04:25

Naming it == would have been less trippy, but I’m sure there are reasons.

p-himik19:04:34

== is already taken. And it'd be less trippy only for people who already know Java.

pez19:04:40

Oh, there’s already a ==

pez19:04:32

I had need for that the other day!

john19:04:42

Well, = might not use hash but to answer the question of "How are equality checks between large data structures made efficient?", isn't that generally done by hashing? And Clojure uses hash equality checks internally a lot for efficiency reasons, right?

p-himik19:04:18

No and no. A "large data structure" is a very ambiguous term. Try computing (hash (map identity (range))). Even a finite collection requires a full loop over its contents to compute its hash. Of course, the hash can be and is often cached. If we're comparing two collections that have their hashes cached, equality can potentially be made faster. But at the cost of behaving differently in the bad cases when two collections are considered equal by the current impl but have different hashes.

➕ 1

john19:04:13

Well, that sounds pretty good I guess 🙂

john20:04:59

Yeah, I guess I'm conflating the equality of immutability with hash equality. I didn't realize vectors and array-maps completely ignore the hash values of their items. Seems surprising, if you know you're comparing two immutable things with the same hashing semantics, why not just use the hashes?

p-himik20:04:44

Because the hash isn't computed in advance. It's computed on demand.

p-himik20:04:09

And that aforementioned "comparing things that break the quality/hashing rule or whatever it's called".

phronmophobic20:04:39

> why not just use the hashes? also, hash collisions.

p-himik20:04:58

> hash collisions I think John talks only about an early return of false.

john20:04:22

Honestly, all the datatypes that impl IHash, I thought it was mostly for more efficient comparisons. But they're mostly just for hash maps and hash sets?

p-himik20:04:59

Yes. For everything that requires hashes.

john20:04:34

Chance of collision is pretty low though, right? Is a collision domain good enough for hash sets not good enough for deep equality check?

p-himik20:04:58

Oh, case might also use hashes. > Chance of collision is pretty low though, right? Not really. Especially when the hash algorithm is known by a potential attacker.

john20:04:57

I mean don't use this for crypto lol

john20:04:44

The hashing of objects, it's just supposed to be cheap and not require cryptographic strength

john20:04:10

I see what you're saying though, you can't depend on that for deep equality check

p-himik20:04:28

> I mean don't use this for crypto lol You underestimate how important proper hashing is. It has nothing to do with crypto.

john20:04:07

Well, there's no such thing as a perfect collision domain. The hash is smaller than the object, so obviously there's collisions

p-himik20:04:14

For shits and giggles, I just tried this:

Clojure 1.11.2
user=> (require '[clojure.spec.alpha :as s])
nil
user=> (require '[clojure.spec.gen.alpha :as gen])
nil
user=> (require '[clojure.test.check.generators])
nil
user=> (def data (gen/sample (s/gen string?) 1000000))
#'user/data
user=> (doseq [a data, b data, :when (not= a b)] (when (= (hash a) (hash b)) (prn a b)))

The list of collisions is long, and it's still going.

p-himik20:04:54

> Well, there's no such thing as a perfect collision domain. The hash is smaller than the object, so obviously there's collisions True. But if an attacker can submit some data to your server and knows that that data is cached in the same hash map, they can bring your server to a halt by feeding it just the right data. One particularly nasty case is using keyword on user data. I don't know if that's been (or can be) handled though.

john20:04:03

What's a key that when passed to a map will return the value of :secret

p-himik20:04:30

That's... absolutely not what I meant.

p-himik20:04:46

Hash maps don't store a single value under a specific hash.

p-himik20:04:56

By exploiting hash collisions, an attacker can make a known hash map put all the data into the same bucket, turning O(1) into O(n).

ghadi20:04:33

https://en.wikipedia.org/wiki/Collision_attack#Hash_flooding

ghadi20:04:41

hash flooding

p-himik20:04:05

Thanks. :) I suck at recalling names of any nature.

john20:04:30

Yeah, I get you're point about deep equality checking. But is it possible for hashmaps and hash sets to have any false negatives or false positives on a wrong key? Or do they always always do a full equality check?

p-himik20:04:02

Not possible. First a hash check, then an equality check.

p-himik20:04:30

We'd be in a very bad place otherwise.

ghadi20:04:54

just to be totally pedantic, first the hash is used to navigate the trie structure (HAMT), then equality used to confirm sameness

👍 1

🙏 1

john20:04:27

that's what I thought, it was used internally

ghadi20:04:45

it's not a slowpath/fastpath kind of thing

john20:04:00

But can we assume two different maps with the same hash are equal or no?

ghadi20:04:13

john20:04:14

I guess it depends on the size

ghadi20:04:27

you can assume that two equal maps have the same hash, not the reverse

john20:04:10

right, because they could have mutable items

p-himik20:04:10

Even that assumption rests on the "equal things generate equal hashes" rule. Ghadi, please recall the name of the rule for me. :D

ghadi20:04:35

no, it has nothing to do with mutability

p-himik20:04:51

@U050PJ2EU No, because hash implementation is technically independent from equality implementation. You can make the first always return 1 and the second ask an atomic clock.

john20:04:24

right

ghadi20:04:33

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Object.html#hashCode()

ghadi20:04:45

read the method docstring for Object.hashCode()

john20:04:52

the simplest hash is a counter, right? as long as each is unique in a given runtime is good enough in some cases, right?

p-himik20:04:12

Oh, and some data structures don't have commutative equality, which doubles the fun.

p-himik20:04:40

> the simplest hash is a counter, right? Such an impractical hash can be made even simpler by always returning the same constant or a random number. There is no practical way of implementing a global counter for all possible data types and values.

john20:04:49

So what do hashing datastructures use the hash for? partitioning? randomizing the distribution?

john21:04:35

Well, if it's not found, it's absolutely not found

p-himik21:04:35

Are you asking for a tl;dr for the Wiki page on hash tables? :)

john21:04:44

false negatives aren't possible

john21:04:10

And when you have a positive, you double check key's equality

john21:04:14

that makes sense

john21:04:53

Still, if I'm iterating through two lists, comparing pairs of items for equality, and I see two maps and their hashes are different, then I know I don't even need to travers those maps to check. If they have different hashes, they're definitely not equal. So why not use that hash when doing that comparison?

p-himik21:04:08

> and I see two maps and their hashes are different For that, you first need to compute the hashes. What is the hash of {:a (range)}? > If they have different hashes, they're definitely not equal. Using it in = would break it because there absolutely are cases out there that don't adhere to the contract of hashCode that Ghadi has linked above.

p-himik21:04:32

You might be able to implement = in a way that uses hashes and doesn't return false results. It will probably be a spaghetti with lots of ifs where you check type combinations. And in the end, it will break a lot of existing code because not everything is hashable (throw or inf loop) and not everything adheres to the contract of hashCode.

john21:04:57

Hmm, I thought hash was getting called on creation of a type impling IHash. I guess I figured lazy seqs got a random hash or something. How is the hash computed when something within an infinite seq is in the key position?

(def a {{:x (range)} {:x (range)}
        :b {:c :d}})
(-> a :b hash) ; => -1916895501

p-himik21:04:51

Call type on that map.

john21:04:26

Yeah so this hangs:

john21:04:29

(def a
  (->> (range 20)
       (map (fn [n]
              [{:x (range)} n]))
       (into {})))

p-himik21:04:08

I'm surprised it doesn't throw.

john21:04:40

That's running in cljs

john21:04:53

but this also hangs:

(def a {:x (range)})
(def b {:x (range)})
(= a b)

p-himik21:04:59

Ah, it doesn't throw because it hangs even with (range 2), because it tries to compare the keys. This, on the other hand, throws: (def x (into {} (map (fn [n] [{n (range)} n])) (range 10))).

john21:04:20

hangs in cljs

p-himik21:04:00

In CLJS, even (hash (range)) hangs. It throwing in CLJ is a recent change, CLJ-2839.

john21:04:48

Anyway, so = does this too, so same deal

andy.fingerhut22:04:52

If you have two deeply nested collections, and one was created by doing update on a few key/val pairs from the other using Clojure's update or assoc, then note that = comparison between them can be made faster by the fact that many of the sub-collections will return true for identical?, and Clojure's = does check identical? quickly before any other checks.

andy.fingerhut22:04:01

It might be nice if = checked to see if the hash value was already calculated and stored for two collections, and if so, return false quickly if their hashes are different, but IIRC Clojure does not have such code.

john22:04:38

Well I definitely had the wrong idea how that worked. I thought hashes were used more for equality checking

teodorlu06:04:03

I think we could get hash-boosted equality if we used immutable collections backed by merkle trees rather than HAMTs. But I believe merkle trees are significantly slower than HAMTs for almost all common use cases (except for equality comparison, which is O(1)). Also, the memory overhead of keeping a long hash (eg SHA-512) for each element is significant. (for people curious about merkle trees, I suggest looking into IPFS)

john10:04:34

Interesting. Looks like unison uses 512 bit hashes for all elements and types

john10:04:19

A clojure on unison would be neat

➕ 2

2024-04-10

Channels