This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2024-04-10
Channels
- # babashka (8)
- # beginners (38)
- # calva (28)
- # cider (7)
- # clj-kondo (6)
- # clojure (113)
- # clojure-austin (2)
- # clojure-berlin (11)
- # clojure-europe (12)
- # clojure-losangeles (5)
- # clojure-madison (1)
- # clojure-nl (1)
- # clojure-norway (17)
- # clojure-spec (5)
- # clojure-uk (4)
- # clojurescript (23)
- # events (2)
- # fulcro (6)
- # hyperfiddle (2)
- # missionary (5)
- # music (1)
- # off-topic (20)
- # portal (5)
- # reitit (4)
- # releases (1)
- # scittle (3)
- # sql (10)
- # squint (2)
Curious. How does Clojure add the line number and column number into the metadata for each of the vars?
Even a pointer towards the part of the source that handles it would be a good start
https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LineNumberingPushbackReader.java
Thanks a ton!
Anyone know why the indentation is inconsistent in this class? the constructor has a tab of 8 and some of the other methods are 4…. 😄
and one of the methods isn’t indented at all
the code is immutable too? 😄
The Clojure core team is not a stickler for changes having indenting consistent with the original code. And it is a stickler for not making wholesale code changes just for whitespace.
I recommend running a pretty-formatter-thingy on it before browsing it, if it bothers you.
It doesn’t bother me one bit. Just curious that’s all
I vaguely remember seeing discussion here on the relative difficulty of reliably writing EDN, and seeing someone recommend a set of dynamic bindings that they use to minimize the pitfalls. Anyone happen to have some variant of those bindings handy? (Or perhaps I'm making things up?)
Oh, here we go, never mind! https://clojurians.slack.com/archives/C03S1KBA2/p1679254983827799?thread_ts=1679245480.586599&channel=C03S1KBA2&message_ts=1679254983.827799
A recent discussion on X reminded me about a thing I have been wondering about. How are equality checks between large data structures made efficient? I sort of have some intuition for it when the two structures are created such that they share things (b is a derivate of a). But if a and b are created in two different ways, say a is from an API call, and b comes from reading something from a file. I sense that I am missing some piece of the puzzle here that I need to understand.
The discussion: https://twitter.com/meekaale/status/1777963616144171073 (It’s a bit sad reading in parts, you have been warned).
Equality checks aren't really very efficient when the two structures do not share ancestry 😅
So it's more a case of it being possible, because they're values, not that it's always fast
Even when they do share stuff, it's not as efficient. For hash maps, for example, it still boils down to comparing all kv-pairs when the size is the same.
(def a {:a 1 :b 2})
(def b {:b 2 :a 1})
(time
(dotimes [i 10000000]
(= a b)))
"Elapsed time: 620.7485 msecs"
(time
(dotimes [i 10000000]
(identical? a b)))
"Elapsed time: 12.702459 msecs"
😅It's just a few Java classes, there's no magic in there - you can trace the whole path =
takes to return something for any combination of types.
identical?
is fun in the sense that (identical? {} {})
is true
but (identical? {:a 1} {:a 1})
is false
. For unrelated reason, though.
Because identical?
is ==
in Java. It doesn't look inside the object at all.
{}
is identical to another {}
only because the reader expands it to the same static constant.
Well, =
might not use hash
but to answer the question of "How are equality checks between large data structures made efficient?", isn't that generally done by hashing? And Clojure uses hash equality checks internally a lot for efficiency reasons, right?
No and no.
A "large data structure" is a very ambiguous term. Try computing (hash (map identity (range)))
.
Even a finite collection requires a full loop over its contents to compute its hash.
Of course, the hash can be and is often cached. If we're comparing two collections that have their hashes cached, equality can potentially be made faster. But at the cost of behaving differently in the bad cases when two collections are considered equal by the current impl but have different hashes.
Yeah, I guess I'm conflating the equality of immutability with hash equality. I didn't realize vectors and array-maps completely ignore the hash values of their items. Seems surprising, if you know you're comparing two immutable things with the same hashing semantics, why not just use the hashes?
And that aforementioned "comparing things that break the quality/hashing rule or whatever it's called".
> why not just use the hashes? also, hash collisions.
Honestly, all the datatypes that impl IHash, I thought it was mostly for more efficient comparisons. But they're mostly just for hash maps and hash sets?
Chance of collision is pretty low though, right? Is a collision domain good enough for hash sets not good enough for deep equality check?
Oh, case
might also use hashes.
> Chance of collision is pretty low though, right?
Not really. Especially when the hash algorithm is known by a potential attacker.
The hashing of objects, it's just supposed to be cheap and not require cryptographic strength
> I mean don't use this for crypto lol You underestimate how important proper hashing is. It has nothing to do with crypto.
Well, there's no such thing as a perfect collision domain. The hash is smaller than the object, so obviously there's collisions
For shits and giggles, I just tried this:
Clojure 1.11.2
user=> (require '[clojure.spec.alpha :as s])
nil
user=> (require '[clojure.spec.gen.alpha :as gen])
nil
user=> (require '[clojure.test.check.generators])
nil
user=> (def data (gen/sample (s/gen string?) 1000000))
#'user/data
user=> (doseq [a data, b data, :when (not= a b)] (when (= (hash a) (hash b)) (prn a b)))
The list of collisions is long, and it's still going.> Well, there's no such thing as a perfect collision domain. The hash is smaller than the object, so obviously there's collisions
True. But if an attacker can submit some data to your server and knows that that data is cached in the same hash map, they can bring your server to a halt by feeding it just the right data.
One particularly nasty case is using keyword
on user data. I don't know if that's been (or can be) handled though.
By exploiting hash collisions, an attacker can make a known hash map put all the data into the same bucket, turning O(1) into O(n).
Yeah, I get you're point about deep equality checking. But is it possible for hashmaps and hash sets to have any false negatives or false positives on a wrong key? Or do they always always do a full equality check?
just to be totally pedantic, first the hash is used to navigate the trie structure (HAMT), then equality used to confirm sameness
Even that assumption rests on the "equal things generate equal hashes" rule. Ghadi, please recall the name of the rule for me. :D
@U050PJ2EU No, because hash implementation is technically independent from equality implementation.
You can make the first always return 1
and the second ask an atomic clock.
https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Object.html#hashCode()
the simplest hash is a counter, right? as long as each is unique in a given runtime is good enough in some cases, right?
> the simplest hash is a counter, right? Such an impractical hash can be made even simpler by always returning the same constant or a random number. There is no practical way of implementing a global counter for all possible data types and values.
So what do hashing datastructures use the hash for? partitioning? randomizing the distribution?
Still, if I'm iterating through two lists, comparing pairs of items for equality, and I see two maps and their hashes are different, then I know I don't even need to travers those maps to check. If they have different hashes, they're definitely not equal. So why not use that hash when doing that comparison?
> and I see two maps and their hashes are different
For that, you first need to compute the hashes. What is the hash of {:a (range)}
?
> If they have different hashes, they're definitely not equal.
Using it in =
would break it because there absolutely are cases out there that don't adhere to the contract of hashCode
that Ghadi has linked above.
You might be able to implement =
in a way that uses hashes and doesn't return false results.
It will probably be a spaghetti with lots of if
s where you check type combinations.
And in the end, it will break a lot of existing code because not everything is hashable (throw or inf loop) and not everything adheres to the contract of hashCode
.
Hmm, I thought hash was getting called on creation of a type impling IHash. I guess I figured lazy seqs got a random hash or something. How is the hash computed when something within an infinite seq is in the key position?
(def a {{:x (range)} {:x (range)}
:b {:c :d}})
(-> a :b hash) ; => -1916895501
Ah, it doesn't throw because it hangs even with (range 2)
, because it tries to compare the keys.
This, on the other hand, throws: (def x (into {} (map (fn [n] [{n (range)} n])) (range 10)))
.
If you have two deeply nested collections, and one was created by doing update on a few key/val pairs from the other using Clojure's update
or assoc
, then note that =
comparison between them can be made faster by the fact that many of the sub-collections will return true
for identical?
, and Clojure's =
does check identical?
quickly before any other checks.
It might be nice if =
checked to see if the hash value was already calculated and stored for two collections, and if so, return false
quickly if their hashes are different, but IIRC Clojure does not have such code.
Well I definitely had the wrong idea how that worked. I thought hashes were used more for equality checking
I think we could get hash-boosted equality if we used immutable collections backed by merkle trees rather than HAMTs. But I believe merkle trees are significantly slower than HAMTs for almost all common use cases (except for equality comparison, which is O(1)). Also, the memory overhead of keeping a long hash (eg SHA-512) for each element is significant. (for people curious about merkle trees, I suggest looking into IPFS)