Fork me on GitHub
#clojure
<
2022-04-01
>
Joshua Suskalo03:04:39

Question: is it worth it to make a rope implementation that accepts non-string components? 🧵

Joshua Suskalo03:04:21

I've been working on a rope implementation for a while now, and just because it was easy in the initial implementation I was allowing the leaf nodes to have either a string or a vector as the collection storing the objects.

Joshua Suskalo03:04:53

The problem is this: I have one implementation that fits for both strings-as-ropes-of-characters, and ropes of general values. There's a number of conditions in the leaves based on the type stored there.

Joshua Suskalo03:04:49

This isn't necessarily an issue, but if I make them two separate types, even though it makes it slightly more performant for a lack of conditionals, it would cause significant code duplication.

Joshua Suskalo03:04:51

The one thing that would be nice about having them be separate is to allow them to have separate semantics when conj is used, that is for string-based ropes conjing on a string would append the string, while for value-based ropes conjing on a string would add a string as one element, rather than each of the characters individually.

Joshua Suskalo03:04:44

This use of conj with a string to append the whole string is practically necessary because adding characters one by one to a string causes way too much copying.

Joshua Suskalo03:04:48

I could perhaps solve this with a transient that causes the final node that gets appended to to have a StringBuilder as the leaf container.

Joshua Suskalo03:04:48

What I have been considering at this stage is just removing the vector leaf container altogether, and I guess the question comes down to: is having a rope data structure for non-character sequences useful enough to bother with the implementation complexity?

phronmophobic03:04:29

I might be interested in a generic rope data structure. Basically, I've wanted to have a decent, performant data structure for text that allows for characters of different fonts, styles, slants, underlines, font sizes, etc.

phronmophobic03:04:25

There are various rope implementations available, but all the ones I've seen are for just sequences of characters.

Joshua Suskalo03:04:11

yeah, that makes sense, so you'd basically want like a rope of maps with an actual string of text and other meta?

Joshua Suskalo03:04:51

potentially with ropes for the actual strings stored too?

phronmophobic03:04:56

The liquid clojure library has a https://github.com/mogenslund/liquid/blob/master/src/liq/buffer.cljc#L13 which allows for a map of data for each character, but I haven't really delved into the trade-offs.

Joshua Suskalo03:04:41

Yeah, so I have metadata on ropes, another way it could potentially be done is to just have a meta map with vectors of ranges to data

Joshua Suskalo03:04:59

that would make the actual string manipulation more efficient, at the cost of making the meta a little more complex to handle.

Joshua Suskalo03:04:09

But good to know there's interest in ropes for other types of sequences.

Joshua Suskalo03:04:29

I'll try to figure out this API question then.

phronmophobic03:04:31

To be honest, I'm not sure what I want. I just know that membrane lacks good representation for stylized text.

phronmophobic03:04:04

So I'm interested and would be happy to try things out. Not sure how helpful that is.

phronmophobic03:04:48

If you do solve the problem of how to represent stylized text in a performant way, I would be eternally grateful 🙏 😁

Joshua Suskalo03:04:44

So the one real like api kink that I have at the moment is this:

(conj (rope) "hello")
;; => #rope [ \h \e \l \l \o ]
(conj (rope) [1])
;; => #rope [ [1] ]

Joshua Suskalo03:04:26

which is distinct from

(concat (rope) (rope ["hello"]))
;; => #rope [ "hello" ]

Joshua Suskalo03:04:05

This has a slightly odd signature for conj where it takes two collections and merges them into one, rather than adding the second collection as an element of the first.

Joshua Suskalo03:04:21

Which isn't unheard of, maps do this.

Joshua Suskalo03:04:32

And when making ropes specifically for characters it's practically necessary if I don't implement transients, because

(into (rope) "hello")
would be horribly unperformant if it's appending characters one at a time to a string. super memory inefficient

phronmophobic03:04:21

> Which isn't unheard of, maps do this.

(conj {} [1 2 3 4])
Execution error (IllegalArgumentException)
Vector arg to map conj must be a pair

phronmophobic03:04:28

I don't think maps do that

Joshua Suskalo03:04:41

(conj {:a 1} {:b 2})
;; => {:a 1, :b 2}

Joshua Suskalo03:04:59

but since this makes the version with strings inconsistent with the one with vectors, it means you can't write code against this rope implementation that's fully generic over what types are contained.

Joshua Suskalo03:04:56

I guess I could work on the transient implementation and make it so that conjing a string on adds the full string as one element and enable you to use into to efficiently copy a string into a rope this way, or perhaps just encourage using concat with another call to rope

phronmophobic03:04:27

I'm actually pretty surprised to see that concat returns anything besides a seq

Joshua Suskalo03:04:37

this is a custom rope-only concat

Joshua Suskalo03:04:49

I just didn't think another name fit

phronmophobic03:04:11

Is the performance difference when adding 1 character at a time because you're not pre-allocating?

Joshua Suskalo03:04:47

I mean I can't preallocate with strings, but also it's all immutable

Joshua Suskalo03:04:33

the issue is this code is horribly inefficient:

(-> "hello"
    (str \w)
    (str \o)
    (str \r)
    (str \l)
    (str \d))

Joshua Suskalo03:04:39

it copies the string five times

phronmophobic03:04:00

right, but you could have a character buffer

phronmophobic03:04:09

and just wrap with pointer to buffer + length

Joshua Suskalo03:04:16

right but then that's mutable

Joshua Suskalo03:04:20

which I don't want

Joshua Suskalo03:04:29

I could do that with a transient rope.

Joshua Suskalo03:04:34

But I'm not on to transient implementations yet.

Joshua Suskalo03:04:20

The way to do this efficiently in the current system would be

(concat (rope "hello") (rope "world"))
which ends up having the same performance characteristics of
(str "hello" "world")

phronmophobic03:04:22

if someone gives you pointer to 10 length buffer with length 3, then the fact that you eventually add stuff to the buffer at positions 4-9 doesn't affect you

phronmophobic03:04:38

I thought that's how the some of the transient stuff worked under the hood generally

Joshua Suskalo03:04:58

well transients don't allow you to save multiple iterations of edits on the same structure

Joshua Suskalo03:04:35

I suppose that you're right about that for buffer appends, which helps in this case, but it does make splits a bit more complex.

phronmophobic03:04:06

yea, it prevents copies for some additions, and deletions from the end, but it is more complicated for splits

phronmophobic03:04:38

this is at the limits of my knowledge for these data structures

Joshua Suskalo03:04:21

right, and splits are one of the main things that ropes are supposed to make efficient and relatively easy.

Joshua Suskalo03:04:06

I am planning on making a form of transient structure that makes it efficient to append many smaller strings or collections onto a rope in sequence.

🆒 1
Joshua Suskalo03:04:36

it would end up being a StringBuffer and a TransientVector at the bottom

phronmophobic03:04:37

if there's a spot for per character info, that would be right up my alley

Joshua Suskalo03:04:15

Yeah, so in order to do that you would just make a non-string rope, gives you the same nice memory characteristics for having a sequence of them, and it's nicer than a vector for it because of the log time for subseq views, trimming out a subseq, splitting in two, etc. The elements would probably just be two-element vectors of character-and-meta.

Joshua Suskalo03:04:20

I suppose at this late stage of development I should probably think about testing the actual performance characteristics of this structure as opposed to a vector, lol

phronmophobic03:04:26

Would the meta info be metadata? The reason I ask is that, for my use case, the info would ideally impact equality (as opposed to metadata)

Joshua Suskalo03:04:41

It'd be better than a vector for character sequences, but for arbitrary values I should look at how it actually compares to just a vector.

phronmophobic03:04:58

Yea, I was just thinking that maybe I should just give the builtin data structures a go to see if that works.

Joshua Suskalo03:04:11

It would not need to be meta, and actually couldn't be since characters arent IMetas

👍 1
Joshua Suskalo03:04:44

well the builtin structures definitely shouldn't be able to compete for strings

Joshua Suskalo03:04:51

for arbitrary values maybe

phronmophobic04:04:13

yea, to be fair, I have been using liquid's buffers so far without issue, but I've only been using them very superficially for smallish buffer sizes.

Joshua Suskalo04:04:30

actually, there's also primitive vectors. hmm.

Joshua Suskalo04:04:39

I wonder how this will compare to (vector-of :char)

🤯 1
🧠 1
Joshua Suskalo04:04:28

Anyway, I'll have to do some testing to see how it compares.

phronmophobic04:04:58

very cool. out of curiosity, what are you interested in using them for?

Joshua Suskalo04:04:28

I made the first version of them while in a voice call with another clojurist to show them how you could make a data structure with deftype

Joshua Suskalo04:04:41

and it was fun enough that I've devoted a good bit of time to it now

👍 1
Joshua Suskalo04:04:59

I genuinely might start making a text editor that's a cross of liquid and nightlight just to have an excuse to use them though.

Joshua Suskalo04:04:14

If they're any better than vectors that is.

Joshua Suskalo04:04:18

if not then this was just a fun side project that resulted in learning a lot more about the collection model of clojure

phronmophobic04:04:11

I think there are https://www.youtube.com/watch?v=uqKta5i7A9c&amp;list=PLb_VRZPxjMADovzE7xYIzMr68BHXLVzH3in #visual-tools (including me) that would be interested in a clojure based editor.

Joshua Suskalo04:04:51

yeah, liquid is definitely already that, but the thing that it isn't is embedded in your project. And while nightlight is embedded in your project, it's a webapp, and I consider that to be a big detriment to its usability.

phronmophobic04:04:38

Liquid is embedded in my projects.

Joshua Suskalo04:04:13

wait really? I didn't realize you could embed it like that. Does it use the fact that it lives in the same jvm to do code completion etc? or does it use nrepl still?

phronmophobic04:04:23

I'm not sure. I don't currently have code completion for my uses, but I'm only using it very superficially.

phronmophobic04:04:38

The basic liquid stuff is vim-esque, but I've been trying to use it in an emacs-esque way.

phronmophobic04:04:18

Ideally, there would be a clojure based code editor that does all the stuff codemirror does. I don't think it's quite there yet.

phronmophobic04:04:35

I have gotten good use of just basic liquid usage though.

Joshua Suskalo04:04:23

What actual functionality does liquid provide as a library?

phronmophobic04:04:42

There's lots of stuff I'm not using, but the functionality I'm currently using is code highlighting, paren matching, text editing, and cursor navigation

Joshua Suskalo04:04:21

Okay, took a chance to do some benchmarks, and the rope really showed its value at concats, but I already knew it'd beat out vectors there since the best way to do vector concat is with into which is linear with the size of the second argument, versus the concat with ropes which is constant time. Now I'm testing splits which I expect to be closer to evenly matched.

🆒 1
Joshua Suskalo04:04:29

ah, that's good to know

Joshua Suskalo04:04:27

oooh, okay, so apparently splits are not comparable across ropes of arbitrary values and primitive vectors. Now to see if that tracks across string ropes, and how non-primitive vectors measure up.

Joshua Suskalo04:04:54

apparently the rope is slightly faster than a non-primitive vector

Joshua Suskalo04:04:05

now to see how the string-based rope compares

Richie01:04:08

I'm also interested in making a clojure coding environment that works better for me. I feel like I've tried them all.

Joshua Suskalo01:04:35

i might look at performance of those to compare, they're probably faster than what i've made. What I'm making is as much a learning tool for me as anything else, since I'd never dived this far into the collection model before.

Joshua Suskalo01:04:07

I'm making it as full an implementation as i can though

Joshua Suskalo01:04:59

today i got the charsequence impl in to allow clojure regex functions to work with it now.

leifericf11:04:48

As a dynamic language with a simple “code-as-data” syntax, reflection capabilities, and powerful macro/metaprogramming facilities, it seems like Clojure has what is needed for more sophisticated static code analysis and “semantic diffing.” For example, suppose I wanted to know how many times a particular function is used in a project. Instead of diffing an entire file against a previous version, it would be helpful to diff each function within the files based on their parameters, return values, and bodies. Imagine cloning all Clojure project repos from GitHub and analyzing the code to see which core functions are most frequently used, which libraries are most often used together in the same project (cool graph visualization!), commonly occurring code patterns/idioms in function bodies, semantically duplicate functions across projects, etc. It could make for a cool website to explore and learn about functions, libraries, “design patterns,” and idioms across all real Clojure projects on GitHub. It could be an excellent resource for newbies in particular! Is this feasible? I can’t think of any other programming language community that has this. Has somebody already tried to do something like this?

Alex Miller (Clojure team)11:04:27

In short, yes, there are many things in this area

👍 1
Alex Miller (Clojure team)12:04:01

Probably the closest match was the cross clj site which cross-indexed all Clojure projects on GitHub. The author decided to stop running the site but the code is available if someone wants to relaunch it

👀 1
💡 1
borkdude12:04:02

There's also something here: https://github.com/borkdude/api-diff In general such diffs can be easily made using the static analysis of clj-kondo

👀 1
💡 1
Alex Miller (Clojure team)12:04:40

I'd say grasp is similar to this for search

borkdude12:04:05

See here for the data specification of the clj-kondo output: https://github.com/clj-kondo/clj-kondo/blob/master/analysis/README.md

👀 1
Alex Miller (Clojure team)12:04:45

And then there are things like https://github.com/Datomic/codeq for semantic analysis of projects over time (all git history)

👀 1
Alex Miller (Clojure team)12:04:32

You can query that for things like “when was a function added”

😮 2
borkdude12:04:56

There is also https://github.com/Wilfred/difftastic and autochrome but those tools produce diffs on syntax, not semantics

👀 2
🚀 1
borkdude12:04:11

Still pretty cool to use when viewing git diffs

leifericf12:04:14

Very cool; thanks for sharing! Now I have some reading to do. As I become more proficient with Clojure, I might try to solve a problem in this space to leverage my data science experience to benefit the community by improving the “getting started” experience. I feel like it’s a bit too advanced for my current skill level—some ideas are sloshing around in my mind for now.

💪 1
flowthing12:04:58

I've been wanting to try out Codeq, but it sadly doesn't seem to be possible because Datomic Free is no longer available and as far as I understand, Datomic Pro doesn't support the free: protocol.

Alex Miller (Clojure team)13:04:56

You should be able to use Datomic Pro Starter and just change the protocol for connection afaik

Alex Miller (Clojure team)13:04:23

Might even work with devlocal, can't say I've tried that though

flowthing13:04:41

Nah, I get java.lang.IllegalArgumentException: :db.error/invalid-storage-protocol Unsupported storage protocol [protocol=free] in transactor properties config/samples/dev-transactor-template.properties. Codeq doesn't support the dev protocol. Might be possible to rewrite it to do so, though, haven't looked into it yet.

Alex Miller (Clojure team)13:04:57

there is actually a codeq2 that works with cloud from a couple years ago that no one ever got around to releasing. not sure anyone has the bandwidth to do so atm

flowthing13:04:43

Cool! Maybe one day. 🙂

phronmophobic18:04:13

#cljdoc already does some of this and aspires to do more.

👀 2
macrobartfast19:04:50

As a clojure learner cross clj was very very helpful to me… +1 on any relaunch or similar thing… it would not be possible for me to do it yet.

Max16:04:52

Is there a better way to express “is there anything in this collection that isn’t nil”? (some some? col) feels like it’ll draw funny looks

p-himik16:04:04

FWIW that's exactly how I'd do it.

1
Nundrum16:04:24

I find anything with 'some' in it is hard to grok

1
kenny16:04:50

Not builtin, but here's a helper macro for this. Won't work if col is dynamic, ofc.

(defmacro or-some
  "Evaluates exprs one at a time, from left to right. If a form
  returns a non-nil value, or-some returns that value and doesn't
  evaluate any of the other expressions, otherwise it returns the
  value of the last expression. (or-some) returns nil."
  ([] nil)
  ([x] x)
  ([x & next]
   `(let [or# ~x]
      (if (some? or#) or# (or-some ~@next)))))

Kelvin17:04:19

Real question about accessing record fields - which way does the community generally prefer?

(defrecord Foo [my-field]
  ...
  (bar [this]
    (let [x my-field] ...)))
(access the record field via the binding directly) or
(defrecord Foo [my-field]
  ...
  (bar [this]
    (let [x (:my-field this)] ...)))
(getting the value from the record like it's a normal map)

Kelvin17:04:15

I went with the former in our app recently since our team believes it's more idiomatic, but we could be wrong

genmeblog17:04:06

Directly imho, it should lead to the more performant code

Kelvin17:04:29

Do you (or anyone else) happen to have a source for this?

Kelvin17:04:04

In general one of the reasons for using records I hear alot is "records are more performant than regular maps" but I personally haven't seen a lot of benchmarks that actually support this.

Kelvin17:04:49

So while I can see that "records are more performant" can be true, I'm skeptical as to whether it's actually true

genmeblog17:04:15

I have docompiled the code, here it is:

genmeblog17:04:57

(wait a second)

genmeblog17:04:23

(defprotocol IProto
  [bar [_]])

(defrecord A [some]
  IProto
  (bar [this] (str some)))

;; final Object some = this.some;

(defrecord B [some]
  IProto
  (bar [this] (str (:some this))))

;; final ILookupThunk _thunk__0__ = B.__thunk__0__;
;; final B b = this;
;; Object o;
;; if (_thunk__0__ == (o = _thunk__0__.get(b))) {
;;   o = (B.__thunk__0__ = B.__site__0__.fault(b)).get(b);
;; }

genmeblog17:04:17

Former just accesses this field. Latter goes through get

Kelvin17:04:26

Ok makes sense!

Kelvin17:04:29

I'm still not sure if it'll be more performant in practice (the difference will likely be negligible compared to DB queries and network calls) but it's nice to see what happens under the hood

genmeblog17:04:13

The difference between direct field access vs map access is minimal afair (I did some tests in the past). However if you define function inside defrecord or deftype fields are already in the scope. I don't see why you shouldn't use them directly.

Alex Miller (Clojure team)17:04:29

the former is definitely the preferred use

dpsutton17:04:35

I can’t think of a reason to use the second version.

quoll17:04:24

4 days late, but I thought I'd add something here... I sort of do this in some places, where I want to share a function between records.

(defprotocol Thing
  (read [this arg] "read the thing")
  (write [this arg1 arg2] "write the thing"))

(defn read*
  [{:keys [a b c] :as this}]
  ;; read some data with a, b and c
  )

(defrecord RThing [a b c]
  (read [this arg] (read* this arg))
  (write [this arg1 arg2]
    (throw (UnsupportedOperationException. "read-only"))))

(defrecord RWThing [a b c]
  (read [this arg] (read* this arg))
  (write [this arg1 arg2]
    ;; update the thing
    ))

quoll17:04:35

It may be faster/better to pass the fields along to the common function, rather than the entire record, but I like this approach

apt22:04:25

Hi folks. Is there any way to preventing pretty printing to print commas between keys in maps? When I copy some map from the repl, I always remove the commas afterwards, so I’d prefer not getting them in the first place.

p-himik23:04:04

There's no easy way to do it, but you can create a new dispatch function that wraps clojure.pprint/simple-dispatch and handles maps differently, and then pass that function to clojure.pprint/set-pprint-dispatch. Alternatively, you can override the simple-dispatch multimethod on clojure.lang.IPersistentMap to do your own thing.

p-himik23:04:19

It's all in clojure/pprint/dispatch.clj.

apt00:04:40

Cool, thanks!

👍 1
misha23:04:46

(loop [[x & ys] (range 10),  !m (transient {})]
  (if x
    (recur ys (assoc! !m x x))
    (persistent! !m)))
=> {0 0, 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 5 5, 8 8}  ;; clojure.lang.PersistentHashMap

(let [!m (transient {})]
  (loop [[x & ys] (range 10)]
    (when x
      (assoc! !m x x)
      (recur ys)))
  (persistent! !m))
=> {0 0, 1 1, 2 2, 3 3, 4 4, 5 5, 6 6, 7 7}   ;;<-- this is funny, does not grow past clojure.lang.PersistentArrayMap size

(->> (range 10)
  (reduce #(assoc! %1 %2 %2)
    (transient {}))
  (persistent!))
=> {0 0, 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 5 5, 8 8} ;; clojure.lang.PersistentHashMap

p-himik23:04:30

You must use the result of assoc!.

misha23:04:39

from cljs:
... 
          (if (<= (+ len 2) (* 2 (.-HASHMAP-THRESHOLD PersistentArrayMap)))
            (do (set! len (+ len 2))
                (.push arr key)
                (.push arr val)
                tcoll)
            (assoc! (array->transient-hash-map len arr) key val))  ;; this is where it returns different obj
...

nivekuil23:04:42

is there a randomness library for cljc, that takes seeds and generates consistent outputs on both platforms?

p-himik00:04:30

Probably https://github.com/clojure/test.check I did a very superficial check for longs and the results are the same for CLJ and CLJS.

p-himik00:04:10

But its usage might be unusual because the generator itself is immutable. This is the line of most interest: https://github.com/clojure/test.check/blob/master/src/main/clojure/clojure/test/check.cljc#L210

Alex Miller (Clojure team)00:04:59

Gary did a talk about this area if you want to track it down

nivekuil01:04:21

never would have thought of this! glad I asked

nivekuil02:04:12

for future reference, it's actually not quite consistent out of the box because rand-long generates a goog.math.Long which has a typical javascript moment when you try to do stuff to it, so (mod (random/rand-long (random/make-random 1)) 2) gives different answers on clj/cljs

p-himik10:04:56

In that regard, there can never be 100% consistency because the platforms are way too different. There are no long numbers in JS, so they have to be emulated somehow - that's why there's goog.math.Long.

p-himik10:04:52

mod does not operate on goog.math.Long, so the object gets converted to a regular JS number (I think). In order to avoid that, you'd have to recreate all arithmetic in CLJS in such a way so it supports goog.math.Long.