This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-12-08
Channels
- # adventofcode (49)
- # babashka (21)
- # babashka-sci-dev (12)
- # beginners (250)
- # calva (23)
- # cider (6)
- # clj-kondo (11)
- # cljsrn (8)
- # clojure (129)
- # clojure-europe (50)
- # clojure-france (8)
- # clojure-italy (6)
- # clojure-nl (14)
- # clojure-romania (7)
- # clojure-spec (21)
- # clojure-uk (3)
- # clojurescript (17)
- # conjure (1)
- # core-async (40)
- # core-logic (24)
- # core-typed (7)
- # datavis (2)
- # datomic (2)
- # emacs (29)
- # fulcro (10)
- # graalvm (6)
- # graphql (24)
- # gratitude (6)
- # jobs (1)
- # lsp (9)
- # malli (6)
- # missionary (1)
- # nextjournal (46)
- # off-topic (2)
- # other-languages (3)
- # pathom (5)
- # portal (2)
- # re-frame (37)
- # remote-jobs (1)
- # shadow-cljs (15)
- # spacemacs (9)
- # testing (6)
- # tools-deps (13)
- # vim (32)
- # xtdb (16)
I saw this code in reagent doc
(defn timer-component []
(let [seconds-elapsed (reagent/atom 0)]
(fn []
(js/setTimeout #(swap! seconds-elapsed inc) 1000)
[:div "Seconds Elapsed: " @seconds-elapsed])))
can we also write this as ?
(defn timer-component []
(let [seconds-elapsed (reagent/atom 0)]
(do
(js/setTimeout #(swap! seconds-elapsed inc) 1000)
[:div "Seconds Elapsed: " @seconds-elapsed])))
the first version returns a function, but the second returns a vector. This makes a difference in how reagent handles it. There's an explanation that follows: > The previous example also uses another feature of Reagent: a component function can return another function, that is used to do the actual rendering. This function is called with the same arguments as the first one. > This allows you to perform some setup of newly created components without resorting to React’s lifecycle events.
Thanks for response @U7RJTCH6J, But how it will be useful if it returned as function ?
reagent will check to see if you returned a function or not. if you do return a function, reagent will assume that your component just does the setup when called and that it returns the actual render function
yea, it's a reagent specific thing
I'm not sure it's completely necessary, but it would definitely help.
Is this the kind of thing you're thinking of? http://www.clara-rules.org/docs/approach/
Zach Oakes' work is probably a great intro https://www.youtube.com/watch?v=XONRaJJAhpA&t=1238s
Hello all, I want to make XML dynamically means from the data like JSON I want to make XML. Does any one who how to make it? Thanks.
Eg:
Data:
{:user [{:name "Romit"}
{:name "Demo}]}
Should be converted to
<root>
<user>
<name> "Romit"
</user>
<user>
<name> "Demo"
</user>
</root>
Likewise. Thanks.
Maybe a library like https://github.com/clojure/data.xml will help. Could use hiccup format with sexp-as-element
(into {} your-map)
But in most cases, there’s no need to do this, as many clojure functions already work on Map
I have a very large seq like this:
((([4 4] [4 4] [4 4]))
(([4 4] [4 4] [4 4]))
(([0 2])
([0 1])
([0 2])
([0 1] [0 1] [0 1] [0 1] [0 1] [0 1] [0 1] [0 1])
([0 1])))
I need to compute two numbers:
a) the sum of all first integers in each entry
b) the sum of all second integers in each entry
Is there an efficient approach for doing this? I think at the moment it's overloading my memory and slowing down as it gets further ahead. I guess I need to partition-all
of it to operate on smaller chunks and then reduce
the totals?(filter vector? (tree-seq list? identity data))
will give you a seq of tuples; then you can use reduce , combine with transducers, etc. to get the sums you want
Like here:
(tree-seq list? identity '((1 2 (3)) (4) 5))
I'm expecting 5 to be excluded from the result because list?
tests false for it. But it returns:
(((1 2 (3)) (4) 5) (1 2 (3)) 1 2 (3) 3 (4) 4 5)
(tree-seq branch? children root)
> Will only be called on nodes for which branch? returns true.
So shouldn't it NOT call identity
(`children`) on 5?@UPWHQK562 it didn't, but it did call it on the list that included 5; and on that list identity
returns all the elements of the list - including 5
user=> (tree-seq list? #(do (prn %) %) '(((1 2 (3)) (4) 5) (1 2 (3)) 1 2 (3) 3 (4) 4 5))
((((1 2 (3)) (4) 5) (1 2 (3)) 1 2 (3) 3 (4) 4 5)
((1 2 (3)) (4) 5)
(((1 2 (3)) (4) 5) (1 2 (3)) 1 2 (3) 3 (4) 4 5) (1 2 (3))
((1 2 (3)) (4) 5) (1 2 (3)) 1 (3)
2 (3) (4)
3 (4) 4 (1 2 (3))
5 (1 2 (3)) 1 (3)
2 (3) 3 1 (3)
2 (3) 3 (4)
3 (4) 4 4 5)
the trick is that all the other elements in the nested lists will still be returned by tree-seq
- but only those that are branch?
will also be expanded via children
> I'm expecting 5 to be excluded from the result because `list?` tests false for it. But it returns:
Don't think of tree-seq
as filter
or remove
-> instead, it's like seq
but instead of just returning all the elements in the list, it will also recursively expand (kind of like a recursive mapcat
) anything that looks like a nested branch. So, tree-seq
will return the same amount of items as seq
(or perhaps more); and then you need to filter
as usual in a second step.
(transduce (filter vector?)
(fn
([acc] acc)
([[l-acc r-acc] [l r]]
[(+ l-acc l) (+ r-acc r)]))
[0 0]
(tree-seq (complement vector?) identity input))
@UPWHQK562 this is how you would use transduce to filter out just the vectors (if that is indeed how leaves are defined), and how you can accumulate both sums in parallel
Thanks @U017QJZ9M7W and @U05476190. I'm going to have to take some time to study this stuff. Still, no matter what I try, I can't seem to get this to work efficiently. It starts to get quite slow due to low memory after a while. Here's the snippet and the results:
(doseq [part (->> (extract-data)
(map transform1)
(map (fn [{:keys [x data]}]
; returning this instead produces a result very quickly
; so I'm convinced I've isolated the latency to this code
; {:x x :y 0 :z 0}
(let [[y z] (transduce (filter vector?)
(fn
([acc] acc)
([[l-acc r-acc] [l r]]
[(+ l-acc l) (+ r-acc r)]))
[0 0]
(tree-seq (complement vector?) identity data))]
{:x x :y y :z z})))
; tried all kinds of partition sizes, not much difference
(partition-all 50))]
(prn (count part))))
;; returning static map
13:56:38.459Z - 1000
13:56:38.637Z - 1000
13:56:38.826Z - 1000
13:56:38.993Z - 1000
13:56:39.171Z - 1000
13:56:39.354Z - 1000
13:56:39.508Z - 1000
13:56:39.646Z - 1000
13:56:39.789Z - 1000
13:56:39.949Z - 1000
13:56:40.125Z - 1000
13:56:40.242Z - 1000
13:56:40.351Z - 979
"Elapsed time: 2081.846321 msecs"
;; transducing totals
13:57:27.456Z - 1000
13:57:30.291Z - 1000
13:57:33.112Z - 1000
13:57:35.567Z - 1000
13:57:39.327Z - 1000
;; major slowdown slow here
13:58:06.428Z - 1000
13:58:30.104Z - 1000
13:58:33.682Z - 1000
13:58:43.273Z - 1000
13:58:49.193Z - 1000
13:58:50.857Z - 1000
13:58:51.640Z - 1000
13:58:52.349Z - 979
"Elapsed time: 86573.092394 msecs"
Any suggestions for how I might fix this? I was under the impression that by using lazy sequences, partitioning them into small parts, and then realizing them with doseq
, I could make this efficient even with a small amount of available memory, but I can't seem to get it to work well.@UPWHQK562 can you put up a gist with some slow example data?
@U017QJZ9M7W I'm trying but when I dump the data to an edn file and run my minimal example code against it, it's fast lol I guess I'm missing something here. Something about extract-data
or transform1
must be adding load and leading to latency.
Going to try profiling it with https://github.com/ptaoussanis/tufte and seeing if that gives some insights.
I'm not having much luck figuring this out. When I export the data to a file and process its lines, it processes quickly. When I process the data in place (without exporting), the transduce-totals
function @U017QJZ9M7W provided above eats up the majority of the processing time and its quite slow overall. Here's some profile data with each snippet:
(profile {}
(doseq [part (->> (extract-data)
(map #(p :details (transform1 %)))
(map #(p :totals (transduce-totals %))))]
(log/info)
(prn (count part))))
; pId nCalls Min 50% ≤ 90% ≤ 95% ≤ 99% ≤ Max Mean MAD Clock Total
;
; :totals 10,265 2.93μs 462.61μs 3.11ms 6.19ms 45.93ms 6.60s 5.95ms ±163% 1.02m 91%
; :details 10,265 58.57μs 208.15μs 390.52μs 551.75μs 2.73ms 64.50ms 338.06μs ±69% 3.47s 5%
;
; Accounted 1.08m 97%
; Clock 1.11m 100%
(defn slow-sample []
(with-open [rdr ( exported-edn)]
(doseq [part (->> (line-seq rdr)
(map read-string)
(map transduce-totals))]
(prn (count part)))))
(time (slow-sample))
; => "Elapsed time: 1340.89125 msecs"
It is hard to help without seeing those other functions. Can you a small project that can reproduce this?
Also is there any structure to this input beyond “nested sequences that eventually bottom out in vector s?”
These are operations on a git repo to extract its log data using jgit. The input at the transduce-totals
step of the ->> is just a seq of maps. Does what happens in the previous steps matter in such a scenario? I thought the steps would be isolated from each other.
They are all maps exactly like this:
{:sha "cd946b229bd2316cfe8c336badb0392b38c81015", :changes (([4 4] [4 4] [4 4]))}
{:sha "c4cd9d58808ce00916a495bf03b5706c07b8a148", :changes (([0 2]) ([0 1]) ([0 2]) ([0 1] [0 1] [0 1] [0 1] [0 1] [0 1] [0 1] [0 1]) ([0 1]) ([0 1] [0 1] [0 1] [0 1] [0 1]) ([0 1]) ([0 4] [0 3] [0 2] [0 2] [1 1] [0 6] [0 2]) ([0 2] [0 13]) ([0 1]) ([19 0] [1 2] [1 1] [1 1] [21 13]) ([0 7]) ([0 9]) ([0 9]) ([0 1] [0 1] [0 27]))}
{:sha "3742992e0204ed8a1b2b559cf5f34afb1805b8e3", :changes (([0 1] [1 1] [0 2] [0 1] [0 1] [1 1] [1 2] [1 3] [0 1] [0 1]) ([1 1] [1 1] [1 1] [1 1] [1 1] [1 1]) ([2 3]))}
{:sha "bf9445c365f663d484b4fce480cd77456e56d0b1", :changes (([2 1] [2 1]) ([0 3]))}
{:sha "ef18886ade283a0c51ca5568cc06a0ae78574609", :changes (([1 1]) ([1 1] [1 1] [1 1] [1 1]) ([1 1]) ([1 1]) ([0 4]) ([5 5] [3 3]) ([1 1] [1 1] [1 1] [1 1] [1 1]) ([1 1] [1 3] [2 1] [0 10] [13 0] [1 1]) ([0 170]) ([0 46]) ([0 246]) ([1 1] [1 1] [6 2] [1 1] [1 1] [1 1] [1 1] [1 1] [1 1] [1 1] [1 1] [1 1] [1 1]) ([1 1] [1 1]) ([0 1] [0 1] [2 39] [1 1] [8 0]) ([1 0] [1 0] [1 0] [1 0] [1 0] [16 0] [90 0] [70 0]))}
I can work on a minimal example project. At this point it's more about understanding why I'm getting this behaviour and how to avoid it rather than fixing this exact implementation. I can think of other ways to get the data - there are workarounds.
Something is hanging on to the head of the sequence as you materialize it from jgit, in way that is not happening when you pull from the file
@UPWHQK562 okay, here we go. and what is the final result you want from that?
totals for each map, along with one of the keys extracted?
if you want the global totals, try this:
(defn global-totals
"the first transducer pulls out the `:changes` entry for each map and
concatenates them all together. The second one, `cat`, concatenates all the
subsequences together. This will feed only the vectors into the reducing
function."
[data]
(let [xform (comp (mapcat :changes) cat)
f (completing
(fn [acc item]
(mapv + acc item)))]
(transduce xform f [0 0] data)))
sicmutils.env> (time (global-totals (take 100000 (cycle inputs))))
"Elapsed time: 2386.474 msecs"
[6440000 14540000]
this can do 500,000 maps in 2.3 seconds
The changes key is a set of tuples. Need to sum of first positions within each tuple (per map) and the sum of second positions, likewise per map. The sha key just gets extracted and is used to identify what commit each sum pair belongs to.
So result should be a set of maps, each with a sha and changes reduced into deletions/insertions. The computation was producing the correct value already, it's just too slow.
awesome, let me post something to try that does it per map
that should be absolutely no problem so the slowdown has gotta be somebody holding onto the full sequence of maps
that function above is a good one to stare at if you have not used the transducer idea yet
(while elevator music plays and I type)
(defn global-totals [data]
(transduce (comp (mapcat :changes) cat)
(completing
(partial mapv +))
[0 0]
data))
alternate way of writing it, with no let
to give things names, and a partial
use for fun
Thanks very much for helping here. I might be a little slow to respond, just afk right now... But I'll definitely dig right into this.
(defn change-sum
"Collapses a sequence of changes into a pair of sums; the first entry is the sum
of all first entries in the leaves of each changeset, the second is the sum of
all second entries.` "
[xs]
(let [f (completing (partial mapv +))]
(transduce cat f [0 0] xs)))
(defn sum-changes [m]
(update m :changes change-sum))
sicmutils.env> (map sum-changes inputs)
({:sha "cd946b229bd2316cfe8c336badb0392b38c81015", :changes [12 12]} {:sha "c4cd9d58808ce00916a495bf03b5706c07b8a148", :changes [44 127]} {:sha "3742992e0204ed8a1b2b559cf5f34afb1805b8e3", :changes [12 23]} {:sha "bf9445c365f663d484b4fce480cd77456e56d0b1", :changes [4 5]} {:sha "ef18886ade283a0c51ca5568cc06a0ae78574609", :changes [250 560]})
@UPWHQK562 change-sum
does what you want for the entry under :changes
, and then sum-changes
uses that to make a function that processes each map individually
so now if you do (map sum-changes inputs), you will get a lazy sequence of transformed maps, with that entry updated
slightly slower, roughly 2.3 seconds to do 100k maps
sicmutils.env> (time (nth (map sum-changes (cycle inputs)) 100000))
"Elapsed time: 2319.499959 msecs"
{:sha "cd946b229bd2316cfe8c336badb0392b38c81015", :changes [12 12]}
but even at 500k entries if I hold on to the head by binding it like this:
(def input-hold (cycle inputs))
Do you think using this code instead of the previous function will resolve something holding the whole seq in memory?
well, I am less convinced now that my guess was right, since these maps are not big…
but yeah I don’t reach for tree-seq
much, so it could be that collapsing all of the maps into one and then using tree-seq materializes a huge amount of stuff?
but I would be very surprised if this approach is slow for you (though I’m prepared to be surprised 🙂
I'll try and post results. Probably tomorrow. It could be RevCommit objects being held in the JVM.
@U017QJZ9M7W it turns out that just one or two of the changesets in the repo I'm exploring contain a large number of changes - in the hundreds of thousands. I tried it using your code above and it still chokes up and performs pretty slowly, which makes sense, because change-sum
will hold all that data when transducing it.
pId nCalls Min 50% ≤ 90% ≤ 95% ≤ 99% ≤ Max Mean MAD Clock Total
:sum-changes 10,265 1.87μs 366.70μs 2.29ms 4.60ms 36.34ms 5.78s 5.26ms ±167% 54.03s 93%
:details 10,265 82.01μs 166.60μs 324.75μs 463.01μs 982.93μs 83.94ms 239.72μs ±59% 2.46s 4%
Accounted 56.49s 98%
Clock 57.86s 100%
It is an edge case, but I think I will come across it once in a while.@UPWHQK562 it should not hold anything in memory while transducing - a transducer will happily walk and realize a lazy sequence, transducing as it goes. probably the code producing the changesets is realizing it all in memory, vs realizing it in a lazy way (or with a java iterator for example)
for example:
(def example-changeset
'(([0 2])
([0 1])
([0 2])
([0 1] [0 1] [0 1] [0 1] [0 1] [0 1] [0 1] [0 1])
([0 1])
([0 1] [0 1] [0 1] [0 1] [0 1])
([0 1])
([0 4] [0 3] [0 2] [0 2] [1 1] [0 6] [0 2])
([0 2] [0 13])
([0 1])
([19 0] [1 2] [1 1] [1 1] [21 13])
([0 7])
([0 9])
([0 9])
([0 1] [0 1] [0 27])))
But then wouldn't that show as latency in those steps of the ->> instead of in the sum-changes
step? I'm confused why the time is being consumed in this particular portion of the ->> if it's really happening in another step.
15 items; you can use cycle
to get an infinite lazy stream of those elements, repeating, so we can check the function’s performance on big stuff
2.5 seconds for 1M items:
sicmutils.env> (time (change-sum
(take 1000000
(cycle example-changeset))))
"Elapsed time: 2578.9755 msecs"
[2933305 8466638]
@UPWHQK562 I think the key to debugging is to try and isolate a single record that is going very slow
what does its :changes
field look like?
can you gist?
I agree that 5.78s
is very strange
unless it is 2M items!
Yep, gisting it. 5s is strange, but the total time there is about a minute usually!
it is faster btw if we skip the mapv
thing and just directly add the items
(defn change-sum
"Collapses a sequence of changes into a pair of sums; the first entry is the sum
of all first entries in the leaves of each changeset, the second is the sum of
all second entries.` "
[xs]
(letfn [(f
([] [0 0])
([acc] acc)
([[l-acc r-acc] [l r]]
[(+ l-acc l) (+ r-acc r)]))]
(transduce cat f xs)))
a little more than 2x faster
sicmutils.env> (time (change-sum
(take 1000000
(cycle example-changeset))))
"Elapsed time: 1327.460958 msecs"
[2933305 8466638]
I also took away “completing”, since all it does is add that single-arity version that just returns the result; and I added a 0-arity that provides the starting value for the transduce
haha that record is a monster @UPWHQK562, my browser tab is choking!
on my machine (and you had found something like this before) summing the values in that record takes 20ms
when you isolate it do you see that too?
lol I know 🙂 Transduce is something I've avoided up until now because it seemed like complexity that I didn't need as a beginner, but now that I'm running into performance issues of this kind, I guess it's time to watch Rich's talk on transducers and do some studying. The concept is not clear.
here is how to think about it -
in this case these are identical:
(defn change-sum
"Collapses a sequence of changes into a pair of sums; the first entry is the sum
of all first entries in the leaves of each changeset, the second is the sum of
all second entries.` "
[xs]
(letfn [(f
([] [0 0])
([acc] acc)
([[l-acc r-acc] [l r]]
[(+ l-acc l) (+ r-acc r)]))]
(reduce f (mapcat identity xs))))
(defn change-sum
"Collapses a sequence of changes into a pair of sums; the first entry is the sum
of all first entries in the leaves of each changeset, the second is the sum of
all second entries.` "
[xs]
(letfn [(f
([] [0 0])
([acc] acc)
([[l-acc r-acc] [l r]]
[(+ l-acc l) (+ r-acc r)]))]
(transduce cat f xs)))
don’t worry about how it works, just think about it as a combo of some mapping / filtering / mapcatting transformation step and a reduce at the same time. if you do it the first way, (mapcat identity xs)
is going to make a new sequence; but then that sequence is immediately eaten up by the reduce and collapsed into the final counts
so transduce
is reduce
with an extra slot for a “transform”
I gotta run out for a short while unfortunately. Duty calls 🙂 will be back shortly to dig back into this.
@UPWHQK562 in this case, it is cat
because we have a sequence of sequences of vectors, and we want to concatenate them all
@UPWHQK562 I think your issue is somewhere in how this data is getting produced; maybe it is streaming from jgit, and there is some rate limiting thing going on?
so that you block waiting for that :changes
entry to appear
you could test this by timing (def all-data (doall (get-the-data)))
, where doall
will force the whole sequence. then separately time the reduction
(time
(porc/with-repo repo-path
(let [df (#'querying/diff-formatter-for-changes repo)
old-tree-iter (EmptyTreeIterator.)
reader (.newObjectReader (.getRepository repo))]
(time (def all-data (doall (->> (extract-repo-commits repo)
(map #(detailed-changed-files repo % df old-tree-iter reader))))))
(time (doall (map transduce-totals all-data))))))
"Elapsed time: 1890.655849 msecs"
"Elapsed time: 47884.185641 msecs"
"Elapsed time: 49778.123504 msecs"
@U017QJZ9M7W still looks like the slow part is the reduction... for some reason I totally don't understand.
With the newest change-sum
and isolating it to the single record:
(time
(porc/with-repo repo-path
(let [df (#'querying/diff-formatter-for-changes repo)
old-tree-iter (EmptyTreeIterator.)
reader (.newObjectReader (.getRepository repo))]
(->> (porc/git-log repo :since "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3^"
:until "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3"
:rev-filter (. RevFilter NO_MERGES))
(map #(detailed-changed-files repo % df old-tree-iter reader))
(map sum-changes-2)
doall)))))
({:sha "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3", :changes [65755 188558]})
"Elapsed time: 8077.543741 msecs"
({:sha "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3", :changes [65755 188558]})
"Elapsed time: 5888.550605 msecs"
({:sha "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3", :changes [65755 188558]})
"Elapsed time: 6133.369632 msecs"
({:sha "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3", :changes [65755 188558]})
"Elapsed time: 6327.125068 msecs"
({:sha "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3", :changes [65755 188558]})
"Elapsed time: 6329.911996 msecs"
Sorry, did this isolate the fetch, then once the fetch is done with do all, THEN do the sum?
https://clojurians.slack.com/archives/C053AK3F9/p1639153516124200?thread_ts=1638975773.014100&cid=C053AK3F9 that one did it that way but using transduce-totals
. Let me fixup the most recent snippet that isolates it.
(time
(porc/with-repo repo-path
(let [df (#'querying/diff-formatter-for-changes repo)
old-tree-iter (EmptyTreeIterator.)
reader (.newObjectReader (.getRepository repo))]
(time (def all-data (doall (->> (porc/git-log repo :since "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3^"
:until "439fd78cd426b7ba2b1a1cba0c712b018a5c27c3"
:rev-filter (. RevFilter NO_MERGES))
(map #(detailed-changed-files repo % df old-tree-iter reader))))))
(time (doall (map sum-changes-2 all-data))))))
"Elapsed time: 7.821379 msecs"
"Elapsed time: 4990.340047 msecs"
"Elapsed time: 5000.335191 msecs"
Looks like it's quick as can be to sum-changes when it's isolated to a single record.Do you think the detailed-changed-files
is holding onto all the data, filling memory, and then making sum-changes-2
look slow due to lack of available memory when they aren't done as separate steps?
yeah I bet detailed-changed-files
is doing something lazy, and when you actually go to access each record it forces it to go hit the network or something
@UPWHQK562 it is very very suspicious to me that querying all of those records etc, populating all the lists, vectors maps etc would take 7ms
the doall
was an attempt to force side effects, but in this case I think you need to do the equvalent of mapping doall
across the sequence and forcing that
does that make sense?
It might take 7ms because in that last snippet I'm limiting the output to a single revision by using the same since/until.
detailed-changed-files has a pair of nested for-loops to go through each entry in the diff list. No network calls or anything like that, all local.
How about this- can you do the aggregation call twice and time each one?
See if it is faster the second time
@UPWHQK562 I suspect all-data
is keeping some nested lazy data. Normally I'd say force a realization, eg. spit
it all to a file, then slurp
it back - but if this is not feasible due to Java objects, than evaling over the same collection twice as @U017QJZ9M7W suggests should do the trick.
@U05476190 yeah, I've actually gone through that above. The spit/slurp thing makes it so reading the data is very quick. It's an available workaround, but doesn't really advance my understanding of how to do this properly/better next time around 🙂
The way forward usually is to try and isolate the piece that you force, and figure out which tiniest function call is slow
> How about this- can you do the aggregation call twice and time each one? So get all the data and then pass it to the transducer twice?
(time (doall (map sum-changes-2 all-data)))
Add this line a second time after the first
I would profile the function - so we know where the CPU (and/or I/O) is actually hanging. https://github.com/clojure-goes-fast/clj-async-profiler is your friend
If you need some help getting the profile working this should help you get started: http://clojure-goes-fast.com/blog/profiling-tool-async-profiler/ and/or just ask :] I think a flamechart would clear up a lot of questions.
I'd check the :cpu
flamegraph first, but if it's a question of laziness - the :allocation
flamegraph may also be insightful.
"get data"
"Elapsed time: 14.498089 msecs"
"first sum-changes"
"Elapsed time: 14630.261689 msecs"
"second sum-change"
"Elapsed time: 8.897242 msecs"
"Elapsed time: 14658.340088 msecs"
Woohoo, so some lazy thing is getting forced by the traversal to get the sums
And the second time everything is already in memory
@U05476190 I'll try those. I've been using tufte for profiling but it doesn't offer memory profiling, just timing. I'll definitely try your link - good tool to get familiar with.
You're getting close!
I've decided if I don't figure it out today - with your generous help or without 🙂 - I'm going to "just get it done" in the hackiest way imaginable until I get better with Clojure and revisit it.
tufte timing is OK, but what you really want - irrespective of CPU or allocation - is a sampling profiler that will give you more granular details about which part of the code is slow.
I'm not even sure if it's "slow" because of holding on to some lazy head, or if this is not just a case of some I/O issues.
^ eg. if it's allocation issues, you'd see lots of objects being created and then GC'ed; if it's I/O... then you have a different problem
@U017QJZ9M7W if you have some ideas for something I should try while working on that, I can probably multitask.
if it were me at the REPL this would be my next few tasks other than profiling:
• go look at a single record and see the type - is it a clojure lazy seq, or some java iterable thing that is getting forced into a seq later by the transduce
call?
• either way, read down into the code you’re calling to figure out how :changes
is getting populated, and why it is so slow. is it disk access? is it maybe doing a separate jgit call for every single change entry, or something silly like that?
for the next power debugging session, it will be way easier if you can makea reproducible example that we could point at a local git repo or something
what is porc
? what library?
I'm guessing it comes down to this silly thing I wrote:
(defn- entries->change-map
[entries df]
(let [changes (for [entry entries
:let [fh (.toFileHeader df entry)
el (.toEditList fh)]]
(for [edit el
:let [deletions (.getLengthA edit)
insertions (.getLengthB edit)]]
[deletions insertions]))]
changes))
it is going to be something like https://stackoverflow.com/questions/65989092/obtain-a-git-blob-size-efficiently-in-java
where something like ObjectReader
is loading way more than you need
When I first embarked on this effort, I thought "I should just spawn processes, call git directly, and parse it's output so I don't have to deal with the git->java->clj abstraction layers." Then I second guessed myself and thought "it would be best to get familiar with the tooling in the ecosystem instead." Kinda thinking I made the wrong call lol
once you get to clojure life is great!! but yeah I feel you for sure
@UPWHQK562 perhaps a little off-topic, but if you plan on merging those deletions and insertions later anyways, perhaps you can do it eagerly instead. One idea:
(defn- entries->change-map
[entries df]
(letfn [(calc-edit [m edit]
(-> m
(update m :deletions + (.getLengthA edit))
(update m :insertions + (.getLengthB edit))))
(calc-entries [m entry]
(let [header (.toFileHeader df entry)
edits (.toEditList header)]
(reduce calc-edit m edits)))]
(reduce calc-entries
{:deletions 0
:insertions 0}
entries)))
I changed the tuple [deletions insertions]
to a map, but it's not strictly necessary. This is irrespective of what @U017QJZ9M7W mentioned about using the most appropriate jgit API (which I'm unfamiliar with).
It's not totally clear to me when I should switch between lazy/eager eval. I thought that it was a good idea to stay lazy as long as possible until I actually needed the realization.
rule of thumb (with lots of caveats): lazy is good if you don't need all the results now (or perhaps ever) and if you can "summarize" or forget things you've seen; lazy is terrible if you keep holding on to lots of things you've seen and your structure just grows more and more as you progress further in the calculation.
if you know you need to go through everything to get a result, and the result is some kind aggregation/summary - usually eager is going to perform better (and more predictably)
transducers (as mentioned earlier) try to get the best of both worlds: you write things as composable independent pieces (which is one of the reasons why lazy may have previously been used) and you still get the performance of running it eager (and with extra optimizations that the compiler can make, because it has better control of the runtime assumptions).
I think my case fits in the good category then. I don't need the result at all except to summarize.
Lazy is great when you can stream through and aggregate as you go;
yeah, but notice that your entries->change-map
is actually keeping a lot of state in memory and not aggregating aggressively enough
Like, in theory this is a great lazy computation (the sequence with the maps) because each item can get processed individually
profiling: ran into https://github.com/clojure-goes-fast/clj-async-profiler/issues/8 and I don't see an immediate fix for that
> ran into https://github.com/clojure-goes-fast/clj-async-profiler/issues/8 and I don't see an immediate fix for that Hmm, that's weird; if you want to debug that further: (1) which version of the JDK are you running? (2) lein or deps-tools? (3) have you added the correct JVM options? https://github.com/clojure-goes-fast/clj-async-profiler#jvm-options
Any insight?
I'm trying to understand how to interpret the CPU graph. Looks like I need to install something else for the allocation profiling to work.
@UPWHQK562 can you upload the cpu svg somewhere?
Did that.. it gave me:
Execution error (ExceptionInfo) at clj-async-profiler.core/start (core.clj:277).
No AllocTracer symbols found. Are JDK debug symbols installed?
I guess you can run
(prof/list-event-types)
to see which ones are supported on your JDK> I'm trying to understand how to interpret the CPU graph You can click on any event to "zoom in". If you hover over an event, it shows on the bottom what percentage of the total is spent there
^ notice the amount of time (percentage of width of row vs total width) - is spent by DiffFormatter.open
and RawText.load
and you can further see a large portion of the DiffFormatter.open
time is spent in PackFile.decompress
so, no matter how much you improve the speed of calculating the diffs, most of the time is actually spent reading from Disk and decompressing the data
So in this case either this is how long it takes - OR if you are doing multiple passes then you may want to be careful about getting it into and keeping the data in memory so you can stay fast
I suggest you check if there is a different jgit API that is available the won't make you spend all this time loading this data into memory and decompressing it; I'm assuming there must be a better and more efficient way of just getting the git stats your interested in
in case it's not clear, the way to read a flame graph is bottom to top -> each entry width shows the total time spent in this function, and the row above it shows what that function called (and how long each of those functions spent as a percentage of the parent function). I'm not sure that's a good description... :P
Also pro tip: all the way on the top left and right of the SVG are small links "Reset zoom" and "Search" to help with interactive exploration. I only mention this, because they're small, gray, and easy to miss.
How do you know to zero in on those particular items in the flamegraph? Do you just scan up until the total percentage starts to get narrow? I understand that familiarity with the code certainly helps, and I did identify the DF as a line of interest, but from there I don't know how you pick out decompress
as the next item of interest... it looks about the same as the others, except it also starts to narrow.
precisely, you kind of gawk at it and notice that entries->change-map-2
is interesting (in my codebase); then look up and most of those callers are the same width (which means they don't spend much time doing anything - all their time is spent in their "child")
the DiffFormatter.open
calls a bunch of stuff (that take basically no time) - but decompress
sounds like the important one (before we're in java.util.zip.Inflater.inflate
)
I get it. Then all the items above decompress
are clearly compression related, so the buck stops there.
then, all you can do is make some hypotheses that would explain these different behaviors; consider which ones you should test; and consider which ones you can reasonably fix
e.g. you probably won't speed up java.util.zip.Inflater.inflate
- but if it's a function you control, perhaps you can - or if it's the compression, perhaps you can avoid calling it as often.
@U017QJZ9M7W and @U05476190, you guys are awesome. Thanks you so much for helping me reason through this process and arrive at this bitter end lol 🙂 I've certainly learned a great deal in this thread.
@U017QJZ9M7W made an important distinction: sampling profilers are a ghost image of the system; they just pause every X ticks and check what the CPU was working on. But they don't really tell you if "this function is really slow" or "this function is getting called a lot more than it needs to". That, you need to figure out separately.
There is a similar flamegraph you can build for allocations (assuming you get your JVM sorted)
Looks like this gitblit thing might be somewhat of a successor to jgit and has the diffing functionality in it https://github.com/gitblit/gitblit/blob/master/src/main/java/com/gitblit/utils/JGitUtils.java#L1047
Thanks for being a good patient @UPWHQK562 :) I have also learned a lot, and I am sure I will avoid similar bugs in the future when my ears start to tingle during jgit interactions, no question about it
Hopefully this is a positive experience! Sitting over top of a few layers of code can induce anxiety (“what the heck is going on, my code is simple and fast!!!”) in the best of us
I gotta admit those transducer functions you shared are still causing a bit of anxiety 🙂 I just gotta sit down and pull it apart when I'm not trying to get stuff done. "reducer with a transform"...
yes! if you want to use those, try always writing them FIRST as “transform them reduce”, then see if you can change them into transducers
Another note @UPWHQK562, where @U05476190 may disagree but let me toss it out there - if you find that, in this example, it is NOT beneficial to go push your reduction down into your entries->change-map
, then don’t do it, keep that aggregation of changes function separate
then you will be able to reuse it no problem if you swap out git libraries later etc
unless it really matters for speed, write your functions in terms of data structures you care about, and separately do the xform ->change-map
, just like you did
and then later, say you have to fuse them to get speed - well, keep the old, simple, easy to understand separate ones there in your test code
(now we are into software stuff you probably already know, just wanted to reinforce in this context!!)
Kudos to @UPWHQK562 for not giving up and coming out the other side (hopefully with more knowledge and confidence) and @U017QJZ9M7W for the awesome support (technical, but I would say even more importantly, emotional). Next time I'm dealing with a hairy problem I'd love to have you two in my corner. :)
(PS. I have absolutely no need to rescue that code; it was just a suggestion when I thought maybe the nested lazy-seqs were eating up memory. "You are not your code" and you should most definitely get rid of any abstractions that don't make it easier to reason about the system)
Yeah @U05476190, been kinda a slow week in terms of crossing items off my list but a good week in terms of covering new ground and putting more tools in my arsenal. Totally off-topic, but I think I watched one of your talks on Fulcro a few months back!
Hello! Say I don't know a thing about JS or frontend. What would be the easiest way to build a very basic page displaying a couple of tables with the data fed from a Clojure backend say via a websocket or something? Maybe any libraries/frameworks/resources you could point me to? The more Clojure and the less JS it is, the better.
I’d probably use shadow-cljs, reagent (or helix) and sente. Shadow compiles your code, reagent is a react wrapper for clojurescript (same for helix) and sente is a websocket library for Clojure(script). If you don’t feel like playing around with React you can probably go with plain goog.dom, but that will probably look very akin to plain js
Thanks. I'm alright with playing with a bit, just need to do something very barebones, doesn't have to be pretty or anything, just give me an interface to conveniently and interactively view endless maps of maps of maps I have on the backend. I don't want to spend too much time on it since data is far more interesting than how I display it.
Thank you for the pointers!
I just wrote something like this for screen scraping package tracking data, cljs and clojure in a single file, no fancy shadle this or that or whatever
lemme make sure I don't have any passwords or anything in the file and I will gist it
Oh neat. That'd be great, thanks
https://gist.github.com/hiredman/c6868603eb9bf3620f2b89acfaef623e#file-packages-cljc-L1105-L1125 is where the clojure file compiles itself as a clojurescript file to serve to clients
😳 there are some terrible bits in there, like the macros trying to provide a unified logging api between clojure and clojurescript
No worries, I think I'll be able to get the basic idea from it 🙂 Thank you a ton for sharing!
yeah, I've never seen anyone do a project like that, but I have no interest in adding more tooling layers (like shadow-cljs) for something so simple
You can also use htmx + clojure to present your data table. Here is an example with Babashka https://github.com/prestancedesign/babashka-htmx-todoapp
> just need to do something very barebones, doesn't have to be pretty or anything, just give me an interface to conveniently and interactively view endless maps of maps of maps I have on the backend. @U5USC6WNL - do you need to build a webpage? Perhaps it may be enough to just use (or perhaps extend a custom viewer) for something like https://github.com/djblue/portal or https://vlaaad.github.io/reveal/ Or, if you need to publish these results online - perhaps something like https://github.com/nextjournal/clerk will be sufficient as both a data explorer and static public website?
I'm doing something like this with re-com, re-frame, and reagent
I'm trying to (spit filename data)
and when I open the file it's a string representation of a LazySeq. data
evaluates to (loop <some stuff>)
so I tried setting data
equal to (doall (loop...))
instead
its not evaluation that is your problem. If you spit
data you just get a string representation of that data.
(str (filter even? [1 2 3]))
the toString on a lazy sequence (regardless of whether it has been realized or not) is just "clojure.lang.LazySeq@21"
or similar
hm. I guess what's confusing me tho is evaluating the loop
form in my repl gives results
(defn spit
"Opposite of slurp. Opens f with writer, writes content, then
closes f. Options passed to ."
{:added "1.2"}
[f content & options]
(with-open [^java.io.Writer w (apply jio/writer f options)]
(.write w (str content))))
spit
is a very simple thing. it just calls str
on the argument. (str (filter even? (range 4))
will show what you end up with. If you need more control you can minic what spit does here and control the writerhi, i'm programming a reagent app and i have a questino about this piece of code:
(let [atom-val (r/atom "")]
[:> rn/View {:style {:align-items :center}}
[:> rn/TextInput {:on-change-text #(reset! atom-val %)}]
[:> rn/Button {:title @atom-val)}]
the title
property of the Button
isn't being updated by the TextInput
callback. am i wrong in assuming that this should work? or is there a bug in the code? thanks in advance.there’s a bug in this. each time it renders, it creates a new r/atom
whose contents is the empty string. When you call reset!
to some new value, it re-renders, and creates a new r/atom
whose contents are the empty string …
that is a super common mistake. you’ll probably do it again in the future and add some printlns that you won’t believe possible. and then you’ll remember after 7 minutes of screaming “this cannot be”