Fork me on GitHub
#nextjournal
<
2022-02-17
>
Carsten Behring09:02:51

I try to investigate an issue, where Clerk sometimes decides to re-evaluate a certain form. I run clerk/show in a loop and measure time, and indeed I see sometimes far longer duration of form evaluations. This is confirmed by the debug output coming from here: https://github.com/nextjournal/clerk/blob/b0e13859af5280e4d75ba9cb324cbf8010e555cd/src/nextjournal/clerk.clj#L145 I have somehow the impression that the hashing of the forms is not always "stable". I see for example that the same form gets different hashes in different calls of "show!" without any code change in between.

"EVAL !!!" :hash "5dr7SHMXW1tdQYL1UbqD6Xdg3bd3jj" :form (def pipe-fn (ml/pipeline (mm/select-columns (concat cat-features numeric-features [:SalePrice])) (mm/replace-missing cat-features :value :NA) (mm/replace-missing numeric-features :midpoint) (fn [ctx] (assoc ctx : train-data)) (mm/transform-one-hot cat-features :full) (mm/set-inference-target :SalePrice) #:metamorph{:id :model} (mm/model {:model-type :smile.regression/random-forest, :trees 1000})))
.....
"EVAL !!!" :hash "5dtaiHaQeang6r83DzaWmbUmTtExFJ" :form (def pipe-fn (ml/pipeline (mm/select-columns (concat cat-features numeric-features [:SalePrice])) (mm/replace-missing cat-features :value :NA) (mm/replace-missing numeric-features :midpoint) (fn [ctx] (assoc ctx : train-data)) (mm/transform-one-hot cat-features :full) (mm/set-inference-target :SalePrice) #:metamorph{:id :model} (mm/model {:model-type :smile.regression/random-forest, :trees 1000})))

Carsten Behring09:02:10

So running this:

(def times
    (doall
     (repeatedly 10 (fn [] (clerk/time-ms (clerk/show! "src/kaggle.clj"))))))

Carsten Behring09:02:31

produces: ({:result nil, :time-ms 22870.775138} {:result nil, :time-ms 20775.000902} {:result nil, :time-ms 1289.373927} {:result nil, :time-ms 1285.455593} {:result nil, :time-ms 21002.457645} {:result nil, :time-ms 20808.522161} {:result nil, :time-ms 20450.540389} {:result nil, :time-ms 1207.209592} {:result nil, :time-ms 20773.024323} {:result nil, :time-ms 1282.546341})

Carsten Behring09:02:25

4 times fast (from cache) and 6 times slow (Clerk re-evaluated the slow form)

Carsten Behring10:02:04

We are getting somewhere...

(repeatedly
 10
 (fn []
   (->
    (clerk/parse-file "src/kaggle.clj")
    (h/build-graph)
    h/hash
    (get 'kaggle/train))))

Carsten Behring10:02:14

Returns different results.....

Carsten Behring10:02:29

;; => ("5dt6Bi1CeH3guTNdnCKCid5TB1hgwA" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5dt6Bi1CeH3guTNdnCKCid5TB1hgwA" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5ds3nXtgTQ4aUw3xt4KiM41tQbJUsE" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5dt6Bi1CeH3guTNdnCKCid5TB1hgwA" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5ds3nXtgTQ4aUw3xt4KiM41tQbJUsE")

mkvlr11:02:27

thanks for digging into this!

Carsten Behring11:02:39

Very inteesting.... I came this far: (-> (clerk/parse-file "src/kaggle.clj") (h/build-graph) :graph :dependencies (get 'kaggle/r-object)) ;; => #{clojure.core/ex-info ;; clojure.string/split ;; kaggle/base-url ;; clojure.core/first ;; clojure.lang.RT/nth ;; opencpu-clj.ocpu/object ;; clojure.lang.Numbers/gt}

Carsten Behring11:02:26

The 'kaggle/r-object changes it's hash, eventhough all deps are a constant string "kaggle/base-url" and the other symbols are from jars

Carsten Behring11:02:09

So the hash should be constant, should it ?

Carsten Behring11:02:47

For ope symbol no location was found ...

mkvlr11:02:50

an issue with a minimal repro would be great

mkvlr11:02:58

which one?

Carsten Behring11:02:06

(h/find-location 'opencpu-clj.ocpu/object) -> nil

Carsten Behring11:02:31

this should not be

Carsten Behring11:02:33

Is it this "-" in the ns ?

Carsten Behring11:02:30

(h/ns->jar (namespace 'opencpu_clj.ocpu/object)) -> working

Carsten Behring11:02:39

(h/ns->jar (namespace 'opencpu-clj.ocpu/object)) > nil

Carsten Behring12:02:42

This fix is needed in my view:

(defn ns->path [ns]
  (str/replace (str/replace (str ns) "-" "_")  "." fs/file-separator))

Carsten Behring12:02:06

So, the jar get found for opencpu-clj namespace.

Carsten Behring12:02:35

now all deps above have a location.

Carsten Behring12:02:52

But the hash of kaggle/r-object is nevertheless sometimes different. -> a hash of one of the deps need to be different "sometimes" as code text is constant!

mkvlr12:02:09

nice catch!

mkvlr12:02:08

adding a test now

mkvlr12:02:48

where’s the r-object coming from? don’t see it in https://github.com/behrica/kaggleHP/blob/main/src/kaggle.clj

Carsten Behring13:02:28

I will try to commit later:

(defn r-object [library function params]
  (let [resp (ocpu/object base-url :library library  :R function params)]
    (when (>  (:status resp) 201) (throw (ex-info "error" resp)))
    (-> resp
        :result
        first
        (str/split #"/")
        (nth 3))))

Carsten Behring13:02:02

It only depends on one more object:

(def base-url "")

Carsten Behring15:02:14

I tried everything, but don't understand it. All inputs to the hashing are constant, and the hashing itself seems as well deterministic

Carsten Behring15:02:33

I did not find , why it fails, but at least which line it makes fail.

(str/split #"/")

Carsten Behring15:02:43

So replacing the above definition with:

(def re #"/")

(defn r-object [library function params]
  (let [resp (ocpu/object base-url :library library  :R function params)]
    (when (>  (:status resp) 201) (throw (ex-info "error" resp)))
    (-> resp
        :result
        first
        (str/split re)
        (nth 3))))

Carsten Behring15:02:23

This small example reproduces it. There is something strange about the re

(ns kaggle
  (:require [nextjournal.clerk :as clerk]
            [nextjournal.clerk.viewer :as v]))


(def re #"/")

(defn use-re--working-always [s]
  (println "very slow !! - cache working always")
  (Thread/sleep 5000)
  (clojure.string/split s re))


(defn use-re--failing-sometimes [s]
  (println "very slow !! -- cache failing sometimes ")
  (Thread/sleep 5000)
  (clojure.string/split s #"/"))

(def splitted (use-re--working-always "hello/clerk"))


(comment
  (def times
    (do
      (clerk/clear-cache!)
      (doall
       (repeatedly 100 (fn [] (clerk/time-ms (clerk/show! "src/buggy.clj")))))))

  :ok)

;;  using `use-re--failing-sometimes`  does not cache 3 times
(->> times (map  :time-ms) (filter #(> % 5000)))


;; => (5118.088344
;;     5112.789448
;;     5111.539959


;;  using `use-re--alawys-working`  does always cache
(->> times (map  :time-ms) (filter #(> % 5000)))
;; => []

Carsten Behring15:02:29

In here the "frequency" of "Not caching" is very low, 3 out of 1000 In my real code is higher, evey 5th run, I would say.

mkvlr17:02:32

regular expressions don’t have equality in clojure

mkvlr17:02:10

digging into how this could affect caching…

mkvlr18:02:58

this seems to reproduce it pretty reliably:

mkvlr18:02:59

(require '[nextjournal.clerk.hashing :as h])
(into []
      (comp (map (comp set vals h/hash h/build-graph h/parse-clojure-string))
            (distinct))
      (repeat 100 "(fn [x] (clojure.string/split x #\"/\"))"))

mkvlr18:02:50

thanks a lot for figuring out the the regexp can cause issues! 🙌

mkvlr18:02:09

(frequencies (into []
                   (comp (map (comp vals h/hash h/build-graph h/parse-clojure-string)))
                   (repeat 1000 "(fn [x] (clojure.string/split x #\"/\"))")))
;;=> {("5dtVT9HVuLAa4BCJBWwz9KQ2b9u37n" "5drKysDWADNZJWwWfgWYs6TETrz7zx")
 946,
 ("5dtVT9HVuLAa4BCJBWwz9KQ2b9u37n" "5dt92a8LBSfEScpUk3X7Y1L3vBDKJm")
 54}

mkvlr18:02:00

a heisenbug 🙃

Carsten Behring19:02:56

Indeed interesting. But I though that Clerk uses the "text representation" via prn-str for hashing.

mkvlr19:02:33

for hashing yes but as nodes in the dep graph it uses either var names or for top level expression the read form

Carsten Behring21:02:16

Looks good to me 👍

Carsten Behring21:02:35

Clerk evaluated 'src/kaggle.clj' in 373.011182ms.
Clerk evaluated 'src/kaggle.clj' in 408.822775ms.
Clerk evaluated 'src/kaggle.clj' in 454.97627ms.
Clerk evaluated 'src/kaggle.clj' in 529.977895ms.
Clerk evaluated 'src/kaggle.clj' in 506.647304ms.
Clerk evaluated 'src/kaggle.clj' in 437.870516ms.
Clerk evaluated 'src/kaggle.clj' in 466.663666ms.
Clerk evaluated 'src/kaggle.clj' in 431.667927ms.
Clerk evaluated 'src/kaggle.clj' in 488.64744ms.
Clerk evaluated 'src/kaggle.clj' in 449.524183ms.

mkvlr21:02:51

great, thanks

Carsten Behring21:02:54

I am trying to make a viewer for a MD dataset and started withg something like:

(clerk/set-viewers! [{:pred tc/dataset?
                      :render-fn (quote v/table-viewer)
                      :transform-fn (fn [x]
                                      (->

                                       (cons (tds/column-names x)
                                             (into [] (tds/rowvecs x)))))}])
The transform-fn transforms the dataset to something the table viewer of Clerk understands, but it does not work.

mkvlr21:02:18

@carsten.behring something like this should work

(clerk/set-viewers! [{:pred tc/dataset?
                      :transform-fn #(hash-map :nextjournal/value
                                               (clerk/table {:head (tds/column-names x)
                                                             :rows (tds/rowvecs x)}))}])

mkvlr21:02:50

this is what I used to test it

^{::clerk/viewer {:transform-fn #(hash-map :nextjournal/value
                                           (clerk/table {:head (first %) :rows (rest %)}))}}
[[:a :b] [:c :d]]

mkvlr19:02:00

@carsten.behring on latest main, this can be simplified to

{:pred tc/dataset?
 :transform-fn #(clerk/table {:head (tds/column-names %)
                              :rows (tds/rowvecs %)})}

Carsten Behring19:02:26

ok, I will try it.👍 Would be good to have at some point docu somewhere on the viewer API, it is for sure the most complex public surface of Clerk.

mkvlr19:02:10

yep for sure, will focus on that once it has settled and we feel it's stable

💯 1
mkvlr21:02:28

(I only tested it on a simpler example)

Carsten Behring21:02:14

Do you think there is any way, to get this working with Clerk ? https://datatables.net/