nextjournal

2022-02-17T09:46:51.883439Z

I try to investigate an issue, where Clerk sometimes decides to re-evaluate a certain form. I run clerk/show in a loop and measure time, and indeed I see sometimes far longer duration of form evaluations. This is confirmed by the debug output coming from here: https://github.com/nextjournal/clerk/blob/b0e13859af5280e4d75ba9cb324cbf8010e555cd/src/nextjournal/clerk.clj#L145 I have somehow the impression that the hashing of the forms is not always "stable". I see for example that the same form gets different hashes in different calls of "show!" without any code change in between.

"EVAL !!!" :hash "5dr7SHMXW1tdQYL1UbqD6Xdg3bd3jj" :form (def pipe-fn (ml/pipeline (mm/select-columns (concat cat-features numeric-features [:SalePrice])) (mm/replace-missing cat-features :value :NA) (mm/replace-missing numeric-features :midpoint) (fn [ctx] (assoc ctx : train-data)) (mm/transform-one-hot cat-features :full) (mm/set-inference-target :SalePrice) #:metamorph{:id :model} (mm/model {:model-type :smile.regression/random-forest, :trees 1000})))
.....
"EVAL !!!" :hash "5dtaiHaQeang6r83DzaWmbUmTtExFJ" :form (def pipe-fn (ml/pipeline (mm/select-columns (concat cat-features numeric-features [:SalePrice])) (mm/replace-missing cat-features :value :NA) (mm/replace-missing numeric-features :midpoint) (fn [ctx] (assoc ctx : train-data)) (mm/transform-one-hot cat-features :full) (mm/set-inference-target :SalePrice) #:metamorph{:id :model} (mm/model {:model-type :smile.regression/random-forest, :trees 1000})))

2022-02-17T09:50:10.300849Z

So running this:

(def times
    (doall
     (repeatedly 10 (fn [] (clerk/time-ms (clerk/show! "src/kaggle.clj"))))))

2022-02-17T09:51:31.216759Z

produces: ({:result nil, :time-ms 22870.775138} {:result nil, :time-ms 20775.000902} {:result nil, :time-ms 1289.373927} {:result nil, :time-ms 1285.455593} {:result nil, :time-ms 21002.457645} {:result nil, :time-ms 20808.522161} {:result nil, :time-ms 20450.540389} {:result nil, :time-ms 1207.209592} {:result nil, :time-ms 20773.024323} {:result nil, :time-ms 1282.546341})

2022-02-17T09:52:25.210429Z

4 times fast (from cache) and 6 times slow (Clerk re-evaluated the slow form)

2022-02-17T10:14:04.271299Z

We are getting somewhere...

(repeatedly
 10
 (fn []
   (->
    (clerk/parse-file "src/kaggle.clj")
    (h/build-graph)
    h/hash
    (get 'kaggle/train))))

2022-02-17T10:14:14.509899Z

Returns different results.....

2022-02-17T10:15:29.809319Z

;; => ("5dt6Bi1CeH3guTNdnCKCid5TB1hgwA" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5dt6Bi1CeH3guTNdnCKCid5TB1hgwA" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5ds3nXtgTQ4aUw3xt4KiM41tQbJUsE" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5dt6Bi1CeH3guTNdnCKCid5TB1hgwA" ;; "5duH7vdgoEpu9E36jc5fxzWJhMJ7Xg" ;; "5ds3nXtgTQ4aUw3xt4KiM41tQbJUsE")

mkvlr 2022-02-17T11:20:27.814159Z

thanks for digging into this!

2022-02-17T11:31:39.436039Z

Very inteesting.... I came this far: (-> (clerk/parse-file "src/kaggle.clj") (h/build-graph) :graph :dependencies (get 'kaggle/r-object)) ;; => #{clojure.core/ex-info ;; clojure.string/split ;; kaggle/base-url ;; clojure.core/first ;; clojure.lang.RT/nth ;; opencpu-clj.ocpu/object ;; clojure.lang.Numbers/gt}

2022-02-17T11:32:26.295079Z

The 'kaggle/r-object changes it's hash, eventhough all deps are a constant string "kaggle/base-url" and the other symbols are from jars

2022-02-17T11:33:09.289889Z

So the hash should be constant, should it ?

mkvlr 2022-02-17T11:33:15.232039Z

yes

2022-02-17T11:33:24.530529Z

One thing ...

2022-02-17T11:33:47.630049Z

For ope symbol no location was found ...

mkvlr 2022-02-17T11:33:50.086809Z

an issue with a minimal repro would be great

mkvlr 2022-02-17T11:33:58.245189Z

which one?

2022-02-17T11:34:06.495599Z

(h/find-location 'opencpu-clj.ocpu/object) -> nil

2022-02-17T11:34:31.266879Z

this should not be

2022-02-17T11:37:33.438729Z

Is it this "-" in the ns ?

2022-02-17T11:37:50.235899Z

"_" vs "-"

2022-02-17T11:39:10.367519Z

here we go !

2022-02-17T11:39:30.851589Z

(h/ns->jar (namespace 'opencpu_clj.ocpu/object)) -> working

2022-02-17T11:39:39.330799Z

(h/ns->jar (namespace 'opencpu-clj.ocpu/object)) > nil

2022-02-17T12:08:42.606889Z

This fix is needed in my view:

(defn ns->path [ns]
  (str/replace (str/replace (str ns) "-" "_")  "." fs/file-separator))

2022-02-17T12:09:06.846099Z

So, the jar get found for opencpu-clj namespace.

2022-02-17T12:09:35.675779Z

now all deps above have a location.

2022-02-17T12:11:52.849349Z

But the hash of kaggle/r-object is nevertheless sometimes different. -> a hash of one of the deps need to be different "sometimes" as code text is constant!

mkvlr 2022-02-17T12:13:09.316899Z

nice catch!

mkvlr 2022-02-17T12:14:08.791199Z

adding a test now

mkvlr 2022-02-17T12:39:48.569309Z

where’s the r-object coming from? don’t see it in https://github.com/behrica/kaggleHP/blob/main/src/kaggle.clj

2022-02-17T13:46:28.294519Z

I will try to commit later:

(defn r-object [library function params]
  (let [resp (ocpu/object base-url :library library  :R function params)]
    (when (>  (:status resp) 201) (throw (ex-info "error" resp)))
    (-> resp
        :result
        first
        (str/split #"/")
        (nth 3))))

2022-02-17T13:47:02.778669Z

It only depends on one more object:

(def base-url "")

2022-02-17T15:05:14.424839Z

I tried everything, but don't understand it. All inputs to the hashing are constant, and the hashing itself seems as well deterministic

2022-02-17T15:14:33.722459Z

I did not find , why it fails, but at least which line it makes fail.

(str/split #"/")

2022-02-17T15:15:43.249789Z

So replacing the above definition with:

(def re #"/")

(defn r-object [library function params]
  (let [resp (ocpu/object base-url :library library  :R function params)]
    (when (>  (:status resp) 201) (throw (ex-info "error" resp)))
    (-> resp
        :result
        first
        (str/split re)
        (nth 3))))

2022-02-17T15:15:50.393439Z

makes it work.

2022-02-17T15:50:23.661039Z

This small example reproduces it. There is something strange about the re

(ns kaggle
  (:require [nextjournal.clerk :as clerk]
            [nextjournal.clerk.viewer :as v]))


(def re #"/")

(defn use-re--working-always [s]
  (println "very slow !! - cache working always")
  (Thread/sleep 5000)
  (clojure.string/split s re))


(defn use-re--failing-sometimes [s]
  (println "very slow !! -- cache failing sometimes ")
  (Thread/sleep 5000)
  (clojure.string/split s #"/"))

(def splitted (use-re--working-always "hello/clerk"))


(comment
  (def times
    (do
      (clerk/clear-cache!)
      (doall
       (repeatedly 100 (fn [] (clerk/time-ms (clerk/show! "src/buggy.clj")))))))

  :ok)

;;  using `use-re--failing-sometimes`  does not cache 3 times
(->> times (map  :time-ms) (filter #(> % 5000)))


;; => (5118.088344
;;     5112.789448
;;     5111.539959


;;  using `use-re--alawys-working`  does always cache
(->> times (map  :time-ms) (filter #(> % 5000)))
;; => []

2022-02-17T15:51:29.367429Z

In here the "frequency" of "Not caching" is very low, 3 out of 1000 In my real code is higher, evey 5th run, I would say.

mkvlr 2022-02-17T17:23:32.026529Z

regular expressions don’t have equality in clojure

mkvlr 2022-02-17T17:29:10.336209Z

digging into how this could affect caching…

mkvlr 2022-02-17T18:17:58.927919Z

this seems to reproduce it pretty reliably:

mkvlr 2022-02-17T18:17:59.943069Z

(require '[nextjournal.clerk.hashing :as h])
(into []
      (comp (map (comp set vals h/hash h/build-graph h/parse-clojure-string))
            (distinct))
      (repeat 100 "(fn [x] (clojure.string/split x #\"/\"))"))

mkvlr 2022-02-17T18:18:50.048679Z

thanks a lot for figuring out the the regexp can cause issues! 🙌

mkvlr 2022-02-17T18:48:09.178009Z

(frequencies (into []
                   (comp (map (comp vals h/hash h/build-graph h/parse-clojure-string)))
                   (repeat 1000 "(fn [x] (clojure.string/split x #\"/\"))")))
;;=> {("5dtVT9HVuLAa4BCJBWwz9KQ2b9u37n" "5drKysDWADNZJWwWfgWYs6TETrz7zx")
 946,
 ("5dtVT9HVuLAa4BCJBWwz9KQ2b9u37n" "5dt92a8LBSfEScpUk3X7Y1L3vBDKJm")
 54}

mkvlr 2022-02-17T18:49:00.045259Z

a heisenbug 🙃

2022-02-17T19:36:56.782389Z

Indeed interesting. But I though that Clerk uses the "text representation" via prn-str for hashing.

mkvlr 2022-02-17T19:38:33.140989Z

for hashing yes but as nodes in the dep graph it uses either var names or for top level expression the read form

mkvlr 2022-02-17T21:07:14.220269Z

can you try https://github.com/nextjournal/clerk/compare/regex-hash-fix?expand=1 ?

2022-02-17T21:24:16.863759Z

Looks good to me 👍

2022-02-17T21:24:35.945519Z

Clerk evaluated 'src/kaggle.clj' in 373.011182ms.
Clerk evaluated 'src/kaggle.clj' in 408.822775ms.
Clerk evaluated 'src/kaggle.clj' in 454.97627ms.
Clerk evaluated 'src/kaggle.clj' in 529.977895ms.
Clerk evaluated 'src/kaggle.clj' in 506.647304ms.
Clerk evaluated 'src/kaggle.clj' in 437.870516ms.
Clerk evaluated 'src/kaggle.clj' in 466.663666ms.
Clerk evaluated 'src/kaggle.clj' in 431.667927ms.
Clerk evaluated 'src/kaggle.clj' in 488.64744ms.
Clerk evaluated 'src/kaggle.clj' in 449.524183ms.

mkvlr 2022-02-17T21:24:51.611669Z

great, thanks

2022-02-17T21:11:54.363369Z

I am trying to make a viewer for a MD dataset and started withg something like:

(clerk/set-viewers! [{:pred tc/dataset?
                      :render-fn (quote v/table-viewer)
                      :transform-fn (fn [x]
                                      (->

                                       (cons (tds/column-names x)
                                             (into [] (tds/rowvecs x)))))}])
The transform-fn transforms the dataset to something the table viewer of Clerk understands, but it does not work.

mkvlr 2022-02-17T21:32:18.305819Z

@carsten.behring something like this should work

(clerk/set-viewers! [{:pred tc/dataset?
                      :transform-fn #(hash-map :nextjournal/value
                                               (clerk/table {:head (tds/column-names x)
                                                             :rows (tds/rowvecs x)}))}])

mkvlr 2022-02-20T19:04:00.073909Z

@carsten.behring on latest main, this can be simplified to

{:pred tc/dataset?
 :transform-fn #(clerk/table {:head (tds/column-names %)
                              :rows (tds/rowvecs %)})}

2022-02-20T19:06:26.270029Z

ok, I will try it.👍 Would be good to have at some point docu somewhere on the viewer API, it is for sure the most complex public surface of Clerk.

mkvlr 2022-02-20T19:12:10.931799Z

yep for sure, will focus on that once it has settled and we feel it's stable

💯 1
mkvlr 2022-02-17T21:32:50.617009Z

this is what I used to test it

^{::clerk/viewer {:transform-fn #(hash-map :nextjournal/value
                                           (clerk/table {:head (first %) :rows (rest %)}))}}
[[:a :b] [:c :d]]

mkvlr 2022-02-17T21:32:28.103879Z

(I only tested it on a simpler example)

2022-02-17T21:44:12.685879Z

Yes, great !

2022-02-17T21:49:14.164459Z

Do you think there is any way, to get this working with Clerk ? https://datatables.net/