Fork me on GitHub
#clerk
<
2023-08-22
>
respatialized20:08:18

I'm frequently encountering a significant cache overhead with medium-to-large datasets used in Clerk notebooks. I took some time this afternoon to create a reproducible example, using this ~150MB / 284k row dataset https://www.kaggle.com/datasets/joebeachcapital/credit-card-fraud. deps.edn:

{:paths ["src"]
 :deps  {io.github.nextjournal/clerk {:mvn/version "0.14.919"}
         com.cnuernber/charred {:mvn/version "1.032"}}}
src/cache_issue.clj:
(ns cache-issue
  (:require [nextjournal.clerk :as clerk]
            [charred.api :as charred]
            [ :as io]))

^{::clerk/visibility {:result :hide}}
(def cc-data
  (time
   (with-open [r (io/reader "resources/creditcard_csv.csv")]
     (into [] (charred/read-csv r)))))

(count cc-data)

(comment
  (clerk/serve! {})

  (clerk/show! "src/cache_issue.clj")
  )
Calling clerk/show! results in the following, which is indicative of the discrepancy:
"Elapsed time: 4703.877497 msecs"
Clerk evaluated '/home/andrew/repos_main/clerk-cache/src/respatialized/cache_issue.clj' in 19861.947261ms.
charred takes about 5 seconds to read the data, which is not displayed in the notebook. there are about 15 additional seconds of overhead, which I am assuming is caused by caching the value. Any pointers on trying to avoid this behavior? Disabling caching on the dataset means I don't get the benefit of caching for subsequent computations on the same dataset.

mkvlr20:08:10

thanks for the repro, I’ll try and take a closer look later today

mkvlr20:08:53

curious what happens without the into

respatialized20:08:29

a Unhandled java.io.IOException Stream closed error when you get to count simple_smile

mkvlr23:08:23

I can’t reproduce the issue on main:

"Elapsed time: 506.300625 msecs"
"Elapsed time: 546.625166 msecs"
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 2958.997917ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 60.986583ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 60.197208ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 57.910167ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 67.838875ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 87.97375ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 49.528459ms.

mkvlr23:08:44

let me try {:mvn/version "0.14.919"}

mkvlr23:08:48

same, as expected it’s slow the first time but subsequent shows are fast:

"Elapsed time: 640.844084 msecs"
"Elapsed time: 522.548208 msecs"
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 2809.95775ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 34.3155ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 52.093875ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 48.5775ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 40.103584ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 44.81075ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 41.364125ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 39.268542ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 38.433625ms.

respatialized00:08:28

It's the initial slowness that I'm concerned about, I think! I know that fine-grained control over caching is not ready yet. I'm encountering other (perhaps unrelated) issues where evaluations that should be fast are quite slow in the context of Clerk. I will try to pin those down as well.

respatialized13:08:26

Also, it seems like exceptions now invalidate a namespace's cache in recent versions of Clerk, but I don't think this was always the case. Is there a way to control that behavior?

respatialized14:08:20

Adding a

(throw (ex-info "error"))
to the above example notebook will invalidate the cache and result in a longer duration, even if the dataset is bound to a var using defonce

mkvlr16:08:50

can see how that makes the current behavior problematic

mkvlr17:08:32

curious also in feedback if the initial cache overhead is still a problem for you in normal usage with this

respatialized12:08:38

@U5H74UNSF I finally had the chance to test latest main - I observe correct cache hits after exceptions, but also the same overhead on initial eval.

respatialized12:08:37

Binding cc-data with defonce avoids the initial read when I clear Clerk's cache, but the initial eval without the cache still takes about 13s on my machine. Seems like defonce may be a way to more effectively isolate the cache overhead for large-ish values.

mkvlr12:08:00

@UFTRLDZEW yes, no change on the initial overhead. Do you feel like the initial overhead is still an issue in practice with the cache hit after exceptions fixed?

respatialized14:08:02

@U5H74UNSF I think the biggest source of friction in my workflow is eliminated by this change for the time being. If (when it's ready) you need an adventurous beta tester for development versions of exposing the cache functions to end users, let me know!

👍 2
mkvlr19:08:30

@UFTRLDZEW excellent, will do!