This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-08-22
Channels
- # announcements (13)
- # babashka (22)
- # beginners (22)
- # biff (17)
- # calva (6)
- # clerk (20)
- # clj-kondo (25)
- # clj-together (5)
- # clj-yaml (20)
- # cljdoc (16)
- # cljs-dev (1)
- # clojure (42)
- # clojure-brasil (1)
- # clojure-europe (26)
- # clojure-nl (6)
- # clojure-norway (24)
- # clojure-turkiye (3)
- # clojure-uk (5)
- # clojurescript (37)
- # core-async (7)
- # core-logic (2)
- # datalevin (7)
- # datomic (43)
- # events (2)
- # fulcro (7)
- # gratitude (1)
- # hyperfiddle (7)
- # java (7)
- # jobs (3)
- # lsp (4)
- # off-topic (16)
- # pathom (18)
- # polylith (1)
- # portal (27)
- # reitit (4)
- # releases (3)
- # shadow-cljs (47)
- # tools-build (14)
- # tools-deps (16)
- # yamlscript (11)
I'm frequently encountering a significant cache overhead with medium-to-large datasets used in Clerk notebooks. I took some time this afternoon to create a reproducible example, using this ~150MB / 284k row dataset https://www.kaggle.com/datasets/joebeachcapital/credit-card-fraud.
deps.edn
:
{:paths ["src"]
:deps {io.github.nextjournal/clerk {:mvn/version "0.14.919"}
com.cnuernber/charred {:mvn/version "1.032"}}}
src/cache_issue.clj
:
(ns cache-issue
(:require [nextjournal.clerk :as clerk]
[charred.api :as charred]
[ :as io]))
^{::clerk/visibility {:result :hide}}
(def cc-data
(time
(with-open [r (io/reader "resources/creditcard_csv.csv")]
(into [] (charred/read-csv r)))))
(count cc-data)
(comment
(clerk/serve! {})
(clerk/show! "src/cache_issue.clj")
)
Calling clerk/show!
results in the following, which is indicative of the discrepancy:
"Elapsed time: 4703.877497 msecs"
Clerk evaluated '/home/andrew/repos_main/clerk-cache/src/respatialized/cache_issue.clj' in 19861.947261ms.
charred
takes about 5 seconds to read the data, which is not displayed in the notebook. there are about 15 additional seconds of overhead, which I am assuming is caused by caching the value. Any pointers on trying to avoid this behavior? Disabling caching on the dataset means I don't get the benefit of caching for subsequent computations on the same dataset.a Unhandled java.io.IOException Stream closed
error when you get to count
I can’t reproduce the issue on main
:
"Elapsed time: 506.300625 msecs"
"Elapsed time: 546.625166 msecs"
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 2958.997917ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 60.986583ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 60.197208ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 57.910167ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 67.838875ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 87.97375ms.
Clerk evaluated '/Users/mk/dev/clerk/notebooks/scratch_cache.clj' in 49.528459ms.
same, as expected it’s slow the first time but subsequent shows are fast:
"Elapsed time: 640.844084 msecs"
"Elapsed time: 522.548208 msecs"
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 2809.95775ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 34.3155ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 52.093875ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 48.5775ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 40.103584ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 44.81075ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 41.364125ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 39.268542ms.
Clerk evaluated '/Users/mk/dev/clerk-demo/src/scratch_cache.clj' in 38.433625ms.
It's the initial slowness that I'm concerned about, I think! I know that fine-grained control over caching is not ready yet. I'm encountering other (perhaps unrelated) issues where evaluations that should be fast are quite slow in the context of Clerk. I will try to pin those down as well.
Also, it seems like exceptions now invalidate a namespace's cache in recent versions of Clerk, but I don't think this was always the case. Is there a way to control that behavior?
Adding a
(throw (ex-info "error"))
to the above example notebook will invalidate the cache and result in a longer duration, even if the dataset is bound to a var using defonce
right, made https://github.com/nextjournal/clerk/issues/549 to track this
@UFTRLDZEW can you give https://github.com/nextjournal/clerk/pull/550 a try?
curious also in feedback if the initial cache overhead is still a problem for you in normal usage with this
fix is in main
now https://github.com/nextjournal/clerk/commit/0846e106ce8b22202232a529f8c76bb85a95763b
@U5H74UNSF I finally had the chance to test latest main
- I observe correct cache hits after exceptions, but also the same overhead on initial eval.
Binding cc-data
with defonce
avoids the initial read when I clear Clerk's cache, but the initial eval without the cache still takes about 13s on my machine. Seems like defonce
may be a way to more effectively isolate the cache overhead for large-ish values.
@UFTRLDZEW yes, no change on the initial overhead. Do you feel like the initial overhead is still an issue in practice with the cache hit after exceptions fixed?
@U5H74UNSF I think the biggest source of friction in my workflow is eliminated by this change for the time being. If (when it's ready) you need an adventurous beta tester for development versions of exposing the cache functions to end users, let me know!
@UFTRLDZEW excellent, will do!