nextjournal

2022-02-15T10:28:37.209829Z

I tried to build clear myself, but it cannot resolve:

io.github.nextjournal/cas {:git/url "git@github.com:nextjournal/cas"
                                                     :git/sha "5e8079b720e347b9466db9c2282ce79a125a011c"}
"io.github.nextjournal/cas" does not exists publicly, it seems

2022-02-15T11:39:48.959289Z

I still have an issue with freezing / un-freezing of TMD datasets. After a JVM restart, I get an exception: Clean cache and re-evaluate makes it go away.

Unhandled java.lang.ClassCastException
   class [D cannot be cast to class [Ljava.lang.Object; ([D and
   [Ljava.lang.Object; are in module java.base of loader 'bootstrap')

          array_buffer.clj:  333  tech.v3.datatype.array-buffer/array-buffer/reify
           BufferIter.java:   60  tech.v3.datatype.BufferIter/next
             protocols.clj:   49  clojure.core.protocols/iter-reduce
             protocols.clj:   75  clojure.core.protocols/fn
             protocols.clj:   75  clojure.core.protocols/fn
             protocols.clj:   13  clojure.core.protocols/fn/G
                  core.clj: 6886  clojure.core/transduce
                  core.clj: 6901  clojure.core/into
                  core.clj: 6889  clojure.core/into
               viewer.cljc:  422  nextjournal.clerk.viewer$describe/invokeStatic
               viewer.cljc:  366  nextjournal.clerk.viewer$describe/invoke
               viewer.cljc:  424  nextjournal.clerk.viewer$describe$fn__19341/invoke
                  core.clj: 7300  clojure.core/map-indexed/fn/fn
                  core.clj: 2881  clojure.core/take/fn/fn
                  core.clj: 2929  clojure.core/drop/fn/fn
             protocols.clj:   49  clojure.core.protocols/iter-reduce
             protocols.clj:   75  clojure.core.protocols/fn
             protocols.clj:   75  clojure.core.protocols/fn
             protocols.clj:   13  clojure.core.protocols/fn/G
                  core.clj: 6886  clojure.core/transduce
                  core.clj: 6901  clojure.core/into
                  core.clj: 6889  clojure.core/into
               viewer.cljc:  422  nextjournal.clerk.viewer$describe/invokeStatic
               viewer.cljc:  366  nextjournal.clerk.viewer$describe/invoke
               viewer.cljc:  424  nextjournal.clerk.viewer$describe$fn__19341/invoke
                  core.clj: 7300  clojure.core/map-indexed/fn/fn
                  core.clj: 2881  clojure.core/take/fn/fn
                  core.clj: 2929  clojure.core/drop/fn/fn
             ArraySeq.java:  116  clojure.lang.ArraySeq/reduce
                  core.clj: 6885  clojure.core/transduce
                  core.clj: 6901  clojure.core/into
                  core.clj: 6889  clojure.core/into
               viewer.cljc:  422  nextjournal.clerk.viewer$describe/invokeStatic
               viewer.cljc:  366  nextjournal.clerk.viewer$describe/invoke
               viewer.cljc:  372  nextjournal.clerk.viewer$describe/invokeStatic
               viewer.cljc:  366  nextjournal.clerk.viewer$describe/invoke
                  view.clj:  110  nextjournal.clerk.view/->result
                  view.clj:  109  nextjournal.clerk.view/->result
                  view.clj:  164  nextjournal.clerk.view/describe-block
                  view.clj:  151  nextjournal.clerk.view/describe-block
                  core.clj: 2635  clojure.core/partial/fn
                  core.clj: 2746  clojure.core/map/fn/fn
     PersistentVector.java:  343  clojure.lang.PersistentVector/reduce
                  core.clj: 6885  clojure.core/transduce
                  core.clj: 6901  clojure.core/into
                  core.clj: 6889  clojure.core/into
                  view.clj:  171  nextjournal.clerk.view/doc->viewer/fn
                  core.clj: 6185  clojure.core/update
                  core.clj: 6177  clojure.core/update
                  view.clj:  171  nextjournal.clerk.view/doc->viewer
                  view.clj:  167  nextjournal.clerk.view/doc->viewer
                  view.clj:  168  nextjournal.clerk.view/doc->viewer
                  view.clj:  167  nextjournal.clerk.view/doc->viewer
             webserver.clj:   80  nextjournal.clerk.webserver/update-doc!
             webserver.clj:   78  nextjournal.clerk.webserver/update-doc!
                 clerk.clj:  221  nextjournal.clerk/show!
                 clerk.clj:  208  nextjournal.clerk/show!
                      REPL:    1  kaggle/eval43131
                      REPL:    1  kaggle/eval43131
             Compiler.java: 7181  clojure.lang.Compiler/eval
             Compiler.java: 7136  clojure.lang.Compiler/eval
                  core.clj: 3202  clojure.core/eval
                  core.clj: 3198  clojure.core/eval

1
mkvlr 2022-02-15T11:44:24.129699Z

btw you can opt out of the clerk cache by setting the clerk.disable_cache system prop to a value that isnt false

2022-02-15T11:53:12.289199Z

Thanks for the tip. My current notebook contains "training of a model", which is slow. So using Clerk at all only makes sense, if caching is enabled. For this concrete issue, "cleaning the cache ones" is good enough as work around. Nevertheless I think we need a in-memory cache, even maybe as default. It seems to me that the nippy frezze / unfreeze has many issues and should not be the default.

mkvlr 2022-02-15T12:08:20.508259Z

that’s a strong statement 🙃

mkvlr 2022-02-15T12:08:36.448739Z

it does work incredibly well with regular Clojure data

mkvlr 2022-02-15T12:09:23.438219Z

there is an in-memory cache, I just need to fix an issue where it’s not used for things that aren’t nippy freezable which I’m doing right now

respatialized 2022-02-15T15:05:20.384749Z

@mkvlr an idea occurs to me about this issue, which I have also faced: could metadata be used to allow Clerk users to annotate values with their own caching functions (analogous to custom viewers)? certainly ordinary Clojure data is easy to persist using Nippy, but if Clerk's caching mechanism were extensible to other disk-backed formats (e.g. CSV/Arrow/etc) then you could potentially cache to a file format that makes sense for bigger things like TMD datasets (or images, etc).

mkvlr 2022-02-15T15:07:00.324899Z

@afoltzm yes! Something @jackrusher has also mentioned and definitely on the roadmap.

🎯 1
➕ 1
2022-02-15T15:20:22.417379Z

that's a strong statement :upside_down_face:
Yes, bad wording from my side. I wanted to say that I still think that it is technically impossible to make a serialisation system which guarantees to faithfully serialize / deserialize all (even unknown) potential JVM classes out-of-the-box. So relying fully on it in Clerk and not have a "work around" (= in-memory cache) seems to be dangerous. Nippy is a great library !!!

2022-02-15T15:21:32.546509Z

on my issue with TDM

class [D cannot be cast to class [Ljava.lang.Object; ([D and
   [Ljava.lang.Object; are in module java.base of loader 'bootstrap')

2022-02-15T15:26:47.545119Z

Does this means that "something" (Nippy ?) converts object arrays into double arrays or the other way arround? Or are we still see a "rendering" issue in Clerk ? It goes away when cleaning caches (but comes back on next JVM restart)

mkvlr 2022-02-15T15:29:09.048379Z

do you have a small repro of the above error?

2022-02-15T15:34:10.994809Z

https://github.com/behrica/kaggleHP

2022-02-15T15:35:58.642389Z

doing: 1. (clerk/show! "src/kaggle.clj") 2. JVM restart 3. (clerk/show! "src/kaggle.clj")

2022-02-15T15:36:04.863499Z

should trigger it

mkvlr 2022-02-15T15:37:19.208869Z

thanks, I’ll try…

mkvlr 2022-02-15T15:37:51.414709Z

btw, you’re still mostly working with the file watcher, right?

mkvlr 2022-02-15T15:38:29.282279Z

can highly recommend trying with a hotkey for clerk/show! https://github.com/nextjournal/clerk#editor-workflow

2022-02-15T15:47:57.185209Z

Regarding "configurable caching": I have my TMD datasets often nested inside a map. This seems to make specific caching via annotations pretty hard. Viewers are simpler, as we control well what to view, while data structures are often just big maps. An other way to say this: The caching should be IMHO completely invisible even if this means that persistent caching (across restarts) is not possible. To wonder about the viewers is quite some work already, to wonder about "how to store objects" should be avoided. (specially as I have not yet a use case for the persistence)

2022-02-15T15:50:44.248369Z

No, I went to hot key in emacs. Very good, indeed ! "C-c c" does all of it:+1:

mkvlr 2022-02-15T15:52:00.998189Z

> The caching should be IMHO completely invisible even if this means that persistent caching (across restarts) is not possible. that’s how it works now or am I misunderstanding?

mkvlr 2022-02-15T15:52:31.942439Z

(well besides the bug you’re encountering 😹)

👍 1
mkvlr 2022-02-15T15:53:02.298849Z

do you know which cell it is that causes the failure?

mkvlr 2022-02-15T16:01:07.510479Z

seems to be this one (def test-data\n (load-hp-data \"test.csv.gz\"))

mkvlr 2022-02-15T16:32:59.615149Z

this seems to be a more minimal repro

(ns kaggle-min
  (:require [nextjournal.clerk :as clerk]
            [tablecloth.api :as tc]))



(defn load-hp-data [file]
  (println "load a file : " file)
  (-> (tc/dataset file {:key-fn keyword})

      (tc/convert-types (zipmap [:BedroomAbvGr
                                 :BsmtFullBath
                                 :BsmtHalfBath
                                 :Fireplaces
                                 :FullBath
                                 :GarageCars
                                 :HalfBath
                                 :KitchenAbvGr
                                 :OverallCond
                                 :OverallQual
                                 :MoSold
                                 :TotRmsAbvGrd
                                 :MSSubClass]
                                (repeat :string)))))

(def df (load-hp-data "train.csv.gz"))

(defn ->table [df]
  (clerk/table {:head (tc/column-names df)
                :rows (tc/rows df :as-seqs)}))


;;  # The data
^{::clerk/width :full}
(->table df)

respatialized 2022-02-15T16:56:25.302869Z

In my personal experience as a data science practitioner, I'd say if you're working with models and datasets that are expensive to retrain/recompute, you probably are going to need to think explicitly about how serializing to disk fits into a workflow at some point. I don't see why Clerk asking its users to be explicit about custom caching is really that different than a lot of work that's already a big part of the data science lifecycle. I think Clerk is a very useful library, but expecting it to work well with heavyweight computation like that without user-supplied configuration may be asking too much of it. My personal preference is for flexible configuration via programming that may require more upfront effort than configuration via settings that is more brittle.

➕ 2
2022-02-15T23:42:51.287479Z

That is for sure true. There is a moment where exploration becomes engineering, and then Clerk is the wrong tool. But this line can be dynamic, and I think Clerk should support the use case where single form computations maybe takes 1 minute. And that minute I only want to wait if really needed (after changes of relevant code). And I think (and it seems so), that Clerk can bring that via caching.

👍 1
2022-02-16T19:29:24.846139Z

The issue above is clearly related to caching. When I disable nippy cache, it goes away.

2022-02-18T12:31:22.027259Z

@mkvlr I found the root cause of thge exception

Unhandled java.lang.ClassCastException
   class [D cannot be cast to class [Ljava.lang.Object; ([D and
   [Ljava.lang.Object; are in module java.base of loader 'bootstrap')

2022-02-18T12:31:58.790159Z

it happens when Clerk calls on a dataset:

(#{:nextjournal/missing} df)

2022-02-18T12:50:20.651899Z

The same exception hppens doing this:

(clerk/eval-string "df")

2022-02-18T12:51:23.356119Z

So does this mean that the nippy caching changes somehow the dataset object and makes it "inconsistent" or something like this ?

2022-02-18T13:35:03.321779Z

It is not a problem of Clerk. https://github.com/techascent/tech.ml.dataset/issues/287