This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-05-25
Channels
- # announcements (21)
- # babashka (7)
- # beginners (27)
- # calva (7)
- # chlorine-clover (3)
- # cider (1)
- # clerk (21)
- # clojure (24)
- # clojure-europe (28)
- # clojure-finland (3)
- # clojure-nl (1)
- # clojure-norway (5)
- # clojure-uk (2)
- # clojurescript (13)
- # clr (2)
- # conjure (1)
- # consulting (1)
- # datahike (1)
- # datomic (13)
- # fulcro (3)
- # graalvm (33)
- # gratitude (7)
- # honeysql (7)
- # humbleui (12)
- # hyperfiddle (26)
- # interop (11)
- # introduce-yourself (4)
- # jobs-discuss (8)
- # lsp (26)
- # malli (6)
- # nbb (11)
- # polylith (26)
- # practicalli (1)
- # rdf (3)
- # re-frame (7)
- # reitit (10)
- # releases (2)
- # shadow-cljs (1)
- # tools-deps (15)
are there recommendations for how to plot large datasets? e.g. >100K datapoints. In my case, I'm typically doing some sort of binning into 1d or 2d histograms when plotting data this large. vegalite can do this for me, but it's pretty slow with data that large. I could do the binning myself jvm side and send the binned data over for plotting, which should help. I'm working on an implementation for that now, but I was wondering if (1) there are other suggested routes for larger datasets and (2) whether routines already exist for these operations (e.g. binning)
I usually manipulate the dataset on the JVM side and send the processed data to VL.
yea makes sense
I'm already using libpython for other things, so I was wondering - can I display matplotlib images easily?
Matplotlib might require a custom viewer to display an image serialized as bytes on the clerk side
Perhaps save a png with matplotlib, then
(clerk/html
[:img {:src "img/myplot.png"}]
?https://github.com/generateme/cljplot can be very helpful in backend-side rendering. On the client side, Bokeh.js is supposed to be clever about big datasets. I don't think we have a good Clojure adapter for it yet. If you are passing large datasets to the browser for Vega, then picking a relatively efficient https://vega.github.io/vega-lite/docs/data.html might help. CSV will tend to be smaller than JSON.
great, thanks all for the suggestions! I'll do some experimenting add see what works well
It should work without a custom viewer. Using the right matplotlib code to get images as bytes and then libpython-clj will convert them to JVM bytes, which can be shown via clerk Image support: https://github.clerk.garden/nextjournal/book-of-clerk/commit/160f7aaaa6c4a30c1ba53a35cb888095ac8f64ce/#images
okay so cljplot
lead me to how they use fastmath
for histograms, which provides data for 1D histograms pretty trivially so I just needed to do something for 2D. Using the source for histograms in fastmath I came up with this
(defn- search-array
([vs] (search-array vs nil))
([vs bins]
(let [[mn mx] (stats/extent vs)
^long bins (stats/estimate-bins vs bins)
diff (- mx mn)
bins (if (zero? diff) 1 bins)
step (/ diff bins)]
{:step step,
:size bins,
:min mn,
:max mx,
:search-array (double-array (butlast
(fm/slice-range mn mx (inc bins))))})))
(defn hist2d-data
([vs zs] (hist2d-data vs zs nil))
([vs zs bins]
(let [sv (search-array vs bins)
sz (search-array zs bins)
zbins (-> sz
:search-array
count)
bins (* zbins
(-> sv
:search-array
count))
buff (long-array bins)]
(doseq [[v z] (map vector vs zs)]
(let [bv (java.util.Arrays/binarySearch ^doubles (:search-array sv) v)
bz (java.util.Arrays/binarySearch ^doubles (:search-array sz) z)
^int vpos (if (neg? bv) (fm/abs (+ bv 2)) bv)
^int zpos (if (neg? bz) (fm/abs (+ bz 2)) bz)
pos (+ (* zbins vpos) zpos)]
(fastmath.java.Array/inc ^longs buff pos)))
{:size bins,
:x (dissoc sv :search-array),
:y (dissoc sz :search-array),
:bins (map vector
(for [v (:search-array sv) z (:search-array sz)] [v z])
buff)})))
(defn hist2d
([vs zs] (hist2d vs zs nil))
([vs zs bins]
(let [{:keys [x y bins]} (hist2d-data vs zs bins)
xstep (:step x)
ystep (:step y)]
(clerk/vl
{:$schema "",
:config {:view {:stroke "transparent"}},
:data {:values
(for [[[xx yy] c] bins]
{:x xx, :x2 (+ xx xstep), :y yy, :y2 (+ yy ystep), :count c})},
:encoding {:color {:field "count",
:type "quantitative",
:condition {:test "datum['count'] == 0",
:value "white"}},
:x {:field "x",
:type "quantitative",
:axis {:grid false},
:scale {:domain [(:min x) (:max x)]}},
:x2 {:field "x2", :type "quantitative"},
:y {:field "y",
:type "quantitative",
:axis {:grid false},
:scale {:domain [(:min y) (:max y)]}},
:y2 {:field "y2", :type "quantitative"}},
:mark {:type :rect}}))))
(def n 500001)
(def x (map double (range n)))
(def y
(map-indexed (fn [i nn]
(-> i
(/ n)
(* math/PI 2)
math/sin
(+ (* nn 0.2))))
(rand/->seq (rand/distribution :normal) n)))
^{:nextjournal.clerk/visibility {:result :show}} (hist2d x y)
^{:nextjournal.clerk/visibility {:result :show}}
(clerk/vl {:$schema "",
:config {:view {:stroke "transparent"}},
:data {:values (for [[xx yy] (map vector x y)] {:x xx, :y yy})},
:encoding {:color {:aggregate "count", :type "quantitative"},
:x {:bin true, :field "x", :type "quantitative"},
:y {:bin true, :field "y", :type "quantitative"}},
:mark {:type :rect}})
which works basically as intended (modulo defaults used by vega lite)although now that I have the implementation, I'm not sure how much faster it really is. And I'm not super sure anymore whether the vegalite plots were what was actually slowing me down. So I guess my new question is what are some good patterns for benchmarking individual cells of a clerk notebook? specifically it seems like a challenge with VL plots where some part of the runtime is in the browser (if I understand how they work)
@U7CAHM72M I'll have to try out getting a byte stream from matplotlib at some point. it would be good to have an escape hatch if I just need to make a plot quickly.
> not super sure anymore whether the vegalite plots were what was actually slowing me down First rule of performance optimization: measure, don’t guess.
okay so separating each option (jvm vs browser side) into separate notebooks, it does appear that the is some speed improvement using the jvm side operations. but I don't know how to get a quantitative assesment b/c the extended runtime is happening in the browser after the clerk show!
functions are returning while the javascript is computing on the data. Is there a good way to query whether the browser page has finished rendering?
(Holiday weekend in Germany, just back from camping) You can get timings for transmitting and rendering using the browser tools. We might add some more affordances to make this easier in the future, as we feel that viewer authoring is not yet as easy as it should be!
awesome, thanks!
I'd also appreciate some guidelines on how to combine large datasets with Clerk's support for reactive dataflow programming. When I https://github.clerk.garden/teodorlu/clerk-stuff/commit/7bd85d28726a0f166d8f4952b0dbf70936531b3e/src/rainbow_tables.html, I got myself into several weird problems. I didn't find a good way "just work in the Clerk document" but still avoid having to regenerate the rainbow tables unnecessarily.
If I were to write the rainbow tables text again, I'd probably isolate the "generate rainbox tables" part completely from "use rainbow tables to lookup passwords" part completely. Then use the REPL only for generating the tables, and Clerk for using the tables.
But if I generate rainbox-tables.sqlite
from a (comment ,,,)
form, how will Clerk Garden know what to do?
In this case, perhaps the answer is "Please don't use Nextjournal Clerk Garden infrastructure for rainbow table generation, that's not what it's for. We also don't allow bitcoin mining."
But I think the general question stands. I'd like to do a compute-heavy thing once, and also have a snappy Clerk experience. In my opinion, it makes sense to cover something like that in the Clerk book.
We definitely don’t allow bitcoin mining on our hardware, but generating a table of passwords in a pedagogical text is entirely within scope 😆
I find this code very strange, btw — going the long way around and making things hard for itself without a clear reason why. Leaving that aside, you could use (for example) defonce
for something you want to run once, or — if it’s something you want to manually re-trigger — consider using an atom like recompute-table
to wrap a boolean that is referenced in your re-computation code. (Ultimately, it’s just clojure code!)
@U3X7174KS A quick code review: • why use an external process to compute the SHA-1 when that algorithm is built-in? • why use a hex (base 16) text encoding of the hash (=40 bytes) instead of base64 (=24 bytes)? • why use an external SQL database to store a small lookup table? • consider:
(defn sha1sum-digest [password]
(->> (.digest
(doto (java.security.MessageDigest/getInstance "SHA-1")
(.update (.getBytes password "UTF-8")) ))
(.encodeToString (java.util.Base64/getEncoder))))
(defn alphabet->lookup-table [alphabet]
(reduce #(assoc %1 (sha1sum-digest %2) %2)
{}
(for [a alphabet
b alphabet
c alphabet]
(str a b c))))
(defonce rainbow-table
(alphabet->lookup-table "abceot"))
Good questions.
> why use an external process to compute the SHA-1 when that algorithm is built-in?
I had never used java.security
before. Cutting the dependency on the sha1sum
binary is great. Thanks!
> why use a hex (base 16) text encoding of the hash (=40 bytes) instead of base64 (=24 bytes)?
No good answer here either. sha1sum
gave me base 16 by default. Saving space is great too.
> why use an external SQL database to store a small lookup table?
I originally built a ~10 MB SQLite database, but cut back on the size when I realized the rainbow table might get recomputed on each deploy. After cutting down on the lookup table size, I didn't reconsider if I still needed files on disk 🙂
---
Thanks a lot for the feedback! I rolled your suggestions into a https://github.clerk.garden/teodorlu/lab/commit/12c6c2518c23bd9899f85cd456e22386c172a4f7/src/rainbow_tables_2.html. Ditching SQLite allowed me to delete about half the code, and fixed my previous issues with updates. I feel like it's easier to read about rainbow tables now that there's less SQLite.
I still want to experiment with workflows for big datasets (1 MB-1 GB) with Clerk. defonce
+ a manual REPL trigger sounds like it's going to work!