babashka

borkdude 2025-11-12T13:58:19.100739Z

I'm at the Conj and also brought babashka, clj-kondo and squint stickers. Feel free to approach me at anytime to ask for a sticker or just chat!

9
5
❀️ 8
4
leifericf 2025-11-12T07:52:11.239479Z

With Babashka, is it somehow possible to read the contents of .zip archives without unzipping them to disk? For example, by unzipping to memory or β€œstreaming” the file contents inside the .zip for downstream processing.

teodorlu 2025-11-26T11:41:04.598969Z

I wrote this little thing yesterday:

(defn slurp-from-zip
  "From zip file <zip>, slurp contained file <entry>"
  [zip entry]
  (let [tempdir (fs/create-temp-dir)]
    (fs/delete-on-exit tempdir)
    (fs/unzip zip tempdir)
    (fs/list-dir tempdir)
    (slurp (fs/file tempdir entry))))
Saves you manual cleanup, fs/delete-on-exit will remove unzipped files when the JVM shuts down. Return something else than just the one file if you want. No idea if this fits your desired performance characteristics. If it was me, I'd try the simple route first, then tune for performance if the simple route is too slow.

πŸ‘€ 1
leifericf 2025-11-26T13:56:25.750049Z

Nice one! For that to work, do I need to know the name of the file within the archive in advance? In my case, each archive contains .csv files, but the names are not consistent/known in advance. In most cases there's only one .csv file in there. I need to slurp all .csv files within the archives, essentially.

leifericf 2025-11-26T13:57:29.262809Z

Ah, I see now. Nice idea of "containing" the temp dir.

borkdude 2025-11-26T13:57:39.508349Z

I once had a project where I had to process zip files that had nested zip files with csv in them

πŸ˜… 2
leifericf 2025-11-26T14:01:15.223589Z

For the curious, I'm basically doing this: 1. Scrape a web page to find all .csv and .zip file links. β—¦ The data provider used to provide data in archives, but stopped doing that at some point. 2. Download all new files (skip files already on disk, which were previously downloaded). 3. Unzip all the .zip files (skip the previously unzipped ones). 4. Slurp all the .csv files and check for consistency (the format has changed over time). 5. Wrangle the data from all the .csv files into the same format. 6. Transact the facts into a Datomic database. It's sort of like an ETL job, now what I think about it. Here's the working code (I'm at step 4):

(ns bysykkel
  (:require [babashka.fs :as fs]
            [babashka.http-client :as http]
            [clojure.java.io :as io]
            [clojure.string :as str]))

(defn get-file-urls [file-ext web-page]
  (let [pattern (re-pattern (str "https?://[^\\s\"']+\\" file-ext "(?:\\.zip)?"))]
    (->> (http/get web-page)
         :body
         (re-seq pattern)
         distinct
         sort)))

(defn url->path [parent-dir url]
  (let [parts (str/split url #"/")
        filename (last parts)]
    (if-let [[_ year] (re-matches #".*/(\d{4})/.*" url)]
      (str parent-dir year "/" filename)
      (str parent-dir filename))))

(defn ensure-dir [path]
  (when-let [parent (fs/parent path)]
    (when-not (fs/exists? (fs/parent path))
      (fs/create-dirs parent))))

(defn download-file [dir url]
  (let [path (url->path dir url)
        file (fs/file path)]
    (when-not (fs/exists? file)
      (ensure-dir path)
      (println "Downloading file:" file)
      (with-open [out (io/output-stream file)]
        (io/copy (:body (http/get url {:as :stream})) out)))))

(defn unzip-file [path]
  (-> path
      (fs/unzip (fs/parent path) {:replace-existing true})))

(comment
  (->> ""
       (get-file-urls ".csv")
       (pmap #(download-file "data/raw/" %))
       doall)

  (->> (fs/glob "data" "**/*.zip")
       (pmap unzip-file)
       doall))

leifericf 2025-11-26T14:25:28.867979Z

This is primarily a learning project to learn Datomic with some real-world data πŸ™‚ I'm just learning other ETL-ish stuff on the way.

teodorlu 2025-11-26T14:38:10.470029Z

> Nice one! For that to work, do I need to know the name of the file within the archive in advance? fs/list-dir is your friend, then! Or fs/glob if there's folders inside the zip. Here's a starting point:

(defn zipfile->map
  [zip]
  (let [tempdir (fs/create-temp-dir)]
    (fs/delete-on-exit tempdir)
    (fs/unzip zip tempdir)
    (into {}
          (comp (filter fs/regular-file?)
                (map slurp))
          (fs/list-dir tempdir))))

πŸ’‘ 1
πŸ‘€ 1
leifericf 2025-11-26T15:01:55.014379Z

Cool! I didn't think of that. I had to change it a bit because (I think) your version was trying to slurp a file path directly, and into didn't work correctly because slurp returned a string (not a key-value pair). I got this working:

(defn zipfile->map
  [zip]
  (let [tempdir (fs/create-temp-dir)]
    (fs/delete-on-exit tempdir)
    (fs/unzip zip tempdir)
    (into {}
          (comp (filter fs/regular-file?)
                (map (fn [f] [(fs/file-name f) (slurp (fs/file f))])))
          (fs/list-dir tempdir))))
Thanks for the pointers, @teodorlu!

πŸ’― 1
leifericf 2025-11-26T15:02:41.847059Z

But it's slow as hell even on a single .zip file πŸ˜… (Trying to figure out why)

teodorlu 2025-11-26T15:07:50.055709Z

β€’ How big is zip file before extraction? β€’ How big after? β€’ How many files were extracted?

leifericf 2025-11-12T08:08:57.176879Z

Context: I’m downloading thousands of tiny .zip archives, each containing .csv files with data. Sometimes these data files are not in a .zip archive because the producer is inconsistent. I have some logic to determine whether a file is a .zip file, and if so, unzip it to a temp area. Then, I merge all the .csv files into a single large dataset, which I intend to load into Datomic for more flexible data analytics. Being able to read the .zip files without unzipping to disk would likely speed things up and allow me to remove some code that checks whether the file already exists (whether it has already been unzipped), and clean up the /tmp directory after each run.

2025-11-12T08:42:15.673029Z

perhaps not quite you are looking for, but fwiw there is a way to get babashka.fs to use a function to decide whether to extract a particular zip entry: https://github.com/babashka/fs/blob/57c089b5f4ff97f343e1f7f7fa04e4134104d1ea/API.md#unzip

πŸ‘€ 1
teodorlu 2025-11-12T08:55:04.745879Z

use the platform! Java has classes for this.

$ bb
Babashka v1.12.208 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=> (import 'java.util.zip.ZipInputStream)
java.util.zip.ZipInputStream
I found the class reading https://www.baeldung.com/java-compress-and-uncompress.

πŸ’‘ 1
πŸ‘€ 1
leifericf 2025-11-12T08:58:10.906499Z

Ah! Good tip, @teodorlu. Thanks! I restricted myself to babashka.fs/unzip and forgot to consider Java might have some other stuff.

Tomas Brejla 2025-11-12T08:59:47.850279Z

yup, ZipInputStream works for this use-case

πŸ‘€ 1
teodorlu 2025-11-12T09:06:19.104709Z

Easy to forget! All of babashka.fs generally just calls Java classes. babashka.fs/unzip also uses these classes directly, specifically java.util.zip.ZipInputStream. The implementation is readable: https://github.com/babashka/fs/blob/ab826ff7e073dbd55842a66e04ed93ce0eee1e9b/src/babashka/fs.cljc#L977-L1005

πŸ™Œ 1
borkdude 2025-11-12T12:17:31.553189Z

@leif.eric.fredheim Java interop is available in bb for this, but you could perhaps also use:

* `:extract-fn` - function that decides if the current ZipEntry
     should be extracted. The function is only called for the file case
     (not directories) with a map with entries:
     * `:entry` and the current ZipEntry
     * `:name` and the name of the ZipEntry (result of calling `getName`)
     Extraction only occurs if a truthy value is returned (i.e. not
     nil/false)."
maybe

πŸ™Œ 1
borkdude 2025-11-12T12:18:00.186079Z

although that will extract stuff to disk, so then yes, Java interop is what you need

leifericf 2025-11-12T12:23:54.143079Z

Thanks for the advice! My current unzip function looks like this:

(defn unzip-file [path]
  (-> path
      (fs/unzip (fs/parent path))))
Very simple and easy with Babashka! The only thing that took me a minute was figuring out how to unzip to the same directory as the .zip file was located. That's why I wrapped it in a function like that. I'm calling it like this:
(doall
   (->> (fs/glob "data" "**/*.zip")
        (pmap unzip-file)))

πŸ‘ 1
leifericf 2025-11-12T12:25:23.933639Z

I naively expected the default location to be the same as the .zip (as when you unzip manually via a GUI tool)

leifericf 2025-11-12T12:25:32.455459Z

But it's actually "`.`" (the root of my project)

borkdude 2025-11-12T12:25:46.800989Z

nice! do note that this will only extract zip files from sub directories, not the current directory. in that case you'll need (fs/glob "data" "**.zip") to include the current dir

πŸ‘ 1
leifericf 2025-11-12T12:26:24.290509Z

Oh! Hahaha, of course πŸ€¦β€β™‚οΈ

borkdude 2025-11-12T12:27:32.184319Z

yeah, zip preserves the structure within the zip file, I think I mimicked that behavior of the command line zip program, it can be surprising

borkdude 2025-11-12T12:28:51.182149Z

so if the zip file contains foo/bar.txt it will create the foo directory in the current working directory

πŸ‘ 1
borkdude 2025-11-12T12:30:30.329079Z

these are symmetric:

borkdude@m1-5 /tmp $ bb -e '(fs/zip "dude.zip" "dude")'
borkdude@m1-5 /tmp $ zipinfo dude.zip
Archive:  dude.zip
Zip file size: 280 bytes, number of entries: 2
-rw----     2.0 fat        0 bX defN 25-Nov-12 07:29 dude/
-rw----     2.0 fat        0 bX defN 25-Nov-12 07:29 dude/foo.txt
2 files, 0 bytes uncompressed, 4 bytes compressed:  0.0%
borkdude@m1-5 /tmp $ bb -e '(fs/unzip "dude.zip")'
----- Error --------------------------------------------------------------------
Type:     java.nio.file.FileAlreadyExistsException
Message:  ./dude/foo.txt
Location: NO_SOURCE_PATH:1:1

πŸ‘€ 1
leifericf 2025-11-12T12:45:12.751619Z

That's nice! I saw in the docs there's also an option to overwrite files if they already exist.

borkdude 2025-11-12T12:45:33.337909Z

yep

leifericf 2025-11-12T12:46:32.498399Z

Maybe it would be possible to add an "in-memory" option as well. But I guess it's a rare use case and not worthwhile since it can be done with Java.

leifericf 2025-11-12T12:48:22.167979Z

Probably 99% of the time people want to unzip to disk.

borkdude 2025-11-12T12:49:00.837929Z

take a look at that example

leifericf 2025-11-12T12:49:34.566099Z

Haha! Nice one. I need to get better at using ChatGPT for stuff like that. Good exmaple.

borkdude 2025-11-12T12:59:30.999679Z

This is actually the working version on my system here:

(ns demo.zipfs
  (:import [java.nio.file FileSystems Files Paths]
           [ URI]
           [java.util Map])
  (:require [babashka.fs :as fs]))

;; Suppose we have a zip file "example.zip" with an entry "hello.txt"
(def zip-path "dude.zip")
(def entry-name "dude/foo.txt")

;; Open the zip file as a FileSystem
(with-open [fs (FileSystems/newFileSystem
                (URI/create (str "jar:file:" (.toAbsolutePath (Paths/get zip-path (into-array String [])))))
                (java.util.HashMap.))]
  (prn :fs fs)
  (let [entry-path (.getPath fs (str "/" entry-name) (into-array String []))]
    ;; Read the entry as a string
    (println "Contents of" entry-name ":")
    (println (slurp (Files/newInputStream entry-path (into-array java.nio.file.OpenOption []))))))

;; fs is automatically closed here
With the above zip file

πŸ‘€ 1
borkdude 2025-11-12T13:00:54.985219Z

I guess babashka.fs path could implement support for ZipFileSystem so you can write:

(.getPath fs (str "/" entry-name) (into-array String []))
=>
(fs/path fs entry-name)

borkdude 2025-11-12T13:02:08.986819Z

also:

(Files/newInputStream entry-path (into-array java.nio.file.OpenOption []))
could perhaps be supported better in fs. Maybe the whole above example could be nicer in bb fs :)

πŸ‘€ 1