I'm at the Conj and also brought babashka, clj-kondo and squint stickers. Feel free to approach me at anytime to ask for a sticker or just chat!
With Babashka, is it somehow possible to read the contents of .zip archives without unzipping them to disk? For example, by unzipping to memory or βstreamingβ the file contents inside the .zip for downstream processing.
I wrote this little thing yesterday:
(defn slurp-from-zip
"From zip file <zip>, slurp contained file <entry>"
[zip entry]
(let [tempdir (fs/create-temp-dir)]
(fs/delete-on-exit tempdir)
(fs/unzip zip tempdir)
(fs/list-dir tempdir)
(slurp (fs/file tempdir entry))))
Saves you manual cleanup, fs/delete-on-exit will remove unzipped files when the JVM shuts down. Return something else than just the one file if you want.
No idea if this fits your desired performance characteristics. If it was me, I'd try the simple route first, then tune for performance if the simple route is too slow.Nice one! For that to work, do I need to know the name of the file within the archive in advance?
In my case, each archive contains .csv files, but the names are not consistent/known in advance. In most cases there's only one .csv file in there.
I need to slurp all .csv files within the archives, essentially.
Ah, I see now. Nice idea of "containing" the temp dir.
I once had a project where I had to process zip files that had nested zip files with csv in them
For the curious, I'm basically doing this:
1. Scrape a web page to find all .csv and .zip file links.
β¦ The data provider used to provide data in archives, but stopped doing that at some point.
2. Download all new files (skip files already on disk, which were previously downloaded).
3. Unzip all the .zip files (skip the previously unzipped ones).
4. Slurp all the .csv files and check for consistency (the format has changed over time).
5. Wrangle the data from all the .csv files into the same format.
6. Transact the facts into a Datomic database.
It's sort of like an ETL job, now what I think about it.
Here's the working code (I'm at step 4):
(ns bysykkel
(:require [babashka.fs :as fs]
[babashka.http-client :as http]
[clojure.java.io :as io]
[clojure.string :as str]))
(defn get-file-urls [file-ext web-page]
(let [pattern (re-pattern (str "https?://[^\\s\"']+\\" file-ext "(?:\\.zip)?"))]
(->> (http/get web-page)
:body
(re-seq pattern)
distinct
sort)))
(defn url->path [parent-dir url]
(let [parts (str/split url #"/")
filename (last parts)]
(if-let [[_ year] (re-matches #".*/(\d{4})/.*" url)]
(str parent-dir year "/" filename)
(str parent-dir filename))))
(defn ensure-dir [path]
(when-let [parent (fs/parent path)]
(when-not (fs/exists? (fs/parent path))
(fs/create-dirs parent))))
(defn download-file [dir url]
(let [path (url->path dir url)
file (fs/file path)]
(when-not (fs/exists? file)
(ensure-dir path)
(println "Downloading file:" file)
(with-open [out (io/output-stream file)]
(io/copy (:body (http/get url {:as :stream})) out)))))
(defn unzip-file [path]
(-> path
(fs/unzip (fs/parent path) {:replace-existing true})))
(comment
(->> ""
(get-file-urls ".csv")
(pmap #(download-file "data/raw/" %))
doall)
(->> (fs/glob "data" "**/*.zip")
(pmap unzip-file)
doall)) This is primarily a learning project to learn Datomic with some real-world data π I'm just learning other ETL-ish stuff on the way.
> Nice one! For that to work, do I need to know the name of the file within the archive in advance?
fs/list-dir is your friend, then! Or fs/glob if there's folders inside the zip. Here's a starting point:
(defn zipfile->map
[zip]
(let [tempdir (fs/create-temp-dir)]
(fs/delete-on-exit tempdir)
(fs/unzip zip tempdir)
(into {}
(comp (filter fs/regular-file?)
(map slurp))
(fs/list-dir tempdir))))Cool! I didn't think of that. I had to change it a bit because (I think) your version was trying to slurp a file path directly, and into didn't work correctly because slurp returned a string (not a key-value pair).
I got this working:
(defn zipfile->map
[zip]
(let [tempdir (fs/create-temp-dir)]
(fs/delete-on-exit tempdir)
(fs/unzip zip tempdir)
(into {}
(comp (filter fs/regular-file?)
(map (fn [f] [(fs/file-name f) (slurp (fs/file f))])))
(fs/list-dir tempdir))))
Thanks for the pointers, @teodorlu!But it's slow as hell even on a single .zip file π
(Trying to figure out why)
β’ How big is zip file before extraction? β’ How big after? β’ How many files were extracted?
Context: Iβm downloading thousands of tiny .zip archives, each containing .csv files with data. Sometimes these data files are not in a .zip archive because the producer is inconsistent. I have some logic to determine whether a file is a .zip file, and if so, unzip it to a temp area.
Then, I merge all the .csv files into a single large dataset, which I intend to load into Datomic for more flexible data analytics.
Being able to read the .zip files without unzipping to disk would likely speed things up and allow me to remove some code that checks whether the file already exists (whether it has already been unzipped), and clean up the /tmp directory after each run.
perhaps not quite you are looking for, but fwiw there is a way to get babashka.fs to use a function to decide whether to extract a particular zip entry: https://github.com/babashka/fs/blob/57c089b5f4ff97f343e1f7f7fa04e4134104d1ea/API.md#unzip
use the platform! Java has classes for this.
$ bb
Babashka v1.12.208 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (import 'java.util.zip.ZipInputStream)
java.util.zip.ZipInputStream
I found the class reading https://www.baeldung.com/java-compress-and-uncompress.Ah! Good tip, @teodorlu. Thanks! I restricted myself to babashka.fs/unzip and forgot to consider Java might have some other stuff.
yup, ZipInputStream works for this use-case
Easy to forget! All of babashka.fs generally just calls Java classes. babashka.fs/unzip also uses these classes directly, specifically java.util.zip.ZipInputStream.
The implementation is readable: https://github.com/babashka/fs/blob/ab826ff7e073dbd55842a66e04ed93ce0eee1e9b/src/babashka/fs.cljc#L977-L1005
@leif.eric.fredheim Java interop is available in bb for this, but you could perhaps also use:
* `:extract-fn` - function that decides if the current ZipEntry
should be extracted. The function is only called for the file case
(not directories) with a map with entries:
* `:entry` and the current ZipEntry
* `:name` and the name of the ZipEntry (result of calling `getName`)
Extraction only occurs if a truthy value is returned (i.e. not
nil/false)."
maybealthough that will extract stuff to disk, so then yes, Java interop is what you need
Thanks for the advice! My current unzip function looks like this:
(defn unzip-file [path]
(-> path
(fs/unzip (fs/parent path))))
Very simple and easy with Babashka!
The only thing that took me a minute was figuring out how to unzip to the same directory as the .zip file was located.
That's why I wrapped it in a function like that.
I'm calling it like this:
(doall
(->> (fs/glob "data" "**/*.zip")
(pmap unzip-file)))I naively expected the default location to be the same as the .zip (as when you unzip manually via a GUI tool)
But it's actually "`.`" (the root of my project)
nice! do note that this will only extract zip files from sub directories, not the current directory.
in that case you'll need (fs/glob "data" "**.zip") to include the current dir
Oh! Hahaha, of course π€¦ββοΈ
yeah, zip preserves the structure within the zip file, I think I mimicked that behavior of the command line zip program, it can be surprising
so if the zip file contains foo/bar.txt it will create the foo directory in the current working directory
these are symmetric:
borkdude@m1-5 /tmp $ bb -e '(fs/zip "dude.zip" "dude")'
borkdude@m1-5 /tmp $ zipinfo dude.zip
Archive: dude.zip
Zip file size: 280 bytes, number of entries: 2
-rw---- 2.0 fat 0 bX defN 25-Nov-12 07:29 dude/
-rw---- 2.0 fat 0 bX defN 25-Nov-12 07:29 dude/foo.txt
2 files, 0 bytes uncompressed, 4 bytes compressed: 0.0%
borkdude@m1-5 /tmp $ bb -e '(fs/unzip "dude.zip")'
----- Error --------------------------------------------------------------------
Type: java.nio.file.FileAlreadyExistsException
Message: ./dude/foo.txt
Location: NO_SOURCE_PATH:1:1That's nice! I saw in the docs there's also an option to overwrite files if they already exist.
yep
Maybe it would be possible to add an "in-memory" option as well. But I guess it's a rare use case and not worthwhile since it can be done with Java.
Probably 99% of the time people want to unzip to disk.
https://chatgpt.com/share/69148230-baa0-8012-a951-7f55db920b3d
take a look at that example
Haha! Nice one. I need to get better at using ChatGPT for stuff like that. Good exmaple.
This is actually the working version on my system here:
(ns demo.zipfs
(:import [java.nio.file FileSystems Files Paths]
[ URI]
[java.util Map])
(:require [babashka.fs :as fs]))
;; Suppose we have a zip file "example.zip" with an entry "hello.txt"
(def zip-path "dude.zip")
(def entry-name "dude/foo.txt")
;; Open the zip file as a FileSystem
(with-open [fs (FileSystems/newFileSystem
(URI/create (str "jar:file:" (.toAbsolutePath (Paths/get zip-path (into-array String [])))))
(java.util.HashMap.))]
(prn :fs fs)
(let [entry-path (.getPath fs (str "/" entry-name) (into-array String []))]
;; Read the entry as a string
(println "Contents of" entry-name ":")
(println (slurp (Files/newInputStream entry-path (into-array java.nio.file.OpenOption []))))))
;; fs is automatically closed here
With the above zip fileI guess babashka.fs path could implement support for ZipFileSystem so you can write:
(.getPath fs (str "/" entry-name) (into-array String []))
=>
(fs/path fs entry-name)also:
(Files/newInputStream entry-path (into-array java.nio.file.OpenOption []))
could perhaps be supported better in fs.
Maybe the whole above example could be nicer in bb fs :)