data-science

jf 2025-11-24T08:32:45.289119Z

has anybody tried using tablecloth to create a dataset from an xlsx file? The doc (assuming https://scicloj.github.io/tablecloth/ is it) is really scant on how to do so. At https://scicloj.github.io/tablecloth/#dataset-creation, it mentions how xlsx is supported (`file types: raw/gzipped csv/tsv, json, xls(x) taken from local file system or URL`), so I've been trying to (tc/dataset "path-to-xlsx-file") to no avail. Initially I had a multi-sheet xlsx... then I tried a single-sheet one, then I tried maybe perhaps putting the xlsx file in the same directory where Noj is running (that's how I am using tablecloth)... I still keep getting "Unrecognized read file type: xlsx"

2025-11-24T09:22:50.268389Z

I usually use the poi namespace from the underlying tmd library https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html

Gent Krasniqi 2025-11-24T09:32:37.556789Z

I've used tech.v3.libs.fastexcel successfully in the past. (And the name of that java library is accurate) If I remember correctly I did have to take a peek at the tmd wrapper code to figure out how to refer to exact sheets by name and things like that, but I could have misunderstood the API the wrapper exposes as well.

Gent Krasniqi 2025-11-24T09:35:26.219689Z

I might be misremembering but one of the other solutions I had tried, possibly poi, was loading too much stuff I didn't need in memory by default, where I was just after a particularly named sheet in a file.

jf 2025-11-24T09:56:58.124449Z

> (And the name of that java library is accurate) as in it is fast? otherwise I'm not getting what you mean

genmeblog 2025-11-24T10:00:25.621369Z

You have to add additional libraries to make it possible.

Gent Krasniqi 2025-11-24T10:00:45.001069Z

Yes that's what I meant @jf.slack-clojurians

๐Ÿ‘Œ 1
genmeblog 2025-11-24T10:03:13.127499Z

Ooops. Sorry for the duplicate. I didn't realize that it's already answered

๐Ÿ‘ 1
jf 2025-11-24T14:18:42.002779Z

could somebody perhaps help me out with requiring fastexcel? I've got techascent/tech.ml.dataset {:mvn/version "7.067"} in my deps.edn as per https://clojars.org/techascent/tech.ml.dataset... and then I figured maybe [tech.v3.libs.fastexcel] in (:require) in my clj file (https://techascent.github.io/tech.ml.dataset/100-walkthrough.html has user> (require '[tech.v3.libs.fastexcel])) would work... but I'm getting issues deps.edn

{:deps {...
        techascent/tech.ml.dataset {:mvn/version "7.067"}}}
a.clj
(ns a
  (:require [tablecloth.api :as tc]
            [tech.v3.dataset :as ds]
            [tech.v3.libs.fastexcel :as fe]))
Error:
Execution error (ClassNotFoundException) at java.net.URLClassLoader/findClass (URLClassLoader.java:445).
org.dhatim.fastexcel.reader.ReadableWorkbook
The org.dhatim in the error message reminds me of https://techascent.github.io/tech.ml.dataset/tech.v3.libs.fastexcel.html:
Required Dependencies:

  [org.dhatim/fastexcel-reader "0.12.8" :exclusions [org.apache.poi/poi-ooxml]]
but I'm not sure what to make of it. If I try requiring [org.dhatim/fastexcel-reader :as fe] instead in my ns form, I get the following:
Syntax error macroexpanding clojure.core/ns at (a.clj:1:1).
((:require [tablecloth.api :as tc] [tech.v3.dataset :as ds] [org.dhatim/fastexcel-reader :as fe])) - failed: Extra input spec: :clojure.core.specs.alpha/ns-form

jf 2025-11-24T14:44:52.527899Z

re problem with requiring fastexcel: I think i have it now. Getting something like this working shouldnt be a goose chase like this but I found https://github.com/techascent/tech.ml.dataset/issues/405 which gave me a clue that perhaps fastexcel is a separate dependency. So putting that into deps.edn resolved it.

jf 2025-11-24T15:08:44.329909Z

I am just noticing https://github.com/techascent/tech.ml.dataset/commit/b1cb8d058d085ae01e4c694695feb499bdcc2ba5 (via https://github.com/techascent/tech.ml.dataset/issues/283) where the comment is made that ... poi is more robust . Hm.

2025-11-24T15:10:10.717759Z

I use poi for most of my needs. Occasionally I'll get a spreadsheet so messy that I'll save what I want out to csv (I could possibly wrestle it down, but sometimes the hacky way is shorter and less error prone)

2025-11-24T15:12:29.789559Z

I also use https://techascent.github.io/tech.ml.dataset/tech.v3.libs.poi.html#var-input-.3Eworkbook input->workbook if some of the sheets are a mess, but the one I want is OK and I don't want to parse the messy ones

jf 2025-11-24T15:23:11.858559Z

gotcha. Thanks for the tip!

jf 2025-11-24T15:49:38.383619Z

sorry, a few more questions on poi: 1. do I need to declare any deps? 2. do I need any require in my ns form? 3. I just tried (ds/->dataset "test.xlsx"), and I am getting an error: Execution error at (io.clj:54).\nUnrecognized read file type: :xlsx

2025-11-24T16:50:05.425229Z

you'll need poi in your deps

2025-11-24T16:50:16.658749Z

and you'll need to require the tmd poi NS

2025-11-24T16:50:28.965779Z

(sorry for the drive by help, I'm pulled in different directions today)

jf 2025-11-25T05:51:07.204689Z

no problem. Youโ€™re still putting in effort to help and I appreciate that!

jf 2025-11-25T07:26:55.593819Z

(mostly for my own records, but good for whoever else might be having an issue) I had to have both the poi, and poi-ooxml in deps: deps.edn:

{:deps { ...
         org.apache.poi/poi       {:mvn/version "5.5.0"}
         org.apache.poi/poi-ooxml {:mvn/version "5.5.0"}
         ...}}
code.clj:
(ns code
  (:require ...
            [tech.v3.dataset :as ds]
            [tech.v3.libs.poi :as poi]
            ...))