Hi everyone, I am trying to use tech.v3.libs.parquet to create some parquet files from datasets. It seems as if the ds->parquet fn takes a loooooooooong time. I noticed the note about perf problems from too much debug logs, but I'm not getting any debug logs, so maybe it's not that (?!)
Here are my numbers:
• The dataset had 42607 records in it and 16 fields
• the final parquet file was 2.6Mb
• Export took 74930 mscs, which I is about 74 seconds.
Does this sound normal?
Nicely done - happy to help.
No, that does not sound normal. If you can make a reproducible case and log it as a github issue (https://github.com/techascent/tech.ml.dataset/issues) we can look at it.
Kindly help me with one more thing? my code looks like this:
(count records') ;; 949267
(def test-dataset
(let [record->ds (ds/mapseq-parser {:dataset-name "test"})]
(doseq [r records']
(record->ds r))
(record->ds)))
(count test-dataset) ;; 16 ;; WHAT? Why? I thought it would be 949267
(def worker
(csp/thread (time
(ds-parquet/ds->parquet test-dataset
"test2.parquet"))
(log/info "done with export")))
;; with (first records'):
{"bridge-timestamp" "2025-11-16T13:28:32.397275650Z",
"heading" 8.0,
"longitude" 15.357746,
"device-timestamp" "2025-11-16T13:28:27Z",
"latitude" -34.6354833,
"vehicle-meta-data-vehicle-id" "1xxx70",
"device-id" "auto/10xx70",
"vehicle-bridge-id" "1/1/886",
"odometer" 0,
"vehicle-meta-data-registration" "108070",
"provider-timestamp" "2025-11-16T13:28:27Z",
"fleet-name" "Probgon__RUBEN",
"speed" 0.0}
----
"Elapsed time: 1524752.336913 msecs"Am I doing something obviously wrong here?
Is records' a sequence of maps? Or something else?
It is a LazySeq created by map over a PersistantVector.
(of maps)
Does (def test-dataset (ds/->>dataset records')) make sense then?
And then, what is (ds/row-count test-dataset) ?
Datasets are Maps (of column name to column data), so (count test-dataset) is counting the map entries (columns). So, presumably, there are 16 columns.
> is counting the map entries (columns). aaah, that explains it, thank you.
(def test-dataset (ds/->>dataset records'))
(type test-dataset) ;; tech.v3.dataset.impl.dataset.Dataset
(ds/row-count test-dataset) ;; 949267Seems good. I also don't understand what csp/thread is doing, but maybe just see if the parquet write of that ds performs better.
If it's still slow you can use async-profilier, and send us a flamegraph as well.
Apologies, I have not made a self-contained reproducible example yet, because I have trouble with re-creating/cleaning the data I'm working with... But here is a screenshot of the flamegraph so long
From my cursory inspection, it seems as if the slf4j is present in all of the time-consuming stacks... it is possible that the logging code is still running despite me not seeing any logs. I'm using a heavily configured timbre for logging.
I've created a logback.xml that looks like this:
(slurp (io/resource "logback.xml"))
"<configuration>\n <root level=\"OFF\" />\n</configuration>\n"
... but it has not affected the performance.
I'll see if I can get a sample of data and package it in a repro.ok... Apologies for the train of thought in my posts... I was looking at the flamegraph some more and noticed that the portions of "logging code" is in the bridge to timbre... I looked at my timbre config and realized it was still set to global :debug level. I made that :info and the perf normalized. (~ 6 seconds for the export to parquet)
(taoensso.timbre/set-level! :info)
so my conclusion is that my logging configuration is/was to blame for the slowness I saw. The logback.xml that I created was useless as that's not the logging framework in use here...
Thanks for your questions and prompts along the way, they helped me think about problem in a better way 👍🏽