data-science

pieterbreed 2025-11-17T09:53:02.127819Z

Hi everyone, I am trying to use tech.v3.libs.parquet to create some parquet files from datasets. It seems as if the ds->parquet fn takes a loooooooooong time. I noticed the note about perf problems from too much debug logs, but I'm not getting any debug logs, so maybe it's not that (?!) Here are my numbers: • The dataset had 42607 records in it and 16 fields • the final parquet file was 2.6Mb • Export took 74930 mscs, which I is about 74 seconds. Does this sound normal?

Harold 2025-11-18T14:02:57.040509Z

Nicely done - happy to help.

Harold 2025-11-17T13:36:18.929729Z

No, that does not sound normal. If you can make a reproducible case and log it as a github issue (https://github.com/techascent/tech.ml.dataset/issues) we can look at it.

👍🏽 1
pieterbreed 2025-11-17T14:10:34.782579Z

Kindly help me with one more thing? my code looks like this:

(count records') ;; 949267

  (def test-dataset
    (let [record->ds (ds/mapseq-parser {:dataset-name "test"})]
      (doseq [r records']
        (record->ds r))
      (record->ds)))

  (count test-dataset) ;; 16 ;; WHAT? Why? I thought it would be 949267

  (def worker
    (csp/thread (time
                 (ds-parquet/ds->parquet test-dataset
                                         "test2.parquet"))
                (log/info "done with export")))

;; with (first records'):
{"bridge-timestamp" "2025-11-16T13:28:32.397275650Z",
 "heading" 8.0,
 "longitude" 15.357746,
 "device-timestamp" "2025-11-16T13:28:27Z",
 "latitude" -34.6354833,
 "vehicle-meta-data-vehicle-id" "1xxx70",
 "device-id" "auto/10xx70",
 "vehicle-bridge-id" "1/1/886",
 "odometer" 0,
 "vehicle-meta-data-registration" "108070",
 "provider-timestamp" "2025-11-16T13:28:27Z",
 "fleet-name" "Probgon__RUBEN",
"speed" 0.0}

----

"Elapsed time: 1524752.336913 msecs"

pieterbreed 2025-11-17T14:11:34.212389Z

Am I doing something obviously wrong here?

😅 1
Harold 2025-11-17T16:30:10.355019Z

Is records' a sequence of maps? Or something else?

pieterbreed 2025-11-17T16:32:07.629899Z

It is a LazySeq created by map over a PersistantVector.

pieterbreed 2025-11-17T16:32:11.213389Z

(of maps)

Harold 2025-11-17T16:33:43.104589Z

Does (def test-dataset (ds/->>dataset records')) make sense then? And then, what is (ds/row-count test-dataset) ? Datasets are Maps (of column name to column data), so (count test-dataset) is counting the map entries (columns). So, presumably, there are 16 columns.

pieterbreed 2025-11-17T16:36:26.635709Z

> is counting the map entries (columns). aaah, that explains it, thank you.

(def test-dataset (ds/->>dataset records'))
  (type test-dataset) ;; tech.v3.dataset.impl.dataset.Dataset
  (ds/row-count test-dataset) ;; 949267

Harold 2025-11-17T16:38:59.310999Z

Seems good. I also don't understand what csp/thread is doing, but maybe just see if the parquet write of that ds performs better.

Harold 2025-11-17T16:39:56.168809Z

If it's still slow you can use async-profilier, and send us a flamegraph as well.

👍🏽 1
pieterbreed 2025-11-18T07:43:32.877509Z

Apologies, I have not made a self-contained reproducible example yet, because I have trouble with re-creating/cleaning the data I'm working with... But here is a screenshot of the flamegraph so long

pieterbreed 2025-11-18T07:44:50.249429Z

From my cursory inspection, it seems as if the slf4j is present in all of the time-consuming stacks... it is possible that the logging code is still running despite me not seeing any logs. I'm using a heavily configured timbre for logging.

pieterbreed 2025-11-18T07:52:33.879869Z

I've created a logback.xml that looks like this:

(slurp (io/resource "logback.xml"))
"<configuration>\n  <root level=\"OFF\" />\n</configuration>\n"
... but it has not affected the performance. I'll see if I can get a sample of data and package it in a repro.

pieterbreed 2025-11-18T07:58:08.117889Z

ok... Apologies for the train of thought in my posts... I was looking at the flamegraph some more and noticed that the portions of "logging code" is in the bridge to timbre... I looked at my timbre config and realized it was still set to global :debug level. I made that :info and the perf normalized. (~ 6 seconds for the export to parquet)

(taoensso.timbre/set-level! :info)
so my conclusion is that my logging configuration is/was to blame for the slowness I saw. The logback.xml that I created was useless as that's not the logging framework in use here... Thanks for your questions and prompts along the way, they helped me think about problem in a better way 👍🏽