Fork me on GitHub
#data-science
<
2023-09-11
>
cdeln06:09:42

Hi all. I've implemented a library for doing automatic reference counting: https://github.com/cdeln/lexref-clj/tree/main , which I think might come in handy for people in this channel. Sales pitch: If you run OOM in your REPL when jiggling around with large array expressions, get this. End of sales pitch. Also wrote an article that accompanies the library that describes the issue I am trying to solve: https://nextjournal.com/cdeln/reference-counting-in-clojure . I've currently used python interop for my experimentations, but the lib should be adaptable to anything really. Would be cool to see it used together with libraries such as tech.v3.tensor or neanderthal or something I haven't though about at all. I'll probably do some cross posting, hope you don't mind. Since this channel is the primary audience I had in mind I figured I should share it here first. Cheers!

๐Ÿ™ 2
Harold16:09:02

Hello! Please read our new blog post about accessing DuckDB from Clojure through TMD - ๐Ÿฆ†: https://techascent.com/blog/just-ducking-around.html Wherein, we join 1.4B+ rows on a laptop in 2.5s. The story continues to coalesce: Single developers, or small teams, with functional data science can accomplish processing tasks that would otherwise drive longer timelines, higher headcounts, and involve bigger machines and dramatic/unwieldy tools.

๐Ÿ’œ 2
Harold17:09:26

@U04JJEM2X1V - thank you! There was a typo - I will correct it now. There are 1.4B rows, but it takes 2.5s โณ

user> (time (duckdb/sql->dataset conn "SELECT COUNT(*) FROM data INNER JOIN colors ON data.sku = colors.sku;"))
"Elapsed time: 2486.620275 msecs"
:_unnamed [1 1]:

| count_star() |
|-------------:|
|   1416737859 |

๐Ÿ‘ 1
Ben Kamphaus17:09:16

@UJ7RSSWDU yโ€™all have to do any row_group size tuning in your uses for parquet? or just defaults: > An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. or not had to bother? context: really large files that need joins like ^ but also iteration over rows. trying to avoid duplication into avro or compressed csv/tsv

Harold15:09:56

Ben! Nice to see you --- I don't think so - never messed with anything like that and read speed hasn't been the bottleneck for me yet... Maybe @UDRJMEFSN has encountered this?

chrisn19:09:12

Not had to bother as Harold said. Not sure the juice is worth the squeeze as I am not sure that lining those numbers up that exactly would work unless you were padding blocks or something like that. Also we don't generally use HDFS as the underlying storage layer so I haven't put any thought into tuning any usage of it. I don't see why you would need duplication as parquet has fine iteration speeds and compressed tsv/csv means you are re-parsing all the dates and numbers every time you scan the file which will be a substantial cost (10x in my experience) as compared to reading the parquet.

๐Ÿ‘ 1
chrisn19:09:09

duckdb works from parquet on s3 and I would guess as good join performance but we haven't tested specifically that. They use a hash-join algorithm same as TMD though which is far faster than spark's sort-join algo.

chrisn19:09:24

They have an article about exactly that somewhere on their site.