Fork me on GitHub
#data-science
<
2023-09-05
>
adham11:09:41

Hey all! I'm doing some data cleaning and prepartion, one operation I'm doing is mapping strings to keywords, these strings aren't always consistent so I have to manually map each one, this isn't a problem as they are at max 19 when the operation is needed. My question is which is more optimal, using (tc/map-rows (fn [{:keys [colname]}] {:colname (case colname "A" :a)) or (tc/map-column) where I supply a map (e.g. {"A" :a}) in a let binding and use some form of #() to map.

🎉 2
otfrom12:09:05

I've done both, but I've never profiled it

chrisn13:09:09

My guess is the the column-based approach but as otfrom said profiling would be ideal. Probably using a java hashmap would also be a small bit quicker as their lookup Times are less than the persistent hashmap lookup times

chrisn13:09:41

For this type of profiling I use criterium

adham09:09:02

criterium seems straight forward with its (bench) function, I'll give it a benchmark tomorrow and report back, thank you both for your feedback and the library to benchmark for such future questions

aaelony15:09:36

for those that enjoy syntax comparisons, I came across a nice comparison cheatsheet of pandas and R's data.table: https://atrebas.github.io/post/2020-06-14-datatable-pandas/