Fork me on GitHub
#sql
<
2022-05-25
>
zane23:05:09

A common task at my organization is “numericalization”: For each non-numeric column in a CSV, replace each unique value in the column with a corresponding unique integer. Output the updated CSV and a table mapping column values to their corresponding integers. For example, input:

|---------|
| species |
|---------|
| cat     |
| dog     |
| bird    |
| bird    |
| cat     |
Output:
|---------|
| species |
|---------|
|       0 |
|       1 |
|       2 |
|       2 |
|       0 |
{"species" {"cat" 0
            "dog" 1
            "bird" 2}}
Currently this is achieved via a script that leverages org.clojure/data.csv, but it’s a bit verbose. I’m wondering if there’s a simpler way, perhaps one that takes advantage of next.jdbc.

seancorfield00:05:47

Are you currently using a database? I'm a bit unclear on what your process is...

zane00:05:04

No database at present.

zane00:05:25

But would be open to using one, potentially an in-memory database, to achieve the task.

zane00:05:32

Might be a better task for something like https://github.com/scicloj/tablecloth, but I figured I’d check.

seancorfield00:05:05

That sort of get-or-create logic can be a bit messy in SQL/JDBC I think, assuming you need to make it thread-safe etc...

seancorfield00:05:42

We have quite a few such instances, specifically for the use case you outline: turning strings into unique ID values.

zane00:05:36

Doesn't need to be thread-safe.

zane00:05:47

When you say “instances” what are you referring to? I’m not sure I understand yet. 😅

seancorfield00:05:25

We have quite a few slightly different get-or-create-<something>-id functions. Instances of that logic.

Eugen08:05:21

have you looked datalevin? it's based on lmdb - kv store. I think it should be a good fit for your use case

zane16:05:00

I’ll lave a look! Thanks, @U011NGC5FFY!