data-science

simonacca 2025-02-28T19:14:44.139579Z

Hello 👋 I have a question as a tech.v3.dataset novice. I am importing a csv dataset where one of the column is a integer representing a unix timestamp with nanosecond precision. How do I convert this into a date or an instant upon loading the dataset? I tried

(d/->dataset "myfile.csv" {:parser-fn {"start" :epoch-nanoseconds}})
however, I get this error:
; Execution error at ham_fisted.Casts/longCast (Casts.java:85).
; Object cannot be casted to long: 1740459600000000000
I tried also with :instant, then I get
; Execution error at tech.v3.dataset.io.column_parsers.FixedTypeParser/addValue (column_parsers.clj:233).
; Failed to parse value 1740459600000000000 as datatype :instant on row 0
Here is a sample of the file:
exp,start
ABC,1738558800000000000
Thanks in advance for your help!

Harold 2025-02-28T19:31:13.819129Z

First thoughts:

user> (require '[tech.v3.dataset :as ds])
nil
user> (slurp "t.csv")
"exp,start\nABC,1738558800000000000"
user> (def ds (ds/->dataset "t.csv"))
#'user/ds
user> ds
t.csv [1 2]:

| exp |               start |
|-----|--------------------:|
| ABC | 1738558800000000000 |
user> (map meta (ds/columns ds))
({:categorical? true, :name "exp", :datatype :string, :n-elems 1}
 {:name "start", :datatype :int64, :n-elems 1})
user> (ds/row-map ds (fn [{:strs [start]}]
                       (let [i (java.time.Instant/ofEpochSecond (/ start 1000000000)
                                                                (rem start 1000000000))]
                         {"inst" i
                          "date" (java.util.Date/from i)})))
t.csv [1 4]:

| exp |               start |                 inst |                         date |
|-----|--------------------:|----------------------|------------------------------|
| ABC | 1738558800000000000 | 2025-02-03T05:00:00Z | Sun Feb 02 22:00:00 MST 2025 |

simonacca 2025-02-28T19:47:00.665409Z

Thanks Harold! Since the column is read natively as an :int64 , I was hoping to be able to leverage one of the packing formats to avoid actually touching the data. Not sure if that's possible.

👍 1
🙇 1
Harold 2025-02-28T21:27:31.417859Z

You're welcome. My gut is that'd be very unlikely to be worth it. More flexible this way, and if performance became a concern (e.g., you have 10B+ rows of this every day) then there'd be bigger wins elsewhere (switching serialization formats, maybe).