Fork me on GitHub
#data-science
<
2023-08-07
>
kenny19:08:23

The dataset I am using happens to represent missing values as "". I’d like to treat these values as missing. Is there a way to do this? What I’m currently attempting is to use row-map, changing the value to :tech.ml.dataset.parse/missing when it is "". I don’t believe this is doing what I intend, however.

genmeblog19:08:40

Hmmm... when I create dataset by hand "" is treated as missing

(def ds (tc/dataset {:a ["a" "" " " "b"]}))

ds
;; => _unnamed [4 1]:
;;    | :a |
;;    |----|
;;    |  a |
;;    |    |
;;    |    |
;;    |  b |

(tc/info ds)
;; => _unnamed: descriptive-stats [1 7]:
;;    | :col-name | :datatype | :n-valid | :n-missing | :mode | :first | :last |
;;    |-----------|-----------|---------:|-----------:|-------|--------|-------|
;;    |        :a |   :string |        3 |          1 |     a |      a |     b |

(tc/replace-missing ds :a :value "it was a missing value")
;; => _unnamed [4 1]:
;;    |                     :a |
;;    |------------------------|
;;    |                      a |
;;    | it was a missing value |
;;    |                        |
;;    |                      b |

kenny19:08:05

Oh, interesting. I am creating the dataset from a parquet file.

genmeblog19:08:35

Oh, probably it's a different path then. @UDRJMEFSN can you look at this?

kenny19:08:52

If I pass a special parse-method for the column, it will mark it as missing. e.g.,

(ds/->dataset "data.parquet"
  {:parser-fn
   {"product_to_region_code" [:string (fn [s]
                                        (if (str/blank? s)
                                          :tech.v3.dataset/missing
                                          s))]}})

chrisn22:08:09

For parquet we use the files missing indicators and it isn’t parsed like a csv. Is it acceptable to use column-map after load and return nil if the string is “”?

kenny19:08:07

One thing that’s still a bit frustrating is even with the columns marking "" as missing, missing values are pass to row-map as nil rather than the column’s key eliding from the row map entirely. Any idea if there’s a way to have the row passed to map-fn elide columns whose value is missing?

chrisn00:08:45

It’s same issue - seems very reasonable expectation - will address soon

chrisn12:08:04

It’s fixed in beta-53

kenny16:08:19

Thank you!!

chrisn22:08:53

Open issue currently :-)