Fork me on GitHub
#data-science
<
2021-03-26
>
David Vujic16:03:54

👋 I have a question about .dataset and hope this is the channel to use 😄 I can't figure out how to tell the ->dataset to parse Excel date cells as dates (and not numerics). Which cells that are date cells is unknown when parsing, is there a way to default to dates? Currently I get a column with a numeric value (the Excel way of calculating dates: "days since 1900 1 jan").

David Vujic16:03:41

Example - a column is recognized as :datatype :float64 with values like 44258.0

genmeblog16:03:14

@UDRJMEFSN can you help here?

chrisn16:03:15

Could I get an example document? Do you have a safe, small one you can send? There is no 'date' type in POI and fastexcel's cell type. You can do it by supplying a custom parse function to convert from days-since-jan-1 to a local date type but I would want to see some data and I can get you a test parse line. Something like

chrisn16:03:20

(->dataset "data.xls" {:parser-fn {"date" [:local-date convert-to-local-date]}})

chrisn17:03:22

where convert-to-local-date is a function from a number to your date type. Be wary of nils or strings; excel can really put anything in any cell.

David Vujic17:03:06

Thanks! I think I succeeded with doing that, there was a similar example in the readme of the repo, I think.

chrisn17:03:07

Your function can return :http://tech.ml.dataset.parse/missing or :http://tech.ml.dataset.parse/parse-failure to denote a missing value or a value you expect to parse but it failed. In that case the value will be in the metadata of the column.

chrisn17:03:47

OK, great 🙂.

David Vujic17:03:15

If the cell containing date is unknown, is there a way to figure out what column is a date, in the meta data or something similar?

David Vujic17:03:51

Aha, do you mean that my parse function could do that - if parsing failed -> not a date?

chrisn17:03:34

I haven't found a way to do that yet. I think there may be but it is through the POI and fastexcel API's.

chrisn17:03:50

Also, if you want to build/edit spreadsheets there is a new library I think is really nice named https://github.com/zero-one-group/fxl.

👍 3
David Vujic17:03:21

Thanks! I'll have a look!

David Vujic12:03:03

In our scenario, we guess the contents of an excel file and present it in a UI as suggestion for later parsing. I have solved the Excel date thing, by reading the file with ->dataset as ususal. Then I take samples of columns from the dataset with type float64 to make an educated guess if it is a reasonable date, by using the Apache POI DateUtil/getLocalDateTime function. I collect the column names that are candidates for excel dates and build a parser-fn map and run the ->dataset again, with the parser functions. There's an extra roundtrip, but could work 😄 Let's see what my colleagues think of it 😅

chrisn12:03:45

I like that approach honestly. It probably parses a file more or less instantly and you get the entire dataset to run your type heuristics on.

chrisn13:03:22

You can re-parse a column though: https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.column.html#var-parse-column Then you can just re-parse the columns using same syntax as your parser-fn.

chrisn13:03:06

I should have directed you towards https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-column-cast - that is more general and not specific towards string columns.

David Vujic13:03:29

I read about the column-cast and will give that one a try too. I think I may have misunderstood the api of it yesterday, but it looks like a good approach. Thank you!

👍 3