Fork me on GitHub
#clojure-dev
<
2019-08-28
>
rutledgepaulv00:08:42

I’ve been running into some issues parsing CSVs generated by excel.. it looks like excel (at least on osx) adds a zero width character at the beginning as a byte order mark and these end up being returned as part of the string in the first cell of the first row by org.clojure/data.csv. Thoughts on whether that should be handled in org.clojure/data.csv or something that should be a consumer’s responsibility? My understanding is that it’s part of the metadata of the file and not intended to be part of the content. If consensus is data.csv should handle I can create a ticket and patch.

andy.fingerhut00:08:02

Do you know if it is a standard Unicode byte order mark byte sequence, or something different than those? I have a half memory that some Java Reader implementations might skip over those for you, but do not know whether data.csv uses those.

rutledgepaulv00:08:45

I think it’s standard.. I guess might be a consequence of excel using UTF-16 with BOM and reading as UTF-8 where the BOM is interpreted as part of the content instead of indicating the byte order?

andy.fingerhut00:08:58

If data.csv lets you pass in a Java Reader that you create yourself, would it be easy to try an experiment with creating a UTF-16 encoding Reader?

rutledgepaulv00:08:39

yeah I can give that a shot. thanks.

andy.fingerhut00:08:16

It looks like there is a section of data.csv's README that mentions byte order marks, with a couple of suggested ways of handling them.

rutledgepaulv01:08:28

thanks! fwiw reading with a utf-16 encoding reader stripped that char too

seancorfield01:08:36

FWIW, I run into this problem trying to use CLI/`deps.edn` on Windows sometimes. If I create a deps.edn file via echo or something similar on Powershell, it ends up with a UTF-16 BOM and tools.deps reads it in (and barfs on the content) rather than skipping it.

rutledgepaulv01:08:12

yeah. looks like the safe option is to use a BOMInputStream and auto-detect / skip

alexmiller01:08:18

There is a huge amount of backstory on bom and java readers going back years