This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2019-08-28
Channels
- # aleph (1)
- # announcements (16)
- # bangalore-clj (1)
- # beginners (78)
- # cider (109)
- # clara (3)
- # cljdoc (6)
- # cljsjs (3)
- # clojure (209)
- # clojure-dev (11)
- # clojure-europe (1)
- # clojure-france (9)
- # clojure-italy (13)
- # clojure-nl (3)
- # clojure-spain (2)
- # clojure-spec (19)
- # clojure-uk (50)
- # clojurescript (41)
- # clojutre (2)
- # core-async (45)
- # cursive (2)
- # datomic (14)
- # emacs (6)
- # figwheel-main (1)
- # fulcro (101)
- # graalvm (1)
- # graphql (3)
- # jobs-discuss (3)
- # kaocha (12)
- # leiningen (8)
- # music (4)
- # off-topic (47)
- # parinfer (8)
- # pathom (17)
- # pedestal (53)
- # re-frame (47)
- # reagent (22)
- # reitit (4)
- # shadow-cljs (49)
- # tools-deps (87)
I’ve been running into some issues parsing CSVs generated by excel.. it looks like excel (at least on osx) adds a zero width character at the beginning as a byte order mark and these end up being returned as part of the string in the first cell of the first row by org.clojure/data.csv. Thoughts on whether that should be handled in org.clojure/data.csv or something that should be a consumer’s responsibility? My understanding is that it’s part of the metadata of the file and not intended to be part of the content. If consensus is data.csv should handle I can create a ticket and patch.
Do you know if it is a standard Unicode byte order mark byte sequence, or something different than those? I have a half memory that some Java Reader implementations might skip over those for you, but do not know whether data.csv uses those.
I think it’s standard.. I guess might be a consequence of excel using UTF-16 with BOM and reading as UTF-8 where the BOM is interpreted as part of the content instead of indicating the byte order?
If data.csv lets you pass in a Java Reader that you create yourself, would it be easy to try an experiment with creating a UTF-16 encoding Reader?
yeah I can give that a shot. thanks.
It looks like there is a section of data.csv's README that mentions byte order marks, with a couple of suggested ways of handling them.
thanks! fwiw reading with a utf-16 encoding reader stripped that char too
FWIW, I run into this problem trying to use CLI/`deps.edn` on Windows sometimes. If I create a deps.edn
file via echo
or something similar on Powershell, it ends up with a UTF-16 BOM and tools.deps
reads it in (and barfs on the content) rather than skipping it.
yeah. looks like the safe option is to use a BOMInputStream and auto-detect / skip
There is a huge amount of backstory on bom and java readers going back years