Fork me on GitHub
#beginners
<
2023-01-19
>
Daniel Kvist17:01:35

Hi everyone! I'm reading a file with Swedish characters "åäö" using slurp, but when printing the contents these characters show up as some other Greek characters. The file is saved with UTF-8 encoding, and slurp should use UTF-8 by default as i understand it. I'm pretty bad att encodings and such, so I might be missing something obvious. Would be happy if someone could point me in the right direction!

hiredman17:01:41

1. how do you know the file is saved as utf?

hiredman17:01:01

2. what is the default encoding your jvm is using (this is usually utf8, but on some os'es it has been different, like osx)

hiredman17:01:45

3. what environment are you printing it in?

hiredman17:01:23

slurp does not default to utf8, it defaults to whatever the jvm defaults to

Alex Miller (Clojure team)17:01:50

you can pass an encoding as extra args to slurp too

Alex Miller (Clojure team)17:01:28

(slurp "foo" :encoding "UTF-8")

R.A. Porter17:01:27

And if you want to know what your default encoding is, you can run this in your REPL (java.nio.charset.Charset/defaultCharset)

Daniel Kvist17:01:56

I created the file myself and saved it with UTF-8 encoding, and I'm printing it to the VS Code console using println. The default encoding though seems to be windows-1252. However, I've tried explicitly specifying UTF-8 encoding when calling slurp but I still got the same result, is this expected behavior?

hiredman17:01:32

it's a whole thing

hiredman18:01:27

you have a file of bytes, you tell slurp to interpret it as utf8, slurp reads them in as utf8 characters converting from utf8 to the utf16 that the jvm uses to represent characters, then builds a string out of those characters

hiredman18:01:34

but strings get tagged with an encoding

hiredman18:01:13

and slurp doesn't specify encoding for the string it builds, so it gets whatever the jvm default it

lread18:01:54

I think the default for JDK18+ is UTF-8. But before that, platform dependent.

hiredman18:01:52

then when you go to print the string, those bytes are written to stdout, and whatever is displaying stdout out to you may or may not match whatever charset those bytes are supposed to be

hiredman18:01:40

most likely if you run the jvm with the default encoding set to utf8 everything snaps in to line

hiredman18:01:16

but it is still possible if you are running just in like cmd.exe on windows, that might expect to be displaying windows-1252 and not utf8

hiredman18:01:07

-Dfile.encoding=UTF-8
is the flag for java to change the default encoding

hiredman18:01:50

strings store characters as java characters (utf16 sometimes optimized to be smaller as utf8, but for backwards compat they always have to appear to be utf16), but the encoding strings are tagged with will effect things like calls to getBytes

hiredman18:01:32

often it can be easier to verify that the bytes (the numeric values) are what you want them to be than trying to sort out display issues

Daniel Kvist18:01:45

Yeah, I can understand there are a lot of steps where the encoding could go wrong. How would I check the numeric values of the characters? I'll have to try changing the default encoding to UTF-8 in a bit. Maybe the encoding of the VS Code console isn't "right" either. Thanks for amazing help so far!

hiredman18:01:37

something like https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointAt(int) might be the best way, you'll get a numeric value and then check if that value matches the unicodecode point of the character you expect

hiredman18:01:51

if you google "å unicode code point" google has an info box that gives you the hex

didibus18:01:18

> and slurp doesn't specify encoding for the string it builds, so it gets whatever the jvm default it > That's pretty weird, is this not something worth changing. I'm not even sure how slurp would do this, read the bytes are a UTF-8 string and then create another string out of it?

hiredman18:01:25

@U0K064KQV I think I might not be interpreting what it does correctly

hiredman18:01:59

I don't use stringreaders and writers much, go right for baos and bais

hiredman18:01:29

like, if you had an inputstream of your file (not a reader, streams are byte based, no encoding) and copied it in to a baos(ByteArrayOutputStream), and then took the byte array and constructed a string explicitly using the utf8 encoding and see how that prints, that would rule out encoding issues while reading in

Daniel Kvist18:01:13

@U0NCTKEV8 thanks, I'll check the character codes then and see if they are correct! I realized also maybe I should have mentioned I'm trying to do this in a Babashka-script. So changing the default JVM-encoding would not change anything in this case, correct?

lread18:01:14

On Windows 10 (I assume you are Windows?):

PS C:\Users\lee> bb
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=> (java.nio.charset.Charset/defaultCharset)
#object[sun.nio.cs.MS1252 0xf912cf3 "windows-1252"]

lread19:01:26

Can't remember if there is a way to change the default encoding for bb, @U04V15CAJ would be the person to answer that.

borkdude19:01:38

If you want to change the default encoding, you can call bb like this: bb -Dfile.encoding=UTF-8 ...

borkdude19:01:54

C:\Users\borkdude>bb -Dfile.encoding=UTF-8
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=>  (System/getProperty "file.encoding")
"UTF-8"
user=> ^D^C
C:\Users\borkdude>bb
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=>  (System/getProperty "file.encoding")
"Cp1252"

lread19:01:23

Oh.. maybe (java.nio.charset.Charset/defaultCharset) does not reflect file.encoding?

borkdude19:01:03

I think so yes

lread19:01:29

C:\Users\lee>bb -Dfile.encoding=UTF-8
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=> (java.nio.charset.Charset/defaultCharset)
#object[sun.nio.cs.MS1252 0xf912cf3 "windows-1252"]

lread19:01:59

Anyway @U04GNS14BQB, because the default encoding can be different on different platforms, best to explicitly specify when reading and writing.

👍 4
Daniel Kvist21:01:51

Thanks @U04V15CAJ, managed to change the default encoding this way. Had to use quotes around the statement for some reason though!

Daniel Kvist21:01:10

@UE21H2HHD sounds like a good idea. However, changing the default encoding when running bb still did not change anything. I also tried changing the VS Code terminal encoding, which didn't seem to change anything either, or else I just didn't get it right!

borkdude21:01:07

@U04GNS14BQB In doubt, also try with Clojure JVM.

Daniel Kvist21:01:21

What I'm really trying to achieve with my little script is generating speech audio-files by slurping an EDN-file, sending the (processed) contents to the Google Text-To-Speech API using babashka.curl, and so what led me to discovering this problem in the first place was that when listening to the generated speech, I realized that these "special" Swedish characters were just replaced with quiet. So they don't seem to be recognized correctly by Google either for some reason. When checking the character codes of the slurped - but oddly printed - characters, they seem to be corresponding to the right characters although in UTF-16 character codes.

borkdude21:01:51

if you have WSL2 on your system, it would be interesting to see what file your-file.txt makes of it

borkdude21:01:16

or better: file --mime-encoding <the-file>

borkdude21:01:07

or just send the file here on slack and I'll check for you

Daniel Kvist21:01:41

@U04V15CAJ running file --mime-encoding input.edn returns input.edn: utf-8

hiredman21:01:25

have you tried passing it as an input stream to babashka.curl instead of as a string?

borkdude21:01:49

it also accepts (io/file ..)

hiredman21:01:10

I can't quite articulate it, but it seems like maybe there could be something fishy there, because if you pass a string that string becomes a command line argument to curl, and I dunno maybe something funky with encodings for command line arguments

borkdude21:01:03

you can just pass the file reference and it won't be processed by babashka at all, it will just become a file argument to curl

hiredman21:01:05

(I guess it isn't really a command line argument, there is no shell running, it just directly invokes the thing with args, but maybe curl interprets the arg string with the windows locale or whatever)

borkdude21:01:48

also the Greek character stuff could be an artifact of printing, i.e. your terminal doesn't support printing these characters? What do you see for (println "åäö")?

Daniel Kvist22:01:52

It's probably easier if I just send some code. Here's part of how I'm doing it:

;; Create a request body for synthesizing the given SSML
(defn req-body [ssml voice]
  {:input {:ssml ssml}
   :voice voice
   :audioConfig {:audioEncoding "MP3"}})

;; Send a request to the TTS API with the given body
(defn send-request [req-body]
  (curl/post ""
             {:compressed false
              :headers {"Authorization" (str "Bearer " access-token)
                        "Content-Type" "application/json; charset=utf-8"}
              :body (json/generate-string req-body)}))
The voice argument is a subset of a map retrieved by slurping the file and parsing the contents with core.edn/read-string. I could refactor it to pass a file reference directly, but I'd like to be able to provide the data to the request partially from the file and partially from within the script, if that makes sense!

Daniel Kvist22:01:01

As for printing "åäö", if I just run bb in the terminal and execute (println "åäö") I actually get the correct characters printed, even without specifying the encoding explicitly.

borkdude22:01:47

@U04GNS14BQB It would be even better if you could make a github repo with this file and this code. If you send it to a service like https://postman-echo.com you will get the request you sent back as JSON

borkdude22:01:13

I can then try it on windows as well

Daniel Kvist22:01:56

I simply run it using bb script.clj input.edn. For me, this produces the following output:

{:args {},
 :data {:input {:ssml {:sample "Eget k?rf?lt"}}},
 :files {},
 :form {},
 :headers
 {:x-forwarded-proto "https",
  :x-forwarded-port "443",
  :host "",
  :x-amzn-trace-id "Root=1-63c9c424-7d0d9c76671e069327bee74d",
  :content-length "44",
  :user-agent "curl/7.83.1",
  :accept "*/*",
  :content-type "application/json; charset=utf-8"},
 :json {:input {:ssml {:sample "Eget k?rf?lt"}}},
 :url ""}

borkdude22:01:28

ok, I'm getting: :json {:input {:ssml {:sample "Eget körfält"}}} here on mac, I can try windows tomorrow

Daniel Kvist22:01:22

Okay, interesting... Would be awesome if you'd like to that! I get the same odd result when running it with both PowerShell, Command Prompt and Git Bash, both with and without specifying explicit default encoding with -Dfile.encoding=UTF-8. Maybe something odd with my setup 😅 Lets see how it turns out for you tomorrow!

hiredman22:01:07

I think still think you should try passing it as a file or inputstream instead of a string

borkdude22:01:36

@U0NCTKEV8 the issue is that he is constructing a JSON string and you can't embed an input stream or file within a JSON string

hiredman22:01:59

sure, I mean, I am not saying forever do it that way, I just think because that causes the data to be processed differently by the library, the results in they are different would help narrow down the source of the difference

👍 2
Daniel Kvist22:01:48

@U04V15CAJ I don't have bb installed on WSL2, and installing it proved a bit more complicated than what I have time for tonight. Haven't used WSL2 much at all before, so I don't have brew installed nor any knowledge to do it in another way 😅 Will have to continue with that tomorrow night!

Daniel Kvist22:01:51

@U0NCTKEV8 I'll try this tomorrow as well, I suppose it could maybe rule out some possible causes. Thanks!

borkdude22:01:19

bash < <(curl -s )

Daniel Kvist23:01:57

@U04V15CAJ awesome! Testing this as soon as I get home tomorrow!

lread23:01:24

From what I'm reading I think cmd.exe and Powershell UTF-8 support is... umm.... confusing. But, apparently, Microsoft created https://apps.microsoft.com/store/detail/windows-terminal/9N0DX20HK701?hl=en-us&amp;gl=us&amp;activetab=pivot%3Aoverviewtab to address this problem?

lread23:01:33

Dunno, hope I'm not sending you on a wild goose chase there.

borkdude10:01:26

@U04GNS14BQB It appears to be an issue with curl on windows You can use babashka.http-client instead and then it works as expected:

((requiring-resolve 'babashka.deps/add-deps) '{:deps {org.babashka/http-client {:mvn/version "0.0.2"}}})

(require '[cheshire.core :as json]
         '[clojure.edn :as edn]
         '[clojure.pprint :refer [pprint]]
         '[babashka.http-client :as http])

(defn send-request [ssml]
  (http/post ""
             {:headers {"Content-Type" "application/json; charset=utf-8"}
              :body (json/generate-string {:input {:ssml ssml}})}))

borkdude10:01:23

I'll dig a little deeper to see why this is a problem with curl

borkdude10:01:02

I'm not even going to bother digging deeper after googling around... Just use bb.http-client or the built-in org.httpkit.client - I'll make bb.http-client a built-in on the next release or so

Daniel Kvist18:01:46

@U04V15CAJ oh, alright! I suppose that works just as well (or, better in this case), I just happened to find babashka.curl first! As for the weird output in terminal I'll see if I can figure something out some other time, at least the script will do what I want for now. Super happy for all the effort to help, thanks to all of you! 🙂

borkdude13:01:06

@U04GNS14BQB babashka.http-client is now included with bb v1.1.171

👍 2
Andrew Carlile23:01:18

really boiling my noodle over using pathom eql, eg com.wsscode.pathom3.interface.async.eql/process.

Andrew Carlile23:01:38

I’m trying to set up a super basic api. here’s a set of image galleries:

(pco/defresolver base
  [{id :gallery/id}]
  {::pco/output [
                 :gallery/name
                 {:gallery/images [:filepath]}]}

   (condp = id

0 {:gallery/name "Foo"
       :gallery/images
       [{:filepath "/images/foo1.jpg"}
        {:filepath "/images/foo2.jpg"}]}
    1 {:gallery/name "Bar"
       :gallery/images
       [{:filepath "/images/bar1.jpg"}
        {:filepath "/images/bar2.jpg"}]}
    ))

Andrew Carlile23:01:09

here’s my “get all” resolver:

(pco/defresolver all
  []
  {::pco/output [{:galleries/all [:gallery/id]}]}
                 (js/console.log "getting all galleries")
  (p/promise
   {:galleries/all [{:gallery/id 0}
                    {:gallery/id 1}
                    {:gallery/id 2}]}))

Andrew Carlile23:01:13

here’s the frontend event I use dispatch to “get all”

(re-frame/reg-event-fx
 ::get-galleries
 (fn [{:keys [_db]} _]
   {:request {:name :get-galleries
              :params [{:galleries/all
                        [:gallery/id
                         :gallery/name]}]
              :on-success [::get-galleries-success]
              :on-failure [::get-galleries-failure]}}))

Andrew Carlile23:01:28

but I can’t for the life of me figure out how to “get one”

Andrew Carlile23:01:28

this fails:

(re-frame/reg-event-fx
 ::get-gallery
 (fn [{:keys [_db]} [_ gallery-id]]
   (js/console.log "com.archemedx.frontend.events/get-gallery: " gallery-id)
   {:request {:name :get-gallery
              :params '`[{(:>/gallery {:gallery/id ~id})
                          [:gallery/id
                           :gallery/name]]}]
              :on-success [::get-gallery-success]
              :on-failure [::get-gallery-failure]}}))

seancorfield23:01:35

Should ~id be ~gallery-id?

Andy Carlile23:01:11

yes, thats a typo here. i still get the same error. i find it difficult to inspect the query response because it comes back as a Promise[~] and i dont know how to force it to resolve

seancorfield00:01:31

You might need to ask in #C87NB2CFN to get more specific help...

Jakub Holý (HolyJak)00:01:19

What is the error you are getting?

Jakub Holý (HolyJak)00:01:12

This works just fine to “get one”

(pco/defresolver get-v [_]
  {::pco/output [{:coll [:v]}]
   ::pco/batch? false}
  {:coll [{:v 1}]})

(->
  (com.wsscode.pathom3.interface.eql/process
    (com.wsscode.pathom3.connect.indexes/register get-v)
    {}
    [{:coll [:v]}])
  :coll first)
Do you have pathom or reframe problem??? Try to simplify it down with hardcoded data and such

wilkerlucio02:01:20

hello @U03HD1DNGDP! I think a nice way to handle the one here is to support some key like :entity at that :request effect, which you can send as an argument to process, like: (p.a.eql/process env (:entity params) query)

wilkerlucio02:01:44

(re-frame/reg-event-fx
  ::get-gallery
  (fn [{:keys [_db]} [_ gallery-id]]
    (js/console.log "com.archemedx.frontend.events/get-gallery: " gallery-id)
    {:request {:name   :get-gallery
               :entity {:gallery/id gallery-id}
               :params [:gallery/id
                        :gallery/name]}]}}))

wilkerlucio03:01:18

the way you are doing with the placeholder is also valid, but I think your unquoting of id might be wrong, you can try using this as params instead:

[{(list :>/gallery {:gallery/id gallery-id})
    [:gallery/id
     :gallery/name]]}]