This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-01-19
Channels
- # announcements (19)
- # asami (9)
- # babashka (26)
- # beginners (87)
- # biff (23)
- # calva (6)
- # clerk (7)
- # clj-kondo (3)
- # cljsrn (3)
- # clojure (115)
- # clojure-belgium (1)
- # clojure-berlin (1)
- # clojure-europe (31)
- # clojure-gamedev (5)
- # clojure-nl (2)
- # clojure-norway (8)
- # clojure-uk (2)
- # clojurescript (43)
- # clr (23)
- # datalevin (1)
- # datomic (14)
- # dev-tooling (23)
- # fulcro (38)
- # graphql (1)
- # gratitude (1)
- # jobs (1)
- # lsp (30)
- # off-topic (7)
- # pathom (25)
- # portal (21)
- # quil (6)
- # releases (5)
- # remote-jobs (1)
- # shadow-cljs (34)
- # sql (5)
- # tools-deps (6)
- # xtdb (13)
Hi everyone! I'm reading a file with Swedish characters "åäö" using slurp
, but when printing the contents these characters show up as some other Greek characters. The file is saved with UTF-8 encoding, and slurp
should use UTF-8 by default as i understand it. I'm pretty bad att encodings and such, so I might be missing something obvious. Would be happy if someone could point me in the right direction!
2. what is the default encoding your jvm is using (this is usually utf8, but on some os'es it has been different, like osx)
you can pass an encoding as extra args to slurp
too
(slurp "foo" :encoding "UTF-8")
And if you want to know what your default encoding is, you can run this in your REPL
(java.nio.charset.Charset/defaultCharset)
I created the file myself and saved it with UTF-8 encoding, and I'm printing it to the VS Code console using println
. The default encoding though seems to be windows-1252
. However, I've tried explicitly specifying UTF-8 encoding when calling slurp
but I still got the same result, is this expected behavior?
you have a file of bytes, you tell slurp to interpret it as utf8, slurp reads them in as utf8 characters converting from utf8 to the utf16 that the jvm uses to represent characters, then builds a string out of those characters
and slurp doesn't specify encoding for the string it builds, so it gets whatever the jvm default it
then when you go to print the string, those bytes are written to stdout, and whatever is displaying stdout out to you may or may not match whatever charset those bytes are supposed to be
most likely if you run the jvm with the default encoding set to utf8 everything snaps in to line
but it is still possible if you are running just in like cmd.exe on windows, that might expect to be displaying windows-1252 and not utf8
strings store characters as java characters (utf16 sometimes optimized to be smaller as utf8, but for backwards compat they always have to appear to be utf16), but the encoding strings are tagged with will effect things like calls to getBytes
often it can be easier to verify that the bytes (the numeric values) are what you want them to be than trying to sort out display issues
Yeah, I can understand there are a lot of steps where the encoding could go wrong. How would I check the numeric values of the characters? I'll have to try changing the default encoding to UTF-8 in a bit. Maybe the encoding of the VS Code console isn't "right" either. Thanks for amazing help so far!
something like https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointAt(int) might be the best way, you'll get a numeric value and then check if that value matches the unicodecode point of the character you expect
> and slurp doesn't specify encoding for the string it builds, so it gets whatever the jvm default it > That's pretty weird, is this not something worth changing. I'm not even sure how slurp would do this, read the bytes are a UTF-8 string and then create another string out of it?
@U0K064KQV I think I might not be interpreting what it does correctly
like, if you had an inputstream of your file (not a reader, streams are byte based, no encoding) and copied it in to a baos(ByteArrayOutputStream), and then took the byte array and constructed a string explicitly using the utf8 encoding and see how that prints, that would rule out encoding issues while reading in
@U0NCTKEV8 thanks, I'll check the character codes then and see if they are correct! I realized also maybe I should have mentioned I'm trying to do this in a Babashka-script. So changing the default JVM-encoding would not change anything in this case, correct?
On Windows 10 (I assume you are Windows?):
PS C:\Users\lee> bb
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (java.nio.charset.Charset/defaultCharset)
#object[sun.nio.cs.MS1252 0xf912cf3 "windows-1252"]
Can't remember if there is a way to change the default encoding for bb, @U04V15CAJ would be the person to answer that.
If you want to change the default encoding, you can call bb like this: bb -Dfile.encoding=UTF-8 ...
C:\Users\borkdude>bb -Dfile.encoding=UTF-8
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (System/getProperty "file.encoding")
"UTF-8"
user=> ^D^C
C:\Users\borkdude>bb
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (System/getProperty "file.encoding")
"Cp1252"
C:\Users\lee>bb -Dfile.encoding=UTF-8
Babashka v1.0.170 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (java.nio.charset.Charset/defaultCharset)
#object[sun.nio.cs.MS1252 0xf912cf3 "windows-1252"]
Anyway @U04GNS14BQB, because the default encoding can be different on different platforms, best to explicitly specify when reading and writing.
Thanks @U04V15CAJ, managed to change the default encoding this way. Had to use quotes around the statement for some reason though!
@UE21H2HHD sounds like a good idea. However, changing the default encoding when running bb still did not change anything. I also tried changing the VS Code terminal encoding, which didn't seem to change anything either, or else I just didn't get it right!
@U04GNS14BQB In doubt, also try with Clojure JVM.
What I'm really trying to achieve with my little script is generating speech audio-files by slurp
ing an EDN-file, sending the (processed) contents to the Google Text-To-Speech API using babashka.curl
, and so what led me to discovering this problem in the first place was that when listening to the generated speech, I realized that these "special" Swedish characters were just replaced with quiet. So they don't seem to be recognized correctly by Google either for some reason.
When checking the character codes of the slurp
ed - but oddly printed - characters, they seem to be corresponding to the right characters although in UTF-16 character codes.
if you have WSL2 on your system, it would be interesting to see what file your-file.txt
makes of it
@U04V15CAJ running file --mime-encoding input.edn
returns input.edn: utf-8
have you tried passing it as an input stream to babashka.curl instead of as a string?
I can't quite articulate it, but it seems like maybe there could be something fishy there, because if you pass a string that string becomes a command line argument to curl, and I dunno maybe something funky with encodings for command line arguments
you can just pass the file reference and it won't be processed by babashka at all, it will just become a file argument to curl
(I guess it isn't really a command line argument, there is no shell running, it just directly invokes the thing with args, but maybe curl interprets the arg string with the windows locale or whatever)
also the Greek character stuff could be an artifact of printing, i.e. your terminal doesn't support printing these characters?
What do you see for (println "åäö")
?
It's probably easier if I just send some code. Here's part of how I'm doing it:
;; Create a request body for synthesizing the given SSML
(defn req-body [ssml voice]
{:input {:ssml ssml}
:voice voice
:audioConfig {:audioEncoding "MP3"}})
;; Send a request to the TTS API with the given body
(defn send-request [req-body]
(curl/post ""
{:compressed false
:headers {"Authorization" (str "Bearer " access-token)
"Content-Type" "application/json; charset=utf-8"}
:body (json/generate-string req-body)}))
The voice
argument is a subset of a map retrieved by slurp
ing the file and parsing the contents with core.edn/read-string
. I could refactor it to pass a file reference directly, but I'd like to be able to provide the data to the request partially from the file and partially from within the script, if that makes sense!As for printing "åäö", if I just run bb
in the terminal and execute (println "åäö")
I actually get the correct characters printed, even without specifying the encoding explicitly.
@U04GNS14BQB It would be even better if you could make a github repo with this file and this code. If you send it to a service like https://postman-echo.com you will get the request you sent back as JSON
@U04V15CAJ here is a repo with a minimal reproducible example: https://github.com/danielkvist/babashka-swedish-characters
I simply run it using bb script.clj input.edn
. For me, this produces the following output:
{:args {},
:data {:input {:ssml {:sample "Eget k?rf?lt"}}},
:files {},
:form {},
:headers
{:x-forwarded-proto "https",
:x-forwarded-port "443",
:host "",
:x-amzn-trace-id "Root=1-63c9c424-7d0d9c76671e069327bee74d",
:content-length "44",
:user-agent "curl/7.83.1",
:accept "*/*",
:content-type "application/json; charset=utf-8"},
:json {:input {:ssml {:sample "Eget k?rf?lt"}}},
:url ""}
ok, I'm getting: :json {:input {:ssml {:sample "Eget körfält"}}}
here on mac, I can try windows tomorrow
Okay, interesting... Would be awesome if you'd like to that! I get the same odd result when running it with both PowerShell, Command Prompt and Git Bash, both with and without specifying explicit default encoding with -Dfile.encoding=UTF-8
. Maybe something odd with my setup 😅 Lets see how it turns out for you tomorrow!
@U04GNS14BQB and in WSL2?
I think still think you should try passing it as a file or inputstream instead of a string
@U0NCTKEV8 the issue is that he is constructing a JSON string and you can't embed an input stream or file within a JSON string
sure, I mean, I am not saying forever do it that way, I just think because that causes the data to be processed differently by the library, the results in they are different would help narrow down the source of the difference
@U04V15CAJ I don't have bb
installed on WSL2, and installing it proved a bit more complicated than what I have time for tonight. Haven't used WSL2 much at all before, so I don't have brew installed nor any knowledge to do it in another way 😅 Will have to continue with that tomorrow night!
@U0NCTKEV8 I'll try this tomorrow as well, I suppose it could maybe rule out some possible causes. Thanks!
@U04V15CAJ awesome! Testing this as soon as I get home tomorrow!
From what I'm reading I think cmd.exe and Powershell UTF-8 support is... umm.... confusing. But, apparently, Microsoft created https://apps.microsoft.com/store/detail/windows-terminal/9N0DX20HK701?hl=en-us&gl=us&activetab=pivot%3Aoverviewtab to address this problem?
@U04GNS14BQB It appears to be an issue with curl on windows
You can use babashka.http-client
instead and then it works as expected:
((requiring-resolve 'babashka.deps/add-deps) '{:deps {org.babashka/http-client {:mvn/version "0.0.2"}}})
(require '[cheshire.core :as json]
'[clojure.edn :as edn]
'[clojure.pprint :refer [pprint]]
'[babashka.http-client :as http])
(defn send-request [ssml]
(http/post ""
{:headers {"Content-Type" "application/json; charset=utf-8"}
:body (json/generate-string {:input {:ssml ssml}})}))
I'm not even going to bother digging deeper after googling around... Just use bb.http-client or the built-in org.httpkit.client - I'll make bb.http-client a built-in on the next release or so
@U04V15CAJ oh, alright! I suppose that works just as well (or, better in this case), I just happened to find babashka.curl
first! As for the weird output in terminal I'll see if I can figure something out some other time, at least the script will do what I want for now. Super happy for all the effort to help, thanks to all of you! 🙂
really boiling my noodle over using pathom eql, eg com.wsscode.pathom3.interface.async.eql/process.
I’m trying to set up a super basic api. here’s a set of image galleries:
(pco/defresolver base
[{id :gallery/id}]
{::pco/output [
:gallery/name
{:gallery/images [:filepath]}]}
(condp = id
0 {:gallery/name "Foo"
:gallery/images
[{:filepath "/images/foo1.jpg"}
{:filepath "/images/foo2.jpg"}]}
1 {:gallery/name "Bar"
:gallery/images
[{:filepath "/images/bar1.jpg"}
{:filepath "/images/bar2.jpg"}]}
))
here’s my “get all” resolver:
(pco/defresolver all
[]
{::pco/output [{:galleries/all [:gallery/id]}]}
(js/console.log "getting all galleries")
(p/promise
{:galleries/all [{:gallery/id 0}
{:gallery/id 1}
{:gallery/id 2}]}))
here’s the frontend event I use dispatch to “get all”
(re-frame/reg-event-fx
::get-galleries
(fn [{:keys [_db]} _]
{:request {:name :get-galleries
:params [{:galleries/all
[:gallery/id
:gallery/name]}]
:on-success [::get-galleries-success]
:on-failure [::get-galleries-failure]}}))
but I can’t for the life of me figure out how to “get one”
this fails:
(re-frame/reg-event-fx
::get-gallery
(fn [{:keys [_db]} [_ gallery-id]]
(js/console.log "com.archemedx.frontend.events/get-gallery: " gallery-id)
{:request {:name :get-gallery
:params '`[{(:>/gallery {:gallery/id ~id})
[:gallery/id
:gallery/name]]}]
:on-success [::get-gallery-success]
:on-failure [::get-gallery-failure]}}))
Should ~id
be ~gallery-id
?
yes, thats a typo here. i still get the same error. i find it difficult to inspect the query response because it comes back as a Promise[~] and i dont know how to force it to resolve
You might need to ask in #C87NB2CFN to get more specific help...
What is the error you are getting?
This works just fine to “get one”
(pco/defresolver get-v [_]
{::pco/output [{:coll [:v]}]
::pco/batch? false}
{:coll [{:v 1}]})
(->
(com.wsscode.pathom3.interface.eql/process
(com.wsscode.pathom3.connect.indexes/register get-v)
{}
[{:coll [:v]}])
:coll first)
Do you have pathom or reframe problem??? Try to simplify it down with hardcoded data and suchhello @U03HD1DNGDP! I think a nice way to handle the one here is to support some key like :entity
at that :request
effect, which you can send as an argument to process
, like: (p.a.eql/process env (:entity params) query)
(re-frame/reg-event-fx
::get-gallery
(fn [{:keys [_db]} [_ gallery-id]]
(js/console.log "com.archemedx.frontend.events/get-gallery: " gallery-id)
{:request {:name :get-gallery
:entity {:gallery/id gallery-id}
:params [:gallery/id
:gallery/name]}]}}))
the way you are doing with the placeholder is also valid, but I think your unquoting of id might be wrong, you can try using this as params instead:
[{(list :>/gallery {:gallery/id gallery-id})
[:gallery/id
:gallery/name]]}]