Fork me on GitHub
#babashka
<
2021-05-09
>
Ben Sless06:05:29

Am I doing something wrong here or is there a bug?

bb repl
Babashka v0.4.0 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=> (require '[babashka.process :refer [process sh]])
nil
user=> (sh ["youtube-dl" "--dump-json" "--ignore-errors" "--skip-download" ""])
Segmentation fault (core dumped)
Even more minimal exaxmple:
bb repl
Babashka v0.4.0 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.

user=> (require '[babashka.process :refer [process sh]])
nil
user=> (sh ["youtube-dl"])
Segmentation fault (core dumped)

borkdude07:05:45

@ben.sless are you using the static Linux binary perhaps?

borkdude07:05:47

If that’s the case, try the normal one

lispyclouds07:05:44

Ive linked you to the thread where we are discussing the issues

lispyclouds07:05:28

@ben.sless what OS and kernel versions are you using? I will try to dig a bit deep

Ben Sless08:05:16

@rahul080327

OS: Pop 20.10 groovy
Kernel: x86_64 Linux 5.11.0-7614-generic

lispyclouds08:05:24

Thanks!

👍 3
lispyclouds09:05:26

ah

By default, --static will statically link your native-image against your host libc (glibc)
https://github.com/oracle/graal/issues/2691#issuecomment-670010352

lispyclouds09:05:41

this could explain the segfaults

lispyclouds09:05:06

i would advocate for using musl too

borkdude09:05:38

ok PR welcome for whoever wants to figure out how to do this :)

lispyclouds09:05:14

will try this

borkdude09:05:39

Maybe try with pod-babashka-aws. It just got a static build an hour ago

borkdude09:05:44

but the tests don't pass there

borkdude09:05:54

so it would be a good reason to start doing that

kokada11:05:11

Yeah, this makes sense if we are linking with glibc

kokada11:05:58

I think even when linking statically, glibc still tries to access some system files on runtime

kokada11:05:05

Something related to nss

kokada11:05:17

And some private symbols

kokada11:05:36

And yeah, a glibc version mismatch can cause segmentation faults

kokada11:05:53

musl should work much better for static binaries

borkdude18:05:19

@ben.sless Any chance you could test this (musl-linked) static binary for linux with the same program that didn't work for you this morning? https://github.com/babashka/babashka/files/6448129/babashka-0.4.1-SNAPSHOT-linux-amd64-static-musl.tar.gz

🎉 3
Ben Sless18:05:22

Looks like it works now

kokada19:05:52

@rahul080327 Seems like musl is really the way to go 🙂

lispyclouds19:05:52

awesome work @thiagokokada great pairing session!

👍 3
kokada19:05:02

Thanks for you too @rahul080327

3
borkdude19:05:29

Would it make sense that iff you run the babashka static binary, then you also download static pods?

borkdude19:05:35

or would it be best to control this via an explicit env var:

BABASHKA_PODS_STATIC=true

lispyclouds19:05:50

if the pods are guaranteed to be on the same machine, which i think they are, i would say it makes sense to make it the default

lispyclouds19:05:08

most users may not think of that env var i think?

borkdude19:05:48

in this case we must capture the fact that babashka has been compiled statically (not sure if graalvm already provides a system property for this that you can read at compile time)

lispyclouds19:05:44

im not sure of it

borkdude19:05:37

this is easily done using (def static? (System/getenv "BABASHKA_STATIC")) in the relevant pods library namespace and use this for downloading the right pod

3
borkdude20:05:00

Meanwhile, @thiagokokada are you going to do a similar PR to babashka for the static one?

borkdude20:05:51

Another question: we can "prefer" a static pod if bb is itself static, but what if there is no static linux pod available, but there is a dynamic one, should we select that one, or just crash with: no such pod here?

kokada20:05:51

Sorry, didn't understand the question

borkdude20:05:16

@thiagokokada My impression was that you were going to update bb static itself with musl too?

borkdude20:05:27

or did I misunderstand?

kokada20:05:03

I thought we would use the @rahul080327 approach instead of mine, so I was waiting for his PR

borkdude20:05:43

ok, that PR is merged now. but that was for the aws pod.

kokada20:05:51

But I can do it too if @rahul080327 is not interested of porting the approach from pods-aws-babashka

lispyclouds20:05:52

you can go ahead with it @thiagokokada if you want 🙂

👍 3
kokada20:05:53

About the question about if we should try to download dynamic pods in case of the static one is unavailable: no, I don't think so

kokada20:05:08

Too much magic may make things more confusing

borkdude20:05:34

maybe automatically selecting a static or dynamic one is also too much magic - that is why I proposed an env var to control this

kokada20:05:48

Yeah, maybe

kokada20:05:02

But I don't like to control those kinds of thing as env vars

kokada20:05:14

Could we have an opts passed to load-pod instead :thinking_face: ?

borkdude20:05:36

That can work, but then you'd have to change the code if you deploy your code from a mac to the cloud for example

kokada20:05:52

Like:

(pods/load-pod 'some-pod "0.0.0" {:static true})

kokada20:05:23

(pods/load-pod 'some-pod "0.0.0" {:linux-opts {:static true}})

kokada20:05:39

This would make it clear that it only applies to linux (so macOS are unaffected)

kokada20:05:58

Also has the advantage if someone wants to do something like "static with dynamic fallback" possible

kokada20:05:26

They just need to capture the exception when calling {:static true} fails

borkdude20:05:56

It would be much simpler if we could just always use the static pod on linux

borkdude20:05:35

Maybe let's test this out a bit for a few months and if the static one works out ok, we can just switch over all linux pods to it

borkdude21:05:11

Made a new release for the pod-babashka-aws pod

borkdude21:05:01

why did you make this change? ARG BABASHKA_XMX="-J-Xmx4500m"

kokada21:05:18

> It would be much simpler if we could just always use the static pod on linux Maybe we could go even further and eventually simple remove the dynamically compiled version of Babashka If someone wants to use Babashka with dynamic linking, they can always depend on their distros packages

borkdude21:05:24

How would dynamically liked bb magically appear in "their distros packages" if we don't provide it

kokada21:05:33

For example, NixOS's Babashka package is dynamically linked I was imagining like the distros themselves packaging Babashka

borkdude21:05:05

I am all for it, the only reason I would not do it if it wouldn't work for someone

kokada21:05:48

> I am all for it, the only reason I would not do it if it wouldn't work for someone Yeah, I think we can work on this static version now and ask for more people to test, and if everything looks good we could eventually remove the non-static version

borkdude21:05:50

I can secretly use the static aws pod in the manifest and see if someone starts yelling?

borkdude21:05:23

Seems like a good starting point since it's still a "young" pod

kokada21:05:34

> I can secretly use the static aws pod in the manifest and see if someone starts yelling? Well, LGTM, I don't know if someone has some objection against this

borkdude21:05:57

let's do it, when it breaks certain environments, we will switch back and find another solution

borkdude21:05:21

updated the 0.0.6 manifest

borkdude21:05:37

Merged 🙏

kokada21:05:11

Thanks @U04V15CAJ Just did another PR to make sure that we are using the correct ZLIB sources (avoiding possible security issues): https://github.com/babashka/babashka/pull/830

borkdude21:05:58

Very good. We should start adopting this for pods as well.

borkdude21:05:04

(I mean, after downloading a pod we should verify it against a sha in the manifest)

👍 3
borkdude21:05:47

I tested the musl bb on my VPS and had no problems

borkdude21:05:17

Hmm, I just enabled the musl static build on master

borkdude21:05:19

and it fails

borkdude21:05:35

I'm logging off for tonight, but feel free to take a look and I'll be back tomorrow

borkdude22:05:00

which graalvm version did you use to compile locally?

kokada22:05:28

> Hmm, I just enabled the musl static build on master Strange, it seems to works fine in my machine

kokada22:05:41

> which graalvm version did you use to compile locally? I used the version included in the Dockerfile

kokada22:05:58

Seems to be version 21.0.0

borkdude22:05:22

hmm ok. your version works fine on my VPS with the clj-http-lite library

borkdude22:05:33

but it fails on CI with some mysterious stackoverflow

kokada22:05:25

In that previous version I build musl from scratch, while this one uses the musl-tools prebuilt from Debian

kokada22:05:50

But I ran the tests above in my machine with the changes that where merged on master

borkdude22:05:02

Did you run lein test or script/test?

borkdude22:05:47

on CI there are more tests, notably script/run_lib_tests

borkdude22:05:04

and to test the native, you have to set BABASHKA_TEST_ENV=native

kokada22:05:04

For the result above I ran BABASHKA_TEST_ENV="native" ./script/run_lib_tests > result

borkdude22:05:16

that should work yeah

kokada22:05:34

I am building it locally again and will run the lib tests again

kokada22:05:45

Just to make sure that I am using the correct versions of everything

kokada22:05:50

Huh... Yeah, now that I build it again it seems to segfault

kokada22:05:55

I will try my approach again

kokada22:05:03

(Building musl from scratch)

borkdude22:05:14

Any chance you were using 21.1 the first time around? I just merged into my 21.1 branch and there it seems to work fine...

borkdude22:05:25

(GraalVM version 21.1.0 JDK11 to be specific)

kokada22:05:14

Nope because I don't even have GraalVM installed in my system

kokada22:05:25

So if it is building, it should be from Dockerfile

borkdude22:05:47

I see. Note that you can bump up the BABASHK_XMX value in you want to speed up your local compile a bit btw

borkdude22:05:40

Can you perhaps also try the graal21.1 branch (pull recent changes from Github)?

kokada22:05:08

You mean trying musl from source with graal21.1?

kokada22:05:31

Right now I am building Babashka with musl from source, and will run the tests

borkdude22:05:33

No, there is a graal21.1 branch which is the same as master but with 21.1 instead of 21.0

borkdude22:05:52

if that works, maybe that's the better approach then

kokada22:05:40

Nope, I also got an exception So maybe the difference is the musl version :thinking_face: ?

kokada22:05:50

Yep, I think the issue is the musl version

kokada22:05:21

Let me try one more thing

kokada23:05:30

Well, downgrading to whatever musl version that comes from Debian fixed the clj-http.lite.client-test tests but broke the clojure.data.generators-test :man-shrugging:

kokada23:05:03

I think the 1.2.0 that I used first (compiling from source) passed all tests

kokada23:05:12

But maybe it is just luck

kokada23:05:32

Let me try the GraalVM 21.1

kokada23:05:34

Yeah, with GraalVM 21.1 it seems to work

kokada23:05:01

So, a summary - musl 1.22+GraalVM 21.0: broke clj-http.lite.client-test - musl 1.1.21 (I think, this is the musl from Debian Buster but I am not sure that this is the version we use)+GraalVM 21.0: broke clojure.data.generators-test - musl 1.22+GraalVM 21.1: works fine with all libs that we have tests

kokada23:05:43

Actually, forget about it

kokada23:05:49

This test (`clj-http.lite.client-test`) seems flakly

kokada23:05:27

So yeah, very bizarre :man-shrugging:

kokada23:05:49

Maybe reporting this issue to GraalVM itself :thinking_face: ?

kokada23:05:20

Yeah, got the same flakly behavior even with Babashka compiled from master

kokada23:05:35

So if I run the tests once it works, again it segfaults, again it works, again it segfaults...

kokada23:05:14

@U04V15CAJ All I can say for now is that the tests are really really flaky with those static binaries. We could add a retry and run it a few times, but it doesn't seem to be a good sign considering that this behavior doesn't happen with the old glibc binaries CC @rahul080327

kokada23:05:50

I do think we should report this to GraalVM itself since it seems to be an issue with them, but I don't know how we could report this (we probably need some way to reproduce this with a smaller example that can trigger this behavior)

kokada23:05:06

My suggestion is maybe releasing this static version with musl separated from the other static binary, at least until we can figure out why this tests are flakly. But I will leave the decision to @U04V15CAJ

kokada23:05:00

Just to leave here, I also tried to build with Debian Buster (we seem to use Debian Stretch actually, since the GCC used by the image is 6.3 instead of the 8.3), since Native Image in the end builds everything with GCC and 6.3 is a quite old GCC version. I ended up using openjdk-17-lein-2.9.6-buster that also uses updated versions of lein and OpenJDK since I didn't find a way to only bump the Debian version. But no dice, same behavior.

kokada01:05:11

> My suggestion is maybe releasing this static version with musl separated from the other static binary, at least until we can figure out why this tests are flakly. Actually, after more testing maybe it is fine. It is just some libs that fail sometimes (but when it works all tests pass), and this musl based binary should work in any Linux system, so this is still a win in my book. The current static binary based on glibc seems more useless in comparison. When it fails it fails hard, and most of the time people are better using the non static version anyway. I also don't think the static glibc version works in Alpine anyway since Alpine is a musl based system (I may be wrong here). My opinion is to release it as is, just leaving somewhere in the release notes that this release is more unstable than the non-static build so the non-static build is still preferred. My 2c.

kokada01:05:37

(And we keep track of new GraalVM releases, I think it is a bug there that will eventually be fixed. It does seem that with GraalVM 21.1 this bug is more difficult to trigger, but this is completely a non-scientific observation.)

borkdude07:05:39

Hmm. The static image was compiled to work with Alpine, that was the only reason it was there. If the musl build has issues that the other static image doesn’t have, I’m inclined to roll back changes until we sort it out. Cc @rahul080327

borkdude07:05:37

I won’t have a build that is flaky, that’s a constant source of headaches

borkdude07:05:15

What kind of issue did you see with data generators, also a stackoverflow?

borkdude08:05:17

I think I found the issue. The static binary stackoverflows much sooner than the non-static one. E.g. try:

./bb -e "(defn foo [x] (prn x) (foo (inc x))) (foo 0)"
The non-static goes to about 20k, while the static one only goes through 186 or so

borkdude08:05:14

> musl allocates memory of very small size for thread stacks. musl documentation says it's only 80KB (now increased to 128KB) for default, which is rather small compared to 2-10MB of glibc. And that's been said to be the major reason for stack-consuming executable to crash (segmentation fault) in musl libc environment.

lispyclouds08:05:34

Very interesting find! My opinion here would be that we could make it explicit that we are using musl and point out the caveats. if this is the reason for flakyness i think we can go ahead with musl with the warning stickers. the ability to run in most places should outweigh the stack size issues. musl has quite a widespread production usage and i think this would be known to people

lispyclouds08:05:04

We could try patching the binary but again i get shivers when we change the defaults of a well used lib, specially when it comes to memory regions

lispyclouds08:05:43

things like rust, zig and others solely rely on musl for static links, i would say thats a good reason to stick to it

borkdude09:05:36

can't we just up the stack size limit?

borkdude09:05:37

@rahul080327 Patching the binary with muslstack didn't work btw. Can't we set a higher stack limit during compilation?

borkdude09:05:19

Is there a way to pass stack-size=2097152 to the compilation?

Ben Sless10:05:40

This option needs to be passed to the native-image compiler?

borkdude10:05:09

I am trying now with: "-H:CCompilerOption=-Wl,-z,stack-size=2097152" So far it seems to work... more testing now

borkdude10:05:53

@thiagokokada Perhaps you can also try this option locally. I'm running the build a few times over to see if works repeatedly. https://app.circleci.com/pipelines/github/babashka/babashka?branch=stack-size

borkdude10:05:01

Doesn't seem to work. The recursion depth is also more limited still with this binary, so maybe it was just luck that it worked a couple of times.

borkdude11:05:32

Discussing with a GraalVM dev now

kokada13:05:09

> What kind of issue did you see with data generators, also a stackoverflow? Yeah, but I only saw this issue once. The issue with clj-http-little seems to occur more frequently (and also it does not seem related to stack, since it occurs randomly; or maybe it is, I will try it later with the recent stack changes) > Very interesting find! My opinion here would be that we could make it explicit that we are using musl and point out the caveats. if this is the reason for flakyness i think we can go ahead with musl with the warning stickers. the ability to run in most places should outweigh the stack size issues. musl has quite a widespread production usage and i think this would be known to people I concur with @rahul080327 here.

borkdude13:05:15

I've done deep into the weeds with a GraalVM dev already. It's related to what musl does with the stack size. We should resolve this before publishing the new static bb, I find this unacceptable.

kokada13:05:40

I am building Babashka with "-H:CCompilerOption=-Wl,-z,stack-size=2097152" just to run the tests and see what we get

borkdude14:05:46

@thiagokokada I already tried this and it doesn't have any effect

borkdude14:05:01

You can see this when using the java repro from the issue

borkdude14:05:51

So there is a workaround, it seems.

borkdude14:05:52

The workaround is to start bb in a thread with a better default for the stack size. The rest of the threads should get the thread-size from the stack-size argument. It does incur some extra startup time so I will make that workaround conditional based on some env var

borkdude14:05:03

We don't even have to set the stack size in Java, as long as we start everything in a thread and pass a better default for the stack size, things should work again

kokada15:05:42

> It does incur some extra startup time so I will make that workaround conditional based on some env var So we will have something like BABASHKA_THREAD_STACK_SIZE that can be set by the user if it wants it? This is the ideia? LGTM by the way

borkdude15:05:01

Let's continue in #babashka-sci-dev

👍 3
borkdude15:05:06

Thread is getting long

dorab21:05:08

What is the recommended way to have a bb task that consists of a Unix pipeline? I tried {:tasks {clean (-> (shell "find . -type f -name '*~' -print") (shell "xargs rm -f"))}} but bb clean gives me a java.util.concurrent.ExecutionException.

borkdude21:05:46

@dorab shell doesn't support this with ->, but babashka.process/process does.

borkdude21:05:16

I think shell could support this though (made an issue: https://github.com/babashka/babashka/issues/829)

borkdude21:05:02

What is also possible:

(let [out (:out (shell {:out :string} "find . -type f -name '*~' -print"))]
  (shell {:in out} "xargs rm -f"))

borkdude21:05:43

although I think this can all be done using babashka.fs glob and delete as well

borkdude21:05:16

does *~ mean you want to find all files ending with a tilde, or does it expand into something else?

borkdude21:05:19

{:requires ([babashka.fs :as fs])
 :task (run! fs/delete
             (fs/glob "." "*~"))}