Fork me on GitHub
#tools-deps
<
2021-09-14
>
kenny16:09:35

Hi. I am updating our CI images to use 1.10.3.967 from 1.10.3.855. After the update, the tests for a project failed by timing out after 10 minutes. The entirety of the test output was as follows.

Downloading: com/datadoghq/dd-trace-api/0.86.0/dd-trace-api-0.86.0.pom from central
Downloading: org/clojure/clojure/maven-metadata.xml from clojars
Downloading: org/clojure/clojure/maven-metadata.xml from central
Downloading: org/clojure/clojure/maven-metadata.xml from datomic-cloud
I will be rolling back to the earlier version, but I am curious, is this something others have hit? Are there some particular steps I should follow when upgrading from 855 to 967?

Alex Miller (Clojure team)16:09:17

I think this may be specific to the datomic-cloud repo, but maybe should move that to #datomic

kenny16:09:02

Interesting. Was there a change in 855 -> 967 that would affect s3 repos?

Alex Miller (Clojure team)16:09:07

no, I think there may be changes that have happened in one of the datomic repos

kenny16:09:44

Ah, ok. Fwiw, I worked it backward. 1.10.3.933 works and 1.10.3.943 does not work.

Alex Miller (Clojure team)16:09:56

we've bumped some aws/s3 deps in that version range, but I have not seen anything like you describe

Alex Miller (Clojure team)16:09:40

hrm, well probably ignore what I said above then

Alex Miller (Clojure team)16:09:51

I've tested s3 stuff but not seen what you're describing

Alex Miller (Clojure team)16:09:27

does it persist with -Sforce?

Alex Miller (Clojure team)16:09:43

if so, do you have a deps.edn that repros that you can give me?

Alex Miller (Clojure team)16:09:35

we did make some changes in the http-client config for s3 in that version range

kenny16:09:19

All tests I ran are run with -Sforce. I don't have a readily available deps.edn. I also don't think this is related to Datomic. In a test run with 943, the cli output was just this.

Downloading: org/clojure/clojure/maven-metadata.xml from cs-mvn
cs-mvn is a s3 repo.

kenny16:09:50

It seems like a deadlock, but that would be very speculative. I'm surprised the requests aren't timing out.

kenny16:09:31

If it helps, the command I'm running.

/home/circleci/clj-1.10.3.943/bin/clojure -Sforce -J-Dclojure.main.report=stderr -J-Xmx3800m -A:test:test-runner -M -m kaocha.runner --reporter kaocha.report/documentation --plugin profiling --plugin kaocha.plugin/junit-xml --junit-xml-file test-results/kaocha/results.xml

Alex Miller (Clojure team)16:09:28

can you ctrl-\ to get a thread dump?

nice 2
Alex Miller (Clojure team)16:09:26

well, that's interesting. don't think it necessarily has anything to do with s3 based on that

kenny16:09:09

Yeah. DefaultMetadataResolver-0-1

Alex Miller (Clojure team)16:09:05

it could indeed be a deadlock in the session locks. I'll have to think about this more. you might be able to bypass with -Sthreads 1

kenny16:09:15

I know close to 0 about the maven lib, so take this with a grain of salt... if they are using a fixed size thread pool to get s3 creds in parallel, that could easily result in a deadlock.

Alex Miller (Clojure team)16:09:41

there are several layers to this problem, but I don't think it has anything to do with s3

2
Alex Miller (Clojure team)16:09:32

there have been changes in the session caching I'm doing, and in the underlying maven lib. I suspect my changes at the moment. :)

kenny16:09:56

With 1.10.3.943, -Sthreads 1 does not have any impact -- still hangs.

Alex Miller (Clojure team)16:09:34

that's probably an important clue :)

Alex Miller (Clojure team)16:09:45

I'm probably not going to get to it today, but I will take a look soon

kenny16:09:03

Sounds good. I'll leave us at 933 for now.

Alex Miller (Clojure team)21:09:47

hey kenny, I have not succeeded in reproducing this or figuring it out, but I did spend some time looking at the various versions of the libs and I think I was using maven resolver libs (1.7.x) that require maven core libs (4.0 alpha+), so I've fallen back to the 1.6.x series there. they have been reworking the concurrency and locking parts of maven in 1.7.x. I don't see that directly implicated but it might be related. But if you wanted to try Clojure CLI 1.10.3.981, it's available.

kenny22:09:18

Thanks for digging in. I'll give it a shot, and let you know if it fixes the issue.

kenny00:09:50

Just ran a test with 1.10.3.981, and I'm still getting the same hang.

kenny00:09:39

I can repro locally (not on ci) too, btw.

kenny00:09:34

I tried the work it backwards by commenting things out approach, and I cannot discern any pattern at all. I'll send you the smallest deps.edn I worked it back to. Any other changes, commenting something out or moving local/root deps into this deps.edn, result in no hang. Let me know if there's any info you're interested in.

Alex Miller (Clojure team)16:09:00

I guess if you're not local, you probably can't

kenny16:09:27

circleci ssh access is one of my fav features 🙂

jaret18:09:30

Sorry @alexmiller @kenny. Do you still think this is Datomic repo related? I am trying to catch up.

Alex Miller (Clojure team)18:09:51

I am pretty sure I understand the source of the problem, and I introduced it

Alex Miller (Clojure team)18:09:21

I actually have failing tests for this on CI on newer versions of Java that I had missed