Fork me on GitHub
#clojure-dev
<
2019-08-22
>
andy.fingerhut03:08:06

I have tried an Amazon EC2 Ubuntu 18.04 Linux t2.micro instance running OpenJDK 11 and seen the same behavior, running the tryme.sh script

Alex Miller (Clojure team)03:08:11

I'm sure it has something to do with opt and deopt of that loop, but you're not running with enough debug stuff on in the jvm to tell

Alex Miller (Clojure team)03:08:48

I've seen this before, don't remember what in particular

Alex Miller (Clojure team)03:08:15

you need to actually have it print all the asm stuff probably

andy.fingerhut03:08:52

I can set that up and collect that data, and give a link to it later.

Alex Miller (Clojure team)03:08:13

well, I don't have time to look at it, but that's what I'd do

andy.fingerhut03:08:29

From what I saw doing it once earlier today, I had the impression that it was optimizing up to 'level 4' at some point, and that was when it got slower.

Alex Miller (Clojure team)03:08:44

that's tier 4, not level 4

andy.fingerhut03:08:51

Sure, not expecting you to dive in to the data, but someone else might see something I don't.

Alex Miller (Clojure team)03:08:39

it could be when it does the on-stack replacement it finds something much faster but then deopts at the termination condition or something

Alex Miller (Clojure team)03:08:29

just tweet it to https://twitter.com/shipilev, he'll probably tell you in 5 min

andy.fingerhut03:08:00

Thanks. I may do that once I have something with a JVM byte code listing and JIT logging enabled.

andy.fingerhut06:08:24

Good news, Alex, I saw your email message to the Clojure Google group, for the first time in a while.

parrot 8
Alex Miller (Clojure team)12:08:51

Yes, that was kind of a test. I’ve been messing with my DKIM and SPF records, may have actually made some progress :)

cgrand08:08:16

@andy.fingerhut your loop is in return position and returns a primitive, the function return type is object, this return type mismatch is the root of the issue: 1/ Repro of your case:

(defn foo2 [n]
  (let [n (int n)]
    (loop [i (int 0)]
      (if (< i n)
        (recur (inc i))
        i))))
#'user/foo2
user=> (time (foo2 100000000))
"Elapsed time: 53.865585 msecs"
100000000
user=> (time (foo2 100000000))
"Elapsed time: 226.535655 msecs"
100000000
2/ return object:
(defn foo2 [n]
  (let [n (int n)]
    (loop [i (int 0)]
      (if (< i n)
        (recur (inc i))
        :foo))))
#'user/foo2
user=> (time (foo2 100000000))
"Elapsed time: 57.383683 msecs"
:foo
user=> (time (foo2 100000000))
"Elapsed time: 56.647265 msecs"
:foo
3/ return prim
(defn ^long foo2 [n]
  (let [n (int n)]
    (loop [i (int 0)]
      (if (< i n)
        (recur (inc i))
        i))))
#'user/foo2
user=> (time (foo2 100000000))
"Elapsed time: 51.049683 msecs"
100000000
user=> (time (foo2 100000000))
"Elapsed time: 47.310656 msecs"
100000000

👍 4
reborg08:08:33

@cgrand any idea why the first time it runs that boxing into object seems to be ignored? Is the var object somewhat ignoring the return type and then save the wrong info somewhere?

cgrand08:08:35

No clear idea. I believe it has something to do with on stack replacement, and later realizing that the replacement sig does not match the replacee sig 🤷 maybe even pushing some boxing inside the loop. Inspecting assembly could help figure out. But i definitely encountered this in the past.

cgrand09:08:31

More experiments: • explicit boxing on exit of the loop works great

user=>
(defn foo2 [n]
  (let [n (int n)]
    (loop [i (int 0)]
      (if (< i n)
        (recur (inc i))
        (Integer/valueOf i)))))
#'user/foo2
user=> (time (foo2 100000000))
"Elapsed time: 46.156153 msecs"
100000000
user=> (time (foo2 100000000))
"Elapsed time: 54.608185 msecs"
100000000
• explicit boxing on the loop expression doesn’t
(defn foo2 [n]
  (let [n (int n)]
    (Integer/valueOf
      (loop [i (int 0)]
        (if (< i n)
          (recur (inc i))
          i)))))
#'user/foo2
user=> (time (foo2 100000000))
"Elapsed time: 45.705437 msecs"
100000000
user=> (time (foo2 100000000))
"Elapsed time: 220.503665 msecs"
100000000
I believe it doesn’t work because the loop expression is hoisted into a function that suffers from the same issue as the original foo2.

cgrand09:08:51

However Long/valueOf instead of Integer/valueOf doesn’t work. Does it have to do with the size of the return type? I tried disabling UseCompressedOops but have seen no difference

reborg10:08:18

On the plus side, I’m only able to reproduce on a plain REPL, it does not happen on a lein repl or from a main (meaning, prod apps are not affected)

andy.fingerhut19:08:14

Not sure if you were aware that Leiningen by default adds its own JVM command line options related to Tiered compilation, which make performance experiment results suspect, unless you override that by using something like the following in your project.clj file: :jvm-opts ^:replace ["jvm-opts-I-want"]

reborg20:08:36

Indeed! That’s now clear. It also reproduces on a pure java main, so there is “a problem” also in production code

cgrand11:08:13

On the minus side, I can:

$ clj -m foo2
"Elapsed time: 49.448754 msecs"
"Elapsed time: 240.798172 msecs"

cgrand11:08:28

where foo2.clj is

(ns foo2)

(defn foo2 [n]
  (let [n (int n)]
    (loop [i (int 0)]
      (if (< i n)
        (recur (inc i))
        i))))

(defn -main []
  (time (foo2 100000000))
  (time (foo2 100000000)))

cgrand11:08:00

also it’s tied to C2

$ clj -J-XX:TieredStopAtLevel=1 -m foo2
"Elapsed time: 85.075081 msecs"
"Elapsed time: 84.942798 msecs"
$ clj -J-XX:TieredStopAtLevel=4 -m foo2
"Elapsed time: 58.025492 msecs"
"Elapsed time: 222.761918 msecs"

cgrand11:08:11

@dominicm AOT:

$ java -cp classes:`clj -Spath` foo2
"Elapsed time: 57.144601 msecs"
"Elapsed time: 218.541054 msecs"

dominicm11:08:35

Ah. Maybe C2 is the difference here then.

Alex Miller (Clojure team)12:08:07

Andy’s earlier post looked like it was doing osr in C3, then deopt in C3, then reopt in C4

Alex Miller (Clojure team)12:08:31

Which matches cgrand’s hypothesis

cgrand12:08:35

C2 is what you call C4 (Tier 4), no? To me Tier 0 is interpreter, Tiers 1 to 3 are C1 with increasing profiling, Tier 4 is C2. We need a JDK specialist, @ghadi?

andy.fingerhut17:08:07

Wow, thanks for all the investigation @cgrand. I will be trying to reproduce your cases and commentary in my repo for possible future reference.

andy.fingerhut20:08:01

OK, still no Twitter message to Aleksey Shipilev, and still no disassembly to JVM byte code nor Intel assembly code, but I have added brief instructions to the README, and attempted to automate what is there working across JDK8 through the latest versions (AdoptOpenJDK, OpenJDK, Azul Zulu, and Amazon Corretto all tested), and added Christophe's variations: https://github.com/jafingerhut/leeuwenhoek

andy.fingerhut20:08:18

My best guess is that we are seeing a case where some Clojure-compiled JVM byte codes are being optimized to faster machine code at JIT tier 3, vs. slower machine code when compiled at JIT tier 4, depending on exactly how you write that Clojure function.