Hi, I'd like to share my experiment using Clojure in competitive programming. My reasoning was as follows: since Java is the second most popular programming language for competitions, and Clojure's performance is comparable to Java, I can use Clojure in competitive programming. I started with the AtCoder Beginner Selection problems, and everything went well—I solved 8 problems, but then I encountered the "Time limit exceeded" limit on the ninth problem, "Toshidama" (out of 11 problems total). I optimized the algorithm and reduced the number of tests that exceeded the time limit from 6 to 2. The exceedance is minor: the slowest test took 2035 ms compared to the 2000 ms limit. Here's the problem: there are N banknotes—x 10,000-yen bills, y 5,000-yen bills, and z 1,000-yen bills, for a total of T yen. Find the combination x, y, z if x 10k + y 5k + z * 1k = T Here's my best solution (2 tests exceeded the time limit):
(require '[clojure.string :as str])
(let [input (slurp *in*)
in (str/split input #" |\n")
n (Integer. (get in 0))
t (Integer. (get in 1))
res (first
(filter (fn [[x y z]] (= t (+ (* x 10000) (* y 5000) (* z 1000))))
(filter #(= n (apply + %))
(for [x (range (inc n))
y (range (inc n))
:let [z (if (neg? (- n x y)) 0 (- n x y))]]
[x y z]))))]
(if (nil? res)
(print -1 -1 -1)
(apply print res)))
It's strange that further optimization—`(range (inc (min n (quot t 10000))))` etc.—led to an increase in the number of tests exceeding the time limit.
Therefore, I have to use C++ for time-consuming problems or choose language-agnostic contests where you submit test output instead of code.Without delving into the algorithm itself (I have no idea what the proper algorithm is and I actively resist being nerd-sniped :D), the way res is created is most definitely sub-optimal. You create three lazy sequences just to find a single value, you become a victim of chunking (lazy sequences, depending on the context, can evaluate items in chunks and not 1-by-1, which usually leads to better performance, but not in this case). Something like a simple loop or reduce would be better. Or even just (first (for ...)) since for supports not just :let but also :when and :while, although combining them could be counter-intuitive.
Another thing is boxing. When you don't specify primitive types, everything is an object - you even used Integer yourself. So a simple value like 17 ends up being wrapped in an object which makes it significantly slower.
Try (set! *unchecked-math* :warn-on-boxed) before evaluating that code above and see what it gives you.
> It's strange that further optimization—(range (inc (min n (quot t 10000)))) etc.
How would the code look with that optimization?
(require '[clojure.string :as str])
(let [input (slurp *in*)
in (str/split input #" |\n")
n (Integer. (get in 0))
t (Integer. (get in 1))
max-x (min n (quot t 10000))
max-y (min n (quot t 5000))
max-z (min n (quot t 1000))
res (first
(filter (fn [[x y z]] (= t (+ (* x 10000) (* y 5000) (* z 1000))))
(filter #(= n (apply + %))
(for [x (range (inc max-x))
y (range (inc max-y))
:let [z (if (neg? (- n x y)) 0 (min max-z (- n x y)))]]
[x y z]))))]
(if (nil? res)
(print -1 -1 -1)
(apply print res)))Can you give a sample input for which the original code is faster than the optimized version?
I just tried with 2 11000 - the optimized version is slightly faster.
mmm, AtCoder only disclose a few first tests, but here is the slowest of disclosed: 1000 1234000
I probably was lucky (low CPU usage etc) to get only 2 failed tests with not finally optimized version
The optimized version is still faster on my end, as expected (`f` is the original f2 is optimized):
(crit/quick-bench
(f 1000 1234000))
Evaluation count : 1260 in 6 samples of 210 calls.
Execution time mean : 562.057965 µs
Execution time std-deviation : 139.024057 µs
Execution time lower quantile : 478.888005 µs ( 2.5%)
Execution time upper quantile : 795.856552 µs (97.5%)
Overhead used : 6.098224 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 64.7749 % Variance is severely inflated by outliers
=> nil
(crit/quick-bench
(f2 1000 1234000))
Evaluation count : 4578 in 6 samples of 763 calls.
Execution time mean : 133.713648 µs
Execution time std-deviation : 3.044836 µs
Execution time lower quantile : 130.066537 µs ( 2.5%)
Execution time upper quantile : 136.497486 µs (97.5%)
Overhead used : 6.098224 ns
=> nilI’ll try to write a “loop” version tonight
I can give some general notes on that code, not performance-related, if you want.
That would be great
From the very start, it's much better to write things in a composable manner. It's not harder to do so, but it's better if you do any sort of experiments.
Some specific things I'd do
• Wrap let in a function that accepts n and t as arguments and returns res without any checks or printing. Reading the input, processing it, deciding the default value in case of a failure, printing - all are concepts unrelated to each other
• Use so-called "rich comments" to run experiments with that function instead of feeding it different stdin
• For the purpose of the online test, just call that function from that let form
Much, much easier to experiment that way and exclude unrelated things like stdin and printing from benchmarking.
Also, check out ->>. Your code would be much more readable with it - it allows you to chain collection-related functions (when coll is the last argument) without nesting. Similar, -> also exists for when something is the first argument, but here it wouldn't be useful.
By trying out many random combinations in a loop, found the value for which the optimized version is consistently slower. But the original timing is still well within a single SD from it. N = 1000, T = 9283000.
Repeated that test many more times - the consistency is gone, both f and f2 take time that's within SD of each other, in any direction. Probably an outlier case that could be slower because of CPU branch prediction or something like that.
I wonder if there is a smart algorithm instead of my brute force solution
In any case, you'll see incomparably smaller times when you get rid of lazy collections in combination with first and boxed math.
What is SD?
I mean abbreviation
Standard Deviation.
> I wonder if there is a smart algorithm instead of my brute force solution Most likely, it's a popular problem.
Oh, almost forgot - that z (if ...) can be rewritten as z (max (- n x y) 0).
Tried out of curiosity optimizing the very first version, without any changes to the algorithm.
Using a nested loop made it 5 times faster.
Avoiding box math made it 2 times faster.
So, a whole order of magnitude.
Wow, can you post the solution?
Sure:
(defn f3 [^long n ^long t]
(loop [xs (range (inc n))]
(when-first [^long x xs]
(or (loop [ys (range (inc n))]
(when-first [^long y ys]
(let [z (max (- n x y) 0)]
(if (= (+ (* x 10000)
(* y 5000)
(* z 1000))
t)
[x y z]
(recur (next ys))))))
(recur (next xs))))))It uses ^long instead of ^int because in Clojure number literals are longs by default, not ints.
I’ll test it tonight (far from the laptop now). It should pass the time constraint
Ah, but we don't really need the ranges and iterating over collections, do we? This version goes 20 times faster than the previous one.
(defn f4 [^long n ^long t]
(loop [x 0]
(when (<= x n)
(or (loop [y 0]
(when (<= y n)
(let [z (max (- n x y) 0)]
(if (= (+ (* x 10000)
(* y 5000)
(* z 1000))
t)
[x y z]
(recur (inc y))))))
(recur (inc x))))))
Tested a plain Java version - it's at exactly the same time as the above version.
So we ended up with a Clojure function that has almost no type hints (since there are no generic collections, types of most bindings can be deduced) and is as performant as an equivalent Java function.The equivalent Java code:
public class X {
public static long[] f(long n, long t) {
for (long x = 0; x <= n; x++) {
for (long y = 0; y <= n; y++) {
long z = Math.max(n - x - y, 0);
if (x * 10000 + y * 5000 + z * 1000 == t) {
return new long[] { x, y, z };
}
}
}
return null;
}
}
Used as:
(import 'X)
(defn f-java [^long n ^long t]
(vec (X/f n t)))
The results:
(crit/quick-bench
(f4 1000 9283000))
Evaluation count : 936 in 6 samples of 156 calls.
Execution time mean : 647.384538 µs
Execution time std-deviation : 2.895507 µs
Execution time lower quantile : 643.851417 µs ( 2.5%)
Execution time upper quantile : 650.939564 µs (97.5%)
Overhead used : 6.098224 ns
=> nil
(crit/quick-bench
(f-java 1000 9283000))
Evaluation count : 912 in 6 samples of 152 calls.
Execution time mean : 665.660780 µs
Execution time std-deviation : 572.906808 ns
Execution time lower quantile : 665.104704 µs ( 2.5%)
Execution time upper quantile : 666.351677 µs (97.5%)
Overhead used : 6.098224 ns
=> nilAwesome!
Heh, wrote an equivalent C++ function. It is 2x slower than the Java version. It's been more than a decade since the last time I touched C++ seriously, so still investigating.
It could be that the constant inputs that I provide make the JIT overeagerly optimize it. Maybe @alexyakushev knows for sure what would be the best way to test it properly, without making JIT optimize for inputs or doing something else that wouldn't be useful in a situation where the inputs aren't known in advance.
Nah, changing the inputs on every iteration still shows 10% faster execution than an equivalent C++ code.
Apparently I could improve the performance of that code by using likely/`unlikely` with if and some fast integer types.
Maybe some extra compilation flags would make it faster. Asked a friend who knows C++ well, he'll get back to me later.
It's hard to say what JIT is doing without inspecting the produced assembly.
However, since this is competitive programming, the "correct" way to measure performance here is an antipattern, precisely opposite of what we always say to do. You would need to spin a fresh REPL and run the test only once measured by time , because that's what the testing systems in those competitions usually do.
I don't know, maybe the testing protocols have changed and JIT warmup is permitted these days, but I have no idea how they control for caching then, besides, JIT is a kind of a cache you can say.
You can get rid of the nested loop entirely. After dividing t by 1000, the constraints of the original problem are:
x + y + z = n
10x + 5y + z = t
We pick a value for x in the outermost loop, which leaves two variables and two equations, which can be solved directly, yielding:
y = (t-n-9x)/4
z = n-x-y
The same thing in Clojure:
(set! *unchecked-math* :warn-on-boxed)
(defn f5 [^long n ^long t]
(let [t (quot t 1000)
max-x-count (quot t 10)]
; if t is not b/w n and 10n, there's no point.
(when (and (<= n t) (<= t (* 10 n)))
(loop [x 0]
(when (<= x max-x-count)
(let [tmp (- t n (* 9 x))]
(or (when (zero? (rem tmp 4))
(let [y (quot tmp 4)
z (- n x y)]
(when (and (>= y 0) (>= z 0))
[x y z])))
(recur (inc x)))))))))
And it's ~700x as fast as the f4 above:
user=> (crit/quick-bench (f4 1000 9283000))
Evaluation count : 1056 in 6 samples of 176 calls.
nil
Execution time mean : 571.315100 µs
Execution time std-deviation : 407.084473 ns
Execution time lower quantile : 570.681813 µs ( 2.5%)
Execution time upper quantile : 571.724933 µs (97.5%)
Overhead used : 1.903179 ns
user=> (crit/quick-bench (f5 1000 9283000))
Evaluation count : 779106 in 6 samples of 129851 calls.
nil
Execution time mean : 773.858834 ns
Execution time std-deviation : 4.238559 ns
Execution time lower quantile : 768.838230 ns ( 2.5%)
Execution time upper quantile : 778.421931 ns (97.5%)
Overhead used : 1.903179 nsYeah, I wasn't trying to change the algorithm at all, just focused on the Clojure aspect of things. A better algorithm is almost always better than just head-on performance tweaks. :)
Yup. The first thing I noticed though, even without changing the algorithm, is to take some work from the innermost loop to the outer. After we freeze x, we don't have to do (* 10000 x) in the innermost loop each time.
(defn f4-b [^long n ^long t]
(loop [x 0]
(when (<= x n)
(let [t (- t (* x 10000))
n (- n x)]
(or (loop [y 0]
(when (<= y n)
(let [z (max (- n y) 0)]
(if (= (+ (* y 5000)
(* z 1000))
t)
[x y z]
(recur (inc y))))))
(recur (inc x)))))))
This makes f4 about twice as fast on my machine.The division of t by 1000 is a similar improvement, because we don't be doing (* z 1000) in the innermost loop.
I'm used to bring nerd-sniped, so I'll take a shot...
Taking a very different approach, after dividing t by 1000 (`t'`), I started at the ones... we know the ones number will be in (range (mod t' 5) (inc n) 5), because it'd be impossible to get e.g. a '2' in the last place with 3 ones. So, that should cut the possibilities down significantly. I think (based on very vague benchmarking with a couple best-ish and worst-ish case scenarios) that this (limiting the possiblities for ones, and then checking the possible fives and tens from there) is a bit faster overall. Then, I think if you wanted to go wild, you could maybe even turn the fives and tens into a binary search situation - for each possible value of ones you've got remaining bills and a remaining total they need to make; I think you could split the remaining bills, calculate the total, and if the total is low, move bills to the tens side, if it's high, move bills to the fives side, and cut the possibilities in half each time.
finally submitted f5: execution time dropped to 1200 ms
> Without delving into the algorithm itself (I have no idea what the proper algorithm is and I actively resist being nerd-sniped :D) 👀 Reading the rest of the thread and thinking: That escalated quickly 🤣 But kudos for the code insights that people can hopefully take out of this and apply to other problems.
I don't consider Clojure-related tasks nerd-sniping. :) Most of the things that I did in the thread are just mechanical transformations. Even Java and C++-related ones (for which I asked an LLM to write the code for me and only changed it slightly, so no human brain was involved).