Fork me on GitHub
#announcements
<
2022-04-01
>
chrisn16:04:02

I have a drop-in replacement for clojure.data.csv that is much faster - around 10x - for large (1GB+) csv files. I took the time to study how univocity gets its great performance and write a small csv parsing system that has equivalent performance but removes some of the complications such as a fixed column width from it. CSV parsing isn't meaningfully parallelizable because quoted sections can contain newlines and inside a quoted section a double quote escapes a quote so you can't start at a random address and start parsing because you don't know if you are inside a quote. You have to parse from the beginning and tricks like line-seq also are not going to work. line-seq is, however, amazingly fast as the readers readline function is very optimized across jdks. Any parallelized approach is a two-pass algo where the second pass can be parallelized which I don't think you can efficiently do in java although I have seen vectorized 2 pass approaches in c++. Turns out anywhere you use an actual java.io.pushbackreader you are sunk as far as performance is concerned. You have to read into character arrays and then implement a pushback-style abstraction on top of that if you want to have good performance parsing files. This is even moreso, interestingly enough, for jdk-17 where the int read(); has been heavily nerfed while the read(char[], int, int) function has received a health bordering on insane amount of buffs. By my tests in jdk-17 read() if a 1.7GB file for a pushback reader is 51sec vs 500ms for a tight loop reading into a character array. This may mean that there are avenues to parsing large amounts of json and edn that are significantly faster than the current approach used by various existing libraries. IT also means that fancy methods of reading character data from a file, such as memory mapping it and even potentially io_uring on linux are unlikely to get any faster for CSV parsing specifically unless you know your data doesn't have quoted sections. At the end of the day - don't use csv. Use arrow or parquet if you need any performance at all as a parquet file of a 1.7GB test csv set was ~240MB. But if you do use or have to parse a lot of csv data - https://cnuernber.github.io/dtype-next/tech.v3.datatype.char-input.html#var-read-csv.

🤯 2
amazed 15
clojure-spin 9
wow 15
👍 1
❤️ 6
🏎️ 1
🎉 17
🔥 8
Alex Miller (Clojure team)16:04:58

are these changes that could be applied in data.csv instead?

☝️ 7
1
2
Noah Bogart16:04:07

Gonna have to check this out, see if there are any wins available for #data-json

chrisn16:04:21

Please read the namespace docs. I used java classes for tight loops.

👍 2
chrisn16:04:57

@U064X3EF3 - I suggest a quick look over the code. It is a complete rewrite that is much more intense than the original pathway. Original clojure.data.csv code is I think very beautiful and concise and thus probably a better teaching tool. What I would perhaps consider is moving away from PushbackReaders in clojure's java edn and lisp reader pathways.

👍 1
Alex Miller (Clojure team)17:04:03

I have done some experiments like that in the past, so cool to see an example

Ben Sless19:04:01

Very cool I've done something similar going the other way around, writing CSV, making it possible to pretty much stream it directly

metasoarous20:04:49

Fantastic work as always @UDRJMEFSN!

chrisn21:04:52

@UK0810AQ2 - That is interesting -all of this is really for tech.ml.dataset at the end of the day and we don't serialize large datasets to csv. We do, however, come across csv datasets in the wild all the time.

👍 2
Ben Sless04:04:16

@UDRJMEFSN we often have to produce reports for clients at work and they can be interested in getting them as CSV sometimes

slipset08:04:59

During my work with speeding up data.json I guess I found some of the same stuff as you did @UDRJMEFSN. In essence there were two bottle necks, one was the parsing out strings, ie stuff inside quotes, and the other was creating the persistent data-structures. I ended up using a java.io.PushbackReader but rather than reading char by char, I ended up reading into a buffer of length 64 and working on that, as you can see here https://github.com/clojure/data.json/blob/master/src/main/clojure/clojure/data/json.clj#L94 I do believe I did some experiments with implementing a PushbackReader like thing in Java, but couldn’t really get any significant performance out of it.

chrisn12:04:24

@U04V5VAUN - then I think there is so many other things going on it overwhelmed the time it took - if you are building persistent datastructures there are probably many things you can do quicker. Here are timings for int read() of pushback reader vs. my custom https://github.com/cnuernber/dtype-next/blob/master/java/tech/v3/datatype/CharReader.java -

(time (read-all-chars (io/reader fname)))
  ;; jdk-8  - 5728ms
  ;; jdk-17 - 28257ms
  (time (read-all-chars (PushbackReader. (io/reader fname))))
  ;; jdk-8  - 19964ms
  ;; jdk-17 - 51795ms
  (time (read-all-char-reader (char-input/reader->char-reader (io/reader fname)
                                                              {:async? false
                                                               :n-buffers 2
                                                               :bufsize (* 16 1024)})))
  ;; jdk-8  - 3728ms
  ;; jdk-17 - 3053ms
  (time (read-all-char-reader (char-input/reader->char-reader (io/reader fname)
                                                              {:async? true
                                                               :queue-depth 4
                                                               })))
  ;; jdk-8  - 2872ms
  ;; jdk-17 - 2832ms 
Even compared to read() using a reader, the custom implementation is a faster till jdk-17. If we look at the source code of PushbackReader then we can see why - https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/PushbackReader.java#L86 - its a synchronized method followed by noise as opposed to, in most cases, an index bump and returning from an aget. But the large win is when you have a tight loop and reading the next character is an aget like you did for the strings. Moving from conj! with a transient persistent vector to an arraylist gets a ton of speed right there and using a normal HashTable and such have similar speed improvements so it really depends on if you need persistent datastructures at the end of the day. Also for CSV parsing I parse into 1 arraylist and have 1 stringbuilder-type object for the entire file and I clear() both of them just copying the arraylist when returning the value so I have a 'workspace' that is fixed and thus uses the same amount of memory all the time. Moving the tight loops from Clojure into Java like I did also improved the time considerably but only once everything else was fixed. Finally I checked that everything would inline correctly with -XX+PrintInline and friends. Finally I have a read thread that does the IO and use an ArrayBlockingQueue to communicate read blocks so the cpu-bound thread doesn't block on IO nearly ever. Because my use case is tmd I don't care if the datastructures are persistent or not but I imagine that if I took a careful look at data.json I could make improvements in the same vein as I mentioned but it would end up being more Java than Clojure. We just don't see GB+ json files like we do see large CSV files so it hasn't come up. I am not sure using transients is ever faster than say using arraylists and then calling toArray and then calling the appropriate persistent constructor (lazilypersistentlist, one of the two maps). Perhaps it is but I don't know if it is. It is definitely slower in the vector-type case than just using arraylist add.

👍 1
chrisn12:04:18

It's also the case that json is just harder by far than csv. metosin's jsonista on top of the jackson architecture only gets like a 2x perf gain - if that - above data.json in terms of decoding time. univocity got 10X+ against data.csv - so there was clearly a lot to gain there. Interestingly to me jsonista gets a major encoding gain which I wouldn't expect.

slipset12:04:59

I guess one major difference is that json is recursive in its structure, whereas csv is not.

slipset12:04:35

Which means sticking to one stringbuffer for the whole parsing operation won’t do (unless you do something really smart)

slipset12:04:33

And, we’re talking different sizes here, ["10b" "100b" "1k" "10k" "100k"] is the range of sizes I was working with (and which are “normal” I guess for json parsing)

slipset12:04:26

(defn- read-array* ([^PushbackReader stream options]
  ;; Handles all array values after the first.
   (let [result (java.util.ArrayList.)]
     (loop []
       (.add result  (-read stream true nil options))
       (codepoint-case (int (next-token stream))
         \] (clojure.lang.PersistentVector/create result)
         \, (recur)
         (throw (invalid-array-exception)))))))

(defn- read-array* [^PushbackReader stream options]
  ;; Handles all array values after the first.
  (loop [result (transient [])]
    (let [r (conj! result (-read stream true nil options))]
      (codepoint-case (int (next-token stream))
        \] (persistent! r)
        \, (recur r)
        (throw (invalid-array-exception))))))
These two versions are basically of same speed.

slipset12:04:12

(and one might argue that at least for my use case, the maps dominates lists in the json)

chrisn12:04:46

Then if you can return arraylists instead of vectors that would be a speed gain but kind of irrelevant. Maps have a similar low level constructor from an array of key,val,key,val.

slipset12:04:22

Yeah, but returning anything non-persistent would be contract breakage in the case of data.json.

chrisn12:04:28

Yes - which is why I work in dtype-next 🙂.

😂 1
chrisn13:04:12

I will take a look at json parsing at some point. If jackson is barely faster then it is using same system under the hood more or less and there may be something there. dtype-next also has https://github.com/cnuernber/dtype-next/blob/master/java/tech/v3/datatype/ListPersistentVector.java to create a persistent vector compatible object from any java list in place.

chrisn13:04:32

json is also harder because those sizes above 1b -> 100k. No one cares about 1k csv files at all - they only care when the file gets big. For servers parsing small json is something you always do all the time so it is very relevant.

chrisn13:04:05

100k csv is like just barely getting started.

chrisn13:04:08

I mean, why not just read everything into a single character array and push-back just writes data back into the array and resets the index.

slipset13:04:08

I believe I did something like that at some point (to see how much gains I could get from that), but everything sort of got dwarfed by https://github.com/clojure/data.json/blob/master/src/main/clojure/clojure/data/json.clj#L94 and specifically https://github.com/clojure/data.json/blob/master/src/main/clojure/clojure/data/json.clj#L108 , So basically creating the new string.

slipset13:04:23

(but then again, it’s a bit over a year since I last worked on this, so I don’t remember everything.

slipset13:04:09

But, I guess you should be seeing much of the same, since parsing csv is very much about creating strings.

slipset13:04:29

And i believe I implemented read-quoted-string in Java at some point without seeing huge increases in speed.

chrisn13:04:39

I realized that to get a major gain I had to be tight looping over indexes of a character array and that array had to be local to the method (meaning spilling field variables to locals). Then it was lots of reorganizing code and profiling and such. I did see significant speed increases when I moved to java for the read-row loop myself but I haven't seen those increases in general when moving e.g. a summation loop into java. It depends I guess on the context a bit.

chrisn13:04:29

the high end of hotspot's ability to optimize pretty much only works on arrays, primitives, and simple base jvm datastructures.

slipset13:04:01

public final class json$read_quoted_string extends AFunction
{
    public static final Object const__1;
    public static final Var const__11;
    
    public static Object invokeStatic(final Object stream) {
        final Object buffer = Numbers.char_array(json$read_quoted_string.const__1);
        final int read = ((PushbackReader)stream).read((char[])buffer, RT.intCast(0L), RT.intCast(64L));
        final int end_index = read - 1;
        if (read < 0L) {
            throw new EOFException("JSON error (end-of-file inside string)");
        }
        long i = RT.intCast(0L);
        Object o = null;
    Label_0537:
        while (true) {
            final int G__18164;
            final int c = G__18164 = RT.intCast(((char[])buffer)[RT.intCast(i)]);
            switch (G__18164) {
                case 34: {
                    final int off = RT.intCast(i) + 1;
                    final int len = read - off;
                    ((PushbackReader)stream).unread((char[])buffer, off, len);
                    o = new String((char[])buffer, RT.intCast(0L), RT.intCast(i));
                    break Label_0537;
                }
                case 92: {
                    final long off2 = i;
                    final int len = read - RT.intCast(off2);
                    ((PushbackReader)stream).unread((char[])buffer, RT.intCast(off2), len);
                    o = ((IFn)json$read_quoted_string.const__11.getRawRoot()).invoke(stream, new String((char[])buffer, RT.intCast(0L), RT.intCast(i)));
                    break Label_0537;
                }
                default: {
                    if (i == end_index) {
                        ((PushbackReader)stream).unread(c);
                        o = ((IFn)json$read_quoted_string.const__11.getRawRoot()).invoke(stream, new String((char[])buffer, RT.intCast(0L), RT.intCast(i)));
                        break Label_0537;
                    }
                    i = RT.intCast(i) + 1;
                    continue;
                }
            }
        }
        return o;
    }
Is the decompiled read-quoted-string what’s a bit annoying are the intCasts that go on, but that might just be handled by the JIT?

chrisn13:04:22

ok, so that was another reason I worked in java. To avoid casting every char to int and indexes to int. Here is the tight loop for csv parsing: https://github.com/cnuernber/dtype-next/blob/master/java/tech/v3/datatype/CharReader.java#L97.

chrisn13:04:52

Also note that hotspot's optimization pathways are built expecting the output of javac. To the extent they work with other compilers is luck.

chrisn13:04:34

haha, to the extent they work with javac's output is luck much less other compilers 🙂.

chrisn13:04:48

...the magic of hotspot...

Ben Sless13:04:12

@U04V5VAUN you can't use ints as primitive locals in loops, you're doing more casts that way, replace it with a regular long and set unchecked math to be true

👍 1
Ben Sless13:04:58

If you was r to actually see what hotspot does with your code you can use tools like jitwatch

Ben Sless13:04:16

It's a bit of a hassle but very doable

Ben Sless14:04:12

@UDRJMEFSN regarding the "canonicity" of javac, it's not as bad as you put it. It's a pretty simple compiler and you can know exactly which bytecode will be emitted from looking at java. I'd go as far as saying you can compile most java by hand (wouldn't want to). The JVM spec contains sections in targeting the JVM with your own compiler. Clojure also does a good job both in the implementation and compilation fronts on emitting JIT friendly code, especially after we got direct linking. I'd say the only detrimental parts in the implementation are around ints and you can often work around most of them, although a java implementation will always be faster there. Still, even in java if you want to squeeze the most performance out of an implementation, you have to write both JIT friendly and CPU friendly code

👍 1
chrisn15:04:50

@UK0810AQ2 - True, there are some statements that compile to similar bytecode and there are some loops that optimize will from clojure to hotspot but that pertains to the extent they match what javac would emit for various bits of code. That was really my point and my first statement - hotspot is built to target the output of javac is very true. Small variations that have no discernable difference in code output such as locals clearing can disable various optimizations that would have happened given very simply written java code. An example from the CSV parser is I got a major perf gain from simply spilling field variables to locals despite the fact that the entire parser is small enough to fit in registers in the current CPU architectures - hotspot couldn't detect of course that the field variable should be kept in a register for a given loop. I hadn't heard about jitwatch and you just made my day - compiling hsdis for a given platform is a pita. Totally agreed about jit and cpu friendly code - hence the design of the CharReader.java that I pointed to earlier. Hotspot is fairly weak - if you move beyond arrays and primitives then you are majorly dropping the chances that it will do anything meaningful as compared to an implementation in c++. In fact, when I was working through a math library a while back in java I found that simply adding an offset to an array access in a loop so I could use sub-arrays caused a major performance drain in a tight loop as compared to no array offset despite the fact that x86 assembly and it's various SIMD extensions have assembly instructions meant for indirect access of data. So at least I have to be really mindful and minimal if I want hotspot to do what I expect it to do given my background in other languages. Local variables, arrays, primitives, simple datastructures are simple looping constructs. Anything else is just usually off the table. And certainly no synchronization and absolutely not synchronization per character.

❤️ 1
👍 1
chrisn15:04:01

Honestly the bigger thing to think about is moving away from pushback reader in https://github.com/clojure/clojure/search?q=pushbackreader - the tools every clojure person uses many many times every day.

Ben Sless16:04:20

If takes a bit of tinkering but I was able to get jitwatch to even display the clojure source alongside the bytecode and assembly. Feel free to ping me about it if you have questions. I also think it went better with an uberjar than a repl

chrisn17:04:29

That is amazing literally. What a great tool - I will get into it and ping you. Thanks a ton!

rickmoynihan16:04:12

@UDRJMEFSN: > This is even moreso, interestingly enough, for jdk-17 where the int read(); has been heavily nerfed while the read(char[], int, int) function has received a health bordering on insane amount of buffs. Can you clarify what you mean by this? I get that you’re saying pushback reader read() is slower in jdk17? but what are you saying about read(char [], int, int)?

chrisn18:04:51

@U06HHF230 - Sure - according to my tests, PushbackReader's read (and clojure.data.csv) are far slower on jdk17 while read(char[], int, int) for a Reader is far faster. Here are the timings I found:

(time (read-all-chars (io/reader fname)))
  ;; jdk-8  - 5728ms
  ;; jdk-17 - 28257ms
  (time (read-all-chars (PushbackReader. (io/reader fname))))
  ;; jdk-8  - 19964ms
  ;; jdk-17 - 51795ms

  ;;line-seq is pretty good but you can have newlines within quoted sections
  (time (count (line-seq (io/reader fname))))
  ;; jdk-17 - 2222ms

  (time (read-all-cbuf (io/reader fname)))
  ;; jdk-8  - 1480ms
  ;; jdk-17 - 426.24ms

chrisn18:04:48

Reading the large CSV with clojure.data.csv also displayed this disparity:

(time (count (csv/read-csv (io/reader fname))))
  ;; jdk-8  - 56937ms
  ;; jdk-17 - 81223ms

Alex Miller (Clojure team)18:04:56

https://clojure.org/releases/devchangelog#v1.11.1-rc1 is now available • https://clojure.atlassian.net/browse/CLJ-2701 Pin serialVersionUID for Keyword and ArraySeq back to 1.10.3 values to retain binary serialization

🎉 17
Noah Bogart20:04:07

Any particular reason you didn't want to pin the other class IDs?

Alex Miller (Clojure team)20:04:47

they didn't change so trying to do as small a change as possible in 1.11.1, we will do that in 1.12

👍 2
Noah Bogart20:04:47

That makes sense. Thanks

jumar05:04:13

Excellent - thanks a ton Alex for the quick turnaround. We'll give it a try.

jumar17:04:50

I've just tried it and it works beautifully - thanks again!

🎉 2