Fork me on GitHub
#announcements
<
2022-04-12
>
practicalli-johnny15:04:37

https://github.com/practicalli/clojure-deps-edn user level aliases for Clojure CLI - library version updates and a few removals • Removed :deps from configuration to avoid over-riding version from install of Clojure CLI • Removed :inspect/rebl (alias is commented) after deprecating 6 months ago • GitHub action .github/workflows/lint-with-clj-kondo.yml updated to clj-kondo version 2022.04.08 • Update library versions using clojure -T:search/outated > command (see https://github.com/practicalli/clojure-deps-edn/blob/live/CHANGELOG.org for details)

practicalli 4
🎉 2
chrisn18:04:57

Introducing https://github.com/cnuernber/charred - fast json/csv encode and decode. This library finalizes my research into csv and json parsing and is a complete drop-in replacement for clojure.data.csv and clojure.data.json. Same API, much better (5-10x) performance. This library gets as good performance for those tasks as anything on the JVM and avoids the jackson hairball entirely. You can find my previous post on fast csv parsing for the reasons why the system is fast or just read the source code. All the files are pretty short. I moved the code from dtype-next into a stand-alone library and added encoding (writing) to the mix so you don't need any other dependencies. Finally this library has the same conformance suite as the libraries it replaces so you can feel at least somewhat confident it will handle your data with respect. Enjoy :-)

wow 29
🎉 42
💯 21
⏱️ 4
clojure-spin 11
gratitude 4
❤️ 6
4
seancorfield18:04:02

I'm curious why org.clojure/tools.logging {:mvn/version "1.2.4"} is a dependency of it?

chrisn18:04:19

Because it uses an offline thread to do blocking reads and sometimes that thread may log depending on the situation.

chrisn19:04:08

I love tools.logging, btw. Far and away the best logging framework IMO. The whole zero dependencies thing is extremely helpful when it comes to logging systems.

1
👍 4
borkdude19:04:46

I and some other people are looking to extend the tools.logging approach to more libraries like JSON and http clients here: https://github.com/clj-easy/tools.misc Feedback is welcome on that!

☝️ 6
❤️ 3
Kelvin19:04:11

Do you happen to have benchmarks, especially against Jackson-based libraries? (I know you've discussed your findings with the community, but would be nice to have results in one place.)

Kelvin19:04:21

I see. So was Charred extracted from the larger dtype-next library?

chrisn19:04:35

Yes - and I added writing the formats efficiently. I thought it would be more palatable to many people if the library was small and exact and had more minimal deps. dtype-next is specifically targetted towards HPC and thus it has somewhat more dependencies and many dependencies unrelated to reading or writing csv and json data.

👍 2
chrisn19:04:00

I guess additionally I feel like charred is a good library to learn techniques from as it is precisely targetted. dtype is a bit of a battleship.

4
kennytilton19:04:32

on the name.

domparry19:04:11

I am unreasonably excited about this.

domparry19:04:15

It's going to be great to get rid of Jackson dependencies for my etl pipelines. Thanks so much Chris.

chrisn19:04:54

@U1S4MH05T - That is great!! The fastest pathway for parsing is to create a https://cnuernber.github.io/charred/charred.api.html#var-parse-json-fn with the options you want and call that. It turns out for parsing small json blobs simply mapping the options map to a parser is a significant portion of the parse time. That function is safe to use in a multithreaded context so you probably only need to ever create one.

domparry19:04:52

That's terrific! We're also parsing a LOT of json in our normal database and message queue usage of the Google cloud api, so this will give us a great lift in performance there too.

richiardiandrea19:04:34

This is an awesome library! FWIW, a way to be more flexible w.r.t. logging is to be able to pass a log function as option. I so like this approach that I could not stop sharing it 😄

metal 2
chrisn19:04:31

That is a great point. Then I could say it has zero dependencies 🙂.

❤️ 3
🙏 2
🌈 1
1
flowthing20:04:10

This looks fantastic — thanks for making it!

kennytilton21:04:14

Looks a little better than twice as fast as c.data.csv on a couple of 30-60k, 3-column, CSVs I am inhaling. 🏎️

chrisn21:04:27

Try :async? false

chrisn21:04:31

And a smaller buffer size

chrisn21:04:31

:bufsize 8192

chrisn21:04:02

The test csv was 1.7Gb

chrisn21:04:20

So the system is tuned for larger files

chrisn21:04:25

Also - use the supplier interface that avoids creation of persistent vectors

chrisn22:04:09

I guess at those sizes you can just load things into memory in an offfline thread pool and just parse the actual strings.

kennytilton23:04:13

Thx! Don't mind me, I am just an applications guy. I thought 2X was great! 🙂 Now I have maybe 100x faster, For one, I started with:

(with-open [reader (io/reader "subtlex-lite.csv")]
                   (doall
                     (rest
                       (charred/read-csv reader))))
This is the 100x improvement:
(charred/read-csv (java.io.File. "subtlex-lite.csv")
                    :async? false
                    :bufsize 8192)
Not having luck finding a replacement for this in the doc:
(json/parse-stream ( "wordnet.json") true)
I stole this from the test suite:
(let [input (java.io.File. "wordnet.json")]
    (charred/read-json input :key-fn keyword))
Gives me:
Execution error at charred.JSONReader/readObject (JSONReader.java:375).
JSON parse error - unrecognized 'null' entry.
Anyway, 100x, not too shabbly! 👏

awesome 2
🎉 1
chrisn23:04:54

100X is amazing actually - That error means something started with 'n' and didn't finish with 'ull' - I really should print more context there. Is that json file accessible publicly?

kennytilton23:04:47

Wordnet: https://wordnet.princeton.edu/download/current-version The so-called synset IDs are indeed strings like "n.123456". The 100x came on https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus I believe I grabbed the 75k Excel 2007 version then deleted all but a new words in OS X Numbers, then exported as CSV>

kennytilton23:04:39

Actually, I recall I got my json version from here: https://github.com/fluhus/wordnet-to-json

kennytilton23:04:26

That ^^ conveniently knits everything together in one JSON file.

chrisn00:04:34

Perhaps that is JSON5. The JSON spec I targeted requires all strings including map keys to be quoted. I will however check it out and get it to work - thanks for the heads up :-).

kennytilton00:04:54

Maybe I am mis-describing the data. Here is the start via less in a terminal:

{
  "synset": {
    "a1000283": {
      "offset": 1000283,
      "pos": "s",
      "word": [
        "admonitory",
        "admonishing",
        "reproachful",
        "reproving"
      ],
      "pointer": [
        {
          "symbol": "\u0026",
          "synset": "a999867",
          "source": -1,
          "target": -1
        },
:

Ben Sless03:04:49

Now the obvious question - EDN reader next?

❤️ 3
chrisn12:04:17

@UK0810AQ2 I thought about edn but I don't know anyone who reads/writes edn at scale. I know plenty of people who process JSON at scale and large CSV files are all over the place.

Ben Sless12:04:43

@UDRJMEFSN besides all of use reading and compiling clojure every day. Faster compiler and tools?

chrisn13:04:59

@U0PUGPSFR - Wordnet found 2 issues related to reading data right at the edge of one buffer leading to the next. New version - 1.001 🙂. Great testcase - lots of escaped data and quite large.

🦾 2
chrisn13:04:09

@UK0810AQ2 - OK now you are talking but that is a bit more involved than faster edn/lisp readers. That involves looking at the entire architecture of the Clojure compiler itself as reading the source code is like 1 step out of 5.

chrisn13:04:40

I am sure Rich would love it if I forked Clojure and started making aggressive changes 🙂.

👍 1
chrisn13:04:05

If I were going to speed things up I would speed up the time it takes to compile dtype-next and core.async. core.async, once required is a 3+ second hit I think due to the compilation of macros and such. dtype-next is a 1.5 second hit at least with the dataset library doubling it - faster require time of those would be beneficial to me and to new people comparing the system against pandas and dplyr. So I think that would involve looking at the macro execution pathway and seeing if I can find some gain there. I guess it would start with the edn and lisp readers however.

Ben Sless15:04:45

@UDRJMEFSN did you by any chance profile this?

chrisn15:04:29

I have only roughly profiled it. The document on https://cnuernber.github.io/dtype-next/datatype-to-dtype-next.html goes into what I found. Going from dtype v2 to v3 halved the require time and I figured this stuff out by starting a blank repl and then timing the require statement.

chrisn15:04:57

I have not (yet) figured out a way to profile this stuff meaningfully aside from starting a jar with a main function that does a dynamic require and that is fairly tedious but doable as visualvm allows you to profile an executable startup up.

Ben Sless15:04:21

Doesn't have to be a jar, you can even pass the flag to clj and eval the require expression

chrisn15:04:46

Oh rly? I hadn't considered that. That is very useful.

Ben Sless15:04:47

clj -J-(visual VM flags) -e (require ...)

chrisn15:04:21

I wonder then if visualvm is the best option. We are getting into where I would just want a data file produced and a tool to look at it. Do you have a suggestion for a tool for that type of profiling?

Ben Sless15:04:40

JFR, attach to JVM at startup, capture recording to file, analyze and share freely

Ben Sless15:04:34

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:StartFlightRecording=duration=60s,filename=myrecording.jfr

chrisn16:04:29

got it- you are extremely helpful - thank you!

Ben Sless16:04:39

clojure -J-XX:+FlightRecorder -J-XX:StartFlightRecording=duration=20s,filename=myrecording.jfr -Sdeps '{:deps {org.clojure/core.async      {:mvn/version "1.3.618"}}}' -M -e "(require '[clojure.core.async])"
Glad I could help 🙂

kennytilton17:04:21

"I am sure Rich would love it if I forked Clojure and started making aggressive changes 🙂." Great, @UDRJMEFSN! I'll add the OOP module! GDR&amp;H

😆 3
chrisn20:04:28

@U0C8489U6 - Now zero deps for real 🙂. Thanks for the idea, log-fn is totally fine for this use case.

❤️ 1
domparry20:04:12

Already integrated into our data pipelines. 🙂 And just upgraded to the zero dep version. 😄

chrisn21:04:19

Fortune favors the bold 🙂.

Ben Sless05:04:08

@UDRJMEFSN any chance for readers for arbitrary data sources? Especially byte arrays and input streams? Currently all parsing is done in terms of chars and strings, any reason not to use bytes?

Ben Sless05:04:07

Would you like me to open an issue for that?

chrisn12:04:48

@UK0810AQ2 - If you pass in something that is not a string or an char[] the system attempts to make a reader out of it. This allows it to parse input streams and http://java.io.File's and such. Is that what you are looking for? So for instance for the large wordnet.json Kenny used earlier you just do (read-json (http://java.io.File. "wordnet.json) :async? true) and off you go. More abstractly you could create a CharReader from anything that can supply a sequence of character arrays.

chrisn12:04:23

Put another way anything that http://clojure.java.io can turn into a reader is fair game.

chrisn12:04:37

One thing I think is an issue is a file that starts with the unicode BOM. All the systems I have seen have specific handling of that and I have none so I imagine those files will fail.

Ben Sless13:04:03

Since you're using a buffer I guess the parser is still shuffling some data around. I thought that parsing bytes directly could help avoid casting and copying

chrisn14:04:12

If you knew you were only going to parse files of a particular encoding then perhaps but my guess is the actual decoding of the byte information into characters is a very minor part of the overall time. the :async? flag allows you to move that conversion to an offline thread and that helps with larger files. You could use a stream of bytes and an encoder and this is how https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L157 do it. What makes the parsers fast, however, is implementing things like parsing strings or numbers into https://github.com/cnuernber/charred/blob/master/java/charred/JSONReader.java#L134. My guess is that doing the decode from bytes to chars in those loops would not be an overall win as you would increase the code size in each loop. I could definitely be wrong on this, however.

chrisn15:04:27

Reading from bytes has the strong advantage of fitting twice as much data into the cache and in this realm that could actually be nearly a 2x gain.

chrisn15:04:34

Potentially you could write the parser to work from byte data and just encode chars that are uninteresting to the parser as a special byte value in a lot of cases. I don't know, you would want to make sure there was a potential win there before committing real work to trying it out.

chrisn15:04:55

String.java has some very interesting information in the comments, btw.

❤️ 1
👀 1
ambrosebs23:04:55

Typed Clojure 1.0.27 - Check your programs without depending on typedclojure! https://www.patreon.com/posts/65065388 Also includes fixes to malli->type translation and other improvements.

😮 13
🆒 10
🎉 12
🏆 5
clojure-spin 5
ambrosebs23:04:14

Here's a fleshed out tutorial on how to type check a Clojure library without introducing a runtime dependency on Typed Clojure https://github.com/typedclojure/typedclojure/tree/main/example-projects/zero-deps