Fork me on GitHub
#clojure
<
2020-02-24
>
dmarjenburgh08:02:56

Are there plans to add clojure to https://sdkman.io/?

Alex Miller (Clojure team)14:02:15

I'm not working on it, but patches welcome

andy.fingerhut18:02:10

If you use a library like data.csv to read CSV files, and probably the other main Clojure libraries for reading CSV files (but I haven't confirmed for those yet), if there are many 'cell values' that are the same as each other, e.g. all the empty string, or many cell values equal to the string "foo", then a separate Java string object is returned for each one, which has a memory overhead of about 40 or 48 bytes per string/csv-cell. That overhead can be several times larger than the CSV file itself, if most cells are short and there are many duplicates. Would folks be interested in an option for a time/space tradeoff that looked for duplicates and returned references to identical Java strings in memory when strings were equal in the input file?

rickmoynihan11:02:56

For what it’s worth I have noticed this too and I think it is a real problem. Especially where you have datasets containing repeated codes, e.g. statistical observation data with codes like MALE FEMALE keyed against dimensions. However depending on the application I think it’s trivially solved by essentially mapping #(.intern %) over the columns that contain such codes; or for cases where you want to manage the pools lifecycle, or don’t trust the data, building that pool yourself in a hashmap. I’m really not sure the library should incorporate this as a feature; however I have seen people complain about memory etc of handling CSV in clojure; where they’ve just consumed a large sequence into memory without any thought… so it might be worth mentioning this in the README?

rickmoynihan11:02:52

ahh just seen your article posted below mentions interning too

andy.fingerhut18:02:46

That may be a better question for the #data-science channel, now that I've typed it in.

andy.fingerhut18:02:15

Since data.csv returns a lazy sequence of rows, it seems the 'deduplication' could be done in a separate optional step from data.csv itself, which might be a useful way to provide such functionality as an option.

Alex Miller (Clojure team)18:02:38

Java's G1 will automatically dedupe string references like this

andy.fingerhut18:02:08

Yeah, I found an article mentioning that. Sounds like a nifty option.

Alex Miller (Clojure team)18:02:23

there is an enhancement with patch (currently sitting in a support queue) for data.csv for this

ghadi18:02:47

I don’t think G1 does that

ghadi18:02:53

By default

andy.fingerhut18:02:39

It looks like an option. Here is the article I found about it, but haven't tried it out myself: https://itblues.pl/2019/01/02/all-you-need-to-know-about-string-performance-in-java/

Alex Miller (Clojure team)18:02:44

I'm pretty conflicted about making that be a default thing for data.csv - it's a lot of complexity for a problem most people don't have

andy.fingerhut18:02:56

Looking for review and/or performance measurements of the data.csv patch?

andy.fingerhut18:02:18

or perhaps a good way to make it an option for data.csv users rather than always-on?

ghadi18:02:19

(CompactStrings is now the default, but not StringDedup)

andy.fingerhut18:02:49

Is the support queue you mentioned containing an enhancement patch outside of Clojure JIRA? I see several JIRA tickets for data.csv, and could have missed the one you talked about while skimming, but none look like what you described.

Alex Miller (Clojure team)18:02:02

it's in jira, but it's not publicly visible

Alex Miller (Clojure team)18:02:39

it's weird that it's there at all but I can't decide what to do with it (it's the only thing in the support queue)

andy.fingerhut18:02:26

If you have a way to make it public, I might take a look at it and see if there is a way to package it as an option, default off.

Alex Miller (Clojure team)18:02:10

I moved it here https://clojure.atlassian.net/browse/DCSV-20 - note the patch is not from a contributor, so I'm not going to look at it

Alex Miller (Clojure team)18:02:37

would need either them to sign or clean room impl

andy.fingerhut18:02:58

So, perhaps a clean room implementation, where it is off using the currently documented API calls, but on with either some new call (or a new arity to an existing call), with some performance measurements of time and memory use using new vs. current API, from a signed-up contributor, might be more interesting?

Alex Miller (Clojure team)18:02:04

this is solution stuff, not problem stuff

andy.fingerhut18:02:05

Sure. One of the problem statements is pretty clear, though: In a data dependent fashion, some (many?) CSV files with repeated cell values use exhorbitant amounts of memory, and hammer X can reduce that significantly.

andy.fingerhut18:02:27

Granted I am putting the solution into that problem statement 🙂

andy.fingerhut18:02:58

I just have no idea how to state a problem that is independent of at least a general statement of the approach used to improve upon it.

ghadi19:02:26

Retaining all rows or processing lazily?

andy.fingerhut19:02:53

The library enables both, so any changes to the library should consider both, yes?

andy.fingerhut19:02:56

If one is processing a row at a time, transforming it into some other data that is not strings, e.g. into numbers, then you will not hit this particular issue, because all strings returned by the lib would become garbage soon.

andy.fingerhut19:02:16

So you would be happy with the existing behavior, and wouldn't go looking for an option to reduce memory utilization.

andy.fingerhut19:02:46

So likely you would only consider using such an option if you were retaining all or large fractions of the rows.

andy.fingerhut19:02:53

As I hinted at earlier, such an optimization can be done completely outside of data.csv code, on the returned lazy sequence. If the answer is thus "make each user of data.csv that wants this option reimplement it, or discover library foo that does it as an add on", then that is certainly a choice.

✔️ 8
ghadi19:02:34

I meant, is the problem you’re experiencing occurring during reducing/streaming or full retention of a vec of rows?

andy.fingerhut19:02:28

Me personally -- I am the dreaded 'see a hammer, and a box of nails that other people noticed weren't getting hammered' kinda person in this situation. I don't have any production use of data.csv yet.

ghadi19:02:11

.... walks away slowly

Alex Miller (Clojure team)19:02:56

I, personally, use data.csv all the time (usually streaming), and have never had a problem with this. it is way below my radar.

andy.fingerhut19:02:34

Understood. Makes sense.

andy.fingerhut19:02:08

It is a good insight that this only arises if someone is using data.csv in a way that retains the returned data 'long term', versus transforming it a row at a time.

kulminaator20:02:40

i naively think that most people do what i do when there's a lot of data coming in ... reduce it straight away and avoid keeping it around 😄

zilti20:02:54

Is there a way to have a "heredoc" inside an EDN file that then gets read as string?

ghadi20:02:15

not anything that bypasses EDN string escaping rules

zilti20:02:08

Alternatively, what kind of data formats are there that would allow such a thing? Requirement would be that nothing except the start token has to be escaped, and that token must be unusual, unlike single- or double-quotes or XML tags

ghadi20:02:56

you can use a tagged reader in EDN that refers to another file, and read in with your own string interp

ghadi20:02:37

{:some :edn
 :other  #heredoc {:file "stuff.template" :params :whatever}}

ghadi20:02:57

then install a reader that slurps the file and interpolates

ghadi20:02:04

but not in situ

zilti20:02:05

Hm, well... ideally it'd all be in one file. Of course I could hack together a "pre-parser" that reads the EDN file as text, splits it by the heredoc and... well... in the end I basically have a new format ^^

zilti20:02:09

So yea... it seems to be hard to find such a format. EDN, XML, YAML, INI all fail

zilti20:02:32

Maybe the world does need yet another format for once

andy.fingerhut21:02:35

When you say "and that token must be unusual, unlike single- or double-quotes or XML tags", are you saying that XML tags are 'usual', because the text you might want inside the heredoc has XML strings as an expected common use case you have in mind?

zilti21:02:50

There might be HTML inside the heredoc, and as we know HTML can be nasty

andy.fingerhut21:02:01

If 'yes', then note that anything you come up with, if you want it to be able to nest, if it becomes popular, then has new common strings that you need to be able to 'quote' without escaping.

zilti21:02:46

That's the sweet thing with heredocs though, you define the token that is used as delimiter yourself

andy.fingerhut21:02:16

Are you saying you'd be happy doing a linear time scan through the heredoc contents to calculate such a delimiter string?

🔎 4
bfabry21:02:52

that's the sweet thing with external files too, as the delimiter is something you would never put inside a string 🙂

zilti21:02:19

The not-so-sweet thing with external files is though that they are exactly that - external files

andy.fingerhut21:02:00

That would guarantee it. You can also use randomly generated N-bit strings in hex/whatever as delimiters, and hope (with pretty good ways to calculate how likely you are to accidentally collide)

zilti21:02:52

In my case I simply know delimiter tokens that are guaranteed not to be in the contents. Everything that's too insane to be used in normal text and/or HTML. Which means, triple symbols are already enough. Something like %%% would be sufficient, e.g.

andy.fingerhut21:02:17

And will someone down the line try to nest these things inside each other? 🙂

andy.fingerhut21:02:32

I ask this, not to dissuade you from using a good quick engineering solution for your situation, which is very likely sufficient. I mention it mainly to show one issue why a new public format intended for general use might be a bit tricky.

ghadi21:02:52

sounds like you're reinventing all the terrible parts of YAML

ghadi21:02:10

context sensitive escapes, etc.

zilti21:02:43

No, but even if - nesting heredocs with custom-defined delimiters is not a problem. Since the heredoc itself doesn't get interpreted, and would have to be sent to an interpreter manually later on

ghadi21:02:19

interpreter == EDN tag handler

andy.fingerhut21:02:56

I'm simply pointing out that using %%% as a delimiter works for one level of heredoc. If you use that at one level, you can't use that same one for nested heretics. ("heretics" is Slack or maybe macOS autocorrect for "heredocs")

zilti21:02:17

Yea, but why would anyone do that

andy.fingerhut21:02:06

When something is proposed as general purpose, people sometimes imagine uses the creator did not imagine or intend.

andy.fingerhut21:02:21

If it is not general purpose, just say "be very cautious, or even better do not ever, nest MyCoolHereDocs inside of each other" and Bob's your Uncle.

zilti21:02:36

If you're afraid whoever uses your data structure wants to nest something inside of it, just use a more unthinkable delimiter. Autogenerate one using a password generator, if it must be

andy.fingerhut21:02:57

Already mentioned above 🙂 (in my message mentioning "randomly generated N-bit strings")

zilti21:02:26

So yea, I guess no such format exists?

andy.fingerhut21:02:43

It is far more common to have a notion of a string inside of some larger file, where double-quotes and some other characters must be escaped.

andy.fingerhut21:02:24

The idea of having a custom start-delimiter for a string explicitly mentioned at the beginning and end I've seen mentioned in some programming language. Rust, maybe?

isak21:02:56

sounds like ruby object notation would work, but it's not a thing outside ruby

andy.fingerhut21:02:48

Escaping of a known fixed delimiter character is pretty easy to get right, and define in a spec.

zilti21:02:18

Many languages have it. Rust, PHP, Perl, Scheme, Bash, ...

andy.fingerhut21:02:13

In a programming language context, it is expected that a person is responsible for making sure the delimiter does not appear within the string. (or a person writing a program that generates code in that language, which amounts to the same thing -- the person is responsible for making sure the delimiter does not appear in the body of the string). I think people could quickly tell you whether any of the most commonly used data formats like YAML, XML, etc. have such a feature, but there are so many uncommonly used ones that no one here has even heard of, that I wouldn't even know if there is a list of 'all' of them.

Parenoid21:02:40

what's the quick and dirty way to get a deps include for deps.edn... clojars has the lein version (I can figure it out from that but just wondering if there's a faster way to get it)... on a side note, is there any emacs way to look it up or even (this would be awesome) a cli or emacs way to find the latest version or versions of a library and have it pasted into deps.edn automagically?

Parenoid21:02:16

of course, the holy grail is a way to include libraries without a clj restart, but that's another topic (but if this is possible now let me know!).

zilti21:02:16

The latter is possible, yes

zilti21:02:27

...and even that is possible ^^

Parenoid21:02:45

was standing but finding a chair.

Parenoid21:02:02

legs are shaky all of a sudden.

zilti21:02:20

For the version updates in deps.edn I use this alias: :outdated {:extra-deps {olical/depot {:mvn/version "1.8.4"}} :main-opts ["-m" "depot.outdated.main" "-a" "outdated" "--update"]}

zilti21:02:04

As for adding libraries without a clj restart, there is a library for that, but I never used it, and I forgot the name

Parenoid21:02:50

the version update string is great, though.

Parenoid21:02:31

aha... wonderful!

Parenoid21:02:31

this is fantastic... and pomogranate is current seeming.

Parenoid21:02:19

It has always baffled me that we don't have an npm install --save for clojure. But I should stop whining and try to create it (though I lack the skills).

zilti21:02:10

What does npm install --save do?

Parenoid21:02:16

well, it would really be, say, npm install <some-dependency> --save and it would 1. find the dependency online, if no version is specified it grabs the latest 2. add the latest version to the includes in package.json (the js version of deps.edn or project.clj) 3. install the dependency in the local dependencies in your project.

Parenoid21:02:53

the --save is what adds it to package.json, otherwise it just pulls it down into the project, handy if you just want to try it out.

Parenoid21:02:36

such a utility is a really basic expectation of coders these days, but I find myself hunting around clojars and so on for every dependency I need.

Parenoid21:02:28

which is ultra antiquated, and I know for a fact that folks seeing what clojure/clojurescript have been flummoxed on finding out this is missing.

seancorfield21:02:24

The add-lib branch of tools.deps.alpha itself lets you add libraries on the fly to a running REPL.

Parenoid21:02:40

oh fantastic!!

Parenoid21:02:27

I remember there was a cli thing called "plz" or something that was like npm install but it wasn't maintained.

zilti21:02:07

Hm that find-deps seems a bit dead though

Parenoid21:02:20

yeah, looking at it now.

Parenoid21:02:44

actually, seeing activity a couple weeks ago and so on.

zilti21:02:22

What's the difference to pomegranate?

dominicm21:02:39

Other than #7 for find-deps, it's pretty much OK as is.

zilti21:02:07

I think the whole npm install <some-dependency> --save is not much of an advantage though, is it? You have to know the package name anyway beforehand, and when you do, you usually also have a version. And then, is it really simpler to run a shell command over just copying the version string?

Parenoid21:02:31

it is absolutely an advantage for me (and apparently 1000's of js devs in their world)... the key is you just remember the package (say, 'enlive') and definitely not the version... it pulls down the latest.

Parenoid21:02:58

if you want a specific version, you type [email protected] or whatever.

Parenoid21:02:09

of course, you can list available packages right at the cli.

zilti21:02:32

Well then add enlive {:mvn/version "0.0.1"} and run the update command from above (haven't tried, but shouldn't matter that the version number you typed in is likely invalid)

zilti21:02:47

Package search though, true, that would be handy

seancorfield21:02:59

enlive {:mvn/version "RELEASE"} will get you the newest release -- but that is frowned on in the JVM world because it leads to non-repeatable builds (because you might get a different version later on)

zilti21:02:23

enlive is dead anyway

Parenoid21:02:34

which makes the npm thing great... as it pulls down the latest and hard codes that into package.json.

seancorfield21:02:37

A lot of JVM folks look at the npm world with horror over how cavalier JS devs seem to be about stability 🙂

Parenoid21:02:44

enlive was a terrible example!

zilti21:02:06

Yea as I said... run the update alias I gave you above, and you have the same

Parenoid21:02:04

well, the latest version thing is a secondary concern... mainly, I want, while I am in the flow of coding and realize I need some dep, to be able to add it to my project and roll on. without going into my browser to find it.

Parenoid21:02:53

there was also an emacs thing I used, M-x clojars or something, that was great, but no longer works.

Parenoid21:02:15

a cli approach would be editor agnostic and better.

Parenoid21:02:39

but this is, I think, a culture mismatch between perhaps the often-java-world-veteran Clojure dev and the new kids on the block. for folks who toiled through Java looking up a dep on the web is trivial, and an annoyance for folks who've cut their teeth on js.

Parenoid21:02:30

but I'm grateful for all the links, and hopefully the tools.deps and pomagranate thing will help.

Parenoid22:02:09

I can't help saying that making Clojure as easy as possible, especially around dependencies and all that, would be a good strategy for adoption. but, as mentioned, something for me to try (though it's over my head knowledge-wise now).

zilti22:02:24

The thing is that right now there are a whole bunch of build tools being used. Sure, deps.edn is gaining momentum, but Leiningen is still the most popular, and Boot is quite popular as well. And quite a few people use Gradle, or Maven. Each have different file formats.

Parenoid22:02:36

yeah, the options are awesome... but the truth is that Clojure's many options thing is crippling adoption... I mean, someone looking to switch in clj/cljs is going now to start into it and find 8 options.

Parenoid22:02:45

it's a paradox.

Parenoid22:02:02

clj? boot? lein? etc.

Parenoid22:02:21

to Clojurists, options are the killer feature. but it makes starting heinous.

seancorfield22:02:28

Adoption isn't really a goal for Clojure tho'...

Parenoid22:02:38

well, that's clear online.

seancorfield22:02:57

It's intended to be "simple", rather than "easy" 🙂

Parenoid22:02:13

where have I heard that? seems a bit familiar. ;-D

Parenoid22:02:31

I like Clojure, that's for sure.

Michael J Dorian22:02:06

I felt like lein was a pretty solid and well recommended starting place, and grabbing those strings from github isn't a concern since I already had to go to github in my search for docs. Not saying things couldn't be better, just that it wasn't a barrier for me in particular

zilti22:02:52

The answer nowadays should be "use deps.edn". Boot is just awful for beginners, the docs are abysmal. Leiningen still has the best docs though, better than deps.edn, and is just as easy

Parenoid22:02:10

boot seems like it may have lost traction.

Parenoid22:02:38

I also started deducing it was more appealing for vim/fireplace users...

seancorfield22:02:42

lein is certainly easy. I'm not sure that boot ever really had much traction. I loved it. I switched from lein to boot at work in late 2015 and was very happy about that.

Parenoid22:02:01

so you use boot, gotcha.

zilti22:02:04

I liked and used Boot for quite a while, but my god, is it cumbersome to find out how to do a certain thing

seancorfield22:02:08

(we switched to the CLI/`deps.edn` stuff pretty quickly after it appeared)

Parenoid22:02:22

and that is what I'm trying to do.

zilti22:02:27

Really, yea, I switched to deps.edn and Makefiles ^^

zilti22:02:53

In the end, all I used Boot for was to create a glorified Makefile. Might as well use the original.

Michael J Dorian22:02:05

So deps.edn is the up and coming option? Does it do anything lein doesn't or does it come down to preferences?

Michael J Dorian22:02:43

Just want I wanted, thanks!

zilti22:02:25

All I miss is some kind of watch feature that automatically watches for file changes and then does things I tell it to

Parenoid22:02:45

well, shadow-cljs has some features like that.

Parenoid22:02:52

for web pages, anyway.

zilti22:02:10

I mean, run stuff that goes beyond code reloading, that's a no-brainer in Lisp world 😛 Though it helps that CIDER has a feature to automatically run the tests upon file reload

Parenoid22:02:24

ah, gotcha.

seancorfield22:02:29

@zilti I really don't like file watchers and reload-based workflows -- and I really find that I don't need them. I think a lot of it comes down to your REPL-based workflow. I eval every single change, as I make it, and can re-run tests via a hot key in my editor easily enough.

👍 4
g22:02:38

how do you get around things like request handlers? or do you always define them by value and not var?

seancorfield22:02:27

Use #' so they are passed as Vars. That way they can be updated while the program is running.

✔️ 4
g22:02:32

hey everyone, i have test suite that leverages with-redefs to mock functionality of tangential machinery but i’d like to move toward making it parallelizable. the things that i’m binding over are not dynamic, is there some other way i can pull this off?

zilti22:02:04

@seancorfield yea I used it for a few niche things, like recompiling Garden CSS upon file save, things like that

seancorfield22:02:52

I don't do any front end stuff -- I might feel differently if I had to deal with CSS "compilation" etc 🙂

zilti22:02:56

I try to avoid it whenever possible. If frontend, I do JavaFX stuff, and there, the REPL works just fine.

zilti22:02:29

Right now the only work-related thing I have is a web crawling infrastructure that crawls webpages (and the individual crawlers are written in a simple DSL)

Parenoid22:02:55

reaver is a great library for scraping

Parenoid22:02:58

build on jsoup

seancorfield22:02:05

We have about 90,000 lines of backend Clojure at work.

seancorfield22:02:19

No ClojureScript. Our front end is JS with React etc.

Parenoid22:02:23

but your dsl may be more advanced than reaver, for sure.

Parenoid22:02:47

@seancorfield not tempted by reagent/re-frame?

seancorfield22:02:43

We looked at cljs back in late 2014 I think it was. We built a proof of concept in Om, then rewrote it in Reagent, and we liked that a lot. But cljs tooling was very fragile and hard to use back then, and there were a lot of annoying differences between cljs and clj. So we decided to build our front end with JS in early 2015.

seancorfield22:02:11

I think if we were starting over today, we might try to use cljs instead -- it's matured a lot in the last five years.

zilti22:02:00

Yea, it isn't limited to JSoup, it also crawls JSON and XML. I originally also had it do crawling through a headless browser, but I figured I might as well just load the page in the embedded browser if it needs JS, then just extract the source and parse that one with JSoup

Parenoid22:02:23

gotcha... was just looking at sparkedriver for js stuff.

zilti22:02:56

Oh, around JFX WebKit, interesting... I was/am using Etaoin

zilti22:02:25

"Because of changes to the underlying JavaFX libraries, this library currently works with Java 1.8! It will fail immediately on more modern JVMs." sigh

zilti22:02:46

But yea, I don't feel like I'm missing out on anything by simply using a headless Firefox via Etaoin. Does all I could ever have imagined. (Okay, almost.)

Parenoid22:02:50

didn't know about Etaoin... I'll take a look.

Parenoid22:02:30

is this the predecessor to what's in tools.deps? https://github.com/hagmonk/find-deps

seancorfield22:02:14

No. It's built to use with tools.deps/CLI.

seancorfield22:02:04

See https://github.com/clojure/tools.deps.alpha/wiki/Tools for a long list of tools built for CLI/`deps.edn`

Parenoid22:02:38

well, dang. that's right up the alley of everything I've been babbling on about. https://github.com/clojure/tools.deps.alpha/wiki/Tools#deps-management

zilti22:02:59

Hmmm Meyvn looks interesting... At least the concept does. But it seems like in the end, it just slaps yet another file on top of the pile

seancorfield22:02:29

We use depstar and test-runner heavily at work, and I use deps-deploy with all my non-work projects.

zilti22:02:49

Can test-runner generate TAP compatible output?

dchelimsky23:02:39

@zilti zilti clojure.test doesn't really have any facilities for formatting output beyond string messages

dchelimsky23:02:17

Oh, wait. I stand corrected.

dchelimsky23:02:20

> Generic reporting function, may be overridden to plug in different report formats (e.g., TAP, JUnit).

dchelimsky23:02:42

Unfortunately the referenced test_is.clj file is no longer on master.

plexus09:02:45

Note that kaocha has a TAP reporter as well.

zilti13:02:42

Yes, kaocha is what I am using 🙂

zilti22:02:12

I'm using Kaocha so far

seancorfield23:02:35

@zilti No idea. Never used TAP stuff

seancorfield23:02:36

test-runner is just way to run clojure.test stuff from the command-line (like lein test).