This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-02-24
Channels
- # announcements (5)
- # aws (24)
- # babashka (41)
- # beginners (130)
- # bristol-clojurians (2)
- # calva (39)
- # chlorine-clover (64)
- # cider (30)
- # clojure (202)
- # clojure-belgium (1)
- # clojure-dev (99)
- # clojure-europe (5)
- # clojure-hungary (4)
- # clojure-italy (10)
- # clojure-losangeles (8)
- # clojure-nl (11)
- # clojure-norway (6)
- # clojure-spec (7)
- # clojure-uk (12)
- # clojurescript (52)
- # core-typed (26)
- # cursive (19)
- # data-science (19)
- # datomic (19)
- # duct (10)
- # emacs (17)
- # fulcro (22)
- # graalvm (11)
- # jobs (3)
- # kaocha (28)
- # leiningen (6)
- # lumo (2)
- # malli (10)
- # nrepl (2)
- # off-topic (23)
- # pathom (2)
- # pedestal (7)
- # re-frame (3)
- # reagent (30)
- # reitit (2)
- # remote-jobs (2)
- # shadow-cljs (77)
- # sql (10)
- # test-check (22)
- # tools-deps (37)
- # vscode (1)
- # yada (3)
Are there plans to add clojure to https://sdkman.io/?
https://clojure.atlassian.net/browse/TDEPS-113 Somewhat related: https://clojure.atlassian.net/browse/TDEPS-36

I'm not working on it, but patches welcome
If you use a library like data.csv to read CSV files, and probably the other main Clojure libraries for reading CSV files (but I haven't confirmed for those yet), if there are many 'cell values' that are the same as each other, e.g. all the empty string, or many cell values equal to the string "foo", then a separate Java string object is returned for each one, which has a memory overhead of about 40 or 48 bytes per string/csv-cell. That overhead can be several times larger than the CSV file itself, if most cells are short and there are many duplicates. Would folks be interested in an option for a time/space tradeoff that looked for duplicates and returned references to identical Java strings in memory when strings were equal in the input file?
For what it’s worth I have noticed this too and I think it is a real problem. Especially where you have datasets containing repeated codes, e.g. statistical observation data with codes like MALE
FEMALE
keyed against dimensions.
However depending on the application I think it’s trivially solved by essentially mapping #(.intern %)
over the columns that contain such codes; or for cases where you want to manage the pools lifecycle, or don’t trust the data, building that pool yourself in a hashmap.
I’m really not sure the library should incorporate this as a feature; however I have seen people complain about memory etc of handling CSV in clojure; where they’ve just consumed a large sequence into memory without any thought… so it might be worth mentioning this in the README?
ahh just seen your article posted below mentions interning too
That may be a better question for the #data-science channel, now that I've typed it in.
Since data.csv returns a lazy sequence of rows, it seems the 'deduplication' could be done in a separate optional step from data.csv itself, which might be a useful way to provide such functionality as an option.
Java's G1 will automatically dedupe string references like this
Yeah, I found an article mentioning that. Sounds like a nifty option.
there is an enhancement with patch (currently sitting in a support queue) for data.csv for this
It looks like an option. Here is the article I found about it, but haven't tried it out myself: https://itblues.pl/2019/01/02/all-you-need-to-know-about-string-performance-in-java/
I'm pretty conflicted about making that be a default thing for data.csv - it's a lot of complexity for a problem most people don't have
Looking for review and/or performance measurements of the data.csv patch?
or perhaps a good way to make it an option for data.csv users rather than always-on?
Is the support queue you mentioned containing an enhancement patch outside of Clojure JIRA? I see several JIRA tickets for data.csv, and could have missed the one you talked about while skimming, but none look like what you described.
it's in jira, but it's not publicly visible
it's weird that it's there at all but I can't decide what to do with it (it's the only thing in the support queue)
If you have a way to make it public, I might take a look at it and see if there is a way to package it as an option, default off.
I moved it here https://clojure.atlassian.net/browse/DCSV-20 - note the patch is not from a contributor, so I'm not going to look at it
would need either them to sign or clean room impl
So, perhaps a clean room implementation, where it is off using the currently documented API calls, but on with either some new call (or a new arity to an existing call), with some performance measurements of time and memory use using new vs. current API, from a signed-up contributor, might be more interesting?
this is solution stuff, not problem stuff
Sure. One of the problem statements is pretty clear, though: In a data dependent fashion, some (many?) CSV files with repeated cell values use exhorbitant amounts of memory, and hammer X can reduce that significantly.
Granted I am putting the solution into that problem statement 🙂
I just have no idea how to state a problem that is independent of at least a general statement of the approach used to improve upon it.
The library enables both, so any changes to the library should consider both, yes?
If one is processing a row at a time, transforming it into some other data that is not strings, e.g. into numbers, then you will not hit this particular issue, because all strings returned by the lib would become garbage soon.
So you would be happy with the existing behavior, and wouldn't go looking for an option to reduce memory utilization.
So likely you would only consider using such an option if you were retaining all or large fractions of the rows.
As I hinted at earlier, such an optimization can be done completely outside of data.csv code, on the returned lazy sequence. If the answer is thus "make each user of data.csv that wants this option reimplement it, or discover library foo that does it as an add on", then that is certainly a choice.
I meant, is the problem you’re experiencing occurring during reducing/streaming or full retention of a vec of rows?
Me personally -- I am the dreaded 'see a hammer, and a box of nails that other people noticed weren't getting hammered' kinda person in this situation. I don't have any production use of data.csv yet.
I, personally, use data.csv all the time (usually streaming), and have never had a problem with this. it is way below my radar.
Understood. Makes sense.
It is a good insight that this only arises if someone is using data.csv in a way that retains the returned data 'long term', versus transforming it a row at a time.
i naively think that most people do what i do when there's a lot of data coming in ... reduce it straight away and avoid keeping it around 😄
Alternatively, what kind of data formats are there that would allow such a thing? Requirement would be that nothing except the start token has to be escaped, and that token must be unusual, unlike single- or double-quotes or XML tags
you can use a tagged reader in EDN that refers to another file, and read in with your own string interp
Hm, well... ideally it'd all be in one file. Of course I could hack together a "pre-parser" that reads the EDN file as text, splits it by the heredoc and... well... in the end I basically have a new format ^^
When you say "and that token must be unusual, unlike single- or double-quotes or XML tags", are you saying that XML tags are 'usual', because the text you might want inside the heredoc has XML strings as an expected common use case you have in mind?
If 'yes', then note that anything you come up with, if you want it to be able to nest, if it becomes popular, then has new common strings that you need to be able to 'quote' without escaping.
That's the sweet thing with heredocs though, you define the token that is used as delimiter yourself
Are you saying you'd be happy doing a linear time scan through the heredoc contents to calculate such a delimiter string?
that's the sweet thing with external files too, as the delimiter is something you would never put inside a string 🙂
The not-so-sweet thing with external files is though that they are exactly that - external files
That would guarantee it. You can also use randomly generated N-bit strings in hex/whatever as delimiters, and hope (with pretty good ways to calculate how likely you are to accidentally collide)
In my case I simply know delimiter tokens that are guaranteed not to be in the contents. Everything that's too insane to be used in normal text and/or HTML. Which means, triple symbols are already enough. Something like %%%
would be sufficient, e.g.
And will someone down the line try to nest these things inside each other? 🙂
I ask this, not to dissuade you from using a good quick engineering solution for your situation, which is very likely sufficient. I mention it mainly to show one issue why a new public format intended for general use might be a bit tricky.
No, but even if - nesting heredocs with custom-defined delimiters is not a problem. Since the heredoc itself doesn't get interpreted, and would have to be sent to an interpreter manually later on
I'm simply pointing out that using %%% as a delimiter works for one level of heredoc. If you use that at one level, you can't use that same one for nested heretics. ("heretics" is Slack or maybe macOS autocorrect for "heredocs")
When something is proposed as general purpose, people sometimes imagine uses the creator did not imagine or intend.
If it is not general purpose, just say "be very cautious, or even better do not ever, nest MyCoolHereDocs inside of each other" and Bob's your Uncle.
If you're afraid whoever uses your data structure wants to nest something inside of it, just use a more unthinkable delimiter. Autogenerate one using a password generator, if it must be
Already mentioned above 🙂 (in my message mentioning "randomly generated N-bit strings")
It is far more common to have a notion of a string inside of some larger file, where double-quotes and some other characters must be escaped.
The idea of having a custom start-delimiter for a string explicitly mentioned at the beginning and end I've seen mentioned in some programming language. Rust, maybe?
Escaping of a known fixed delimiter character is pretty easy to get right, and define in a spec.
In a programming language context, it is expected that a person is responsible for making sure the delimiter does not appear within the string. (or a person writing a program that generates code in that language, which amounts to the same thing -- the person is responsible for making sure the delimiter does not appear in the body of the string). I think people could quickly tell you whether any of the most commonly used data formats like YAML, XML, etc. have such a feature, but there are so many uncommonly used ones that no one here has even heard of, that I wouldn't even know if there is a list of 'all' of them.
what's the quick and dirty way to get a deps include for deps.edn... clojars has the lein version (I can figure it out from that but just wondering if there's a faster way to get it)... on a side note, is there any emacs way to look it up or even (this would be awesome) a cli or emacs way to find the latest version or versions of a library and have it pasted into deps.edn automagically?
of course, the holy grail is a way to include libraries without a clj restart, but that's another topic (but if this is possible now let me know!).
For the version updates in deps.edn I use this alias:
:outdated {:extra-deps {olical/depot {:mvn/version "1.8.4"}}
:main-opts ["-m" "depot.outdated.main" "-a" "outdated" "--update"]}
As for adding libraries without a clj restart, there is a library for that, but I never used it, and I forgot the name
Ah, here you go: https://github.com/clj-commons/pomegranate
It has always baffled me that we don't have an npm install --save
for clojure. But I should stop whining and try to create it (though I lack the skills).
well, it would really be, say, npm install <some-dependency> --save
and it would 1. find the dependency online, if no version is specified it grabs the latest 2. add the latest version to the includes in package.json (the js version of deps.edn or project.clj) 3. install the dependency in the local dependencies in your project.
the --save
is what adds it to package.json, otherwise it just pulls it down into the project, handy if you just want to try it out.
such a utility is a really basic expectation of coders these days, but I find myself hunting around clojars and so on for every dependency I need.
which is ultra antiquated, and I know for a fact that folks seeing what clojure/clojurescript have been flummoxed on finding out this is missing.
The add-lib
branch of tools.deps.alpha
itself lets you add libraries on the fly to a running REPL.
I remember there was a cli thing called "plz" or something that was like npm install
but it wasn't maintained.
https://github.com/hagmonk/find-deps for deps.edn
I think the whole npm install <some-dependency> --save
is not much of an advantage though, is it? You have to know the package name anyway beforehand, and when you do, you usually also have a version. And then, is it really simpler to run a shell command over just copying the version string?
it is absolutely an advantage for me (and apparently 1000's of js devs in their world)... the key is you just remember the package (say, 'enlive') and definitely not the version... it pulls down the latest.
if you want a specific version, you type [email protected] or whatever.
Well then add enlive {:mvn/version "0.0.1"}
and run the update command from above (haven't tried, but shouldn't matter that the version number you typed in is likely invalid)
enlive {:mvn/version "RELEASE"}
will get you the newest release -- but that is frowned on in the JVM world because it leads to non-repeatable builds (because you might get a different version later on)
which makes the npm thing great... as it pulls down the latest and hard codes that into package.json.
A lot of JVM folks look at the npm world with horror over how cavalier JS devs seem to be about stability 🙂
well, the latest version thing is a secondary concern... mainly, I want, while I am in the flow of coding and realize I need some dep, to be able to add it to my project and roll on. without going into my browser to find it.
there was also an emacs thing I used, M-x clojars or something, that was great, but no longer works.
but this is, I think, a culture mismatch between perhaps the often-java-world-veteran Clojure dev and the new kids on the block. for folks who toiled through Java looking up a dep on the web is trivial, and an annoyance for folks who've cut their teeth on js.
but I'm grateful for all the links, and hopefully the tools.deps and pomagranate thing will help.
I can't help saying that making Clojure as easy as possible, especially around dependencies and all that, would be a good strategy for adoption. but, as mentioned, something for me to try (though it's over my head knowledge-wise now).
The thing is that right now there are a whole bunch of build tools being used. Sure, deps.edn
is gaining momentum, but Leiningen is still the most popular, and Boot is quite popular as well. And quite a few people use Gradle, or Maven. Each have different file formats.
yeah, the options are awesome... but the truth is that Clojure's many options thing is crippling adoption... I mean, someone looking to switch in clj/cljs is going now to start into it and find 8 options.
Adoption isn't really a goal for Clojure tho'...
It's intended to be "simple", rather than "easy" 🙂
I felt like lein was a pretty solid and well recommended starting place, and grabbing those strings from github isn't a concern since I already had to go to github in my search for docs. Not saying things couldn't be better, just that it wasn't a barrier for me in particular
The answer nowadays should be "use deps.edn". Boot is just awful for beginners, the docs are abysmal. Leiningen still has the best docs though, better than deps.edn, and is just as easy
lein
is certainly easy. I'm not sure that boot
ever really had much traction. I loved it. I switched from lein
to boot
at work in late 2015 and was very happy about that.
I liked and used Boot for quite a while, but my god, is it cumbersome to find out how to do a certain thing
(we switched to the CLI/`deps.edn` stuff pretty quickly after it appeared)
In the end, all I used Boot for was to create a glorified Makefile. Might as well use the original.
So deps.edn is the up and coming option? Does it do anything lein doesn't or does it come down to preferences?
More Q&A on that topic available here: https://clojureverse.org/t/is-there-a-sales-pitch-for-switching-to-deps-edn-from-lein-in-2020/5367
Just want I wanted, thanks!
All I miss is some kind of watch feature that automatically watches for file changes and then does things I tell it to
I mean, run stuff that goes beyond code reloading, that's a no-brainer in Lisp world 😛 Though it helps that CIDER has a feature to automatically run the tests upon file reload
@zilti I really don't like file watchers and reload-based workflows -- and I really find that I don't need them. I think a lot of it comes down to your REPL-based workflow. I eval every single change, as I make it, and can re-run tests via a hot key in my editor easily enough.
how do you get around things like request handlers? or do you always define them by value and not var?
Use #'
so they are passed as Vars. That way they can be updated while the program is running.
hey everyone, i have test suite that leverages with-redefs
to mock functionality of tangential machinery but i’d like to move toward making it parallelizable. the things that i’m binding over are not dynamic, is there some other way i can pull this off?
@seancorfield yea I used it for a few niche things, like recompiling Garden CSS upon file save, things like that
I don't do any front end stuff -- I might feel differently if I had to deal with CSS "compilation" etc 🙂
I try to avoid it whenever possible. If frontend, I do JavaFX stuff, and there, the REPL works just fine.
Right now the only work-related thing I have is a web crawling infrastructure that crawls webpages (and the individual crawlers are written in a simple DSL)
We have about 90,000 lines of backend Clojure at work.
No ClojureScript. Our front end is JS with React etc.
@seancorfield not tempted by reagent/re-frame?
We looked at cljs back in late 2014 I think it was. We built a proof of concept in Om, then rewrote it in Reagent, and we liked that a lot. But cljs tooling was very fragile and hard to use back then, and there were a lot of annoying differences between cljs and clj. So we decided to build our front end with JS in early 2015.
I think if we were starting over today, we might try to use cljs instead -- it's matured a lot in the last five years.
Yea, it isn't limited to JSoup, it also crawls JSON and XML. I originally also had it do crawling through a headless browser, but I figured I might as well just load the page in the embedded browser if it needs JS, then just extract the source and parse that one with JSoup
"Because of changes to the underlying JavaFX libraries, this library currently works with Java 1.8! It will fail immediately on more modern JVMs." sigh
But yea, I don't feel like I'm missing out on anything by simply using a headless Firefox via Etaoin. Does all I could ever have imagined. (Okay, almost.)
is this the predecessor to what's in tools.deps? https://github.com/hagmonk/find-deps
No. It's built to use with tools.deps/CLI.
See https://github.com/clojure/tools.deps.alpha/wiki/Tools for a long list of tools built for CLI/`deps.edn`
well, dang. that's right up the alley of everything I've been babbling on about. https://github.com/clojure/tools.deps.alpha/wiki/Tools#deps-management
Hmmm Meyvn looks interesting... At least the concept does. But it seems like in the end, it just slaps yet another file on top of the pile
We use depstar
and test-runner
heavily at work, and I use deps-deploy
with all my non-work projects.
@zilti zilti clojure.test doesn't really have any facilities for formatting output beyond string messages
Oh, wait. I stand corrected.
> Generic reporting function, may be overridden to plug in different report formats (e.g., TAP, JUnit).
Unfortunately the referenced test_is.clj
file is no longer on master
.
@zilti No idea. Never used TAP stuff
test-runner
is just way to run clojure.test
stuff from the command-line (like lein test
).