This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-01-30
Channels
- # babashka (19)
- # beginners (87)
- # calva (11)
- # cider (6)
- # cljdoc (30)
- # clojure (84)
- # clojure-china (1)
- # clojure-dev (13)
- # clojure-europe (4)
- # clojure-france (1)
- # clojure-gamedev (1)
- # clojurescript (12)
- # core-async (1)
- # cursive (12)
- # data-oriented-programming (1)
- # defnpodcast (1)
- # emacs (9)
- # events (1)
- # fulcro (8)
- # graalvm (1)
- # introduce-yourself (1)
- # missionary (6)
- # music (1)
- # nextjournal (14)
- # off-topic (26)
- # portal (2)
- # re-frame (1)
- # releases (2)
- # shadow-cljs (13)
What's the best way to deal with filesystem resources in cljc code? I've got a medium-sized word that I'd like to include in a cljc library, but now when I try to use it in a project that uses an uberjar for deployment, I get a Method code too large!
error while trying to compile.
Before, when it was just clj code I was slurping it in via io/resource, and that wasn't a problem. I decided to just def
it inside a data
namespace.
You can still use something like io/resource
in CLJC, but it will have to be in a macro on the CLJS side so it gets executed at compile time (it will of course use compile time classpath as well), so it might be a hassle given that you can't use it as a macro on the CLJ side since it would lead to the same error.
Another approach would be to split that string into a bunch of def
s or, since that might not work as well, a bunch of defn
s, and then make a new def
by combining them. That way, the large string will be split into smaller ones, each one in its own method.
To ease the second approach, especially if the chance of you changing that string is high, you can write a macro that reads a file and generates a bunch of defn
s where each contains, say, at most 10kB string, and an extra def
that uses them to come up with the resulting string.
@U028BUU1P3R can you provide a brief example of the code that was failing?
I think that I've seen this sort of problem before; the tldr is that one should favor thin code that points to big data, over large code that maps 1:1 to data.
Method code too large!
says it all - Java classes (which is what all clj code ultimately compiles to) aren't meant to be arbitrarily large.
Sure. I made myself a cli wordle clone to satisfy my wordle-cravings. I've got a two wordlists in separate files - one is 14K, and the other is 67K. I wrote in clj but I have vague intentions of making a web version so my family can play together, so I changed it to cljc.
;; old way
;(def targets (str/split-lines (slurp (io/resource "targets.txt"))))
;(def dict (str/split-lines (slurp (io/resource "dict.txt"))))
;; cljc way - this blows up with Method code too large!
(def targets ["word1" "word2" ,,, "word2500"])
(def dict ["word1" ,,, "word12972"])
I've just changed it back to the slurp for now, but I think I'll take a a swing at the macro approach sometime soon.
Yeah the old way is the one way to go in general. Normally I'd add a delay
so that one avoids the compile-time side-effect - those are yucky in general :)
For js compat, maybe I'd look into how to "slurp" from cljs (which surely varies from web to cli) and make a protocol, with jvm and (possibly two) cljs implementations.
I’ve been looking around but can’t seem to find anything - what characters are valid namespace characters? How can you print the file path that will be looked for with that namespace?
I’m setting up a directory structure based routing system, and I’m trying to figure out how to name variables (i.e. /photos/:photo-id
would map to routes.photos.[$+*-]photo-id
. I know I can at least prefix it with a hyphen that will look for an underscore, but I’m curious about the other ($, *, and +).
Is there a clojure function that functions like (ns->filepath 'routes.$photo-id)
?
There is, but it's private, clojure.core/root-resource
. And it's so small - you basically already know what it does.
In how far is keyword interning a serious danger with respect to memory usage if you intern keywords that result from reading EDN/JSON from some third party?
Yep, this message and the following two: https://clojurians.slack.com/archives/C03S1KBA2/p1643147571081700
ok, I'm asking because I noticed that when using keywords as keys in a map, the lookup becomes much quicker than when using symbols
but this danger is already present if you evaluate random programs, as they can contain random keywords, so maybe it's not a huge deal
so probably: if the user was malicious, he/she could cause an OOM, but in normal usage, it shouldn't be a problem
What you want to do is prevent the creation of too many keywords, any mechanism that will keep track of the number of keywords created and would throttle or force the cleanup of old keywords can be set up.
You can set a policy which biases known keywords, dig in the implementation of Keyword to see how you can force the cache to evict unreferenced keywords, and generally wrap keyword
in something which will bound its behavior in that context
@UK0810AQ2 I could also convert the symbols (which are a limited set/scope) to some number, or maybe a keyword like :k1
etc instead of :g__auto_001
I notice that a long lookup is much slower than a keyword lookup:
user=> (let [m {:a 1 :b 2 :c 3 'd 6 'x 7 1 1}] (time (dotimes [i 10000000] (.get m 1))))
"Elapsed time: 348.475696 msecs"
nil
user=> (let [m {:a 1 :b 2 :c 3 'd 6 'x 7 1 1}] (time (dotimes [i 10000000] (.get m :a))))
"Elapsed time: 31.343012 msecs"
Seems so!
user=> (let [^java.util.Map m (into {} [[1 2] [3 4] [:b 2]])] (time (dotimes [i 1000000000] (.get m :b))))
"Elapsed time: 3000.653457 msecs"
user=> (let [^java.util.Map m (into (i/int-map) [[1 2] [3 4] [4 2]])] (time (dotimes [i 1000000000] (.get m 4))))
"Elapsed time: 500.608064 msecs"
I wonder if such a thing also exists for CLJS, or if that would even make sense there
It seems in CLJS using an int as a key is faster than a keyword anyway:
cljs.user=> (let [m (into {} [[1 2] [3 4] [:b 2]])] (time (dotimes [i 100000000] (.get m :b))))
"Elapsed time: 2821.658472 msecs"
nil
cljs.user=> (let [m (into {} [[1 2] [3 4] [:b 2]])] (time (dotimes [i 100000000] (.get m 1))))
"Elapsed time: 794.211123 msecs"
@U04V15CAJ Your comparisons are not entirely correct. Those are small maps so they are array maps and not hash maps - the lookup is sequential. When the long is the last item, of course its lookup will be slow.
Try hash-map
, the lookup speed should be around the same for all the keys - as long as you don't have some bad case of hash collision.
Thanks for having this conversation, it reminded me that I needed to get #farolero to stop interning keywords at runtime. I was able to cut a new release today fixing what would probably have been a pretty hard-to-trace bug a while from now when I deploy farolero to production on something.
@U5NCUG8NR Aha! Are there any performance repercussions when disabling that?
should be about the same on unwind because I've exchanged an indirect keyword identity (comparison was still using =
at a top level) to a boxed long comparison, but it should be faster when entering a block that needs to construct a jump target can do so via a simple fetch_add on an atomic rather than needing to do a gensym, get the string out of it, and then construct a keyword with that and a fixed namespace string.
The gensym one clearly does an atomic fetch_add somewhere in its body anyway, so I'm just going down to only the fetch_add, rather than needing string ops as well
interesting... I ended up not using int-map yet, since assoc
seemed to be way slower, so in total performance would be actually worse than just a regular hash-map.... A hash-map + keywords are in total the fastest for the problem I have, but the interning of keywords still could be a problem for some environments. Maybe I'll make the keywordizing configurable or so. :thinking_face:
making it configurable makes sense when you can know at app development that you're receiving only trusted data. Another thing to consider would be to have your own interning deftype that allows you to have "epochs" where you say "all the keys from this epoch will get GCed together"
If you're doing tons of stuff that needs efficiency through a single request but across requests it won't give much benefit, then maybe just wrap an epoch around your handler, but if you're doing something where there's savings to be made across requests, you could just make it like a ttl cache basically.
ofc doing it this way brings problems if your users want to get data out by writing literal keywords into their code
Could make a custom associative data structure that just wraps a map basically but compares keywords as equal to the new key type.
But that loses a lot of the performance benefits if you're doing that constantly.
(let me know if my ramblings are useless, I can stop)
So if the intention is to keep things convenient while being performant, I think that means you basically need a global "initial epoch" that would have all the interned items that are literal values in your code, and you could provide an alternative to the get
function that would map a keyword to something in this initial epoch at compile time, and then you make interning be a two-step process of first looking things up in the initial epoch and then if it's missing falling through to the non-static epoch, that way you get fast lookup at all your callsites, and construction of keys is also not that much more expensive.
Personally I'd opt for making this alternate get
actually be a function and just have an inline function definition that will do the epoch registration. It would mean that depending on if you're using it as an argument to a HOF you'd get different performance characteristics, but it'd be as performant as possible in all cases.
I think that mostly covers the main usecases for "I want to make my own interning", although I'd love to hear if you have any thoughts about holes in this idea. There's probably other good ways that don't go down this route though.
I think they'll be about the same size seeing as boxed numbers have their size dominated by the object header. Although I think being able to use identical?
would help performance.
I want to move to mutable arrays eventually anyway. Using keywords was a nice cheap way to get some free speed-ups without changing much
but as it's only safe to intern keywords in environments that aren't long-lived or do not process code untrusted input, making a configuration option for this in the meanwhile makes sense I think
yeah, I agree
so with your int map of symbols being converted to a number, what did you do to allow easy lookup from the data without losing the benefits of a simple type like integers for comparison? Were you able to move the mapping from symbol to integer to compile time or was that a runtime lookup? Or am I missing something and there's no lookup needed with known-at-compile-time keys?
ah, fair enough
Anyone having issues with connecting to cljdoc? I’m getting “Peer’s Certificate has been revoked” on Firefox with SEC_ERROR_REVOKED_CERTIFICATE
err code, and NET::ERR_CERT_REVOKED
on chrome. Seems a bit weird though since it seems to only be an issue on my osx machine.
a random stack overflow mentions that firefox will error if any resource on the page (eg. favicon) but other browsers just silently fail to load it
but that's just from a quick google
On that note, cljdoc is looking for ops-savvy folks for advice/help. Drop by #cljdoc if you have time/interest/love-in-your-heart.
On both Safari and Firefox I get messages that secure connection can’t be established/revoked certificate.
Thanks @UTFAPNRPT, known issue.
Hello, i've encountered this which seems surprising to me, any idea?
(with-meta list {:meta :data})
=> Exception
this might clarify https://ask.clojure.org/index.php/11514/functions-with-metadata-can-not-take-more-than-20-arguments?show=11515#a11515
are you trying to add metadata to the list
function or do you mean an actual list instance?
If I have a map of search terms like {:genome "hg38"}
and a "database" of maps like [{:genome "hg19" :sample "A"} {:genome "hg38" :sample "A"} {:genome "hg38" :sample "B"} {:genome "hg19" :sample "B"}]
is there an easy way to get all matches, i.e. {:genome "hg38" :sample "A"} {:genome "hg38" :sample "B"}
?
Note that there might be multiple kv pairs in the search terms and fewer in the db:
A search like {:genome "hg38" :sample "A"}
in the db [{:genome "hg19"} {:genome "hg38"}]
is possible and should return {:genome "hg38"}
.
It feels like it should be very simple (a clojure.set
operation or something) but my current attempts are quite verbose. It is almost like a join that requires all keys (that exist in the db) on the left side to match.
Erm, this might just be what set/join does I glean from reading the docs
Ugh, I feel stupid now XD I had never used set/join before
user=> (def db #{{:genome "hg19" :sample "A"} {:genome "hg38" :sample "A"} {:genome "hg38" :sample "B"} {:genome "hg19" :sample "B"}})
#'user/db
user=> (def search #{{:genome "hg38"}})
#'user/search
user=> (clojure.set/join)
Execution error (ClassNotFoundException) at java.net.URLClassLoader/findClass (URLClassLoader.java:471).
clojure.set
user=> (require '[clojure.set :as set :refer [join]])
nil
user=> (join search db)
#{{:genome "hg38", :sample "A"} {:genome "hg38", :sample "B"}}
I was just typing up that you would need to put single k/v entries into a hash-map to do the query, clearly you beat me to it
Regarding your first question - if that's a frequent operation, just create an index with group-by
or clojure.set/index
. If it's a one-off thing, just reduce
over the db and accumulate the matching records.
Regarding the second part - so a query map represents an OR
query? Note that clojure.set/join
would join the query together with the results - probably not what you want, given that then all results will have :sample "A"
.
Note also that join
may work incorrectly if you have maps with different keys. Also, it may easily be slower than a hand-rolled alternative because it indexes all the data each time you call the function.
Thanks for pointing me to the index function. It will speed up my code but only add one line AFAICS.
I have two sorted, lazy lists containing integers, possibly between 10 and 1000 values each. I want to write a function that only keeps values which exist in both lists and came up with this, it will be called quite often, so efficiency is a concern (which it rarely was in my current work with Clojure). I played with the data in the REPL and came up with this solution, are there any obvious issues that come to mind? I control the source of those lists, so I could generate different data types, if it helped.
LGTM.
If the :else
branch is potentially the most frequently hit one, it could make sense to make it the first one in cond
. But perhaps JVM itself will do that in runtime.
You could also make it a tiny bit faster by replacing destructuring with manual calls to first
and rest
/`next`, given how the first two branches of cond
don't use tail-2
and tail-1
, respectively.
I guess it’s not possible to predict which branches are hit the most. Calling first
and rest
explicitly is a good point, I changed to destructuring for better readability, but did not think of this 👍
Currently we always need all values from the lists, but if the system grows it might become a concern, you also raise a good point there