Fork me on GitHub
#clojure
<
2022-01-30
>
winsome04:01:55

What's the best way to deal with filesystem resources in cljc code? I've got a medium-sized word that I'd like to include in a cljc library, but now when I try to use it in a project that uses an uberjar for deployment, I get a Method code too large! error while trying to compile.

winsome04:01:46

Before, when it was just clj code I was slurping it in via io/resource, and that wasn't a problem. I decided to just def it inside a data namespace.

p-himik10:01:53

You can still use something like io/resource in CLJC, but it will have to be in a macro on the CLJS side so it gets executed at compile time (it will of course use compile time classpath as well), so it might be a hassle given that you can't use it as a macro on the CLJ side since it would lead to the same error. Another approach would be to split that string into a bunch of defs or, since that might not work as well, a bunch of defns, and then make a new def by combining them. That way, the large string will be split into smaller ones, each one in its own method. To ease the second approach, especially if the chance of you changing that string is high, you can write a macro that reads a file and generates a bunch of defns where each contains, say, at most 10kB string, and an extra def that uses them to come up with the resulting string.

vemv05:01:35

@U028BUU1P3R can you provide a brief example of the code that was failing? I think that I've seen this sort of problem before; the tldr is that one should favor thin code that points to big data, over large code that maps 1:1 to data. Method code too large! says it all - Java classes (which is what all clj code ultimately compiles to) aren't meant to be arbitrarily large.

winsome06:01:39

Sure. I made myself a cli wordle clone to satisfy my wordle-cravings. I've got a two wordlists in separate files - one is 14K, and the other is 67K. I wrote in clj but I have vague intentions of making a web version so my family can play together, so I changed it to cljc.

;; old way
;(def targets (str/split-lines (slurp (io/resource "targets.txt"))))
;(def dict (str/split-lines (slurp (io/resource "dict.txt"))))

;; cljc way - this blows up with Method code too large!
(def targets ["word1" "word2" ,,, "word2500"])
(def dict ["word1" ,,, "word12972"])

winsome06:01:48

I've just changed it back to the slurp for now, but I think I'll take a a swing at the macro approach sometime soon.

vemv06:01:47

Yeah the old way is the one way to go in general. Normally I'd add a delay so that one avoids the compile-time side-effect - those are yucky in general :) For js compat, maybe I'd look into how to "slurp" from cljs (which surely varies from web to cli) and make a protocol, with jvm and (possibly two) cljs implementations.

barrell09:01:15

I’ve been looking around but can’t seem to find anything - what characters are valid namespace characters? How can you print the file path that will be looked for with that namespace? I’m setting up a directory structure based routing system, and I’m trying to figure out how to name variables (i.e. /photos/:photo-id would map to routes.photos.[$+*-]photo-id. I know I can at least prefix it with a hyphen that will look for an underscore, but I’m curious about the other ($, *, and +). Is there a clojure function that functions like (ns->filepath 'routes.$photo-id)?

p-himik10:01:31

There is, but it's private, clojure.core/root-resource. And it's so small - you basically already know what it does.

borkdude10:01:12

In how far is keyword interning a serious danger with respect to memory usage if you intern keywords that result from reading EDN/JSON from some third party?

Ben Sless10:01:12

Slightly mitigated by weak references but too quickly you can probably oom

borkdude10:01:57

ok, I'm asking because I noticed that when using keywords as keys in a map, the lookup becomes much quicker than when using symbols

borkdude10:01:15

and those symbols could be the result of auto-generating in macros

borkdude10:01:13

but this danger is already present if you evaluate random programs, as they can contain random keywords, so maybe it's not a huge deal

Ben Sless10:01:33

You could put them behind a bounded cache

borkdude10:01:46

so probably: if the user was malicious, he/she could cause an OOM, but in normal usage, it shouldn't be a problem

borkdude10:01:32

How would you use a bounded cache for this?

Ben Sless11:01:38

What you want to do is prevent the creation of too many keywords, any mechanism that will keep track of the number of keywords created and would throttle or force the cleanup of old keywords can be set up. You can set a policy which biases known keywords, dig in the implementation of Keyword to see how you can force the cache to evict unreferenced keywords, and generally wrap keyword in something which will bound its behavior in that context

borkdude11:01:52

@UK0810AQ2 I could also convert the symbols (which are a limited set/scope) to some number, or maybe a keyword like :k1 etc instead of :g__auto_001

borkdude11:01:37

I notice that a long lookup is much slower than a keyword lookup:

user=> (let [m {:a 1 :b 2 :c 3 'd 6 'x 7 1 1}] (time (dotimes [i 10000000] (.get m 1))))
"Elapsed time: 348.475696 msecs"
nil
user=> (let [m {:a 1 :b 2 :c 3 'd 6 'x 7 1 1}] (time (dotimes [i 10000000] (.get m :a))))
"Elapsed time: 31.343012 msecs"

borkdude11:01:17

I would just use numbers if I could make that as fast as the keyword lookup

Ben Sless11:01:59

maybe int-map is faster?

borkdude11:01:51

Seems so!

user=> (let [^java.util.Map m (into {} [[1 2] [3 4] [:b 2]])] (time (dotimes [i 1000000000] (.get m :b))))
"Elapsed time: 3000.653457 msecs"

user=> (let [^java.util.Map m (into (i/int-map) [[1 2] [3 4] [4 2]])] (time (dotimes [i 1000000000] (.get m 4))))
"Elapsed time: 500.608064 msecs"

borkdude11:01:32

I wonder if such a thing also exists for CLJS, or if that would even make sense there

borkdude11:01:49

It seems in CLJS using an int as a key is faster than a keyword anyway:

cljs.user=> (let [m (into {} [[1 2] [3 4] [:b 2]])] (time (dotimes [i 100000000] (.get m :b))))
"Elapsed time: 2821.658472 msecs"
nil
cljs.user=> (let [m (into {} [[1 2] [3 4] [:b 2]])] (time (dotimes [i 100000000] (.get m 1))))
"Elapsed time: 794.211123 msecs"

p-himik12:01:07

@U04V15CAJ Your comparisons are not entirely correct. Those are small maps so they are array maps and not hash maps - the lookup is sequential. When the long is the last item, of course its lookup will be slow.

p-himik12:01:02

Try hash-map, the lookup speed should be around the same for all the keys - as long as you don't have some bad case of hash collision.

borkdude12:01:34

good point!

Joshua Suskalo16:01:14

Thanks for having this conversation, it reminded me that I needed to get #farolero to stop interning keywords at runtime. I was able to cut a new release today fixing what would probably have been a pretty hard-to-trace bug a while from now when I deploy farolero to production on something.

borkdude16:01:02

@U5NCUG8NR Aha! Are there any performance repercussions when disabling that?

Joshua Suskalo16:01:11

should be about the same on unwind because I've exchanged an indirect keyword identity (comparison was still using = at a top level) to a boxed long comparison, but it should be faster when entering a block that needs to construct a jump target can do so via a simple fetch_add on an atomic rather than needing to do a gensym, get the string out of it, and then construct a keyword with that and a fixed namespace string.

Joshua Suskalo16:01:44

The gensym one clearly does an atomic fetch_add somewhere in its body anyway, so I'm just going down to only the fetch_add, rather than needing string ops as well

borkdude16:01:01

interesting... I ended up not using int-map yet, since assoc seemed to be way slower, so in total performance would be actually worse than just a regular hash-map.... A hash-map + keywords are in total the fastest for the problem I have, but the interning of keywords still could be a problem for some environments. Maybe I'll make the keywordizing configurable or so. :thinking_face:

Joshua Suskalo16:01:33

making it configurable makes sense when you can know at app development that you're receiving only trusted data. Another thing to consider would be to have your own interning deftype that allows you to have "epochs" where you say "all the keys from this epoch will get GCed together"

Joshua Suskalo16:01:06

If you're doing tons of stuff that needs efficiency through a single request but across requests it won't give much benefit, then maybe just wrap an epoch around your handler, but if you're doing something where there's savings to be made across requests, you could just make it like a ttl cache basically.

Joshua Suskalo16:01:11

ofc doing it this way brings problems if your users want to get data out by writing literal keywords into their code

Joshua Suskalo16:01:46

Could make a custom associative data structure that just wraps a map basically but compares keywords as equal to the new key type.

Joshua Suskalo16:01:14

But that loses a lot of the performance benefits if you're doing that constantly.

Joshua Suskalo16:01:29

(let me know if my ramblings are useless, I can stop)

borkdude16:01:47

oh please continue ;)

Joshua Suskalo16:01:26

So if the intention is to keep things convenient while being performant, I think that means you basically need a global "initial epoch" that would have all the interned items that are literal values in your code, and you could provide an alternative to the get function that would map a keyword to something in this initial epoch at compile time, and then you make interning be a two-step process of first looking things up in the initial epoch and then if it's missing falling through to the non-static epoch, that way you get fast lookup at all your callsites, and construction of keys is also not that much more expensive.

Joshua Suskalo16:01:18

Personally I'd opt for making this alternate get actually be a function and just have an inline function definition that will do the epoch registration. It would mean that depending on if you're using it as an argument to a HOF you'd get different performance characteristics, but it'd be as performant as possible in all cases.

Joshua Suskalo16:01:14

I think that mostly covers the main usecases for "I want to make my own interning", although I'd love to hear if you have any thoughts about holes in this idea. There's probably other good ways that don't go down this route though.

borkdude16:01:25

That's what I basically did with the int-map, symbols were converted to a number

borkdude16:01:42

but instead of a number I could just create Objects

Joshua Suskalo16:01:34

I think they'll be about the same size seeing as boxed numbers have their size dominated by the object header. Although I think being able to use identical? would help performance.

borkdude16:01:34

I want to move to mutable arrays eventually anyway. Using keywords was a nice cheap way to get some free speed-ups without changing much

borkdude17:01:20

but as it's only safe to intern keywords in environments that aren't long-lived or do not process code untrusted input, making a configuration option for this in the meanwhile makes sense I think

Joshua Suskalo17:01:29

so with your int map of symbols being converted to a number, what did you do to allow easy lookup from the data without losing the benefits of a simple type like integers for comparison? Were you able to move the mapping from symbol to integer to compile time or was that a runtime lookup? Or am I missing something and there's no lookup needed with known-at-compile-time keys?

borkdude17:01:04

compile time

borkdude17:01:21

the lookups were faster

borkdude17:01:34

but the changing of the map using assoc was slower

borkdude17:01:37

so in total it was slower

Joshua Suskalo17:01:37

ah, fair enough

pinealan13:01:26

Anyone having issues with connecting to cljdoc? I’m getting “Peer’s Certificate has been revoked” on Firefox with SEC_ERROR_REVOKED_CERTIFICATE err code, and NET::ERR_CERT_REVOKED on chrome. Seems a bit weird though since it seems to only be an issue on my osx machine.

p-himik13:01:11

Same on Linux, but works in Chrome.

noisesmith13:01:08

a random stack overflow mentions that firefox will error if any resource on the page (eg. favicon) but other browsers just silently fail to load it

noisesmith13:01:23

but that's just from a quick google

lread13:01:13

On that note, cljdoc is looking for ops-savvy folks for advice/help. Drop by #cljdoc if you have time/interest/love-in-your-heart.

wevrem17:01:32

On both Safari and Firefox I get messages that secure connection can’t be established/revoked certificate.

lread17:01:48

Thanks @UTFAPNRPT, known issue.

lread13:01:30

@U050TNB9F has resolved the certificate issue.

🙏 2
pbaille17:01:19

Hello, i've encountered this which seems surprising to me, any idea? (with-meta list {:meta :data}) => Exception

Alex Miller (Clojure team)18:01:21

are you trying to add metadata to the list function or do you mean an actual list instance?

pbaille06:01:09

Yes I was trying to add meta to the clojure.core/list function

pbaille06:01:55

I was believing that every IObj could take meta

Nom Nom Mousse18:01:49

If I have a map of search terms like {:genome "hg38"} and a "database" of maps like [{:genome "hg19" :sample "A"} {:genome "hg38" :sample "A"} {:genome "hg38" :sample "B"} {:genome "hg19" :sample "B"}] is there an easy way to get all matches, i.e. {:genome "hg38" :sample "A"} {:genome "hg38" :sample "B"}? Note that there might be multiple kv pairs in the search terms and fewer in the db: A search like {:genome "hg38" :sample "A"} in the db [{:genome "hg19"} {:genome "hg38"}] is possible and should return {:genome "hg38"}. It feels like it should be very simple (a clojure.set operation or something) but my current attempts are quite verbose. It is almost like a join that requires all keys (that exist in the db) on the left side to match.

Nom Nom Mousse18:01:37

Erm, this might just be what set/join does I glean from reading the docs

Nom Nom Mousse18:01:43

Ugh, I feel stupid now XD I had never used set/join before duckie

user=> (def db #{{:genome "hg19" :sample "A"} {:genome "hg38" :sample "A"} {:genome "hg38" :sample "B"} {:genome "hg19" :sample "B"}})
#'user/db
user=> (def search #{{:genome "hg38"}})
#'user/search
user=> (clojure.set/join)
Execution error (ClassNotFoundException) at java.net.URLClassLoader/findClass (URLClassLoader.java:471).
clojure.set
user=> (require '[clojure.set :as set :refer [join]])
nil
user=> (join search db)
#{{:genome "hg38", :sample "A"} {:genome "hg38", :sample "B"}}

noisesmith18:01:56

I was just typing up that you would need to put single k/v entries into a hash-map to do the query, clearly you beat me to it

🙏 1
p-himik18:01:58

Regarding your first question - if that's a frequent operation, just create an index with group-by or clojure.set/index. If it's a one-off thing, just reduce over the db and accumulate the matching records. Regarding the second part - so a query map represents an OR query? Note that clojure.set/join would join the query together with the results - probably not what you want, given that then all results will have :sample "A".

🙏 1
p-himik18:01:54

Note also that join may work incorrectly if you have maps with different keys. Also, it may easily be slower than a hand-rolled alternative because it indexes all the data each time you call the function.

Nom Nom Mousse18:01:16

Thanks for pointing me to the index function. It will speed up my code but only add one line AFAICS.

javahippie21:01:15

I have two sorted, lazy lists containing integers, possibly between 10 and 1000 values each. I want to write a function that only keeps values which exist in both lists and came up with this, it will be called quite often, so efficiency is a concern (which it rarely was in my current work with Clojure). I played with the data in the REPL and came up with this solution, are there any obvious issues that come to mind? I control the source of those lists, so I could generate different data types, if it helped.

p-himik21:01:12

LGTM. If the :else branch is potentially the most frequently hit one, it could make sense to make it the first one in cond. But perhaps JVM itself will do that in runtime.

p-himik22:01:10

You could also make it a tiny bit faster by replacing destructuring with manual calls to first and rest/`next`, given how the first two branches of cond don't use tail-2 and tail-1, respectively.

javahippie22:01:32

I guess it’s not possible to predict which branches are hit the most. Calling first and rest explicitly is a good point, I changed to destructuring for better readability, but did not think of this 👍

hiredman22:01:54

It isn't lazy if that is a concern

javahippie22:01:12

Currently we always need all values from the lists, but if the system grows it might become a concern, you also raise a good point there