This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-04-03
Channels
- # beginners (7)
- # calva (25)
- # clerk (2)
- # clj-kondo (5)
- # clojure (42)
- # clojure-brasil (1)
- # clojure-europe (10)
- # clojure-nl (1)
- # clojure-norway (14)
- # clojure-uk (3)
- # conjure (6)
- # datahike (4)
- # datomic (3)
- # etaoin (4)
- # fulcro (6)
- # graalvm (7)
- # hoplon (9)
- # hyperfiddle (6)
- # introduce-yourself (2)
- # london-clojurians (1)
- # off-topic (22)
- # pedestal (5)
- # portal (12)
- # proletarian (1)
- # releases (1)
- # shadow-cljs (9)
- # vim (9)
I want to extend java.io.PushbackReader
to capture the characters it reads into a string, similarly to clojure.lang.LineNumberingPushbackReader
. (I can’t use LineNumberingPushbackReader
, becuse I need different CR/LF handling).
I know I can use proxy
to override the read
method like this:
(defn ^:private string-capturing-pushback-reader
[reader]
(let [sb (StringBuilder.)]
(proxy [PushbackReader] [reader]
(read []
(let [^java.io.Reader this this
n (proxy-super read)]
(.append sb (char n))
n)))))
Is there a way to add a method that the extendee (`PushbackReader`) doesn’t have, though? I’d like to add getString
to get at the captured string. Or is there some other idiom for this sort of thing? I could override toString
, but that feels hacky. Do I need to resort to making a Java class?Thanks, you’re right, it does. I forgot about that flowchart. Looks like I’ll need to use gen-class
.
You can also split it into two things if that's OK in your scenario - e.g. return a vector of [(proxy ...) (fn [] (str sb))]
.
Also defprotocol
creates an interface, so maybe that's usable.
There's also definterface
but it works only with AOT.
Doesn't StringReader release the captured text with toString? Everything has a toString and perhaps PushbackReader's toString is not already very important?
Yes, you’re right. Just using toString
is probably the simplest solution here, especially considering that this is only for the private use of a single namespace anyway.
Is there a good way of ensuring I'm operating on a file line-by-line and not loading the whole thing into memory? I have something like:
(with-open [reader (io/reader csv-file)]
(let [[header & rows] (csv/read-csv reader)]
(rows->schema header rows))))
(defn- rows->schema
[headers rows]
(->> rows
(map ...)
(reduce ...)
(map ...)
(map vector headers)
(ordered-map/ordered-map)))
A coworker ran the code with a 78MB file in a JVM constrained to 16MB of memory and it ran (albeit more slowly than on a JVM with several GBs of memory, but I suppose that makes sense); am I right in thinking that if I were loading the whole file into memory the JVM would just OOM? Or am I "holding the head" by keeping hold of the reference to header
?
I guess this is really two questions:
1. Is the code reasonable? Please tell me how you know it is or isn't.
2. Is there an automated test I can write to ensure that I don't accidentally blow up memory usage in the future?Reduce lazily consumes the seq you pass in though, right? Like, I'm only on the hook for the size of whatever reduce
returns, not what gets inputted into it?
If your processing was truly lazy, you'd get a file closed exception, so your processing must be operating greedily, presumably because of the reduce. Also I would have thought ordered map would have to hold everything in memory in order to do the ordering? Anyway, memory usage will surely depend on how much 'reduction' is occurring in your reduce step.
> Reduce lazily consumes the seq you pass in though, right? No, reduce doesn't lazily consume, nor lazily return: E.g.
(->> (range)
(reduce conj [])
(take 10))
Runs foreverIf the sequence you pass to reduce is lazy, then it will be consumed lazily.
(->> (range)
(reduce (fn [acc x]
(println "X" x)
(conj acc x)) [])
(take 10))
This will keep running forever but it will print the numbers as it goes on😄 yeah. I'm not sure there's a difference between (range)
"lazily producing" and anything downstream "lazily consuming". That is, I don't think there's anything special about reduce
that is "lazily consuming" here, right @U7S5E44DB?
Yeah exactly. Which is what @UJLF48QJC is looking for (reducing an arbitrarily large file)
> If your processing was truly lazy, you'd get a file closed exception, so your processing must be operating greedily
Won't with-open
keep it open until (in this case) (rows->schema ...)
returns? But rows->schema
itself is operating on lazily-loaded line by lazily-loaded line?
If your rows->schema returned a lazy result then no it wouldn’t. Eg try returning something that returns a lazy result (eg map) and see what happens.
To be a bit pedantic, even if it did return a lazy sequence (a map instead of a ordered-map for example), the file would still be fully consumed at the reduce
step
For your other question, how to test it doesn’t blow up the memory in the future, I’m not sure how to do it (besides running the test in a memory constraint environment…), but there’s ways to test whether you’ve fully consumed a lazy-sequence.
(defn number-generator
([] (number-generator 1))
([n]
(if (< n 1000)
(lazy-seq (cons n (number-generator (inc n))))
(lazy-seq (throw (Exception. "Fully consumed"))))))
(take 999 (number-generator)) ;; works
(take 1000 (number-generator)) ;; throws exception "Fully consumed"
You could create a lazy-sequence that would throw an exception after a certain number of items have been consumed. It doesn’t help for your rows->schema
function though, as you want everything to be consumed once you reach reduce
Has OP's question about whether the [header & rows]
formulation "holds the head" been answered?
i don’t think so. it got wrapped up in lazy or not-lazy talk that the core question remained unanswered.
This isn't the most rigorous experiment, but memory usage remained constant when I ran
(let [[head & tail] (repeatedly rand)]
(reduce max tail))
which suggests that the front bit of tail
is being GCed even though I'm keeping a reference to head
In this example, isn’t header simply a collection of column headings? It doesn’t hold a sequence of the other rows. It’s the destructured ‘rows’ that holds the head of a lazy sequence, but you then greedily process that so the sequence is fully realised by the end of the with-open block.
I am using time-literals library, which adds data_readers.cljc
that adds time/date
and other reader tags for Java 8 time object literals. The thing is though that if I use clojure.core/read-string
it will read them correctly, but clojure.edn/read-string
will complain about unknown literal. How do I fix this?
You have to pass the :readers
option to edn/read-string. You can just pass it *data-readers*
if you just want to use the current ones
oh ok
I am working with a java library that is trying to serialize a nested clojure datastructure that I'm passing to the library. When I pass in my value I'm getting an exception:
java.io.NotSerializableException
clojure.lang.RT$4
however, when I println
my value once before passing it to the library, the exception doesn't occur. Does anyone know what's going on with that clojure.lang.RT$4
?My guess is you have some sort of lazy/partially-realized data structure that the println
is forcing evaluation of and making it serializable.
I tried running doall
on the datastructure, but I guess if there are nested lazy-sequences a top-level doall
won't realize everything?
this object is being read in from a transit string using cognitect.transit/reader
. I don't need laziness. It's a business object, let me try to obscure it a bit
it's roughly like this, arbitrarily nested:
(lazy-seq-1 :a-val (lazy-seq-2 :b-val (lazy-seq-3 :c-val)))
no, it's being read just like this with no further processing
(with-open [input-stream (ByteArrayInputStream. (.getBytes transit-str))]
(transit/read (transit/reader input-stream :json opts)))
Does transit preserve the lazy-seq type vs. converting to a persistent list? I guess I can test that question myself(type (<-transit-json-str (->transit-json-str (list 1 2 3))))
#_#<Class@14a50707 clojure.lang.LazySeq>
using the round trip functions I have on hand produces a lazy seq. I wonder if I can force that to come out as a persistentList?