Fork me on GitHub
#clojure
<
2023-04-03
>
flowthing07:04:21

I want to extend java.io.PushbackReader to capture the characters it reads into a string, similarly to clojure.lang.LineNumberingPushbackReader. (I can’t use LineNumberingPushbackReader, becuse I need different CR/LF handling). I know I can use proxy to override the read method like this:

(defn ^:private string-capturing-pushback-reader
  [reader]
  (let [sb (StringBuilder.)]
    (proxy [PushbackReader] [reader]
      (read []
        (let [^java.io.Reader this this
              n (proxy-super read)]
          (.append sb (char n))
          n)))))
Is there a way to add a method that the extendee (`PushbackReader`) doesn’t have, though? I’d like to add getString to get at the captured string. Or is there some other idiom for this sort of thing? I could override toString, but that feels hacky. Do I need to resort to making a Java class?

flowthing08:04:36

Thanks, you’re right, it does. I forgot about that flowchart. Looks like I’ll need to use gen-class.

👍 2
p-himik08:04:41

You can also split it into two things if that's OK in your scenario - e.g. return a vector of [(proxy ...) (fn [] (str sb))].

p-himik08:04:31

Also defprotocol creates an interface, so maybe that's usable. There's also definterface but it works only with AOT.

p-himik08:04:20

Or maybe you can even feed that proxy into reify. Dunno, haven't tested it.

phill09:04:14

Doesn't StringReader release the captured text with toString? Everything has a toString and perhaps PushbackReader's toString is not already very important?

flowthing09:04:35

Yes, you’re right. Just using toString is probably the simplest solution here, especially considering that this is only for the private use of a single namespace anyway.

TMac10:04:37

Is there a good way of ensuring I'm operating on a file line-by-line and not loading the whole thing into memory? I have something like:

(with-open [reader (io/reader csv-file)]
  (let [[header & rows] (csv/read-csv reader)]
    (rows->schema header rows))))

(defn- rows->schema
  [headers rows]
  (->> rows
       (map ...)
       (reduce ...)
       (map ...)
       (map vector headers)
       (ordered-map/ordered-map)))
A coworker ran the code with a 78MB file in a JVM constrained to 16MB of memory and it ran (albeit more slowly than on a JVM with several GBs of memory, but I suppose that makes sense); am I right in thinking that if I were loading the whole file into memory the JVM would just OOM? Or am I "holding the head" by keeping hold of the reference to header? I guess this is really two questions: 1. Is the code reasonable? Please tell me how you know it is or isn't. 2. Is there an automated test I can write to ensure that I don't accidentally blow up memory usage in the future?

p-himik11:04:00

Reduce is not lazy.

TMac11:04:16

Reduce lazily consumes the seq you pass in though, right? Like, I'm only on the hook for the size of whatever reduce returns, not what gets inputted into it?

Mark Wardle11:04:48

If your processing was truly lazy, you'd get a file closed exception, so your processing must be operating greedily, presumably because of the reduce. Also I would have thought ordered map would have to hold everything in memory in order to do the ordering? Anyway, memory usage will surely depend on how much 'reduction' is occurring in your reduce step.

tomd12:04:27

> Reduce lazily consumes the seq you pass in though, right? No, reduce doesn't lazily consume, nor lazily return: E.g.

(->> (range)
     (reduce conj [])
     (take 10))
Runs forever

solf12:04:32

If the sequence you pass to reduce is lazy, then it will be consumed lazily.

(->> (range)
     (reduce (fn [acc x]
               (println "X" x)
               (conj acc x)) [])
     (take 10))
This will keep running forever but it will print the numbers as it goes on

tomd12:04:51

😄 yeah. I'm not sure there's a difference between (range) "lazily producing" and anything downstream "lazily consuming". That is, I don't think there's anything special about reduce that is "lazily consuming" here, right @U7S5E44DB?

solf12:04:44

Yeah exactly. Which is what @UJLF48QJC is looking for (reducing an arbitrarily large file)

👍 4
TMac12:04:58

> If your processing was truly lazy, you'd get a file closed exception, so your processing must be operating greedily Won't with-open keep it open until (in this case) (rows->schema ...) returns? But rows->schema itself is operating on lazily-loaded line by lazily-loaded line?

Mark Wardle12:04:51

If your rows->schema returned a lazy result then no it wouldn’t. Eg try returning something that returns a lazy result (eg map) and see what happens.

TMac12:04:20

ah, that makes sense then—`rows->schema` does not return a lazy result

solf13:04:40

To be a bit pedantic, even if it did return a lazy sequence (a map instead of a ordered-map for example), the file would still be fully consumed at the reduce step

👍 2
solf13:04:28

For your other question, how to test it doesn’t blow up the memory in the future, I’m not sure how to do it (besides running the test in a memory constraint environment…), but there’s ways to test whether you’ve fully consumed a lazy-sequence.

(defn number-generator 
  ([] (number-generator 1))
  ([n]
   (if (< n 1000)
     (lazy-seq (cons n (number-generator (inc n))))
     (lazy-seq (throw (Exception. "Fully consumed"))))))

(take 999 (number-generator))  ;; works
(take 1000 (number-generator)) ;; throws exception "Fully consumed"

solf13:04:50

You could create a lazy-sequence that would throw an exception after a certain number of items have been consumed. It doesn’t help for your rows->schema function though, as you want everything to be consumed once you reach reduce

phill15:04:38

Has OP's question about whether the [header & rows] formulation "holds the head" been answered?

dpsutton16:04:55

i don’t think so. it got wrapped up in lazy or not-lazy talk that the core question remained unanswered.

TMac17:04:23

This isn't the most rigorous experiment, but memory usage remained constant when I ran

(let [[head & tail] (repeatedly rand)]
  (reduce max tail))
which suggests that the front bit of tail is being GCed even though I'm keeping a reference to head

😃 1
Mark Wardle18:04:27

In this example, isn’t header simply a collection of column headings? It doesn’t hold a sequence of the other rows. It’s the destructured ‘rows’ that holds the head of a lazy sequence, but you then greedily process that so the sequence is fully realised by the end of the with-open block.

roklenarcic12:04:56

I am using time-literals library, which adds data_readers.cljc that adds time/date and other reader tags for Java 8 time object literals. The thing is though that if I use clojure.core/read-string it will read them correctly, but clojure.edn/read-string will complain about unknown literal. How do I fix this?

jjttjj13:04:14

You have to pass the :readers option to edn/read-string. You can just pass it *data-readers*if you just want to use the current ones

adamfrey18:04:32

I am working with a java library that is trying to serialize a nested clojure datastructure that I'm passing to the library. When I pass in my value I'm getting an exception:

java.io.NotSerializableException
   clojure.lang.RT$4
however, when I println my value once before passing it to the library, the exception doesn't occur. Does anyone know what's going on with that clojure.lang.RT$4 ?

seancorfield19:04:45

My guess is you have some sort of lazy/partially-realized data structure that the println is forcing evaluation of and making it serializable.

adamfrey19:04:47

I tried running doall on the datastructure, but I guess if there are nested lazy-sequences a top-level doall won't realize everything?

ghadi19:04:21

somewhere nested inside you have a seq over an Iterable source

ghadi19:04:59

the more details you can reveal, the better

adamfrey19:04:23

this object is being read in from a transit string using cognitect.transit/reader . I don't need laziness. It's a business object, let me try to obscure it a bit

adamfrey19:04:23

it's roughly like this, arbitrarily nested:

(lazy-seq-1 :a-val (lazy-seq-2 :b-val (lazy-seq-3 :c-val)))

ghadi19:04:55

are you transforming what you read from transit? using eduction?

adamfrey19:04:28

no, it's being read just like this with no further processing

(with-open [input-stream (ByteArrayInputStream. (.getBytes transit-str))]
     (transit/read (transit/reader input-stream :json opts)))
Does transit preserve the lazy-seq type vs. converting to a persistent list? I guess I can test that question myself

adamfrey19:04:16

(type (<-transit-json-str (->transit-json-str (list 1 2 3))))
#_#<Class@14a50707 clojure.lang.LazySeq>
using the round trip functions I have on hand produces a lazy seq. I wonder if I can force that to come out as a persistentList?

ghadi19:04:08

you can call vec on all your lazy seqs

adamfrey19:04:11

Yeah, I was able to override the transit list handler to match what I wanted and now my value is serializable:

(def persistent-list-opts
  {:handlers {"list" (transit/read-handler (fn [v] (apply list v)))}})

adamfrey19:04:42

thanks for your help, all