So for next.jdbc, i am running into an issue where a connection cannot be constructed because it is attempting to set a property to a default value before setting the value i am setting in the dbspec, and this fails because the driver i am using, the google cloud bigquery driver, doesn't provide a method to get a default for that property, in this case ProjectID. Is there a way to get it to not attempt to set defaults before constructing the 'real' properties? nvm, this appears to be a bug in the driver
Also: #sql is more likely to get targeted help for next.jdbc etc.
thanks, i had looked for a jdbc channel but it was archived
I set the topic on that to "Archived. Join #sql instead." to help future folks. The description already had that but it was less visible I think.
I create lots of big PDFs in Clojure, and this generation follow always a patter, which I implement abstractly like below.
So I generate seq-of-byte-array (bytes are the PDF content), most often in various map calls.
This gets in out of memory very quick, if I have lets say 1000 files of 100 MB.
But in reality I only need to have one byte array in memory at any given point in time, after it was written to disk,
it could be discarded.
(->>
(map-indexed (fn [index _]
{:index index
:bytes (byte-array (repeat 100000000 1))})
(range 1000))
(run!
(fn [info]
(with-open [out (io/output-stream (format "/tmp/%s.bin" (:index info)))]
(io/copy (:bytes info) out)))))
I know that I could get this by doing the full file generation inside the run! but this seems not Clojure idiomatic, as being procedural.
(run!
(fn [index]
(println :index index)
(let [info
{:index index
:bytes (byte-array (repeat 100000000 1))}
]
(with-open [out (io/output-stream (format "/tmp/%s.bin" (:index info)))]
(io/copy (:bytes info) out))))
(range 1000)
)
Is there any pattern to solve this, which is more idiomatic Clojure ?
I wonder about a "lazy sequence", which somehow "forgets" a value after it was consumed once....you want to avoid holding the head of a large sequence, which will prevent the whole thing from being garbage collected. run! is a good approach to do so
it achieves this by using reduce instead of ->> - you could also do the same
is the first example here OOMing on chunking?
Not sure, I just think that at one point in time I will have 1000 * 100 000 000 bytes in memory. I don't think that using lazy sequence prevents this.
iβm claiming that using the first form in this thread
(->>
(map-indexed (fn [index _]
{:index index
:bytes (byte-array (repeat 100000000 1))})
(range 1000))
(run!
(fn [info]
(with-open [out (io/output-stream (format "/tmp/%s.bin" (:index info)))]
(io/copy (:bytes info) out)))))
you will probably have only ~32 byte arrays in memory at once due to chunking of the lazy sequenceThis is exactly one of my questions.... But even if so, ideally I want "at max 1" .. not 32
ah. gotcha. i bet @hiredman has a transducing function that does this immediately.
I don't think it is a specific transducing function, but just using transducers
map-indexed has a transducer arity
transducers tend to be better than using lazy-seqs when you really care about when something happens, in this case you care because you can't have more than one in memory at once
I think transducres are indeed the answer:
(def process-xf
(comp
(map (fn [index]
{:index index
:bytes (byte-array (repeat 100000000 1))}))))
(transduce
process-xf
(completing
(fn [_ info]
(println :index (:index info))
(with-open [out (io/output-stream (format "/tmp/%s.bin" (:index info)))]
(io/copy (:bytes info) out))))
nil
(range 1000))This runs through while using "stable memory", it seems. (but in parallel, can it be ?) Is is correct to assume that this then holds either "1" or at max (max-cores) of my big byte-arrays in memory ?
If so, then this is the general pattern I was looking for.
i donβt believe there is anything that fans out to cores. and it can garbage collect but it might have more than one in memory at once
I see in htop that multiple cores are active... but do get results in order....
That was an"artifact" of htop... The dorun! shows the same, and that's not parallel for sure.
Right , "it can garbage collect", but it might of course not guaranty "only 1" in memory.
same as the dorun! version.
But my "first" version, does not allow garbage collection, as I "hold on the head"....
I learned something... and a potential use case for transducers-
doesn't run! on an eduction -as opposed to a map-indexed chunked seq- do the trick here?
yes, as well. Docu says:
;; This will run out of memory eventually,
;; because the entire seq is realized,
;; because the head of the lazy seq is retained.
(let
[s (range 100000000)]
(do (apply print s) (first s)))
;; This iterates through the lazy seq without realizing the seq.
(let
[s (eduction identity (range 100000000))]
(do (apply print s) (first s)))"Docu says"?
yeah it's very misleading
why ?
user=> (let [x (eduction identity (range 100))] (map System/identityHashCode [(seq x) (seq x)]))
(602830277 296204898)
user=> (let [x (range 100)] (map System/identityHashCode [(seq x) (seq x)]))
(1938259481 1938259481)
they are different seqs in the eduction case
it's iterating through the first seq, throwing it away, then creating another seq
you can make this work easily with seqs as mentioned above 1. write a fn that takes one pdf 2. don't hold the entire collection of pdfs
the clojuredocs comment is wrong
This does work, which is my case. I can write 1000 files of one GB each. So it does garbage collection in between.
(eduction
(map (fn [index]
{:index index
:bytes (byte-array (repeat 1000000000 1))}))
(map
(fn [info]
(println :index (:index info))
(with-open [out (io/output-stream (format "/tmp/%s.bin" (:index info)))]
(io/copy (:bytes info) out))))
(range 1000)
)you do not need to use eduction
plain seq/coll fns will work
eduction also stops chunking
so if you took the original code and replaced (range 1000) with (eduction identity (range 1000)) there is a good chance it would also stop running out of memory
I suspect @ghadi is keying off the mentioning of head holding, but that doesn't sound like what is happening here, it looks like chunking is causing 32 or so gigabyte size byte arrays to try to exist in memory at once. you can do stuff (like the eduction thing with identity) to try and avoid chunking (range is chunked so you have to unchunk which eduction does, but things like map and even for pass chunking through) but the best way to have complete control of what is realized and not is using transducers (and I would say processing with transduce, using transducers in eduction has complicating factors).
^ yes I didn't catch the chunking concerns
what hiredman said
Fwiw what I was proposing was
(run! F (eduction (map-indexed G) Xs))
as opposed to
(run! F (map-indexed G Xs))
with the assumption that the resource consumption happens in G.
As a reference, this is what eduction's reduce looks like:
(reduce [_ f init]
;; NB (completing f) isolates completion of inner rf from outer rf
(transduce xform (completing f) init coll))i think the signature of f that run! expects gets a bit funky with transducers right?
wdym?
run! expects a function that takes a single arg whereas reducing functions treat that as the completion arity. So it gets a bit wonky i though. i swapped back to transducer rather than run for this
but you baked the transducer into the eduction (which feels weird to me, but works here)
that doesn't track for me, I never had problems run! ing eductions
yes. i was thinking (run! (xf rf) seq) which is what i wanted to reach for rather than (run! f (eduction xf seq))
yeah that wouldn't work because run! goes into a plain reduce
https://ask.clojure.org/index.php/13153/could-gain-another-arity-cover-common-case-feeding-eduction
As far as I understand, you're trying to share the same byte array across multiple PDFs. This is fragile and might be broken by switching from map to pmap (or any kind of parallel execution). It would be simpler to use a dedicated ByteArrayOutputStream for each file. This output is dynamically increasing on demand. Once you've processed your PDF, the stream gets garbage-collected.
Also, consider an output stream pointing to a temp file. It's slower of course but you won't saturate memory when processing PDFs in parallel
Not sharing. Be sure that the ByteArray can be garbage collected after I wrote it to disk
Which run! does allow, but I was wondering about as different pattern. But the refered AskClojure looks like what I was looking for
It makes not sense to allocate an array in advance as you don't know for sure the exact size of the output
Just use ByteArrayOutputStream if you need raw bytes. Write to it, close it and then invoke the method called .toByteArray
But I don't want to allocate. The code is simplified. I use a library which produces PDFs as bytes
perhaps the lib already supports outputting to a stream?
Are sure there is no way to pass an output stream?
That might be an solution, indeed.
I will look for that.
In- and output streams are bread and butter for Java, there must be a method that accepts either of these
Maybe I was too much focussing on simple values, as used in closure,so came to byte arrays
Yes, I think this was the right direction. My usage of byte-arrays was wrong, using streams instead is better.
Anyone else obsessed with layers of macro DSL's?
macros are great fun for showing off how clever you are
> Nothing says "scr*w you" like a DSL β’ Stuart Halloway - https://youtu.be/LEZv-kQUSi4?si=JJQp05ZKs8Q71ofl&t=483 π
Data is better! π
I'd much rather have a linear data pipeline than DSL magic.
Like a lasagna with layers cheesy macro compilers mapping between language and semantic domains . ?
6, 7 layers deep ??
Nope π
Not me, never... Why, ya got any?
Not me, often times my ideas that start with macros often just end up as simple function or dsl over clojure data
Wow very surprised by this response from lisp people
Clojure's got its peculiarities! But I suspect you might find similar opinions from Common Lisp people who have been exposed to macro spaghetti written by other people!
@gcoumessos curious about your take? for/against macros?