This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-12-08
Channels
- # adventofcode (49)
- # babashka (21)
- # babashka-sci-dev (12)
- # beginners (250)
- # calva (23)
- # cider (6)
- # clj-kondo (11)
- # cljsrn (8)
- # clojure (129)
- # clojure-europe (50)
- # clojure-france (8)
- # clojure-italy (6)
- # clojure-nl (14)
- # clojure-romania (7)
- # clojure-spec (21)
- # clojure-uk (3)
- # clojurescript (17)
- # conjure (1)
- # core-async (40)
- # core-logic (24)
- # core-typed (7)
- # datavis (2)
- # datomic (2)
- # emacs (29)
- # fulcro (10)
- # graalvm (6)
- # graphql (24)
- # gratitude (6)
- # jobs (1)
- # lsp (9)
- # malli (6)
- # missionary (1)
- # nextjournal (46)
- # off-topic (2)
- # other-languages (3)
- # pathom (5)
- # portal (2)
- # re-frame (37)
- # remote-jobs (1)
- # shadow-cljs (15)
- # spacemacs (9)
- # testing (6)
- # tools-deps (13)
- # vim (32)
- # xtdb (16)
• Hey folks, I'm new to the slack, so please forgive me, if I don't follow any CoC... I was wondering if I stumbled over something weird. I currently introduced core.async to one of our services, myself being new to core.async. Basically 4 CSPs (go-loops, alts!, timeout, pipeline-blocking, ...), that do requesting of big data structures via UUID from a db, then processing it, and transforming it into a file. Unfortunately when working on large data (long E-Mail with attachment, "metadata", ...) we stumble across some Heapdumps in the logs. The service recovers (after a gc, I guess?), but still I find it quite puzzling. I managed to load it into the Eclipse-MAT and noobily found out, that apparently there were a bunch of clojure.core.async.impl.timers.TimeoutQueueEntry with the biggest item showing 65,893,400 (4.57%) bytes. In total that 'problem suspect' seems to be a quite large (1.3Gi) object array inside of core.async. Is there a known issue or something, that the channels keep track of content somehow, although items are <! out of the channel?
when you put a value to a channel that can't immediately be completed a handler wrapping the value goes into the put queue
there was a bug fixed recently that could allow alt handlers to get stranded in channel queues so that could be a match to your description
https://clojure.atlassian.net/browse/ASYNC-204 is the bug and it was fixed in core.async 1.5.640 (latest is 1.5.644)
interesting! I'm gonna try the 1.5.644 version, and if the OOM still occurs
Thanks for the lightspeed response btw. Really appreciate it 🙂
there is also https://clojure.atlassian.net/browse/ASYNC-234 pertaining to alts and timeouts and might be worth thinking about architecturally. I am still on the fence about whether that even is a "bug" or whether there is something that should be done.
the memory is not really leaked here, just held until cleaning happens
I think it's very unlikely we'd want to do the nack stuff
I think that is a mistake, but also expected, the first patch is independent of the nack stuff, and would go a long way
the reference clearing is maybe interesting (I assume "0001-ASYNC-234-reduce-leaked-memory-on-alts-not-taken.patch" is what you're referring to)
But, as I said, I don't consider it a complete solution, because you still can have atoms pilling up not being freed even though they are "dead"
is cleaning more aggressively in timeout chans something you thought about at all?
seems like this is particularly a problem for that case
It is a problem for any case of a "global" service that operates like timeouts do, timeouts are just the only case of that in the core.async std lib
if the channel is being used (puts/takes) it's being regularly cleaned. timeouts are unique in potentially going long periods w/o that and if there's only one it's not a big deal to spin a cleaner thread.
If I have some way to communicate with an rpc server that gives me a channel for replies and keeps a reference to the channel it gave me, so it can send me data that is a reply to my rpc
but if you are regularly getting replies, you are also regularly cleaning
it depends on how fine grained rpc calls (timeout creation) is, if you make lots of rpc calls for each little thing, then the rate of data coming through the rpc reply channels is low, vs if you have a single "give me the firehouse" rpc call
like, there are already tickets where people are asking to be able to close timeout channels so they can manually manage this stuff, the wrinkle with timeout channels vs the general kind of rpc thing I mentioned is timeout channels are shared and shouldn't be closed
with the rpc thing, if I am in the know, I can close the response channel in the both branches of an alt to manually manage this
with our production core.async code base, a year or two before I stared long enough at alts to see this, we had all kinds of memory issues when we first deployed, and I just sort of tossed everything and the kitchen sink at them, so hard to tell what fixed/helped them, but one of the things I did was say "oh, timeouts are a global reference somewhere, so maybe just limit the number of timeouts"
so I did that, and the other big thing is we have a sort of proxy weak reference channel, that combined with limitting usages of timeouts seems to have worked around "extended reference retention" (if not leaks) in different parts of core.async (pub/sub, alts, etc)
async-204 was definitely exacerbating this problem
oh, I guess we now have our own version of pub based to some degree on my patch on that issue
"that issue" = ASYNC-90 ?
and from a prior art perspective, I know my bible thumping about concurrent ml is tiresome, but cml includes a nack combinator for events for this kind of resource management
mmh... The issue still seems to persist after 1.5.644.
I was reading through 234 and thought that it may be related. Interestingly the use case seems to be quite similar. So a quick overview about what we do here: (A) puts UUIDs on an ids-chan, that will be consumed by (B) pipeline-blocking (10 Threads / 20 Cores). B RPCs to a db-wrapper and puts the return value onto the file-chan. The file-chan is then consumed by a loop (C) until the file-chan is closed. C then returns to the main thread. So what I'm curious about is, if the pipeline may persist items (i.e. keep them as a reference) on the to-channel. The big data-structures in MAT would support this (65Mi being a large mail), as (C) takes these big items via alts!, but then immediately writes them into the file. However, alts! usually selects the file-chan. The timeout on the other hand is the same all across.
(pipeline-blocking (processes-count env)
file-chan
(map (partial get-x-rpc env tenant ex-chan))
ids-chan
true
#(>!! ex-chan %))
(loop [[item port] (alts! [file-chan timeout-chan ex-chan])]
(cond
(= port data-chan) (when item
(item-fn item)
(recur (alts! [file-chan timeout-chan ex-chan])))
(= port timeout-chan) (throw (TimeoutException.))
(= port ex-chan) (throw (AbortException.))
(def file-chan (chan 1 (mapcat explode-specific-value)))
(def timeout-chan (timeout 600000))
(def ex-chan (promise-chan))
It very well could be, the issue in 234 is subtle, the way a mutable array is used for locals and slots in it tend to be overwritten in loops often mitigates it
Any easy check/mitigation is to instead of sending item
send (atom item)
and nil out the atom when you get it on that branch of the cond