Fork me on GitHub
#core-async
<
2021-12-08
>
Al Z. Heymer13:12:53

• Hey folks, I'm new to the slack, so please forgive me, if I don't follow any CoC... I was wondering if I stumbled over something weird. I currently introduced core.async to one of our services, myself being new to core.async. Basically 4 CSPs (go-loops, alts!, timeout, pipeline-blocking, ...), that do requesting of big data structures via UUID from a db, then processing it, and transforming it into a file. Unfortunately when working on large data (long E-Mail with attachment, "metadata", ...) we stumble across some Heapdumps in the logs. The service recovers (after a gc, I guess?), but still I find it quite puzzling. I managed to load it into the Eclipse-MAT and noobily found out, that apparently there were a bunch of clojure.core.async.impl.timers.TimeoutQueueEntry with the biggest item showing 65,893,400 (4.57%) bytes. In total that 'problem suspect' seems to be a quite large (1.3Gi) object array inside of core.async. Is there a known issue or something, that the channels keep track of content somehow, although items are <! out of the channel?

Alex Miller (Clojure team)13:12:13

when you put a value to a channel that can't immediately be completed a handler wrapping the value goes into the put queue

Alex Miller (Clojure team)13:12:01

there was a bug fixed recently that could allow alt handlers to get stranded in channel queues so that could be a match to your description

Alex Miller (Clojure team)13:12:41

https://clojure.atlassian.net/browse/ASYNC-204 is the bug and it was fixed in core.async 1.5.640 (latest is 1.5.644)

Al Z. Heymer13:12:05

interesting! I'm gonna try the 1.5.644 version, and if the OOM still occurs

Al Z. Heymer13:12:27

Thanks for the lightspeed response btw. Really appreciate it 🙂

Alex Miller (Clojure team)13:12:23

there is also https://clojure.atlassian.net/browse/ASYNC-234 pertaining to alts and timeouts and might be worth thinking about architecturally. I am still on the fence about whether that even is a "bug" or whether there is something that should be done.

Alex Miller (Clojure team)16:12:23

the memory is not really leaked here, just held until cleaning happens

Alex Miller (Clojure team)16:12:08

I think it's very unlikely we'd want to do the nack stuff

hiredman16:12:15

I think that is a mistake, but also expected, the first patch is independent of the nack stuff, and would go a long way

Alex Miller (Clojure team)16:12:04

the reference clearing is maybe interesting (I assume "0001-ASYNC-234-reduce-leaked-memory-on-alts-not-taken.patch" is what you're referring to)

hiredman16:12:08

But, as I said, I don't consider it a complete solution, because you still can have atoms pilling up not being freed even though they are "dead"

Alex Miller (Clojure team)16:12:19

is cleaning more aggressively in timeout chans something you thought about at all?

Alex Miller (Clojure team)16:12:51

seems like this is particularly a problem for that case

hiredman16:12:49

It is a problem for any case of a "global" service that operates like timeouts do, timeouts are just the only case of that in the core.async std lib

Alex Miller (Clojure team)16:12:21

if the channel is being used (puts/takes) it's being regularly cleaned. timeouts are unique in potentially going long periods w/o that and if there's only one it's not a big deal to spin a cleaner thread.

hiredman16:12:49

If I have some way to communicate with an rpc server that gives me a channel for replies and keeps a reference to the channel it gave me, so it can send me data that is a reply to my rpc

hiredman16:12:58

It will also have this kind of issue

Alex Miller (Clojure team)17:12:01

but if you are regularly getting replies, you are also regularly cleaning

hiredman17:12:52

it depends on how fine grained rpc calls (timeout creation) is, if you make lots of rpc calls for each little thing, then the rate of data coming through the rpc reply channels is low, vs if you have a single "give me the firehouse" rpc call

hiredman17:12:05

like, there are already tickets where people are asking to be able to close timeout channels so they can manually manage this stuff, the wrinkle with timeout channels vs the general kind of rpc thing I mentioned is timeout channels are shared and shouldn't be closed

hiredman17:12:54

with the rpc thing, if I am in the know, I can close the response channel in the both branches of an alt to manually manage this

hiredman17:12:18

(assuming I am done with the channel)

hiredman17:12:22

with our production core.async code base, a year or two before I stared long enough at alts to see this, we had all kinds of memory issues when we first deployed, and I just sort of tossed everything and the kitchen sink at them, so hard to tell what fixed/helped them, but one of the things I did was say "oh, timeouts are a global reference somewhere, so maybe just limit the number of timeouts"

hiredman17:12:36

so I did that, and the other big thing is we have a sort of proxy weak reference channel, that combined with limitting usages of timeouts seems to have worked around "extended reference retention" (if not leaks) in different parts of core.async (pub/sub, alts, etc)

hiredman17:12:51

which I wasn't even aware of at the time

Alex Miller (Clojure team)17:12:05

async-204 was definitely exacerbating this problem

hiredman17:12:21

oh, I guess we now have our own version of pub based to some degree on my patch on that issue

Alex Miller (Clojure team)17:12:12

"that issue" = ASYNC-90 ?

hiredman17:12:30

and from a prior art perspective, I know my bible thumping about concurrent ml is tiresome, but cml includes a nack combinator for events for this kind of resource management

Al Z. Heymer13:12:20

mmh... The issue still seems to persist after 1.5.644.

Al Z. Heymer14:12:31

I was reading through 234 and thought that it may be related. Interestingly the use case seems to be quite similar. So a quick overview about what we do here: (A) puts UUIDs on an ids-chan, that will be consumed by (B) pipeline-blocking (10 Threads / 20 Cores). B RPCs to a db-wrapper and puts the return value onto the file-chan. The file-chan is then consumed by a loop (C) until the file-chan is closed. C then returns to the main thread. So what I'm curious about is, if the pipeline may persist items (i.e. keep them as a reference) on the to-channel. The big data-structures in MAT would support this (65Mi being a large mail), as (C) takes these big items via alts!, but then immediately writes them into the file. However, alts! usually selects the file-chan. The timeout on the other hand is the same all across.

(pipeline-blocking (processes-count env)
    file-chan
    (map (partial get-x-rpc env tenant ex-chan))
    ids-chan
    true
    #(>!! ex-chan %))

 (loop [[item port] (alts! [file-chan timeout-chan ex-chan])]
        (cond
          (= port data-chan)    (when item
                                  (item-fn item)
                                  (recur (alts! [file-chan timeout-chan ex-chan])))
          (= port timeout-chan) (throw (TimeoutException.))
          (= port ex-chan)      (throw (AbortException.))
(def file-chan (chan 1 (mapcat explode-specific-value))) (def timeout-chan (timeout 600000)) (def ex-chan (promise-chan))

hiredman16:12:31

It very well could be, the issue in 234 is subtle, the way a mutable array is used for locals and slots in it tend to be overwritten in loops often mitigates it

hiredman16:12:17

Any easy check/mitigation is to instead of sending item send (atom item) and nil out the atom when you get it on that branch of the cond

hiredman16:12:16

My first patch on 234 basically does the atom thing automatically

hiredman16:12:18

I don't consider that a complete solution, because it still potential leaks the atoms

hiredman16:12:36

But an atom is potentially much smaller then a big chunk of data