Fork me on GitHub
#clojure
<
2020-12-03
>
devn07:12:40

Observability for Clojure applications, state of the art: What is it?

simongray09:12:15

#observability

chrisblom10:12:09

is there a similar facade for micrometer?

ghadi14:12:46

I love Honeycomb for tracing

ghadi15:12:52

Micrometer is ok for metrics

ghadi15:12:00

OpenTelemetry metrics are too new, but the tracing side is ok

ghadi15:12:43

Dropwizard/CodaHale are showing age

mpenet16:12:04

honeycomb is quite nice, but it can get expensive at scale. But it might not matter in that case.

mpenet08:12:36

micrometer is a good solution, it's quite easy to work with via interop

mpenet08:12:01

if you want to go the extra mile you can also use something like opentelemetry

Elad Gur16:12:54

Hey, What is the correct way to add a timeout to channels? I've tried to add a naive implementation using an async/timeout but got huge spikes with memory increase Then found out that I'm not the only one to struggle this (see here https://clojure.atlassian.net/browse/ASYNC-225)

ghadi16:12:40

can you post your naive implementation?

Elad Gur17:12:03

(defmacro <?t
  "Read a value from a channel (using <!) and if the value is Throwable, throw it."
  ([channel]
   `(<?t ~channel default-take-timeout))
  ([channel timeout-msecs]
   `(let [in-chan# ~channel
          timeout-chan# (async/timeout ~timeout-msecs)
          [v# c#] (async/alts! [in-chan# timeout-chan#])]
      (if (= c# in-chan#)
        (<?* v#)
        (throw (ex-info timeout-msg {:cause timeout-msg
                                     :timeout ~timeout-msecs}))))))

hiredman16:12:09

There is a #core-async channel, but timeout channels and alt is the way to do it

hiredman17:12:00

async-225 is fairly trivial to work around (async/go (async/<! (async/timeout whatever)))

potetm17:12:59

is it tho?

potetm17:12:21

does timeout chan sharing only happen in a single go block?

potetm17:12:20

the underlying issue is that prolifically using timeouts increase CPU overhead

potetm17:12:32

and you cannot close timeouts because they are shared

hiredman17:12:54

you are misunderstanding

hiredman17:12:21

the channel returned from (async/go (async/<! (async/timeout whatever))) is a fresh channel that while close whenever the timeout does

hiredman17:12:41

I doubt prolific timeout usage is the root cause of your increased cpu usage

hiredman17:12:31

or I should say, I doubt it is the sharing of timeouts that is the root cause

hiredman17:12:22

my guess is it is the fundamental global nature of timeouts keeping objects alive, filling your heap, causing the gc run all the time spiking your cpu usage

potetm17:12:57

Yeah, I’ve not looked into it, but the ticket says that timeout checks caused CPU usage.

potetm17:12:07

(I hope that’s not the case, but that’s what it says.)

hiredman17:12:12

the thing about timeout channels is there is a global singleton scheduler that is doing the timing out, so it keeps alive a reference to each timeout channel until it closes

hiredman17:12:47

and the nature of go blocks is the code is transformed into callbacks are registered on channels

hiredman17:12:40

so if you wait on a timeout you attach a closure to the timeout that closes over all the data your code references

hiredman17:12:44

and that callback may still be live (reachable) because of the timeout even when you stop waiting on the timeout, because of how channels do bookkeeping

hiredman17:12:32

so my guess is the issue both you, and the guy in the ticket are having, has nothing to do with the shared nature of timeouts, and everything with the gc implications

hiredman17:12:12

at work are have custom sort of weak reference channel implementation we use to help combat this kind of thing

hiredman17:12:30

and in fact towards the end of the ticket they even talk about it alleviating gc pressure

noisesmith17:12:59

yeah, theres a seeming "inversion", where the channel becomes the gc root holding onto all the objects in the code waiting on the channel. it is clearly counterintuitive, and I could see how that could easily lead to misunderstanding the source of a problem like this

potetm17:12:16

So the sharing comes into play if you try to manually clean up for yourself.

potetm17:12:37

It’s not clear to me what the actual underlying problem is. As you say, it could be GC. However, if you bork the timeout check, you can introduce redundant checks in the DelayQueue.

potetm17:12:46

(I know because I’ve done it.)

potetm17:12:14

They say they confirmed that timeout checks were a problem. I have no idea if that’s true.

hiredman17:12:11

it is absolutely the gc, because the cpu usage from increased gc is what caused them to start closing shared timeout channels in the first place, which introduced further breakage

potetm17:12:52

they probably assumed closing the chan would stop the checking?

hiredman17:12:18

but why did they looking for a problem to solve in the first place?

hiredman17:12:40

because they had metrics somewhere that showed increased cpu usage

hiredman17:12:02

so they started doing things like closing channels, and looking at the delayqueue, etc

potetm17:12:04

> Using timeout channels (with long timeouts of 2 minutes) without closing them solved our core problem (too many TCP sockets in CLOSE_WAIT state) > but immediately caused the application to use approximately 10%+ CPU and way more memory than before.

potetm17:12:16

which is in line with gc probs

potetm17:12:24

> After analyzing the problem, we concluded that leaving the timeout channels as is resulted in them continuously executing the code that checks if their time had passed, even if the “safe take” go block had returned

potetm17:12:36

which is a dubious claim, but not impossible

hiredman17:12:10

yeah, the whole root cause analysis in that ticket doesn't go deep enough and is bound up way to much in what they already decided is a solution based on how they go there

potetm17:12:30

Agreed. I would be interested to know how they arrived at that claim.

hiredman17:12:03

the issue is alts attach a closure to all the channels in the alts, and don't get rid of the reference once one of the alternatives is committed to

potetm17:12:53

yeah, agreed

hiredman17:12:43

I just wrote a little patch that I believe fixes that and left it as a comment on the ticket, but I'll need to remember how the contribution process works to attach a real patch and then lean on alex for a few months to a year to get the patch in to a new core.async release

😆 3
awesome 3
6
didibus23:12:04

I'm still surprised nobody started a community fork of core.async (or a whole other alternative to it). Considering I feel everyone always complain about its slow pace of change and lack of higher level error handling and all.

hiredman23:12:38

on the other hand maybe going slow is good, I am still convinced there is a leak there in alts!, but I've been trying to show a difference in memory usage before and after a patch, and I am having trouble getting anything repeatable. which could be any number of things 1. no leak 2. poor data collection 3. poor test case 4. the jvm is optimization the differences away somehow 5. the overhead of the timeout structures is dwarfing the leak of data via alts

didibus00:12:35

true true, I've found in general, I've rarely encountered deal breaker issues, and slow and steady can be better then too fast, though sometimes I'm not sure about the "steady" when it come to core libraries 😛

hiredman04:12:16

Ah, I bet it is locals clearing that mostly handles avoids the leaks, so I need to make a more complicated test case to avoid locals clearing

hiredman05:12:41

yeah, a simple go-loop doesn't do it either because of locals clearing / the way the go blocks re-use slots in the array of locals, but change it from a loop/recur to actual recursive function calls, and throw in some exception catching to extend the life time, and it leaks

3
dominicm20:12:14

woah:

(mm/measure (map inc (take 100 (range)))) => 232B
(mm/measure (walk/postwalk identity (map inc (take 100 (range))))) => "8.6 KiB"
This was an interesting discovery. Not sure what to make of it. mm is clj-memory-meter.

hiredman20:12:22

map returns a lazy seq, and range has a specifically optimized lazy seq, walking a lazy-seq realizes it

dominicm20:12:07

Oh. The map hasn't been realized at that point. I see.

dominicm20:12:14

Duh, needs more doall.

dominicm20:12:59

Feels less mind blowing now. Lazy sequences always ruin profiling :)

Evan Bowling20:12:33

Hadn’t heard of clj-mm, this is cool

dominicm20:12:42

It's been helping me out with a production issue :). We handle analytical-style data and clojure.walk is somehow "leaking" memory. I thought I had a repro, but evidently not.

dominicm20:12:48

Okay, this is actually it:

(let [x [[]]]
    (prn (mm/measure x)) => "304 B"
    (prn (mm/measure (walk/postwalk identity x)))) => "488 B"
It has to be a nested vector, [1] won't do it, nor will [()]. It's something to do with into, but I haven't figured out what.

dominicm20:12:33

(let [x [[:a]]]
    (prn (mm/measure x))
    (prn (mm/measure (into [] x))))
Also has the memory leak.

andy.fingerhut21:12:44

FYI, if you ever want to dig into the details of these calculations, with pictures of the JVM objects involved and pointers between them, there is this tool: https://github.com/jafingerhut/cljol

dominicm21:12:35

Oh, that would have been super useful.

dominicm21:12:14

Although, this is a 350MB data structure when it hasn't been exploded by postwalk, so it may still have needed some help :p

andy.fingerhut21:12:27

You can also give it a sequence of "root objects", and it will visually show any same-object-in-memory sub-structures shared between them as the same node in the drawing, versus different objects in memory are always different nodes in the drawing.

dominicm21:12:47

Seems like there's overlap with visualvm, but REPL friendly?

dominicm21:12:56

I've been wanting a way to "send" something to visualvm while working on this.

dominicm21:12:09

I've resorted to creating a custom type to hold what I'm trying to measure.

andy.fingerhut21:12:18

Yeah, cljol is best for small structures that you want to see all of. It is possible to tweak the graph it constructs in memory to prune it to depth 2, depth 3, etc., or whatever kinds of pruning you want, but you typically don't ever want to draw a graph with hundreds of JVM objects.

andy.fingerhut21:12:07

There might be overlap with visualvm -- cljol was something I put some hobby time into hacking on, for fun, without attempting to compare it to other tools.

Alex Miller (Clojure team)20:12:06

into does transient->persistent, might affect what's held in fields

dominicm21:12:53

That's exactly the problem. A new root is being established, which is only for that vector. So it's wasting loads of memory (we have ~500,000 vectors in this data structure). Obviously we shouldn't be postwalking it, but now I'm determined to dig into why there's a leak.

dominicm21:12:11

(persistent! (transient [[]])) is bigger than [[]] by 184 Bytes.

dominicm21:12:35

@U064X3EF3 does this warrant a jira?

dominicm21:12:06

https://clojure.atlassian.net/browse/CLJ-2541 maybe there's already one, but it was concerned with CPU and not Mem.

dominicm21:12:07

Because it clones the empty root array, whereas before it pointed at the common one in clojure.lang.PersistentVector/EMPTY.

dominicm21:12:29

So, essentially /EMPTY is being cloned creating arbitrary overhead.

Alex Miller (Clojure team)21:12:04

so this is specific only to the case of an empty vector?

Alex Miller (Clojure team)21:12:23

seems like a non-empty vector wouldn't have the benefit of reusing the special EMPTY vector - do you see that in comparing (persistent! (transient [[1]])) and [[1]] ?

dominicm21:12:11

user=> (mm/measure (persistent! (transient [[1]])))
"520 B"
user=> (mm/measure [[1]])
"336 B"
Yeah. It's more general than empty vectors. We picked up on this because we have keys like [:a [:b :c]] in maps.

Alex Miller (Clojure team)21:12:05

I guess I don't understand your explanation above then

dominicm21:12:00

Maybe I'm not understanding well. For [1] there's a root set of EMPTY. I assume vectors are red/black tries or something?

Alex Miller (Clojure team)21:12:09

pretty sure the cloning is important to avoid having the new persistent thing refer into the old transient thing

dominicm21:12:36

Yeah, I think so.

Alex Miller (Clojure team)21:12:01

not red/black, this is the same HAMT thing used in maps

dominicm21:12:29

My understanding of that part is vague hand wavey stuff :)

dominicm21:12:49

EMPTY was wrong. It's actually EMPTY_NODE.

dominicm21:12:59

So, [[]] has a root set to clojure.lang.PersistentVector/EMPTY_NODE. But the one that's been transient'd has a "brand new root".

dominicm21:12:33

If I do (def a [[]]) and (def b [[]]) both of those vectors will share a root, EMPTY_NODE. Once it goes through a transient/persistent cycle, there's a new root.

dominicm21:12:41

Which has been cloned.

Alex Miller (Clojure team)21:12:24

yeah, I get it. it's cloned at that point so it can be potentially edited in place. the constraints around time to do the p->t or t->p changes are important too. seems like you could potentially detect this case on t->p and use EMPTY_NODE

Alex Miller (Clojure team)21:12:03

but it's a judgement call whether the perf of that op or the loss of the memory optimization are more important

Alex Miller (Clojure team)21:12:19

would be ok by me to file a ticket. I guess for it to rise up the awareness list I'd want to know how it's actually affecting you. presumably you have these in volume for it to be mattering

dominicm21:12:27

Yeah. I get that. I can raise a jira. We can fix this short term by migrating our keys to lists rather than vectors, and probably drop postwalk long-term.

dominicm21:12:33

This causes our data structure to grow from 373.1MiB to 842.3MiB. It's probably got around 640,000 vectors in it.

dominicm21:12:59

We've been having OOMs in prod and tracked it back to this.

Alex Miller (Clojure team)22:12:25

well that's the kind of info that will catch Rich's eye, so definitely include that

andy.fingerhut22:12:47

Please send me a link to the JIRA with any steps you have to reproduce. I have looked at Clojure vector and transient implementations quite a bit, and may have seen this before, or at least be in a position to quickly analyze it.

Alex Miller (Clojure team)22:12:23

I guess I'm still a little puzzled as to what's different in the non-empty case

Alex Miller (Clojure team)22:12:58

you don't need to figure that out to log the jira, but include that info above on [[1]]

dominicm22:12:25

I'm puzzled about that too. It's a bit beyond me :p

dominicm22:12:24

The repro is very small :)

Alex Miller (Clojure team)22:12:22

there are only diffs up to [[ .. 32 elem ]]

dominicm22:12:48

Probably changes the root at that point.

andy.fingerhut00:12:49

Makes sense. I added a comment to the ticket. It would be a slightly-more-than-trivial change to the TransientVector class to avoid the need to do this copying of the empty root node, I think, unless there is an approach I'm not thinking of yet that is a smaller change.

Alex Miller (Clojure team)00:12:05

I think you have to do that

andy.fingerhut15:12:30

I added a patch that I've used collection-check and a few small manual tests to test, to the CLJ-2594 ticket, plus a link to a small git repo I made with code for drawing pictures of various persistent and transient vectors that help to see the behavior before and after.

dominicm21:12:47

@U0CMVHBL2 thank you for digging into this.

andy.fingerhut21:12:43

No worries. It was something that I may have even noticed before and wondered about, in my digging into the implementation of persistent and transient vectors a while ago, so right up there in my wheel house

schmidt7320:12:44

So I have a mixed project consisting of Java/Clojure code. It works wonderfully on Linux, but on Windows I run into trouble with the Java bit.

schmidt7320:12:15

(defproject guidescan-web "2.0"
  :description "Version 2.0 of the Guidescan website."
  :url ""
  :author "Henri Schmidt"
  :dependencies [[org.clojure/clojure "1.10.1"]
                 [org.clojure/data.csv "1.0.0"]
                 [com.github.samtools/htsjdk "2.23.0"]
                 [com.stuartsierra/component "1.0.0"]
                 [http-kit "2.4.0"]
                 [compojure "1.6.2"]
                 [ring/ring-defaults "0.3.2"]
                 [cheshire "5.10.0"]
                 [failjure "2.0.0"]
                 [selmer "1.12.28"]
                 [com.taoensso/timbre "4.10.0"]
                 [org.clojure/tools.cli "1.0.194"]
                 [seancorfield/next.jdbc "1.1.610"]
                 [honeysql "1.0.444"]
                 [com.h2database/h2 "1.4.197"]
                 [org.postgresql/postgresql "42.2.5"]
                 [com.mchange/c3p0 "0.9.5.5"]
                 [ring "1.8.1"]]
  :profiles {:uberjar {:aot :all}
             :dev {:resource-paths ["test/resources"]
                   :source-paths ["src" "dev" "test"]
                   :java-source-paths ["src/java"]
                   :javac-options ["-target" "1.8" "-source" "1.8"]
                   :repl-options {:init-ns reload}
                   :dependencies [[org.clojure/tools.namespace "1.0.0"]]}
             :prod {:main guidescan-web.core
                    :jar-name "guidescan.jar-THIN"
                    :uberjar-name "guidescan.jar"
                    :java-source-paths ["src/java"]
                    :javac-options ["-target" "1.8" "-source" "1.8"]}
             ;;;; Scripts for various maintenance tasks
             :create-gene-db {:source-paths ^:replace ["scripts"]
                              :uberjar-name "create-gene-db.jar"
                              :jar-name "create-gene-db.jar-THIN"
                              :main create-gene-db}
             :add-organism   {:source-paths ^:replace ["scripts"]
                              :uberjar-name "add-organism.jar"
                              :jar-name "add-organism.jar-THIN"
                              :main add-organism}})

schmidt7320:12:03

Anyone have any ideas?

Alex Miller (Clojure team)20:12:26

you haven't said what the trouble is

schmidt7320:12:39

Sorry, let me attach that.

schmidt7321:12:04

Caused by: java.lang.ClassNotFoundException: guidescan.algo.RabinKarp
        at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
        at clojure.lang.DynamicClassLoader.findClass(DynamicClassLoader.java:69)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
        at clojure.lang.DynamicClassLoader.loadClass(DynamicClassLoader.java:77)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at clojure.lang.RT.classForName(RT.java:2211)
        at clojure.lang.RT.classForNameNonLoading(RT.java:2224)
        at guidescan_web.genomics.resolver$eval5366$loading__6721__auto____5367.invoke(resolver.clj:1)
        at guidescan_web.genomics.resolver$eval5366.invokeStatic(resolver.clj:1)
        at guidescan_web.genomics.resolver$eval5366.invoke(resolver.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7177)
        at clojure.lang.Compiler.eval(Compiler.java:7166)
        at clojure.lang.Compiler.load(Compiler.java:7636)
        ... 146 more
Suppressed exit

schmidt7321:12:20

Essentially, I get a stacktrace telling me that it can't find my Java code.

schmidt7321:12:44

But I have explicitly specified it and everything is working on linux.

dpsutton21:12:02

are there any classfiles from your java sources?

schmidt7321:12:27

No. They do not appear to be getting compiled.

dpsutton21:12:15

so on windows it doesn't seem to be honoring your java source paths. if you delete those class files on the linux machine does it still work?

schmidt7321:12:58

@dpsutton let me check, but I imagine it would simply recompile them.

schmidt7321:12:22

@dpsutton the issue I think is that it's not honoring my java source paths.

dpsutton21:12:51

so it failed to compile them on linux? if so we've eliminated the os from the problem (or at least the problem as far as you've gotten so far)

schmidt7321:12:59

But I make sure to run lein with-profile dev repl explicitly.

schmidt7321:12:04

No it did not fail to compile them on linux.

schmidt7321:12:26

Okay so I think I got it working... Still it was rather strange behaviour.

schmidt7321:12:38

See: Having one source root contain another (e.g. src and src/java) can cause obscure problems. from https://github.com/technomancy/leiningen/blob/stable/doc/MIXED_PROJECTS.md#polyglot-clojure-java-projects-with-leiningen