Fork me on GitHub
#core-async
<
2023-02-27
>
danielcompton16:02:04

I deployed a change over the weekend to add OpenTelemetry tracing to our core.async based webservers. We use binding to pass tracing context between goroutines. A tiny fraction of requests (0.0004%) are throwing exceptions IllegalStateException: Pop without matching push. It seems like this might be a race condition or threading issue. The stack trace is throwing at:

java.lang.IllegalStateException: Pop without matching push
    at clojure.lang.Var.popThreadBindings(Var.java:336)
    at clojure.core$pop_thread_bindings.invokeStatic(core.clj:1948)
    at clojure.core$pop_thread_bindings.invoke(core.clj:1948)
    at server.monitoring.opentelemetry$wrap_open_telemetry$fn__$fn__$state_machine__$fn__.invoke(opentelemetry.clj:101)
    at server.monitoring.opentelemetry$wrap_open_telemetry$fn__$fn__$state_machine__.invoke(opentelemetry.clj:101)
    at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:978)
    at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:977)
    at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:982)
    at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:980)
    at clojure.core.async.impl.ioc_macros$take_BANG_$fn__.invoke(ioc_macros.clj:991)
    at clojure.core.async.impl.channels.ManyToManyChannel$fn__$fn__.invoke(channels.clj:99)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at clojure.core.async.impl.concurrent$counted_thread_factory$reify__$fn__.invoke(concurrent.clj:29)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.lang.Thread.run(Thread.java:833)
I’m not quite sure what to do to debug this because it happens so infrequently. I’ve tried running a load test locally but it hasn’t thrown yet. Here is the code which is using binding
(defmacro with-span [span-opts & body]
  `(let [span-opts# ~span-opts
         source#    (merge {:line ~(:line (meta &form))
                           :file ~*file*
                           :ns   ~(str *ns*)}
                           (:source span-opts#))
         span-opts# (assoc span-opts#
                           :source source#
                           :parent (or (:parent span-opts#) *current-context*))]
     (span/with-span-binding' [context# span-opts#]
       ;; We bind the context to a dynamic var so that it propagates between core.async Go blocks.
       ;; By default, OpenTelemetry propagates the context via thread locals which doesn't work with core.async.
       (binding [*current-context* context#]
         ~@body))))
Does anyone have any ideas? I can see https://clojure.atlassian.net/browse/ASYNC-170 but that was resolved a while ago. We’re using core.async 1.5.648.

Alex Miller (Clojure team)16:02:34

I assume this is used in a go block?

danielcompton17:02:51

Yep, this is used inside a go block, e.g.

(otel/with-span {:name "token"}
        (go
          (<? (pg/execute>
                {:new-query {:insert-into :token
                             :values [(update record :data $edn/write-string)]}}
                :trx trx))))

Alex Miller (Clojure team)17:02:02

I assume it is probably some variant of ASYNC-170 maybe in a different branch of the analyzer

Alex Miller (Clojure team)17:02:31

it would be helpful to me if you could put it on ask and when I do my next sweep I'll make a jira out of it

danielcompton19:02:09

Ok will do. I’m tracking down some internal stuff it could possibly be related to as well. I’ll post an update either way

hiredman20:02:14

The stacktrace there mentions wrap-open-telemetry

hiredman20:02:28

For the above with-span call, the push and pops for the bindings are all outside of the go block, so cannot be the source of the error, unless it is also wrapped in a go block somewhere (the stacktrace is from a running go block)

danielcompton20:02:14

Here’s the wrapper

(defn wrap-open-telemetry [handler>]
  (fn [req]
    (go
      (try
        (with-span (trace-http/server-span-opts (assoc req :protocol "HTTP/1.1")
                                                {:captured-request-headers
                                                 ;; We may want to capture these as custom whimsical.* Attributes
                                                 ;; (Baggage?) in the future, but this works for now.
                                                 ["w-app-version" "w-session-id" "w-version"]})
                   (let [res (<? (handler> req))]
                     (trace-http/add-server-span-response-data! res {:context *current-context*})
                     ;; There is trace-http/add-route-data!, but it sets the span name incorrectly.
                     ;; > HTTP server span names SHOULD be {http.method} {http.route} if there is a (low-cardinality)
                     ;; > http.route available.
                     ;; > 
                     (when-let [endpoint (:endpoint res)]
                       (span/add-span-data! {:context *current-context*
                                             :name (str (request-method req) " " endpoint)
                                             :attributes {SemanticAttributes/HTTP_ROUTE endpoint}}))
                     res))
        (catch Throwable t
          ($/error t "Exception while wrapping with tracing")
          (sentry/send-event! {:throwable t
                               :message "Exception while wrapping OpenTelemetry"
                               :request req})
          {:status 500
           :headers {"Content-Type" "text/plain"}
           :body "Whimsical internal server error (tracing)"})))))