clj-otel

devurandom 2024-05-14T16:27:09.142149Z

I am struggling to debug an issue where JavaAgent auto instrumentation spans end up in a different trace than the span manually created using clj-otel. I use Quartz. In my org.quartz.Job/execute implementation I call steffan-westcott.clj-otel.api.trace.span/with-span!, with parent being equal to steffan-westcott.clj-otel.context/root (for the case described below), and the Clojure function implementing the task inside that. In calls to steffan-westcott.clj-otel.context/current that I make in my Clojure function I see the context that I created using with-span!. But when I call clojure.java.jdbc/execute! (for example) the OTel (JDBC) auto instrumentation places the span inside a span created by the Quartz auto instrumentation (`otel.library.name io.opentelemetry.quartz-2.0`) instead of the span I created using with-span!. In some semi-clojury pseudo-code (`=>` is a direct observation, << is what I see in Jaeger):

(require '[steffan-westcott.clj-otel.context :as context])
(require '[steffan-westcott.clj-otel.trace.span :as span])

;<< context / span created by OTel JavaAgent for Quartz Job
(org.quartz.Job/execute
  (context/current) ;=> root context
  (span/with-span! ; in Jaeger this span appears empty
    (context/current) ;=> my context / span
    (clojure.java.jdbc/execute!) ;<< context / span created by OTel JavaAgent for Quartz Job
Why would the auto instrumentation fail to use the span created from Clojure / clj-otel? Why does my Clojure code not see the context / span the Quartz auto instrumentation created as being the current one? Any hints on what I should look at to figure out what is going on?

devurandom 2024-07-12T14:17:58.172859Z

I now tried the (context/root) ; Must allow agent to instrument ContextStorage/root before emitting agent log messages workaround you proposed and can gladly report that it also works in our case. 🙂

devurandom 2024-07-12T14:18:00.084799Z

Could this issue be related to https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/safety-mechanisms.md#classloader-separation ? I think Clojure uses a different ClassLoader, right? So maybe when clj-otel gets called too early, it will load things using the wrong (Clojure's) class loader, instead of the class loaders described in https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/contributing/javaagent-structure.md ?

steffan 2024-07-12T17:24:38.729409Z

As you point out, the OpenTelemetry instrumentation agent performs classloader separation. Thank you for finding that documentation, I was not aware of that simple_smile I honestly don't know the exact reason why logging during a call to ContextStorage/root during its own instrumentation should cause the problem, so what you suggest may be correct.

steffan 2024-05-17T10:36:43.913289Z

To support your assertion that the code is synchronous, verify that the thread.name and thread.id is the same in the spans created by the agent and your code.

devurandom 2024-05-17T11:57:17.224769Z

thread.name and thread.id are the same for both traces, the one with clj-otels span and the one with the agent's spans.

devurandom 2024-05-17T12:17:11.796449Z

I turned on -Dio.opentelemetry.context.enableStrictContext=true (`clojure -J-Dio.opentelemetry...`) but that had no effect (neither with agent 2.1.0 nor with 2.3.0). I turned on debug logging for the opentelemetry libraries, and got (when running with agent 2.3.0):

[CT] | EXECUTE ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | EXECUTE CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | EXECUTE BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | 2024-05-17T12:08:18.725Z 4eacb247fa91 DEBUG [io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage:63] - unexpected context: {}
[CT] |                                                      org.quartz.simpl.SimpleThreadPool$WorkerThread.run     SimpleThreadPool.java: 573
[CT] |                                                                         org.quartz.core.JobRunShell.run          JobRunShell.java: 202
[CT] |                                                                                    ....ProxyJob/execute            task_queue.clj: 314
[CT] |                                                      steffan-westcott.clj-otel.api.trace.span/new-span!                  span.clj: 188
[CT] |     io.opentelemetry.javaagent.instrumentation.opentelemetryapi.trace.ApplicationSpan$Builder.setParent      ApplicationSpan.java: 205
[CT] | io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage.getAgentContext  AgentContextStorage.java: 121
[CT] | java.lang.Exception: unexpected context
[CT] |
[CT] |
[CT] | EXECUTE WITH SPAN ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | EXECUTE WITH SPAN CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x27c116ef {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=171ea6c4cb4b5e70e8683c2f78e3170d, spanId=b7cfbdd4e3949f62, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, code.filepath=.../task_queue.clj, code.namespace=....task-queue, thread.id=35}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715947698766721000, endEpochNanos=0}}}]
[CT] | EXECUTE WITH SPAN BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x27c116ef {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=171ea6c4cb4b5e70e8683c2f78e3170d, spanId=b7cfbdd4e3949f62, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, code.filepath=.../task_queue.clj, code.namespace=....task-queue, thread.id=35}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715947698766721000, endEpochNanos=0}}}]
Note the java.lang.Exception: unexpected context. With agent 2.1.0 I do not get this exception.

devurandom 2024-05-17T12:37:38.716769Z

span.clj: 188 calls (.setParent parent-context) with parent-context (or parent (context/root). Whether I pass parent (context/current) or not does not make a difference (I tried both) since the default is parent (context/dyn). The this argument comes via cond-> (.spanBuilder tracer' (str name)), where tracer' (or tracer (get-default-tracer!)). Since I do not pass tracer and there is no default, tracer' is (set-default-tracer! (get-tracer)) where get-tracer builds the tracer from (or open-telemetry (otel/get-default-otel!)). Since I never call steffan-westcott.clj-otel.api.otel/set-default-otel! or otel/set-global-otel!, it seems that otel/get-default-otel! sees what was set up by the agent. Which makes it even more surprsising that steffan-westcott.clj-otel.context/current does not see what the agent set.

steffan 2024-05-17T12:38:24.098249Z

With which versions of the agent do you get this exception? The statement above contradicts itself, saying with and without 2.3.0

devurandom 2024-05-17T12:39:25.790309Z

Thanks for pointing that out. Was a typo. I get the exception with 2.3.0 and I do not get it with 2.1.0.

steffan 2024-05-17T12:42:08.073409Z

Can you please try running your Quartz example with the agent and strict context checker but without clj-otel? If you still get exceptions then it could point to bad behaviour in the agent.

steffan 2024-05-17T12:43:08.868389Z

It's difficult for me to provide any assistance without a minimal app showing the issue.

devurandom 2024-05-17T12:43:16.660779Z

I'll try to create a minimal example on the weekend, if I find the time.

steffan 2024-05-17T12:46:58.288889Z

Your observation of this behaviour on simple Jetty server handling would probably be easier to analyse, as there would be fewer moving parts. A repro with a minimal sever app would be great!

devurandom 2024-05-17T12:47:38.206429Z

Do you mean I should copy the implementation of steffan-westcott.clj-otel.api.trace.span/with-span! into our code and remove the dependency on clj-otel? (Because the exception is thrown from new-span!. Seemingly because the otel implementation returned by (GlobalOpenTelemetry/get) is handed a context by new-span! that it did not expect. So I would expect the exception to go away if I stop calling with-span!.)

steffan 2024-05-17T12:48:49.850039Z

Hmm, no. I didn't know that's when you are getting an exception.

devurandom 2024-05-17T12:49:22.385349Z

The stacktrace is:

[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage:63] - unexpected context: {}
[CT] |                                                      org.quartz.simpl.SimpleThreadPool$WorkerThread.run     SimpleThreadPool.java: 573
[CT] |                                                                         org.quartz.core.JobRunShell.run          JobRunShell.java: 202
[CT] |                                                                                    ....ProxyJob/execute            task_queue.clj: 314
[CT] |                                                      steffan-westcott.clj-otel.api.trace.span/new-span!                  span.clj: 188
[CT] |     io.opentelemetry.javaagent.instrumentation.opentelemetryapi.trace.ApplicationSpan$Builder.setParent      ApplicationSpan.java: 205
[CT] | io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage.getAgentContext  AgentContextStorage.java: 121
[CT] | java.lang.Exception: unexpected context

devurandom 2024-05-17T12:51:27.630779Z

Anyway, I'll try to build a reproduction with just Jetty, OTel agent and clj-otel on the weekend.

1
steffan 2024-05-17T12:55:10.499019Z

Yesterday I deployed 0.2.7-SNAPSHOT of clj-otel, so you can use that or 0.2.6 for your repro.

🙏 1
steffan 2024-05-17T14:38:16.384479Z

You could use the "countries-service" example as a starting point for your minimal repro. That example is a small standalone Jetty HTTP service that uses the OpenTelemetry agent. I've just tried adding a span inside the agent span and that works as expected. There must be some unexpected situation in your larger application to cause this to go wrong.

🙏 1
steffan 2024-05-17T14:41:57.092909Z

The clj-otel tutorial is another possible starting point. The tutorial walks through adding a span inside a span created by the agent.

devurandom 2024-05-22T08:42:46.406319Z

Thanks for taking the time to investigate! > if you really require DEBUG level agent logs. We first noticed the problem with Timbre only emitting info level log messages from namespaces outside of our own. Either Timbre fully processes log messages it will later discard, or there is something more fishy going on. (The logs I quoted above contain debug level messages, because I turned them on in the hopes to get some helpful information as to why we were not seeing the agent-created context.) > The above suggests at least two workarounds: We are currently using https://github.com/devurandom/clj-otel-mwe/commit/51cb7dd1d43aa77cf5ae10403f5ec1dbc4341484 as a workaround and that seems to work well, even with otel.javaagent.logging=application and info level logging for 3rd party namespaces configured in Timbre.

devurandom 2024-05-22T08:45:29.316579Z

To avoid the JSON parser issues, the alternative to otel.javaagent.logging=application would be otel.javaagent.logging=none, but I would rather like to avoid that, since to my understanding that would mean we would not get warning and error messages from the agent either. I don't feel comfortable with that yet.

devurandom 2024-05-15T08:53:55.041409Z

I haven't worked with Quartz, but I suspect what you are seeing is a result of async processing, or lazy evaluation. > It is important to note with-span! works with the current context, which is a thread local reference maintained by Java OpenTelemetry. Therefore, it only works well with synchronous code. > Something else doesn't follow in your description: You should never get root as the current context inside a span, if you are using synchronous code. That was also baffling me. As far as I can tell my code is completely synchronous. As far as I can tell Quartz' Java code is also synchronous (the JavaAgent code linked above contains comment with a link to the Quartz code as proof: https://github.com/quartz-scheduler/quartz/blob/quartz-2.0.x/quartz/src/main/java/org/quartz/core/JobRunShell.java#L180, which still looks the same in the version of Quartz we are running: https://github.com/quartz-scheduler/quartz/blob/quartz-2.3.x/quartz-core/src/main/java/org/quartz/core/JobRunShell.java#L173). And still my code does not see the context/current that the JavaAgent created, but the JDBC code called by my code sees it. Code:

(deftype ProxyJob [f]
  Job
  (execute [_ job-context]
    (println "EXECUTE ROOT CONTEXT" (context/root))
    (println "EXECUTE CURRENT CONTEXT" (context/current)) ; same as root
    (println "EXECUTE BOUND CONTEXT" (context/dyn)) ; same as root
    (span/with-span! {:span-kind  :consumer
                      :name       "TEST/execute"
                      :parent     (context/current)}
                     (println "EXECUTE WITH SPAN ROOT CONTEXT" (context/root))
                     (println "EXECUTE WITH SPAN CURRENT CONTEXT" (context/current)) ; picks up the new context
                     (println "EXECUTE WITH SPAN BOUND CONTEXT" (context/dyn)) ; picks up the new context
                     (jdbc/execute! (tdb/connection) "SELECT 1;")))) ; creates traces in an unrelated context
Logs:
[CT] | EXECUTE ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE WITH SPAN ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE WITH SPAN CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x45ceb000 {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=af775bcb2f1f9cfbf2ffbb4e25aef9e8, spanId=4cf96add0cf57d01, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.namespace=...task-queue, code.filepath=.../task_queue.clj, thread.id=40, code.lineno=315, thread.name=periodic-triggers_Worker-1}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715761427266618000, endEpochNanos=0}}}]
[CT] | EXECUTE WITH SPAN BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x45ceb000 {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=af775bcb2f1f9cfbf2ffbb4e25aef9e8, spanId=4cf96add0cf57d01, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.namespace=...task-queue, code.filepath=.../task_queue.clj, thread.id=40, code.lineno=315, thread.name=periodic-triggers_Worker-1}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715761427266618000, endEpochNanos=0}}}]
custom-span.png is http://localhost:16686/trace/af775bcb2f1f9cfbf2ffbb4e25aef9e8, contains only the TEST/execute custom span, and Jaeger shows under "process" attributes:
{
  "key": "otel.library.name",
  "type": "string",
  "value": "steffan-westcott.clj-otel"
}
{
  "key": "otel.library.version",
  "type": "string",
  "value": "0.2.6"
}
auto-instrumentation.png is http://localhost:16686/trace/371834e48b57205783ee7de680136940, contains the DEFAULT.... Quartz span, and Jaeger shows under "process" attributes:
{
  "key": "otel.library.name",
  "type": "string",
  "value": "io.opentelemetry.quartz-2.0"
}
{
  "key": "otel.library.version",
  "type": "string",
  "value": "2.3.0-alpha"
}
The SELECT ... JDBC span inside it has these "process" attributes:
{
  "key": "otel.library.name",
  "type": "string",
  "value": "io.opentelemetry.jdbc"
}
{
  "key": "otel.library.version",
  "type": "string",
  "value": "2.3.0-alpha"
}
All also have these "process" attributes:
{
  "key": "process.runtime.description",
  "type": "string",
  "value": "Microsoft OpenJDK 64-Bit Server VM 11.0.22+7-LTS"
}
{
  "key": "telemetry.distro.name",
  "type": "string",
  "value": "opentelemetry-java-instrumentation"
}
{
  "key": "telemetry.distro.version",
  "type": "string",
  "value": "2.3.0"
}
{
  "key": "telemetry.sdk.language",
  "type": "string",
  "value": "java"
}
{
  "key": "telemetry.sdk.name",
  "type": "string",
  "value": "opentelemetry"
}
{
  "key": "telemetry.sdk.version",
  "type": "string",
  "value": "1.37.0"
}
And all have "tags" attributes, exactly like the context/current I see:
{
  "key": "thread.id",
  "type": "int64",
  "value": 40
}
{
  "key": "thread.name",
  "type": "string",
  "value": "periodic-triggers_Worker-1"
}

devurandom 2024-05-15T17:00:33.906449Z

I noticed that also other manually created spans appear out of context, so I thought about a "split brain" situation between clj-otel and the OpenTelemetry JavaAgent. I managed to trace it down to: • https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.1.0 is the last version that works • https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.2.0 is the first version where clj-otel does not see the agent's auto instrumentation contexts

devurandom 2024-05-15T17:07:32.010619Z

I do not see anything obvious in the release notes of 2.2.0, except maybe https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/10854. But we run Java 11, not 21, i.e. we have no virtual threads. But maybe something in the release notes catches your eye?

steffan 2024-05-15T17:17:23.706039Z

I don't see any connection to virtual threads here. When you say "other manually created threads appear out of context", are you referring to your Quartz example specifically, or all applications in general? I don't see any problem with the "sentence summary" microservice example in clj-otel which adds a manually created span inside a span created by the agent. I'm running with agent 2.3.0 in my test. (Need to go out now, will test with 2.2.0 later)

devurandom 2024-05-15T19:14:46.732739Z

> When you say "other manually created threads appear out of context" We also create manual spans elsewhere (not just in our Quartz code) and they also appear separate from the trace context that the auto instrumentation created when Jetty handled the HTTP requests. > I don't see any problem with the "sentence summary" microservice example in clj-otel which adds a manually created span inside a span created by the agent. I'm running with agent 2.3.0 in my test. Thanks for testing it. I'll see whether I can figure out more specifically what is causing this. I was using 2.3.0 for https://clojurians.slack.com/archives/C034UN5273N/p1715704029142149. For the https://clojurians.slack.com/archives/C034UN5273N/p1715792433906449?thread_ts=1715704029.142149&amp;cid=C034UN5273N I tried all versions from 2.0.0 to 2.2.0. 2.0.0 and 2.1.0 had no problem. The problem seems to have started with 2.2.0.

steffan 2024-05-19T13:48:58.987079Z

About turning off agent logging, are you sure turning it off does not help? If I modify your example such that the JVM option "-Dotel.javaagent.logging=none" is used, the issue is resolved when I run it. This setting mutes all agent logging, including the log messages on startup.

devurandom 2024-05-19T13:51:24.837049Z

Thanks! I did not try setting it to none, only removing the property, which leaves it at the default value. Doing that would resolve the "clj-otel cannot see the context" issue, but trip the JSON processing of log messages.

steffan 2024-05-19T13:54:35.252749Z

I'm not sure I follow, if all the agent log messages are muted, there would be no issue with JSON processing...?

devurandom 2024-05-19T13:55:01.232989Z

I did not try setting it to none.

👍🏼 1
devurandom 2024-05-16T16:44:39.293059Z

For comparison, the logs with agent 2.1.0:

[CT] | EXECUTE ROOT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x60f4cc44 AgentContextWrapper{agentContext={}, applicationContext={}}]
[CT] | EXECUTE CURRENT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x1d81157b AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}}, applicationContext={}}]
[CT] | EXECUTE BOUND CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x3c1e4ba8 AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}}, applicationContext={}}]
[CT] | EXECUTE WITH SPAN ROOT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x60f4cc44 AgentContextWrapper{agentContext={}, applicationContext={}}]
[CT] | EXECUTE WITH SPAN CURRENT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x1ba892e AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=3818496ccb896ee1, parentSpanContext=ImmutableSpanContext{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, thread.id=34, code.filepath=.../task_queue.clj, code.namespace=....task-queue}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358743204879, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}, otel-context={}}, applicationContext={}}]
[CT] | EXECUTE WITH SPAN BOUND CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x50838d38 AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=3818496ccb896ee1, parentSpanContext=ImmutableSpanContext{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, thread.id=34, code.filepath=.../task_queue.clj, code.namespace=....task-queue}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358743204879, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}}, applicationContext={}}]
> I'll see whether I can figure out more specifically what is causing this. To test the hypothesis that this could be a "diamond dependency" problem, I tried adding io.opentelemetry/opentelemetry-api {:mvn/version "1.36.0"} to deps.edn and running with https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.2.0, and io.opentelemetry/opentelemetry-api {:mvn/version "1.37.0"} with https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.3.0, but no change -- EXECUTE CURRENT CONTEXT is initially output with an empty io.opentelemetry.context.ArrayBasedContext instead of the io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper I would expect. Since io.opentelemetry.semconv/opentelemetry-semconv {:mvn/version "1.24.0-alpha"} mentions breaking changes and io.opentelemetry.semconv/opentelemetry-semconv {:mvn/version "1.25.0-alpha"} is supposed to restore backward compatibility, I also tried to override clj-otel-api's dependency with 1.25.0-alpha in our deps.edn just in case the agent's version of it would somehow leak into the classpath, but no change either -- ArrayBasedContext instead of AgentContextWrapper. I also tried checking out 3eb216d3553b28a2574e323470a0bce077394b26 of clj-otel-api locally and using that instead (without mentioning any io.opentelemetry* dependencies in our deps.edn), but still the same output -- ArrayBasedContext instead of AgentContextWrapper.

steffan 2024-05-16T16:58:29.799969Z

It would be helpful if you could provide a minimal project which reproduces the issue. I'm having no luck reproducing the issue you are seeing.

👍 1
steffan 2024-05-14T22:44:47.234579Z

Note that the agent also instruments JDBC access automatically. See https://github.com/steffan-westcott/clj-otel/tree/master/examples/rpg-service/src/example/rpg_service for an example. It is not clear in your description, but the "span created by OTel JavaAgent for Quartz Job" might actually be a span around database access instead.

steffan 2024-05-14T22:46:59.070519Z

I'm having difficulty following your explanation of what spans you actually see. Perhaps some screenshots of the traces displayed in Jaeger might clear things up for me.

steffan 2024-05-14T22:49:55.039419Z

I haven't worked with Quartz, but I suspect what you are seeing is a result of async processing, or lazy evaluation. The body in your with-span! form has completed before the execute! starts.

steffan 2024-05-14T22:52:24.726579Z

It is important to note with-span! works with the current context, which is a thread local reference maintained by Java OpenTelemetry. Therefore, it only works well with synchronous code.

steffan 2024-05-14T23:31:25.771379Z

Something else doesn't follow in your description: You should never get root as the current context inside a span, if you are using synchronous code.

steffan 2024-05-14T23:33:01.244699Z

I think I'd also need to see your code to be able to offer help.

steffan 2024-05-21T14:04:35.604799Z

After some investigation, I have a better understanding of what is happening. Also, I've found an alternative easy workaround if you really require DEBUG level agent logs. I hoped to make a simple repro in Java using the OpenTelemetry Log4J2 NDC provider, but I have not been able to do this. The OpenTelemetry instrumentation agent https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/b36d2845b35db088008a8ea11f6666977d285e9e/instrumentation/opentelemetry-api/opentelemetry-api-1.0/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/opentelemetryapi/ContextInstrumentation.java#L36-L38 ContextStorage/root (when it is first used) to wrap the ArrayBasedContext that would otherwise have been returned with a helper method in AgentContextStorage. If DEBUG level agent logging is turned on, log messages are emitted during this instrumentation process. My tests show that a call to ContextStorage/root during its own instrumentation will cause issues. This will happen if we attempt to access the current context during emitting an agent log message when instrumenting ContextStorage/root. The above suggests at least two workarounds: 1 - Consider if you really need otel.javaagent.logging=application and enabling DEBUG level agent logging 2 - Evaluate (context/root) before routing agent logs through the logger in your application

(defn -main
  []
  (context/root)  ; Must allow agent to instrument ContextStorage/root before emitting agent log messages
  (log/merge-config! {:min-level  [[#{"io.opentelemetry.javaagent.*"} :debug] [#{"*"} :info]]
                      :middleware [otel-context-middleware]})
  (jetty/run-jetty #'handler {:port 3000}))

steffan 2024-05-18T11:19:23.541639Z

Java agent 2.4.0 was released a few hours ago. I've done a quick test with countries-service and sentence-summary-service examples with no problems.

🙏 1
devurandom 2024-05-18T14:54:46.131309Z

I found a repro: https://github.com/devurandom/clj-otel-mwe/commit/f0a17c82e6bce30ee2939020b7e7c8422c551078 Both the Timbre middleware and the Java property are needed to break clj-otel.

devurandom 2024-05-18T14:59:59.047529Z

It can be fixed with https://github.com/devurandom/clj-otel-mwe/commit/51cb7dd1d43aa77cf5ae10403f5ec1dbc4341484.

devurandom 2024-05-18T15:07:24.225159Z

(Not visible in the mwe: We use the otel-context-middleware to transport the otel context if (.isValid span-context) into a Timbre {:output-opts {:msg-fn ...}} that includes the trace and span ID in the logged message, if present.)

devurandom 2024-05-18T15:19:39.309559Z

I do not even need to call (span/get-span-context). Calling (context/current) is enough: https://github.com/devurandom/clj-otel-mwe/commit/469d9e5f35c72571eb2a28dacfe228214858cd31

steffan 2024-05-18T16:54:42.617589Z

I have reproduced the issue, so thanks for that 😄 I believe what you are seeing is an interaction of turning on agent logging, LazyStorage (which will establish a source of Context objects only when one is requested, which is normally well after the rest of the OpenTelemetry SDK is initialised) and SLF4J classpath shenanigans. Somewhere in all of that mess is causing this issue.

steffan 2024-05-18T16:55:37.626169Z

As a short-term fix, perhaps you don't need agent logging turned on.

steffan 2024-05-18T17:28:16.746949Z

I suspect this is an issue with the interaction I mentioned above and not related to clj-otel at all. I will try to recreate the issue in a small Java application. If I succeed, I will report the issue on the opentelemetry-java-instrumentation Github repo.

🙏 1
steffan 2024-05-18T17:33:23.411899Z

For your information, to cut down the noisy log, I'm using this modified version of your repro:

(defn- handler
 [_]
 (println "CONTEXT" (context/current))
 (span/with-span! "Nested span")
 {:status 200 :body nil})

(defn- otel-context-middleware
 [data]
 (context/current)
 data)

(defn -main
 []
 (log/merge-config! {:min-level [[#{"io.opentelemetry.javaagent.*"} :debug] [#{"*"} :info]]
                     ;; With this middleware and `otel.javaagent.logging=application`
                     ;; clj-otel cannot see agent-generated spans:
                     :middleware [otel-context-middleware]})
 (jetty/run-jetty #'handler {:port 3000}))

devurandom 2024-05-18T18:05:29.741609Z

> As a short-term fix, perhaps you don't need agent logging turned on. > Sadly we do need it, as otherwise the agent emits text log messages and breaks the JSON parsing of a colleague's logging stack. But https://github.com/devurandom/clj-otel-mwe/commit/51cb7dd1d43aa77cf5ae10403f5ec1dbc4341484 works as a workaround, so we're good. > I suspect this is an issue with the interaction I mentioned above and not related to clj-otel at all. I will try to recreate the issue in a small Java application. If I succeed, I will report the issue on the opentelemetry-java-instrumentation Github repo. > If this could be fixed in the agent, that would be awesome. Thanks for looking into it!