I am struggling to debug an issue where JavaAgent auto instrumentation spans end up in a different trace than the span manually created using clj-otel.
I use Quartz. In my org.quartz.Job/execute implementation I call steffan-westcott.clj-otel.api.trace.span/with-span!, with parent being equal to steffan-westcott.clj-otel.context/root (for the case described below), and the Clojure function implementing the task inside that.
In calls to steffan-westcott.clj-otel.context/current that I make in my Clojure function I see the context that I created using with-span!. But when I call clojure.java.jdbc/execute! (for example) the OTel (JDBC) auto instrumentation places the span inside a span created by the Quartz auto instrumentation (`otel.library.name io.opentelemetry.quartz-2.0`) instead of the span I created using with-span!.
In some semi-clojury pseudo-code (`=>` is a direct observation, << is what I see in Jaeger):
(require '[steffan-westcott.clj-otel.context :as context])
(require '[steffan-westcott.clj-otel.trace.span :as span])
;<< context / span created by OTel JavaAgent for Quartz Job
(org.quartz.Job/execute
(context/current) ;=> root context
(span/with-span! ; in Jaeger this span appears empty
(context/current) ;=> my context / span
(clojure.java.jdbc/execute!) ;<< context / span created by OTel JavaAgent for Quartz Job
Why would the auto instrumentation fail to use the span created from Clojure / clj-otel?
Why does my Clojure code not see the context / span the Quartz auto instrumentation created as being the current one?
Any hints on what I should look at to figure out what is going on?I now tried the (context/root) ; Must allow agent to instrument ContextStorage/root before emitting agent log messages workaround you proposed and can gladly report that it also works in our case. 🙂
Could this issue be related to https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/safety-mechanisms.md#classloader-separation ? I think Clojure uses a different ClassLoader, right? So maybe when clj-otel gets called too early, it will load things using the wrong (Clojure's) class loader, instead of the class loaders described in https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/contributing/javaagent-structure.md ?
As you point out, the OpenTelemetry instrumentation agent performs classloader separation. Thank you for finding that documentation, I was not aware of that simple_smile I honestly don't know the exact reason why logging during a call to ContextStorage/root during its own instrumentation should cause the problem, so what you suggest may be correct.
To support your assertion that the code is synchronous, verify that the thread.name and thread.id is the same in the spans created by the agent and your code.
thread.name and thread.id are the same for both traces, the one with clj-otels span and the one with the agent's spans.
I turned on -Dio.opentelemetry.context.enableStrictContext=true (`clojure -J-Dio.opentelemetry...`) but that had no effect (neither with agent 2.1.0 nor with 2.3.0).
I turned on debug logging for the opentelemetry libraries, and got (when running with agent 2.3.0):
[CT] | EXECUTE ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | EXECUTE CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | EXECUTE BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | 2024-05-17T12:08:18.725Z 4eacb247fa91 DEBUG [io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage:63] - unexpected context: {}
[CT] | org.quartz.simpl.SimpleThreadPool$WorkerThread.run SimpleThreadPool.java: 573
[CT] | org.quartz.core.JobRunShell.run JobRunShell.java: 202
[CT] | ....ProxyJob/execute task_queue.clj: 314
[CT] | steffan-westcott.clj-otel.api.trace.span/new-span! span.clj: 188
[CT] | io.opentelemetry.javaagent.instrumentation.opentelemetryapi.trace.ApplicationSpan$Builder.setParent ApplicationSpan.java: 205
[CT] | io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage.getAgentContext AgentContextStorage.java: 121
[CT] | java.lang.Exception: unexpected context
[CT] |
[CT] |
[CT] | EXECUTE WITH SPAN ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x1f1bb8aa {}]
[CT] | EXECUTE WITH SPAN CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x27c116ef {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=171ea6c4cb4b5e70e8683c2f78e3170d, spanId=b7cfbdd4e3949f62, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, code.filepath=.../task_queue.clj, code.namespace=....task-queue, thread.id=35}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715947698766721000, endEpochNanos=0}}}]
[CT] | EXECUTE WITH SPAN BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x27c116ef {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=171ea6c4cb4b5e70e8683c2f78e3170d, spanId=b7cfbdd4e3949f62, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, code.filepath=.../task_queue.clj, code.namespace=....task-queue, thread.id=35}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715947698766721000, endEpochNanos=0}}}]
Note the java.lang.Exception: unexpected context.
With agent 2.1.0 I do not get this exception.span.clj: 188 calls (.setParent parent-context) with parent-context (or parent (context/root). Whether I pass parent (context/current) or not does not make a difference (I tried both) since the default is parent (context/dyn).
The this argument comes via cond-> (.spanBuilder tracer' (str name)), where tracer' (or tracer (get-default-tracer!)). Since I do not pass tracer and there is no default, tracer' is (set-default-tracer! (get-tracer)) where get-tracer builds the tracer from (or open-telemetry (otel/get-default-otel!)).
Since I never call steffan-westcott.clj-otel.api.otel/set-default-otel! or otel/set-global-otel!, it seems that otel/get-default-otel! sees what was set up by the agent. Which makes it even more surprsising that steffan-westcott.clj-otel.context/current does not see what the agent set.
With which versions of the agent do you get this exception? The statement above contradicts itself, saying with and without 2.3.0
Thanks for pointing that out. Was a typo. I get the exception with 2.3.0 and I do not get it with 2.1.0.
Can you please try running your Quartz example with the agent and strict context checker but without clj-otel? If you still get exceptions then it could point to bad behaviour in the agent.
It's difficult for me to provide any assistance without a minimal app showing the issue.
I'll try to create a minimal example on the weekend, if I find the time.
Your observation of this behaviour on simple Jetty server handling would probably be easier to analyse, as there would be fewer moving parts. A repro with a minimal sever app would be great!
Do you mean I should copy the implementation of steffan-westcott.clj-otel.api.trace.span/with-span! into our code and remove the dependency on clj-otel? (Because the exception is thrown from new-span!. Seemingly because the otel implementation returned by (GlobalOpenTelemetry/get) is handed a context by new-span! that it did not expect. So I would expect the exception to go away if I stop calling with-span!.)
Hmm, no. I didn't know that's when you are getting an exception.
The stacktrace is:
[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage:63] - unexpected context: {}
[CT] | org.quartz.simpl.SimpleThreadPool$WorkerThread.run SimpleThreadPool.java: 573
[CT] | org.quartz.core.JobRunShell.run JobRunShell.java: 202
[CT] | ....ProxyJob/execute task_queue.clj: 314
[CT] | steffan-westcott.clj-otel.api.trace.span/new-span! span.clj: 188
[CT] | io.opentelemetry.javaagent.instrumentation.opentelemetryapi.trace.ApplicationSpan$Builder.setParent ApplicationSpan.java: 205
[CT] | io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextStorage.getAgentContext AgentContextStorage.java: 121
[CT] | java.lang.Exception: unexpected contextAnyway, I'll try to build a reproduction with just Jetty, OTel agent and clj-otel on the weekend.
Yesterday I deployed 0.2.7-SNAPSHOT of clj-otel, so you can use that or 0.2.6 for your repro.
You could use the "countries-service" example as a starting point for your minimal repro. That example is a small standalone Jetty HTTP service that uses the OpenTelemetry agent. I've just tried adding a span inside the agent span and that works as expected. There must be some unexpected situation in your larger application to cause this to go wrong.
The clj-otel tutorial is another possible starting point. The tutorial walks through adding a span inside a span created by the agent.
Thanks for taking the time to investigate!
> if you really require DEBUG level agent logs.
We first noticed the problem with Timbre only emitting info level log messages from namespaces outside of our own. Either Timbre fully processes log messages it will later discard, or there is something more fishy going on. (The logs I quoted above contain debug level messages, because I turned them on in the hopes to get some helpful information as to why we were not seeing the agent-created context.)
> The above suggests at least two workarounds:
We are currently using https://github.com/devurandom/clj-otel-mwe/commit/51cb7dd1d43aa77cf5ae10403f5ec1dbc4341484 as a workaround and that seems to work well, even with otel.javaagent.logging=application and info level logging for 3rd party namespaces configured in Timbre.
To avoid the JSON parser issues, the alternative to otel.javaagent.logging=application would be otel.javaagent.logging=none, but I would rather like to avoid that, since to my understanding that would mean we would not get warning and error messages from the agent either. I don't feel comfortable with that yet.
I haven't worked with Quartz, but I suspect what you are seeing is a result of async processing, or lazy evaluation. > It is important to note with-span! works with the current context, which is a thread local reference maintained by Java OpenTelemetry. Therefore, it only works well with synchronous code.
> Something else doesn't follow in your description: You should never get root as the current context inside a span, if you are using synchronous code.
That was also baffling me. As far as I can tell my code is completely synchronous. As far as I can tell Quartz' Java code is also synchronous (the JavaAgent code linked above contains comment with a link to the Quartz code as proof: https://github.com/quartz-scheduler/quartz/blob/quartz-2.0.x/quartz/src/main/java/org/quartz/core/JobRunShell.java#L180, which still looks the same in the version of Quartz we are running: https://github.com/quartz-scheduler/quartz/blob/quartz-2.3.x/quartz-core/src/main/java/org/quartz/core/JobRunShell.java#L173). And still my code does not see the context/current that the JavaAgent created, but the JDBC code called by my code sees it.
Code:
(deftype ProxyJob [f]
Job
(execute [_ job-context]
(println "EXECUTE ROOT CONTEXT" (context/root))
(println "EXECUTE CURRENT CONTEXT" (context/current)) ; same as root
(println "EXECUTE BOUND CONTEXT" (context/dyn)) ; same as root
(span/with-span! {:span-kind :consumer
:name "TEST/execute"
:parent (context/current)}
(println "EXECUTE WITH SPAN ROOT CONTEXT" (context/root))
(println "EXECUTE WITH SPAN CURRENT CONTEXT" (context/current)) ; picks up the new context
(println "EXECUTE WITH SPAN BOUND CONTEXT" (context/dyn)) ; picks up the new context
(jdbc/execute! (tdb/connection) "SELECT 1;")))) ; creates traces in an unrelated context
Logs:
[CT] | EXECUTE ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE WITH SPAN ROOT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x37a55a95 {}]
[CT] | EXECUTE WITH SPAN CURRENT CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x45ceb000 {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=af775bcb2f1f9cfbf2ffbb4e25aef9e8, spanId=4cf96add0cf57d01, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.namespace=...task-queue, code.filepath=.../task_queue.clj, thread.id=40, code.lineno=315, thread.name=periodic-triggers_Worker-1}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715761427266618000, endEpochNanos=0}}}]
[CT] | EXECUTE WITH SPAN BOUND CONTEXT #object[io.opentelemetry.context.ArrayBasedContext 0x45ceb000 {opentelemetry-trace-span-key=ApplicationSpan{agentSpan=SdkSpan{traceId=af775bcb2f1f9cfbf2ffbb4e25aef9e8, spanId=4cf96add0cf57d01, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.namespace=...task-queue, code.filepath=.../task_queue.clj, thread.id=40, code.lineno=315, thread.name=periodic-triggers_Worker-1}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715761427266618000, endEpochNanos=0}}}]
custom-span.png is http://localhost:16686/trace/af775bcb2f1f9cfbf2ffbb4e25aef9e8, contains only the TEST/execute custom span, and Jaeger shows under "process" attributes:
{
"key": "otel.library.name",
"type": "string",
"value": "steffan-westcott.clj-otel"
}
{
"key": "otel.library.version",
"type": "string",
"value": "0.2.6"
}
auto-instrumentation.png is http://localhost:16686/trace/371834e48b57205783ee7de680136940, contains the DEFAULT.... Quartz span, and Jaeger shows under "process" attributes:
{
"key": "otel.library.name",
"type": "string",
"value": "io.opentelemetry.quartz-2.0"
}
{
"key": "otel.library.version",
"type": "string",
"value": "2.3.0-alpha"
}
The SELECT ... JDBC span inside it has these "process" attributes:
{
"key": "otel.library.name",
"type": "string",
"value": "io.opentelemetry.jdbc"
}
{
"key": "otel.library.version",
"type": "string",
"value": "2.3.0-alpha"
}
All also have these "process" attributes:
{
"key": "process.runtime.description",
"type": "string",
"value": "Microsoft OpenJDK 64-Bit Server VM 11.0.22+7-LTS"
}
{
"key": "telemetry.distro.name",
"type": "string",
"value": "opentelemetry-java-instrumentation"
}
{
"key": "telemetry.distro.version",
"type": "string",
"value": "2.3.0"
}
{
"key": "telemetry.sdk.language",
"type": "string",
"value": "java"
}
{
"key": "telemetry.sdk.name",
"type": "string",
"value": "opentelemetry"
}
{
"key": "telemetry.sdk.version",
"type": "string",
"value": "1.37.0"
}
And all have "tags" attributes, exactly like the context/current I see:
{
"key": "thread.id",
"type": "int64",
"value": 40
}
{
"key": "thread.name",
"type": "string",
"value": "periodic-triggers_Worker-1"
}I noticed that also other manually created spans appear out of context, so I thought about a "split brain" situation between clj-otel and the OpenTelemetry JavaAgent. I managed to trace it down to:
• https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.1.0 is the last version that works
• https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.2.0 is the first version where clj-otel does not see the agent's auto instrumentation contexts
I do not see anything obvious in the release notes of 2.2.0, except maybe https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/10854. But we run Java 11, not 21, i.e. we have no virtual threads. But maybe something in the release notes catches your eye?
I don't see any connection to virtual threads here.
When you say "other manually created threads appear out of context", are you referring to your Quartz example specifically, or all applications in general? I don't see any problem with the "sentence summary" microservice example in clj-otel which adds a manually created span inside a span created by the agent. I'm running with agent 2.3.0 in my test. (Need to go out now, will test with 2.2.0 later)
> When you say "other manually created threads appear out of context" We also create manual spans elsewhere (not just in our Quartz code) and they also appear separate from the trace context that the auto instrumentation created when Jetty handled the HTTP requests. > I don't see any problem with the "sentence summary" microservice example in clj-otel which adds a manually created span inside a span created by the agent. I'm running with agent 2.3.0 in my test. Thanks for testing it. I'll see whether I can figure out more specifically what is causing this. I was using 2.3.0 for https://clojurians.slack.com/archives/C034UN5273N/p1715704029142149. For the https://clojurians.slack.com/archives/C034UN5273N/p1715792433906449?thread_ts=1715704029.142149&cid=C034UN5273N I tried all versions from 2.0.0 to 2.2.0. 2.0.0 and 2.1.0 had no problem. The problem seems to have started with 2.2.0.
About turning off agent logging, are you sure turning it off does not help? If I modify your example such that the JVM option "-Dotel.javaagent.logging=none" is used, the issue is resolved when I run it. This setting mutes all agent logging, including the log messages on startup.
Thanks! I did not try setting it to none, only removing the property, which leaves it at the default value. Doing that would resolve the "clj-otel cannot see the context" issue, but trip the JSON processing of log messages.
I'm not sure I follow, if all the agent log messages are muted, there would be no issue with JSON processing...?
I did not try setting it to none.
For comparison, the logs with agent 2.1.0:
[CT] | EXECUTE ROOT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x60f4cc44 AgentContextWrapper{agentContext={}, applicationContext={}}]
[CT] | EXECUTE CURRENT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x1d81157b AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}}, applicationContext={}}]
[CT] | EXECUTE BOUND CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x3c1e4ba8 AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}}, applicationContext={}}]
[CT] | EXECUTE WITH SPAN ROOT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x60f4cc44 AgentContextWrapper{agentContext={}, applicationContext={}}]
[CT] | EXECUTE WITH SPAN CURRENT CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x1ba892e AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=3818496ccb896ee1, parentSpanContext=ImmutableSpanContext{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, thread.id=34, code.filepath=.../task_queue.clj, code.namespace=....task-queue}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358743204879, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}, otel-context={}}, applicationContext={}}]
[CT] | EXECUTE WITH SPAN BOUND CONTEXT #object[io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper 0x50838d38 AgentContextWrapper{agentContext={opentelemetry-trace-span-key=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=3818496ccb896ee1, parentSpanContext=ImmutableSpanContext{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}, name=TEST/execute, kind=CONSUMER, attributes=AttributesMap{data={code.lineno=314, thread.name=periodic-triggers_Worker-1, thread.id=34, code.filepath=.../task_queue.clj, code.namespace=....task-queue}, capacity=128, totalAddedValues=7}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358743204879, endEpochNanos=0}, opentelemetry-traces-local-root-span=SdkSpan{traceId=6de3b0a63bdd1731eb71aaddd29d6145, spanId=70d201f4c67c2583, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, name=DEFAULT...., kind=INTERNAL, attributes=AttributesMap{data={thread.name=periodic-triggers_Worker-1, thread.id=34, code.namespace=...task_queue.ProxyJob, code.function=execute}, capacity=128, totalAddedValues=4}, status=ImmutableStatusData{statusCode=UNSET, description=}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1715876358742397000, endEpochNanos=0}}, applicationContext={}}]
> I'll see whether I can figure out more specifically what is causing this.
To test the hypothesis that this could be a "diamond dependency" problem, I tried adding io.opentelemetry/opentelemetry-api {:mvn/version "1.36.0"} to deps.edn and running with https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.2.0, and io.opentelemetry/opentelemetry-api {:mvn/version "1.37.0"} with https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.3.0, but no change -- EXECUTE CURRENT CONTEXT is initially output with an empty io.opentelemetry.context.ArrayBasedContext instead of the io.opentelemetry.javaagent.instrumentation.opentelemetryapi.context.AgentContextWrapper I would expect.
Since io.opentelemetry.semconv/opentelemetry-semconv {:mvn/version "1.24.0-alpha"} mentions breaking changes and io.opentelemetry.semconv/opentelemetry-semconv {:mvn/version "1.25.0-alpha"} is supposed to restore backward compatibility, I also tried to override clj-otel-api's dependency with 1.25.0-alpha in our deps.edn just in case the agent's version of it would somehow leak into the classpath, but no change either -- ArrayBasedContext instead of AgentContextWrapper.
I also tried checking out 3eb216d3553b28a2574e323470a0bce077394b26 of clj-otel-api locally and using that instead (without mentioning any io.opentelemetry* dependencies in our deps.edn), but still the same output -- ArrayBasedContext instead of AgentContextWrapper.It would be helpful if you could provide a minimal project which reproduces the issue. I'm having no luck reproducing the issue you are seeing.
This is how the io.opentelemetry.context.Context that steffan-westcott.clj-otel.context/current does not see is created: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/68cf431abfc97d1ba3132e8dbaf89e6ecd058561/instrumentation/quartz-2.0/library/src/main/java/io/opentelemetry/instrumentation/quartz/v2_0/TracingJobListener.java#L37-L66
Note that the agent also instruments JDBC access automatically. See https://github.com/steffan-westcott/clj-otel/tree/master/examples/rpg-service/src/example/rpg_service for an example. It is not clear in your description, but the "span created by OTel JavaAgent for Quartz Job" might actually be a span around database access instead.
I'm having difficulty following your explanation of what spans you actually see. Perhaps some screenshots of the traces displayed in Jaeger might clear things up for me.
I haven't worked with Quartz, but I suspect what you are seeing is a result of async processing, or lazy evaluation. The body in your with-span! form has completed before the execute! starts.
It is important to note with-span! works with the current context, which is a thread local reference maintained by Java OpenTelemetry. Therefore, it only works well with synchronous code.
Something else doesn't follow in your description: You should never get root as the current context inside a span, if you are using synchronous code.
I think I'd also need to see your code to be able to offer help.
After some investigation, I have a better understanding of what is happening. Also, I've found an alternative easy workaround if you really require DEBUG level agent logs. I hoped to make a simple repro in Java using the OpenTelemetry Log4J2 NDC provider, but I have not been able to do this.
The OpenTelemetry instrumentation agent https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/b36d2845b35db088008a8ea11f6666977d285e9e/instrumentation/opentelemetry-api/opentelemetry-api-1.0/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/opentelemetryapi/ContextInstrumentation.java#L36-L38 ContextStorage/root (when it is first used) to wrap the ArrayBasedContext that would otherwise have been returned with a helper method in AgentContextStorage. If DEBUG level agent logging is turned on, log messages are emitted during this instrumentation process.
My tests show that a call to ContextStorage/root during its own instrumentation will cause issues. This will happen if we attempt to access the current context during emitting an agent log message when instrumenting ContextStorage/root.
The above suggests at least two workarounds:
1 - Consider if you really need otel.javaagent.logging=application and enabling DEBUG level agent logging
2 - Evaluate (context/root) before routing agent logs through the logger in your application
(defn -main
[]
(context/root) ; Must allow agent to instrument ContextStorage/root before emitting agent log messages
(log/merge-config! {:min-level [[#{"io.opentelemetry.javaagent.*"} :debug] [#{"*"} :info]]
:middleware [otel-context-middleware]})
(jetty/run-jetty #'handler {:port 3000}))Java agent 2.4.0 was released a few hours ago. I've done a quick test with countries-service and sentence-summary-service examples with no problems.
I found a repro: https://github.com/devurandom/clj-otel-mwe/commit/f0a17c82e6bce30ee2939020b7e7c8422c551078 Both the Timbre middleware and the Java property are needed to break clj-otel.
It can be fixed with https://github.com/devurandom/clj-otel-mwe/commit/51cb7dd1d43aa77cf5ae10403f5ec1dbc4341484.
(Not visible in the mwe: We use the otel-context-middleware to transport the otel context if (.isValid span-context) into a Timbre {:output-opts {:msg-fn ...}} that includes the trace and span ID in the logged message, if present.)
I do not even need to call (span/get-span-context). Calling (context/current) is enough: https://github.com/devurandom/clj-otel-mwe/commit/469d9e5f35c72571eb2a28dacfe228214858cd31
I have reproduced the issue, so thanks for that 😄 I believe what you are seeing is an interaction of turning on agent logging, LazyStorage (which will establish a source of Context objects only when one is requested, which is normally well after the rest of the OpenTelemetry SDK is initialised) and SLF4J classpath shenanigans. Somewhere in all of that mess is causing this issue.
As a short-term fix, perhaps you don't need agent logging turned on.
I suspect this is an issue with the interaction I mentioned above and not related to clj-otel at all. I will try to recreate the issue in a small Java application. If I succeed, I will report the issue on the opentelemetry-java-instrumentation Github repo.
For your information, to cut down the noisy log, I'm using this modified version of your repro:
(defn- handler
[_]
(println "CONTEXT" (context/current))
(span/with-span! "Nested span")
{:status 200 :body nil})
(defn- otel-context-middleware
[data]
(context/current)
data)
(defn -main
[]
(log/merge-config! {:min-level [[#{"io.opentelemetry.javaagent.*"} :debug] [#{"*"} :info]]
;; With this middleware and `otel.javaagent.logging=application`
;; clj-otel cannot see agent-generated spans:
:middleware [otel-context-middleware]})
(jetty/run-jetty #'handler {:port 3000}))> As a short-term fix, perhaps you don't need agent logging turned on.
>
Sadly we do need it, as otherwise the agent emits text log messages and breaks the JSON parsing of a colleague's logging stack. But https://github.com/devurandom/clj-otel-mwe/commit/51cb7dd1d43aa77cf5ae10403f5ec1dbc4341484 works as a workaround, so we're good.
> I suspect this is an issue with the interaction I mentioned above and not related to clj-otel at all. I will try to recreate the issue in a small Java application. If I succeed, I will report the issue on the opentelemetry-java-instrumentation Github repo.
>
If this could be fixed in the agent, that would be awesome. Thanks for looking into it!