#otel: live progress

1 messages ยท Page 1 of 1 (latest)

shy berry
#

Hey @plucky drift ๐Ÿ‘‹ I'm trying to add live progress support in Python, can you help me figure out what's missing? This is what I have:

class LiveSpanProcessor(sdktrace.SpanProcessor):
    def __init__(self, exp: SpanExporter):
        self._span_processor = BatchSpanProcessor(exp, schedule_delay_millis=100)

    def on_start(self, span: sdktrace.Span, parent_context=None) -> None:
        return super().on_end(span)


def live_traces_enabled() -> bool:
    return os.getenv("OTEL_EXPORTER_OTLP_TRACES_LIVE") is not None


def _init_tracing(exporters: dict[str, type[SpanExporter]]):
    provider = get_tracer_provider()

    if isinstance(provider, sdktrace.TracerProvider):
        for exporter_class in exporters.values():
            proc_cls = (
                LiveSpanProcessor if live_traces_enabled() else BatchSpanProcessor
            )
            provider.add_span_processor(proc_cls(exporter_class()))

That comes from here: https://github.com/dagger/dagger/blob/cb77cc8da722f69dded2d4e59a14212ef9a02243/sdk/python/src/dagger/telemetry.py#L107-L108

#

Just noticed the on_end is not being called on the underlying span processor.

plucky drift
#

yea that's the only part that stuck out to me ๐Ÿ™‚ needs to use self._span_processor right?

shy berry
#

yep yep

#

Sorry for the ping ๐Ÿ™‚

plucky drift
#

nice! np np

shy berry
#

Hmm.... they seem to be stuck in a pending state even when it looks like they've finished (same link)

plucky drift
#

some guesses:

  • you might need to do the same thing as here - though this would be more likely to affect very fast spans
  • is the program exiting without flushing?
shy berry
#

The program is still running... it takes a really long time. But on the top level there's only 4 that should happen at the same time (there's hundreds), if you see a fifth, the first has finished, etc...

#

That SnapshotSpan seems lower level. Was it written by you or copied from the otel library?

plucky drift
#

written by me; the Go version is dumb, it converts to protobuf and back lol

#

i'm not sure if that's the reason though, total shot in the dark, might have been behavior unique to Go anyway

shy berry
#

If not for that filter, is the idea that it should just be a read-only span?

plucky drift
#

The problem was that OTel spans mutate in-place in the Go SDK (likely others too), so if a span started and finished within an export interval, you'd end up with two completed spans being sent, rather than one unfinished one and one completed one. The Dagger CLI uses FilterLiveSpansExporter to prevent sending unfinished spans to places like Honeycomb, but without this change we would end up sending the same completed span twice, instead of filtering it out

shy berry
#

Since that Go code is used by server, client, etc.... it's hard to tell what's client specific that would apply to other SDKs.

plucky drift
#

Yeah - that filtering isn't something you have to worry about, since it's handled in the CLI for you

#

Actually hm

#

you might still care technically, since you'd have the same problem at this level, only it'd be Python sending to our internal exporter, which would pass along the "bad data" and still be unable to filter it

#

but, I don't think that's what you're running into right now - since that would mean only a completed span gets sent, and not a running one, and it looks like you're setting an export interval that's shorter than the spans

shy berry
#

Yep

plucky drift
#

seems like this is happening for basically every span?

#

every Python-created span anyway

shy berry
#

Yes

#

But in this case, I'm manually setting error statuses. Not sure if that makes a dent. I'll try removing that.

plucky drift
#

hmm shouldn't matter

#

my only other guess atm is something weird in the Python OTel SDK like keeping track of spans which have already been exported. but, i'd be a little surprised by that, seems like it would be a memory leak

plucky drift
#

oh

#

do you need to implement on_end to call through to the underyling span processor?

shy berry
#

Yep

plucky drift
#

lol I imagined it as a Go embed, glossed over it

shy berry
#

Oh ๐Ÿ˜ณ

#

Maybe all the methods? force_flush, shutdown as well ๐Ÿ™‚

plucky drift
#

I guess so - is there no Python pattern for wrapping like that? thinkspin

shy berry
#

There's an easier way, if I inherit from SynchronousMultiSpanProcessor, it already has an underlying span processor (actually supports several).

#
class LiveSpanProcessor(sdktrace.ConcurrentMultiSpanProcessor):
    def __init__(self, exp: SpanExporter):
        super().__init__()
        self.add_span_processor(BatchSpanProcessor(exp))

    def on_start(self, span: sdktrace.Span, parent_context=None) -> None:
        return self.on_end(span)
#

So simple, love it!

#

Thanks ๐Ÿ™

plucky drift
#

nice! np

#

i would recommend keeping the 100ms interval btw

#

it helps a lot for the CLI, i think the defaults are much higher, on the order of seconds

shy berry
#

I have, but moved to an env var:

        if live_traces_enabled():
            os.environ.setdefault(OTEL_BSP_SCHEDULE_DELAY, "100")

This way it can be easily configured.

plucky drift
#

ah cool