@vito I'm working on integrating | Dagger | Page 1

heavy flint Sep 17, 2024, 11:44 PM

#

My first attempt has been to just do something dumb with pushing fake logs that have the data in them and updating the TUI code to handle that data and display it separately

#

I'm sure that's viable but I'm wondering if I should just be jumping to using OTEL metrics for this? I know we have some plumbing code in various places for metrics, but would it be a huge effort to hook all that up still? Or is everything in place basically to your knowledge?

#

Also, if you have any general suggestions entirely different than what I'm doing I'm all ears of course 😄

#

(continuing in the background with hacking it together with logs and special attributes, no rush)

fickle dune Sep 18, 2024, 12:28 AM

#

heavy flint I'm sure that's viable but I'm wondering if I should just be jumping to using OT...

Yeah I think it's worth giving a go at OTel metrics - but I haven't messed with it myself yet so don't have any guidance besides guessing your way through all the plumbing. So feel free to just do whatever works first and I can take a look tomorrow. Excited to see progress here AHHH

tranquil void Sep 19, 2024, 4:14 PM

#

@heavy flint NICE

+1 on using native otel metrics

We don't support those in the API at the moment, my suggestion would be to:

Still use otel metrics anyway in the codebase
Perhaps there's a way to ship those metircs outside of cloud to try them out, @fickle dune?
I can help adding metrics to cloud API, once we confirm they fit the bill

heavy flint Sep 19, 2024, 10:55 PM

#

Been working through the otel metrics integration this afternoon after finishing a refactor of the cgroup monitoring code to publish samples on a chan and some other simplifications, so far it has been 90% copy pasting chunks of code and then s/Log/Metric/ more or less 😆 The OTEL libs and our own wrapping code around them all seem quite standardized in the respect, which is nice guy_fieri_chef_kiss Think I'm nearing the more meaty parts that hit the TUI soon though

fickle dune Sep 19, 2024, 10:59 PM

#

heavy flint Been working through the otel metrics integration this afternoon after finishing...

lol, sweet - I think the trickiest part is just the OTLP protobufs <-> OTel Go SDK type conversions, and even that is mostly mindless. Sorry I didn't find time for it anyway!

heavy flint Sep 19, 2024, 11:00 PM

#

fickle dune lol, sweet - I think the trickiest part is just the OTLP protobufs <-> OTel Go S...

No problem, I'm happy to get a little more xp with OTEL so I can have more of a clue what's going on

heavy flint Sep 20, 2024, 12:16 AM

#

fickle dune lol, sweet - I think the trickiest part is just the OTLP protobufs <-> OTel Go S...

Spit-balling approaches as I try them if you have a sec:

Since otel metrics aren't inherently associated with spans/traces (afaict?) do you think for now just including the Call digest as an attribute on the metric and using that digest to find the right place in the TUI to display the associated metrics makes sense? Since this is just for exec-ops for now I feel like that would cover all the cases atm.

Alternatively I'm guessing I could include the span/trace id of the exec op as an attribute on the metric instead? Might be more generic in the long run?

Also curious if one or the other would hypothetically fit better in the cloud's db schema + indexing

fickle dune Sep 20, 2024, 12:25 AM

#

heavy flint Spit-balling approaches as I try them if you have a sec: Since otel metrics are...

Call digest or Buildkit op digest makes sense, yeah - wouldn't be the first time we used those to inform the UI

#

In the long run we'll likely have stable function digests and/or stable span digests, so the metrics can be compared across traces - actually, speaking of, yeah it might make sense to put the trace ID in there as an attribute too

#

Honestly just guessing here, we can iterate, the OTel metrics patterns are super different from everything else

heavy flint Sep 20, 2024, 12:28 AM

#

fickle dune Honestly just guessing here, we can iterate, the OTel metrics patterns are super...

Yep that's what I figured, just checking if I'm on a vaguely correct path here 😄 That sounds good though, I'll just throw in all the attributes for now

heavy flint Sep 20, 2024, 6:45 PM

#

<whining> WHY is this package internal??? https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/otlpmetric/otlpmetrichttp/internal/transform/metricdata.go </whining>

#

(seems like those transform packages being public would save us 100s of lines of boilerplate code for logs/traces/metrics, oh well)

fickle dune Sep 20, 2024, 7:07 PM

#

yuuuuuuuuuuuuuuuuuuuuuuuuuuuuuup

#

OTel has a pretty obnoxious nanny culture around what's exposed and how, it's almost like they're offended people are using their standards sometimes

#

like part of me wonders if that Ok/Error 1 <=> 2 swap between OTLP and the Go SDK is literally just to keep integrators on their toes

heavy flint Sep 24, 2024, 5:22 PM

#

Huh...

runtime: pointer 0x4028a250d0 to unused region of span span.base()=0x400dac8000 span.limit=0x400dac9fe0 span.state=1
runtime: found in object at *(0x400edc5b70+0x0)
object=0x400edc5b70 s.base()=0x400edc4000 s.limit=0x400edc6000 s.spanclass=4 s.elemsize=16 s.state=mSpanInUse
 *(object+0) = 0x4028a250d0 <==
 *(object+8) = 0x1d
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

(from the engine)

#

I did something wrong in the OTEL metrics pipeline (it's trying to export way too much), but that's a new failure mode for me 😄

fickle dune Sep 24, 2024, 5:41 PM

#

oh dear

#

seems like some kind of data race?

#

https://github.com/golang/go/issues/51552 only result lol

#

the word "span" there might even be a coincidence

heavy flint Sep 24, 2024, 5:44 PM

#

Oh yeah span is referring to the Go GC I think lol. We don't have cgo anywhere so I'm inclined to believe it when it says "incorrect use of unsafe". Last I checked I think at least protobuf uses unsafe somewhere. And whatever bug I have seems to be resulting in way too many exports, which would probably cause way too many protobuf ser/desers, etc.

#

I'll fix the export problem and hope it never reappears for now 🤞

heavy flint Sep 25, 2024, 6:17 PM

#

Okay finally got the pipeline working and displaying in the TUI in the most ugly way possible (green text is each metric for number of disk bytes written in the exec, which writes 100MB every 2s for 10 iterations)

Screenshot_2024-09-25_at_11.16.22_AM.png

#

The metrics do show up live and append as they come in though, which is the important point. So just gotta cleanup code and make that output more comprehensible. I'm imagining something simple like:
Container.withExec(...) <timer> | Disk Read Bytes: <num formatted with kb/mb/etc.> | Disk Write Bytes: <num formatted>, with the numbers just being the most recent read values

#

Something like a trendline would be cool but will save for down the line

#@vito I'm working on integrating