#@vito I'm working on integrating
1 messages ยท Page 1 of 1 (latest)
My first attempt has been to just do something dumb with pushing fake logs that have the data in them and updating the TUI code to handle that data and display it separately
I'm sure that's viable but I'm wondering if I should just be jumping to using OTEL metrics for this? I know we have some plumbing code in various places for metrics, but would it be a huge effort to hook all that up still? Or is everything in place basically to your knowledge?
Also, if you have any general suggestions entirely different than what I'm doing I'm all ears of course ๐
(continuing in the background with hacking it together with logs and special attributes, no rush)
Yeah I think it's worth giving a go at OTel metrics - but I haven't messed with it myself yet so don't have any guidance besides guessing your way through all the plumbing. So feel free to just do whatever works first and I can take a look tomorrow. Excited to see progress here 
@heavy flint NICE
+1 on using native otel metrics
We don't support those in the API at the moment, my suggestion would be to:
- Still use otel metrics anyway in the codebase
- Perhaps there's a way to ship those metircs outside of cloud to try them out, @fickle dune?
- I can help adding metrics to cloud API, once we confirm they fit the bill
Been working through the otel metrics integration this afternoon after finishing a refactor of the cgroup monitoring code to publish samples on a chan and some other simplifications, so far it has been 90% copy pasting chunks of code and then s/Log/Metric/ more or less ๐ The OTEL libs and our own wrapping code around them all seem quite standardized in the respect, which is nice
Think I'm nearing the more meaty parts that hit the TUI soon though
lol, sweet - I think the trickiest part is just the OTLP protobufs <-> OTel Go SDK type conversions, and even that is mostly mindless. Sorry I didn't find time for it anyway!
No problem, I'm happy to get a little more xp with OTEL so I can have more of a clue what's going on
Spit-balling approaches as I try them if you have a sec:
Since otel metrics aren't inherently associated with spans/traces (afaict?) do you think for now just including the Call digest as an attribute on the metric and using that digest to find the right place in the TUI to display the associated metrics makes sense? Since this is just for exec-ops for now I feel like that would cover all the cases atm.
Alternatively I'm guessing I could include the span/trace id of the exec op as an attribute on the metric instead? Might be more generic in the long run?
Also curious if one or the other would hypothetically fit better in the cloud's db schema + indexing
Call digest or Buildkit op digest makes sense, yeah - wouldn't be the first time we used those to inform the UI
In the long run we'll likely have stable function digests and/or stable span digests, so the metrics can be compared across traces - actually, speaking of, yeah it might make sense to put the trace ID in there as an attribute too
Honestly just guessing here, we can iterate, the OTel metrics patterns are super different from everything else
Yep that's what I figured, just checking if I'm on a vaguely correct path here ๐ That sounds good though, I'll just throw in all the attributes for now
<whining> WHY is this package internal??? https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/otlpmetric/otlpmetrichttp/internal/transform/metricdata.go </whining>
(seems like those transform packages being public would save us 100s of lines of boilerplate code for logs/traces/metrics, oh well)
yuuuuuuuuuuuuuuuuuuuuuuuuuuuuuup
OTel has a pretty obnoxious nanny culture around what's exposed and how, it's almost like they're offended people are using their standards sometimes
like part of me wonders if that Ok/Error 1 <=> 2 swap between OTLP and the Go SDK is literally just to keep integrators on their toes
Huh...
runtime: pointer 0x4028a250d0 to unused region of span span.base()=0x400dac8000 span.limit=0x400dac9fe0 span.state=1
runtime: found in object at *(0x400edc5b70+0x0)
object=0x400edc5b70 s.base()=0x400edc4000 s.limit=0x400edc6000 s.spanclass=4 s.elemsize=16 s.state=mSpanInUse
*(object+0) = 0x4028a250d0 <==
*(object+8) = 0x1d
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)
(from the engine)
I did something wrong in the OTEL metrics pipeline (it's trying to export way too much), but that's a new failure mode for me ๐
oh dear
seems like some kind of data race?
https://github.com/golang/go/issues/51552 only result lol
the word "span" there might even be a coincidence
Oh yeah span is referring to the Go GC I think lol. We don't have cgo anywhere so I'm inclined to believe it when it says "incorrect use of unsafe". Last I checked I think at least protobuf uses unsafe somewhere. And whatever bug I have seems to be resulting in way too many exports, which would probably cause way too many protobuf ser/desers, etc.
I'll fix the export problem and hope it never reappears for now ๐ค
Okay finally got the pipeline working and displaying in the TUI in the most ugly way possible (green text is each metric for number of disk bytes written in the exec, which writes 100MB every 2s for 10 iterations)
The metrics do show up live and append as they come in though, which is the important point. So just gotta cleanup code and make that output more comprehensible. I'm imagining something simple like:
Container.withExec(...) <timer> | Disk Read Bytes: <num formatted with kb/mb/etc.> | Disk Write Bytes: <num formatted>, with the numbers just being the most recent read values
Something like a trendline would be cool but will save for down the line