#@vito I'm working on integrating

1 messages ยท Page 1 of 1 (latest)

heavy flint
#

My first attempt has been to just do something dumb with pushing fake logs that have the data in them and updating the TUI code to handle that data and display it separately

#

I'm sure that's viable but I'm wondering if I should just be jumping to using OTEL metrics for this? I know we have some plumbing code in various places for metrics, but would it be a huge effort to hook all that up still? Or is everything in place basically to your knowledge?

#

Also, if you have any general suggestions entirely different than what I'm doing I'm all ears of course ๐Ÿ˜„

#

(continuing in the background with hacking it together with logs and special attributes, no rush)

fickle dune
tranquil void
#

@heavy flint NICE

+1 on using native otel metrics

We don't support those in the API at the moment, my suggestion would be to:

  1. Still use otel metrics anyway in the codebase
  2. Perhaps there's a way to ship those metircs outside of cloud to try them out, @fickle dune?
  3. I can help adding metrics to cloud API, once we confirm they fit the bill
heavy flint
#

Been working through the otel metrics integration this afternoon after finishing a refactor of the cgroup monitoring code to publish samples on a chan and some other simplifications, so far it has been 90% copy pasting chunks of code and then s/Log/Metric/ more or less ๐Ÿ˜† The OTEL libs and our own wrapping code around them all seem quite standardized in the respect, which is nice guy_fieri_chef_kiss Think I'm nearing the more meaty parts that hit the TUI soon though

fickle dune
heavy flint
heavy flint
# fickle dune lol, sweet - I think the trickiest part is just the OTLP protobufs <-> OTel Go S...

Spit-balling approaches as I try them if you have a sec:

Since otel metrics aren't inherently associated with spans/traces (afaict?) do you think for now just including the Call digest as an attribute on the metric and using that digest to find the right place in the TUI to display the associated metrics makes sense? Since this is just for exec-ops for now I feel like that would cover all the cases atm.

Alternatively I'm guessing I could include the span/trace id of the exec op as an attribute on the metric instead? Might be more generic in the long run?

Also curious if one or the other would hypothetically fit better in the cloud's db schema + indexing

fickle dune
#

In the long run we'll likely have stable function digests and/or stable span digests, so the metrics can be compared across traces - actually, speaking of, yeah it might make sense to put the trace ID in there as an attribute too

#

Honestly just guessing here, we can iterate, the OTel metrics patterns are super different from everything else

heavy flint
heavy flint
#

(seems like those transform packages being public would save us 100s of lines of boilerplate code for logs/traces/metrics, oh well)

fickle dune
#

yuuuuuuuuuuuuuuuuuuuuuuuuuuuuuup

#

OTel has a pretty obnoxious nanny culture around what's exposed and how, it's almost like they're offended people are using their standards sometimes

#

like part of me wonders if that Ok/Error 1 <=> 2 swap between OTLP and the Go SDK is literally just to keep integrators on their toes

heavy flint
#

Huh...

runtime: pointer 0x4028a250d0 to unused region of span span.base()=0x400dac8000 span.limit=0x400dac9fe0 span.state=1
runtime: found in object at *(0x400edc5b70+0x0)
object=0x400edc5b70 s.base()=0x400edc4000 s.limit=0x400edc6000 s.spanclass=4 s.elemsize=16 s.state=mSpanInUse
 *(object+0) = 0x4028a250d0 <==
 *(object+8) = 0x1d
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

(from the engine)

#

I did something wrong in the OTEL metrics pipeline (it's trying to export way too much), but that's a new failure mode for me ๐Ÿ˜„

fickle dune
#

oh dear

#

seems like some kind of data race?

#

the word "span" there might even be a coincidence

heavy flint
#

Oh yeah span is referring to the Go GC I think lol. We don't have cgo anywhere so I'm inclined to believe it when it says "incorrect use of unsafe". Last I checked I think at least protobuf uses unsafe somewhere. And whatever bug I have seems to be resulting in way too many exports, which would probably cause way too many protobuf ser/desers, etc.

#

I'll fix the export problem and hope it never reappears for now ๐Ÿคž

heavy flint
#

Okay finally got the pipeline working and displaying in the TUI in the most ugly way possible (green text is each metric for number of disk bytes written in the exec, which writes 100MB every 2s for 10 iterations)

#

The metrics do show up live and append as they come in though, which is the important point. So just gotta cleanup code and make that output more comprehensible. I'm imagining something simple like:
Container.withExec(...) <timer> | Disk Read Bytes: <num formatted with kb/mb/etc.> | Disk Write Bytes: <num formatted>, with the numbers just being the most recent read values

#

Something like a trendline would be cool but will save for down the line