#Prometheus high cardinality

1 messages · Page 1 of 1 (latest)

crude nimbus
#

Hello there,

thanks for the great product, loved it!

I faced a problem with the prometheus metrics, causing high cardinality issue in prometheus, thus, overloading it. This happens on a plain CRUD operations, when we have a resource /users and an id of a resource /users/1, but in fact, we interested only in a template of a path - /users/{user_id}

I came up with a draft solution how to resolve it. Could someone take a quick look and guide me if Im trying to fix this in a correct place
https://github.com/litestar-org/litestar/pull/3533

p.s. I havent fixed CI, just showing off the idea

harsh igloo
#

so, it's the name of the route/operation ID you want to track?

crude nimbus
#

I want to track which endpoints I have overloaded in general. I cant imagine a situation, where I need an exact set of parameters for an endpoint on grafana dashboard to spot the problem

so, instead of /v1/users/1, /v1/users/2, /v1/users/34, I wanna operate with more ephemeral data - /v1/users/{user_id}

harsh igloo
#

we can get this fixed for you. I think your request makes sense and it should be how it works

#

at first glance, your change looks ok. I'll need to review it a bit more today though

junior hazel
#

Hi !

#

same problem here

#

In fact let me code that for you

jolly karmaBOT
#

Summary

Hello,

I'm interested into improving the prometheus metrics system of litestar and work on a PR.

Let me explain the problem I'm trying to solve.

So I want to use the /metrics endpoint, but the issue I have is that it's registering every requests even the one I don't need to have, it clutter my time series
For exemple, when quering a route with random or bad params in the URL with 404 responses, it register the calls which is useless and clutter the monitoring data.
It can be a problem because it increase the active time series number in my prometheus server and create high cardinality labels.

So we need to implement some way to control the potential cardinality explosion.

Currently there is a exclude_unhandled_paths setting in litestar.contrib.prometheus.PrometheusConfig which should not register metrics when a the server respond a 404.
But, apparently it does nothing currently, it's unimplemented.

So my proposal is to work on these issues:

Have a way to ignore 404 calls,

For that we can implement the exclude_unhandled_paths setting, to not register metrics when a call return 404. All others status code can be interesting.

May be we can replace this setting by exclude_status_code so it can be more configurable and we allow the user to exclude others status codes like 400 for example.

Have a way to specify which params in a URL are worth registering

We can implement a route_keep_params setting for each route, there is 2 way that I see to do that:

Add a MetricParameter in Annotated with a include option. We can make the default value configurable.

So what are your thoughts on this? Feedback and critics appreciated.

Basic Example

Example of endpoint config, adding a meta data in the Annotated type to tell the metrics handler to include or ignore a param:

from typing_extensions import Annotated
from litestar.params import Parameter

THINGS = …

@get(path="/thing/{category:str}/{uuid:int}", sync_to_thread=False)
def get_product_version(
    category: Annotated[
        str,
        Parameter(
            title="Category slug",
        ),
        MetricParameter(include = True),
    ],
    uuid: Annotated[
        str,
        Parameter(
            title="Thing UUID",
        ),
        MetricParameter(ignore = True),
    ],
) -> thing:
    return THINGS[category][uuid]

So if I'm making a request GET /thing/bike/1
then a request GET /thing/bike/2, then a request GET /thing/house/1
before this feature I'm getting in the /metrics:

requests_total{method="GET", path="/thing/bike/1"}
requests_total{method="GET", path="/thing/bike/2"}
requests_total{method="GET", path="/thing/house/1"}

```...
Labels

Enhancement

junior hazel
#

@crude nimbus @harsh igloo I was planning to contribute to add this feature

crude nimbus
#

@junior hazel I don’t see a way of implementing what you want, cause it’s slightly different from mine. I want “template all parametrized parts”, you want only a certain ones. But the motivation looks the same 🙂

junior hazel
#

Well you see I'm using annotation on parameters

#

I think by default we should have all parameters be ignored

#

but some time I like having cardinality on some parameters, if this parameter has low possibilities count

crude nimbus
#

hi @harsh igloo did you take a look? should I finalize the MR?

violet crystal
crude nimbus
#

no rush from my end, just keeping an eye in it 🙂

harsh igloo
#

Can someone give me a TLDR on the difference between the two open PRs on this?

#

nm, it looks like 1 is the issue

#

and 1 is the PR

#

@crude nimbus can you take a look at the open telementry implementation?

#

I feel like OTEL doesn't have the issue here

#

and i'm seeing this

#


def get_route_details_from_scope(scope: Scope) -> tuple[str, dict[Any, str]]:
    """Retrieve the span name and attributes from the ASGI scope.

    Args:
        scope: The ASGI scope instance.

    Returns:
        A tuple of the span name and a dict of attrs.
    """
    route_handler_fn_name = scope["route_handler"].handler_name
    return route_handler_fn_name, {SpanAttributes.HTTP_ROUTE: route_handler_fn_name}
#

we may be able to leverage the same thing?

jolly karmaBOT
#

litestar/contrib/opentelemetry/_utils.py line 21

def get_route_details_from_scope(scope: Scope) -> tuple[str, dict[Any, str]]:
crude nimbus
#

we can use handler name in metrics as well, but I’m afraid that with this approach application layer leaks into monitoring and infra layer and this doesn’t sound good to me

harsh igloo
#

I’d prefer them to be as close as possible between the two monitoring options.

#

If there’s something wrong with the otel approach it should be fixed.

#

If not, you should use the otel approach

crude nimbus
#

we can use the same get_route_details_from_scope extractor as default in prometheus metrics and provide an option to redefine it, as done in otel
thus, we can have the same approach, which can be redefined/extended. WDYT?

harsh igloo
#

Sorry for being dense here (I'm much more familiar with OTEL than Prometheus), can you explain why the approach we have in OTEL may leak application layer info where as the method above does not?

#

I feel like this is going to be something obvious once you point it out

#

Maybe it's just monday and my brain isn't working yet

crude nimbus
#

no worries, lets have the best option possible
OTEL is something for devs, to figure out what is slow and has to be refactored

prometheus metrics is more about monitoring and monitoring is often done by another department and having metrics like
/v1/users/{user_id} 134 (134 is rps)
is much more transparent and standartized, than
get_user_by_id 134
because function name is something, that devs handle and API contracts are managed on the company level and related to everyone in the same way

#

also API in prometheus metrics offers SRE to tune cache on the higher levels, like API gateway, because you know an exact API uri
having handler name doesnt offer such option

crude nimbus
#

@harsh igloo lets align otel and prometheus metrics, having the same extractor by default, but also provide a drop-in-replacement with the logic I described, WDYT?

harsh igloo
#

Unless, @violet crystal you have a few to weigh in here?

violet crystal
#

sadly, none as of now 🙃

crude nimbus
#

any updates? I have some time this weekend, can implement what we've discussed so far

harsh igloo
#

Sorry , for the delay on this. I 100% agree. However, I need to go and review some old otel code I had. I really thought it logged as this /v1/users/{user_id}

harsh igloo
harsh igloo
crude nimbus
harsh igloo
jolly karmaBOT
#

litestar/contrib/prometheus/config.py line 30

app_name: str = field(default="litestar")
crude nimbus
#

my proposal is to have a "path extraction function" param with the default function that yields the route name, but also provide a function, that extracts templated uri path and make it by default in v3

harsh igloo
violet crystal
crude nimbus
#

gotcha, infra tag didnt fit to me as it felt related something to CI

violet crystal
#

normally that would just be a code 🙂

but you were trying for something different did not want to leave you hanging

fwiw, maintainence is more for maintainers

crude nimbus
#

finished MR https://github.com/litestar-org/litestar/pull/3533

decided to keep current behaviour not to introduce breaking changes and add a parameter, which enables new metrics style on demand
this parameter, group_path can be changed to True by default in v3, so we can provider smoother transition here

jolly karmaBOT
#

Adding new extraction function for prometheus metrics to avoid high cardinality issue in prometheus, eg having metrics GET /v1/users/{id} is preferable over GET /v1/users/1, GET /v1/users/2,GET /v1/users/3

More info about prometheus high cardinality
https://grafana.com/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/

Labels

Triage Required :hospital:, area/docs, pr/external, size: small, type/bug

harsh igloo
#

@crude nimbus Thanks for talking me through the change here