본문 바로가기
Development

[ENG] Basic Understanding of Prometheus Query (PromQL)

by 토마스.dev 2024. 12. 30.
반응형

The following is a translation into English of a blog post I wrote in 2017. Please note that there may be some errors in the translation.

 

Prometheus Query (hereafter PromQL) differs from SQL, and when you first encounter it, you might find it somewhat challenging to understand. However, once you fully grasp it, you'll realize that it is a well-designed language. In this section, we aim to explain the basic syntax of PromQL and Metric Join (Vector Matching).

Data Model

First, let's examine the format in which Prometheus outputs metrics, which is as follows:

http_requests_total{container="A"} 1037

<metric_name>{<label_key>=<label_value>, <label_key>=<label_value> ...} <metric_value> [<timestamp>]

The metric name appears first, followed by labels that describe the characteristics of the metric. Finally, there is the metric value. A timestamp can also be included if needed. However, the timestamp is usually not displayed because the value shown is the most recently collected one (although the timestamp is actually sent, it is typically not shown on the web interface).

Data Type

Prometheus categorizes data types into four types. This does not refer to the type of the metric value mentioned above. The type of values used in Prometheus is exclusively float64.

  • Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp
  • Range vector - a set of time series containing a range of data points over time for each time series
  • Scalar - a simple numeric floating point value
  • String - a simple string value; currently unused
 

Querying basics | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

prometheus.io

 

Since the descriptions of the four types above may not be easy to understand, additional explanations will be provided. It can be a bit confusing to start directly with the Instant vector, so first, let's explain time series.

A time series means a sequence of values that change over time. The metric data we want to observe represents the changes in values over time and is expressed in the format [time, value]...[time, value]. For example,

The CPU values of Container A can be represented as [1 minute, 0.1], [2 minutes, 0.2], [3 minutes, 0.1]. These arrays are grouped together to form a single time series. Here, a single element like [1 minute, 0.1] is called a sample.

http_requests_total{container="A"} 1037
http_requests_total{container="B"} 500

In the example above, there are two samples, and the collection (array) of samples recorded as time changes can be considered a time series.

Instant Vector

An instant vector refers to a set of samples from multiple time series that point to the same point in time. As shown above, there are two time series for http_requests_total, and the two sets of samples corresponding to the same time are called an instant vector.

To use an analogy, if a time series represents the horizontal axis over time, an instant vector can be thought of as a cross-section cut vertically through the time series.

Basically, if you execute a query using just the metric name, you can obtain an instant vector, and if you want to apply filtering, you can use an instant vector selector (explained later).

Range Vector

It is a type that holds an array of values over a specific period. An instant vector is a type where each time series has only one sample, whereas a range vector can have all values for each time series from a given reference time to the specified past period.

To denote a range vector, you append square brackets [<period>] to the metric name. (This is called a range selector.)

For example, if you execute a query with http_requests_total[5m], it appears as follows.

http_requests_total{container="A"}[5m]
[1037 @1551242271.728, 1038 @1551242331.728, 1040 @1551242391.728]

http_requests_total{container="B"}[5m]
[500 @1551242484.013, 501 @1551242544.013, 502 @1551242604.013]

It is a type that holds an array of values over a specific period. An instant vector is a type where each time series has only one sample, whereas a range vector can have all values for each time series from a given reference time to the specified past period.

To denote a range vector, you append square brackets [<period>] to the metric name. (This is called a range selector.)

For example, if you execute a query with http_requests_total[5m], it appears as follows.

The array elements are in the form of <value> @<timestamp>. In the example, there are three values over five minutes.

Since there are multiple values expressed as an array instead of just one, you cannot create a graph with this result alone. To create a graph, you need a single (time, value) pair based on the time axis, but here there are three.

So why does a range vector exist? The reason is that operations such as calculating the average or the rate of change over a specific period may be necessary. In the example above, you can see the values increasing as 1037 > 1038 > 1040. This is the increase in cumulative values. However, what if we want to see the change over five minutes? Then, a metric operation meaning "the change in http_requests_total over five minutes" is required, which can be expressed as rate(http_requests_total[5m]).

Here, Range Query and Range Vector might be confused, so to explain, a Range Query is a type of query expression that takes Start and End times as options (similar to an API), and a Range Vector is a data type.

To execute a query and create a graph, it must be executed with an Instant Vector + Range Query. When you execute a query with http_requests_total as explained in the example on the Prometheus Dashboard's Console input screen, it only shows the most recent value from the last five minutes (the reason for five minutes will also be explained later). At this time, it is executed not as a Range Query but as a basic Query (Instant Query). To create a graph, you need all the Instant Vector values between the Start and End times you want to view.

 

Instant Query

https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries

GET /api/v1/query

Here is an example of an API call when executing a basic query in Grafana. Instead of start and end, a specific time is included in the query string. The resultType is vector.

Range Query

GET /api/v1/query_range

Here is an example of an API call when executing a Query Range in Grafana. You can see that the API is different and start and end are included in the query string. The resultType is matrix (a composite type used internally that contains vector values for each time period).

If you cannot create a graph with a range vector, how should you use a range vector? To do this, you need to use an aggregation function that aggregates the results of the range vector into a single instant vector.

A representative function is rate.

rate(http_requests_total{container="A"}[5m])
0.01098

The rate function is a function that converts the rate of change per second. In the previous example, it converts the changes of three values included over five minutes into a single value expressed per second.

If you try to create a graph using only a range vector without such aggregation functions, an error occurs. In other words, Range vector + Query range is not possible.

An error occurs saying, "Invalid expression type 'range vector' for range query, must be Scalar or instant Vector."

Scalar/String

Scalar refers to a value without time information and can be thought of simply as a number.

String is a string value, but it is not actually used. (It is also not used in Prometheus code).

 

Understanding Basic Queries

Selectors

With the most basic syntax, you can query the desired instant vector based on label conditions.

http_requests_total

=~ is an operator that allows the use of regex, and to explain the above example, it searches for an instant vector named http_requests_total where the environment label value is staging, testing, or development, and the method is not GET.

However, when you want to search for all label values, do not use .* but use .+.

Since metric names are also stored as a label called __name__, if you want to search for multiple metric names or patterns, leave the metric name empty and search using the __name__ label.

The above query shows all instant vectors with names that start with container.

Aggregation Operators

You may need to perform various mathematical calculations, such as summing or averaging the searched metric results. In such cases, you can use the operators below, which apply syntax similar to SQL's GROUP BY.

  • sum (calculate sum over dimensions)
  • min (select minimum over dimensions)
  • max (select maximum over dimensions)
  • avg (calculate the average over dimensions)
  • stddev (calculate population standard deviation over dimensions)
  • stdvar (calculate population standard variance over dimensions)
  • count (count number of elements in the vector)
  • count_values (count number of elements with the same value)
  • bottomk (smallest k elements by sample value)
  • topk (largest k elements by sample value)
  • quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)

https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators

<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]

Both by and without serve roles similar to group by, but without is the opposite of by. In other words, when using without, the specified label is excluded, and the grouping is performed based on the remaining labels.

sum(http_requests_total) by (method)

The above

http_requests_total

is grouped by method and summed, resulting in the following form:

2000
3000

The above is grouping http_requests_total by method and summing them to show, and the result appears in the following form.

After aggregation, you can see that the metric name has disappeared. It has become a new time series.

If the vector expression becomes long, you can move by to the front to use it.

sum by (method) (http_requests_total)

 

Understanding Join Queries (Vector Matching)

I believe this is one of the most powerful features of PromQL and likely the most frequently used syntax. By matching Instant Vectors with various names through labels, you can utilize them for the following purposes:

  • Operations involving two or more values, such as ratios or percentages
  • Combining labels

When performing vector matching, the following scenarios can occur:

  • One-to-One: When vectors match exactly on a 1:1 basis (when two or more vectors are mapped 1:1 through label matching)
  • One-to-Many: When vectors have a 1:N or N:1 relationship
  • Many-to-Many: When vectors have an N:M relationship

Ultimately, you need to match two or more vectors to form a single value. Therefore, Many-to-Many relationships, which cannot be computed, are logically not allowed.

One-to-One

The syntax for One-to-One is as simple as follows.

<vector expr> <bin-op> ignoring(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) <vector expr>

The left <vector expr> and the right <vector expr> refer to the two vectors to be matched.

<bin-op> refers to the binary operator that performs an operation on the values of the two vectors to produce a single value.

on refers to the list of labels to be used for matching, while ignoring has the opposite meaning of on. When using ignoring, the specified labels are excluded, and the remaining labels are used for matching.

method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="get", code="404"}  30
method_code:http_errors:rate5m{method="put", code="501"}  3
method_code:http_errors:rate5m{method="post", code="500"} 6
method_code:http_errors:rate5m{method="post", code="404"} 21

method:http_requests:rate5m{method="get"}  600
method:http_requests:rate5m{method="del"}  34
method:http_requests:rate5m{method="post"} 120

I will explain the above metric with an example. To explain the following query in detail,

method_code:http_errors:rate5m{code="500"} / ignoring(code) method:http_requests:rate5m

it matches method_code:http_errors:rate5m with method:http_requests:rate5m where the code is 500.

First, let's look at the results of method_code:http_errors:rate5m:

method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="post", code="500"} 6

It uses ignoring to exclude code and matches only based on the remaining method label.

method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="post", code="500"} 6

method:http_requests:rate5m{method="get"}  600
method:http_requests:rate5m{method="del"}  34
method:http_requests:rate5m{method="post"} 120

Only one remains for each, and they are matched in a 1:1 manner. Since the operator is / (division), the result is as follows.

{method="get"}  0.04            //  24 / 600
{method="post"} 0.05            //   6 / 120
반응형

 

One-to-Many

The syntax for One-to-Many is as follows.

<vector expr> <bin-op> ignoring(<label list>) group_left(<label list>) <vector expr>
<vector expr> <bin-op> ignoring(<label list>) group_right(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) group_left(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) group_right(<label list>) <vector expr>

Before explaining the group_left/group_right syntax, let's look at an example first.

method_code:http_errors:rate5m / ignoring(code) group_left method:http_requests:rate5m

If you exclude code and match only by method in the example above, in the case of method="get", it results in a 2:1 (N:1) relationship as follows.

method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="get", code="404"}  30

method:http_requests:rate5m{method="get"}  600

The value that can ultimately be displayed here is the method_code:http_errors:rate5m vector corresponding to N.

{method="get", code="500"}  0.04            //  24 / 600
{method="get", code="404"}  0.05            //  30 / 600

A vector with such a large number of elements is said to have "high cardinality."

(* Cardinality refers to the range of samples a vector can take (the diversity of labels). Here, method is only get, and code has various types even for one get. In other words, method:code has a 1:N relationship, and method_code:http_errors:rate5m, which includes code, is said to have high cardinality.)

group_left/group_right are syntaxes used to select which vector has the higher cardinality. (You can think of it as the final basis vector to be displayed.)

If you do it the other way around in the above example, an error naturally occurs because it cannot determine what value to divide 600 by.

Label Merging

You can include a list of labels in group_left/group_right, which means merging the labels of a vector with lower cardinality into a vector with higher cardinality. In fact, vector matching is primarily used for such label merging rather than for value operations.

method_code:http_errors:rate5m  24
method_code:http_errors:rate5m  30
method_code:http_errors:rate5m  3
method_code:http_errors:rate5m 6
method_code:http_errors:rate5m 21

method:http_requests:rate5m  600
method:http_requests:rate5m  34
method:http_requests:rate5m 120

If you want to include the message label in the final result in the modified example above, the query is as follows.

method_code:http_errors:rate5m / ignoring(code) group_left(message) method:http_requests:rate5m

In the case of method="get":

  0.04            //  24 / 600
  0.05            //  30 / 600

 

Let's understand more complex queries

sum by (node, type) (
    kube_node_status_allocatable{resource="cpu"}
    * on (node) group_left(type)
    label_replace(
        kube_node_labels, "type", "$1", "label_type", "(.+)"
    )
)

The purpose of the above query is to fetch the type label from kube_node_labels and combine it with kube_node_status_allocatable for display.

kube_node_labels is one of the metrics generated by kube-state-metrics and is a metric that contains the label information (Kubernetes labels) set on nodes. Each label key name is stored in the form of label_<label key>.

kube_node_labels

 

The label_replace function is used to change label keys and values, and its usage is as follows:

label_replace(v instant-vector, dst_label string, replacement string, src_label string, regex string)

To explain with an example, it searches for a label named label_type, changes its name to type, and retains the entire value of the label as is.

At this time, label_replace does not actually change the label but instead creates a new one. Even if type is created, label_type remains unchanged. Therefore, you need to additionally remove label_type in the end.

Then, match these two vectors using the label node and multiply them (since kube_node_labels is a metric for label information and all its values are 1, multiplying does not change the original metric kube_node_status_allocatable), resulting in the label type being included in the result.

kube_node_status_allocatable is also one of the metrics generated by kube-state-metrics, and an example of this is as follows. In the example, the CPU resource is targeted.

kube_node_status_allocatable{endpoint="http",instance="10.240.3.2:8080",job="kube-state-metrics",namespace="default",node="worker-1",pod="nginx-77cfp4rb",resource="cpu",service="kube-state-metrics",unit="core"} 0.1

 

Additionally, the sum by (node, type) operation sums up while ignoring job, endpoint, etc., except for node and type.

{node="worker-1",type="bare-metal"}	15.8
{node="worker-2",type="vm"}		4.8
{node="worker-3",type="vm"}		4.8

 

When performing vector matching, one should be cautious because a query that initially tests as 1:1 or 1:N may, under different circumstances, become N:M, resulting in a "many-to-many" error. This means that the metric might not be generated at all.

Vector matching is a powerful feature, but it is also one of the biggest factors that can degrade Prometheus's performance. A single vector matching operation is generally acceptable, but as queries become more complex, you might end up performing vector matching multiple times, such as twice or three times. Additionally, if you combine this with range vectors, Prometheus might run out of memory (OOM) and crash. When queries become complex, it is essential to use Record Rules. (Record Rules are equivalent to Views in traditional DBMS. They calculate queries at the instant vector level and store them as new metrics. Since these are calculated and stored at configured intervals, querying them returns results with the same performance as querying a single metric.)

 

Calculating Timeframes During Vector Matching

Vectors do not always have exactly the same timestamps. For example, if there is a MySQL exporter and a Node exporter being scraped by Prometheus, the intervals of the two exporters may differ, and the collection durations may vary, making it unlikely for the collection times to exactly match. However, as mentioned earlier in the explanation of the meaning of vectors, there is a phrase "all sharing the same timestamp."

In Prometheus, this concept is handled by applying a "buffered timeframe," within which the included vectors are considered to have the "same timestamp."

Looking back at the example mentioned during the explanation of Range Vectors, you can see that each value has a different timestamp.

http_requests_total
[1037 @1551242271.728, 1038 @1551242331.728, 1040 @1551242391.728]

http_requests_total
[500 @1551242484.013, 501 @1551242544.013, 502 @1551242604.013]

Then, how much does that "time buffer" refer to? By default, it is set to 5 minutes in Prometheus.

Staleness

When queries are run, timestamps at which to sample data are selected independently of the actual present time series data. This is mainly to support cases like aggregation (sum, avg, and so on), where multiple aggregated time series do not exactly align in time. Because of their independence, Prometheus needs to assign a value at those timestamps for each relevant time series. It does so by simply taking the newest sample before this timestamp.

If a target scrape or rule evaluation no longer returns a sample for a time series that was previously present, that time series will be marked as stale. If a target is removed, its previously returned time series will be marked as stale soon afterwards.

If a query is evaluated at a sampling timestamp after a time series is marked stale, then no value is returned for that time series. If new samples are subsequently ingested for that time series, they will be returned as normal.

If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.

Staleness will not be marked for time series that have timestamps included in their scrapes. Only the 5 minute threshold will be applied in that case.

https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

Due to this characteristic, you can also observe in the Prometheus Console that metrics older than 5 minutes do not appear.

This value can be adjusted using the option when running Prometheus. The option name is query.lookback-delta.

Conclusion

This concludes a brief understanding of Prometheus Query. For various additional examples, calculations, and usage of functions, please refer to the official website. I hope this explanation helps you understand Prometheus Query to some extent.

 

 

반응형