Agent

Setting Up With Thanos

Configuring external_labels for Thanos

Due to the way we extract node and container information from the Prometheus metrics aggregated by Thanos, we recommend that the keys configured for podLabel, nodeLabel and containerLabel be renamed if already used in your Prometheus external labels.

Setting up the thanos receive component

Due to availability concerns with using the sidecar, we HEAVILY RECOMMEND using the thanos receive component which emulates all the realtime results of Prometheus that the agent needs to get metrics.

Kindly review the setup docs here and configure the remote_write sections of your prometheus installation(s).

Here's a sample of the prometheus configuration desired to work with the thanos receive component

# prometheus.yaml
remote_write:
  url: <thanos-receive-url>/api/v1/receive
  headers:
    - THANOS-TENANT: <replex-cluster-name>

Configuring the Querier With The Replex Agent

The Querier / Query Gateway is a part of the thanos components which provides a Prometheus compatible endpoint that works just fine with the replex agent.

If you have a Thanos instance set up, verify there is support for access to this endpoint using a simple curl command, replace THANOS_QUERIER_URL with the url of your thanos instance

curl $THANOS_QUERIER_URL/api/v1/query?query=up

This should provide reasonable output similar to this which indicates that endpoint is prometheus compatible and would work with the agent.

up{instance="replex.io:9090", job="prometheus"} 1
up{instance="replex.io:9091", job="pushgateway"} 1
up{instance="replex.io:9093", job="alertmanager"} 1
up{instance="replex.io:9100", job="node"} 1

From the previous guide, provide the THANOS_QUERIER_URL under the prometheus.url variable as depicted in the agent docs presented in this section, and you're up and running with Thanos on Replex.

Configuring Self-Signed SSL Certificates (On-Prem Only)

For on-prem pushgateway deployments, if the puhgateway is served with a self-signed SSL certificate, the agent may encounter errors when trying to sync with the pushgateway.

To resolve this, you can use the sslCertificate Helm chart parameter to pass your certificate into the agent.

Example:

sslCertificate: “-----BEGIN CERTIFICATE-----\nMIIC1TCCAb2gAwIBAgIJAKbCs/2knCwGMA0GCSqGSIb3DQEBBQUAMBoxGDAWBgNV\nZAeRdaEZS6Bs\n-----END CERTIFICATE——"

Filesystem Metrics (Prometheus)

This section is only for setups using prometheus as metrics provider. We use metrics from different sources for collecting PVC informations. The default setting uses cAdvisor's kubelet_volume* metrics.

cAdvisor

The default setup uses the cAdvisor metrics to get the PVC informations. In that case the METRICS_FILESYSTEM environment variable can be left at the default value that is cadvisor.

Metrics used:

Storage Metric

cAdvisor Metrics

Capacity

kubelet_volume_stats_capacity_bytes

Used

kubelet_volume_stats_used_bytes

CSI

If kubelet_volume* metrics are not available and you are using CSI plugins, you must set the METRICS_FILESYSTEM environment variable to csi. In that case kube-state-metrics is required. For csi, we get the PVC informations from the node_exporter and kube-state-metrics metrics.

Metrics used:

Storage Metric

node_exporter Metrics

kube-state-metrics Metrics

Capacity

node_filesystem_size_bytes

kube_persistentvolumeclaim_info

Used

node_filesystem_size_bytes - node_filesystem_free_bytes

kube_persistentvolumeclaim_info

Exposed Metrics

The Agent self exposes metrics. The metrics can be accessed via the /metrics route on port :8083.

Metric

Type

Labels

Description

replex_agent_provider_status

gauge

name: metrics provider name

Indicates whether or not a metrics provider is reachable. 1 if it is reachable and 0 if not

replex_agent_sync_duration_count

counter

agent_version, response_code

Total number of times the metrics were synchronized with the Replex server

replex_agent_sync_duration_sum

gauge

agent_version, response_code

The total duration of all sync requests to the Replex server in seconds

replex_agent_retry_cache_size

gauge

cluster_id

Count of cached metrics that are waiting to be re-sent to the replex server

replex_agent_failed_metrics_total

counter

cluster_id

The total count of once failed metrics

Used Metrics

These are the metrics the agent currently uses:

Property

Description

Prometheus

Instana (plugin: metric)

1

Container CPU Usage

container_cpu_usage_seconds_total

docker: cpu.total_usage

kubernetes.io/container/cpu/core_usage_time

kubernetes.cpu.usage.total

2

Container MEM Usage

container_memory_working_set_bytes

docker: memory.usage

kubernetes.io/container/memory/used_bytes

kubernetes.memory.working_set

3

Node CPU Usage

node_cpu_seconds_total

docker: cpu.total_usage

kubernetes.io/node/cpu/core_usage_time

kubernetes.cpu.usage.total

4

Node MEM Usage

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

docker: memory.usage

kubernetes.io/node/memory/used_bytes

kubernetes.memory.usage

5

Storage (Capacity)

kubelet_volume_stats_capacity_bytes, node_filesystem_size_bytes

-

kubernetes.io/pod/volume/total_bytes

kubernetes.kubelet.volume.stats.capacity_bytes

6

Storage (Used)

kubelet_volume_stats_used_bytes, node_filesystem_free_bytes

-

kubernetes.io/pod/volume/used_bytes

kubernetes.kubelet.volume.stats.used_bytes

7

Disk Capacity

container_fs_limit_bytes

-

-

system.disk.total

8

Disk Used

container_fs_usage_bytes

-

-

system.disk.used

9

Network I/O (Received)

container_network_receive_bytes_total

docker: network.rx.bytes

kubernetes.io/pod/network/received_bytes_count

kubernetes.network.rx_bytes

10

Network I/O (Sent)

container_network_transmit_bytes_total

docker: network.tx.bytes

kubernetes.io/pod/network/sent_bytes_count

kubernetes.network.tx_bytes

11

Disk I/O (Written)

container_fs_writes_bytes_total

docker: blkio.blk_write

-

kubernetes.io.write_bytes

12

Disk I/O (Read)

container_fs_reads_bytes_total

docker: blkio.blk_read

-

kubernetes.io.read_bytes

Environment Variables

Variable

Required

Default

Comment

1

REPLEX_TOKEN

Yes

2

METRIC_PROVIDER

Yes

Options: prometheus, datadog, stackdriver, instana, thanos

3

CLUSTER_ID

If "KUBERNETES_INFO_PROVIDER" == "kubernetes"

4

CLUSTER_NAME

If "KUBERNETES_INFO_PROVIDER" == "kubernetes"

6

PROMETHEUS_SERVER_URL

If "METRIC_PROVIDER" == "prometheus"

7

DATADOG_API_KEY

If "METRIC_PROVIDER" == "datadog"

8

DATADOG_APPLICATION_KEY

If "METRIC_PROVIDER" == "datadog"

9

DATADOG_SITE

No

com

Options: com, eu

10

GCP_PROJECT_ID

If "METRIC_PROVIDER" == "stackdriver"

11

INSTANA_BASE_URL

If "METRIC_PROVIDER" == "instana"

Format: https://tenant-unit.instana.io

12

INSTANA_API_TOKEN

If "METRIC_PROVIDER" == "instana"

13

KUBERNETES_INFO_PROVIDER

No

kubernetes

Options: kubernetes, instana

14

INSTANA_CLUSTER_ID

No

15

ONLY_USE_READY_NODES

No

false

Track only nodes that are in "Ready" state

16

PROMETHEUS_NODE_LABEL

No

node

The label that represents the node in the Prometheus metrics

17

PROMETHEUS_CONTAINER_LABEL

No

container

The label that represents the container in the Prometheus metrics

18

PROMETHEUS_POD_LABEL

No

pod

The label that represents the pod in the Prometheus metrics

19

CLOUD_PROVIDER_OVERRIDE

No

Detecting automatically

Options: aws, azure, gce, custom, alibaba

20

USE_CONTROL_PLANE_COST

No

false

Track costs of the Kubernetes Control Plane

21

METRICS_FILESYSTEM

No

cadvisor

Specify the filesystem metric source. Options: cadvisor, csi

22

SYNC_INTERVAL_SECONDS

No

300

23

LOG_LEVEL

No

3

Higher value means higher verbosity

24

METRICS_RETRY_INTERVAL_SECONDS

No

300

25

METRICS_CACHE_DISK

No

true

Cache failed metrics on disk

26

METRICS_CACHE_DISK_DIR

No

/data/metrics

Directory to cache metrics if METRICS_CACHE_DISK == true

27

PROMETHEUS_BEARER_TOKEN

No

Prometheus server requests bearer token. Only if METRIC_PROVIDER == prometheus

Last updated