Agent

Setting Up With Thanos

Configuring external_labels for Thanos

Due to the way we extract node and container information from the Prometheus metrics aggregated by Thanos, we recommend that the keys configured for podLabel, nodeLabel and containerLabel be renamed if already used in your Prometheus external labels.

Setting up the thanos receive component

Due to availability concerns with using the sidecar, we HEAVILY RECOMMEND using the thanos receive component which emulates all the realtime results of Prometheus that the agent needs to get metrics.
Kindly review the setup docs here and configure the remote_write sections of your prometheus installation(s).
Here's a sample of the prometheus configuration desired to work with the thanos receive component
# prometheus.yaml
remote_write:
url: <thanos-receive-url>/api/v1/receive
headers:
- THANOS-TENANT: <replex-cluster-name>

Configuring the Querier With The Replex Agent

The Querier / Query Gateway is a part of the thanos components which provides a Prometheus compatible endpoint that works just fine with the replex agent.
If you have a Thanos instance set up, verify there is support for access to this endpoint using a simple curl command, replace THANOS_QUERIER_URL with the url of your thanos instance
curl $THANOS_QUERIER_URL/api/v1/query?query=up
This should provide reasonable output similar to this which indicates that endpoint is prometheus compatible and would work with the agent.
up{instance="replex.io:9090", job="prometheus"} 1
up{instance="replex.io:9091", job="pushgateway"} 1
up{instance="replex.io:9093", job="alertmanager"} 1
up{instance="replex.io:9100", job="node"} 1
From the previous guide, provide the THANOS_QUERIER_URL under the prometheus.url variable as depicted in the agent docs presented in this section, and you're up and running with Thanos on Replex.

Configuring Self-Signed SSL Certificates (On-Prem Only)

For on-prem pushgateway deployments, if the puhgateway is served with a self-signed SSL certificate, the agent may encounter errors when trying to sync with the pushgateway.
To resolve this, you can use the sslCertificate Helm chart parameter to pass your certificate into the agent.
Example:
sslCertificate:-----BEGIN CERTIFICATE-----\nMIIC1TCCAb2gAwIBAgIJAKbCs/2knCwGMA0GCSqGSIb3DQEBBQUAMBoxGDAWBgNV\nZAeRdaEZS6Bs\n-----END CERTIFICATE——"

Filesystem Metrics (Prometheus)

This section is only for setups using prometheus as metrics provider. We use metrics from different sources for collecting PVC informations. The default setting uses cAdvisor's kubelet_volume* metrics.
cAdvisor
The default setup uses the cAdvisor metrics to get the PVC informations. In that case the METRICS_FILESYSTEM environment variable can be left at the default value that is cadvisor.
Metrics used:
Storage Metric
cAdvisor Metrics
Capacity
kubelet_volume_stats_capacity_bytes
Used
kubelet_volume_stats_used_bytes
CSI
If kubelet_volume* metrics are not available and you are using CSI plugins, you must set the METRICS_FILESYSTEM environment variable to csi. In that case kube-state-metrics is required. For csi, we get the PVC informations from the node_exporter and kube-state-metrics metrics.
Metrics used:
Storage Metric
node_exporter Metrics
kube-state-metrics Metrics
Capacity
node_filesystem_size_bytes
kube_persistentvolumeclaim_info
Used
node_filesystem_size_bytes - node_filesystem_free_bytes
kube_persistentvolumeclaim_info

Exposed Metrics

The Agent self exposes metrics. The metrics can be accessed via the /metrics route on port :8083.
Metric
Type
Labels
Description
replex_agent_provider_status
gauge
name: metrics provider name
Indicates whether or not a metrics provider is reachable. 1 if it is reachable and 0 if not
replex_agent_sync_duration_count
counter
agent_version, response_code
Total number of times the metrics were synchronized with the Replex server
replex_agent_sync_duration_sum
gauge
agent_version, response_code
The total duration of all sync requests to the Replex server in seconds
replex_agent_retry_cache_size
gauge
cluster_id
Count of cached metrics that are waiting to be re-sent to the replex server
replex_agent_failed_metrics_total
counter
cluster_id
The total count of once failed metrics

Used Metrics

These are the metrics the agent currently uses:
Property
Description
Prometheus
Instana (plugin: metric)
Datadog
1
Container CPU Usage
container_cpu_usage_seconds_total
docker: cpu.total_usage
kubernetes.io/container/cpu/core_usage_time
kubernetes.cpu.usage.total
2
Container MEM Usage
container_memory_working_set_bytes
docker: memory.usage
kubernetes.io/container/memory/used_bytes
kubernetes.memory.working_set
3
Node CPU Usage
node_cpu_seconds_total
docker: cpu.total_usage
kubernetes.io/node/cpu/core_usage_time
kubernetes.cpu.usage.total
4
Node MEM Usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
docker: memory.usage
kubernetes.io/node/memory/used_bytes
kubernetes.memory.usage
5
Storage (Capacity)
kubelet_volume_stats_capacity_bytes, node_filesystem_size_bytes
-
kubernetes.io/pod/volume/total_bytes
kubernetes.kubelet.volume.stats.capacity_bytes
6
Storage (Used)
kubelet_volume_stats_used_bytes, node_filesystem_free_bytes
-
kubernetes.io/pod/volume/used_bytes
kubernetes.kubelet.volume.stats.used_bytes
7
Disk Capacity
container_fs_limit_bytes
-
-
system.disk.total
8
Disk Used
container_fs_usage_bytes
-
-
system.disk.used
9
Network I/O (Received)
container_network_receive_bytes_total
docker: network.rx.bytes
kubernetes.io/pod/network/received_bytes_count
kubernetes.network.rx_bytes
10
Network I/O (Sent)
container_network_transmit_bytes_total
docker: network.tx.bytes
kubernetes.io/pod/network/sent_bytes_count
kubernetes.network.tx_bytes
11
Disk I/O (Written)
container_fs_writes_bytes_total
docker: blkio.blk_write
-
kubernetes.io.write_bytes
12
Disk I/O (Read)
container_fs_reads_bytes_total
docker: blkio.blk_read
-
kubernetes.io.read_bytes

Environment Variables

Variable
Required
Default
Comment
1
REPLEX_TOKEN
Yes
2
METRIC_PROVIDER
Yes
Options: prometheus, datadog, stackdriver, instana, thanos
3
CLUSTER_ID
If "KUBERNETES_INFO_PROVIDER" == "kubernetes"
4
CLUSTER_NAME
If "KUBERNETES_INFO_PROVIDER" == "kubernetes"
5
PUSHGATEWAY_URL
No
6
PROMETHEUS_SERVER_URL
If "METRIC_PROVIDER" == "prometheus"
7
DATADOG_API_KEY
If "METRIC_PROVIDER" == "datadog"
8
DATADOG_APPLICATION_KEY
If "METRIC_PROVIDER" == "datadog"
9
DATADOG_SITE
No
com
Options: com, eu
10
GCP_PROJECT_ID
If "METRIC_PROVIDER" == "stackdriver"
11
INSTANA_BASE_URL
If "METRIC_PROVIDER" == "instana"
Format: https://tenant-unit.instana.io
12
INSTANA_API_TOKEN
If "METRIC_PROVIDER" == "instana"
13
KUBERNETES_INFO_PROVIDER
No
kubernetes
Options: kubernetes, instana
14
INSTANA_CLUSTER_ID
No
15
ONLY_USE_READY_NODES
No
false
Track only nodes that are in "Ready" state
16
PROMETHEUS_NODE_LABEL
No
node
The label that represents the node in the Prometheus metrics
17
PROMETHEUS_CONTAINER_LABEL
No
container
The label that represents the container in the Prometheus metrics
18
PROMETHEUS_POD_LABEL
No
pod
The label that represents the pod in the Prometheus metrics
19
CLOUD_PROVIDER_OVERRIDE
No
Detecting automatically
Options: aws, azure, gce, custom, alibaba