Agent
Due to the way we extract node and container information from the Prometheus metrics aggregated by Thanos, we recommend that the keys configured for
podLabel
, nodeLabel
and containerLabel
be renamed if already used in your Prometheus external labels.Due to availability concerns with using the sidecar, we HEAVILY RECOMMEND using the
thanos receive
component which emulates all the realtime results of Prometheus that the agent needs to get metrics.Kindly review the setup docs here and configure the remote_write sections of your prometheus installation(s).
Here's a sample of the prometheus configuration desired to work with the
thanos receive
component# prometheus.yaml
remote_write:
url: <thanos-receive-url>/api/v1/receive
headers:
- THANOS-TENANT: <replex-cluster-name>
The Querier / Query Gateway is a part of the thanos components which provides a Prometheus compatible endpoint that works just fine with the replex agent.
If you have a Thanos instance set up, verify there is support for access to this endpoint using a simple curl command, replace
THANOS_QUERIER_URL
with the url of your thanos instancecurl $THANOS_QUERIER_URL/api/v1/query?query=up
This should provide reasonable output similar to this which indicates that endpoint is prometheus compatible and would work with the agent.
up{instance="replex.io:9090", job="prometheus"} 1
up{instance="replex.io:9091", job="pushgateway"} 1
up{instance="replex.io:9093", job="alertmanager"} 1
up{instance="replex.io:9100", job="node"} 1
From the previous guide, provide the
THANOS_QUERIER_URL
under the prometheus.url variable as depicted in the agent docs presented in this section, and you're up and running with Thanos on Replex.For on-prem pushgateway deployments, if the puhgateway is served with a self-signed SSL certificate, the agent may encounter errors when trying to sync with the pushgateway.
To resolve this, you can use the
sslCertificate
Helm chart parameter to pass your certificate into the agent.Example:
sslCertificate: “-----BEGIN CERTIFICATE-----\nMIIC1TCCAb2gAwIBAgIJAKbCs/2knCwGMA0GCSqGSIb3DQEBBQUAMBoxGDAWBgNV\nZAeRdaEZS6Bs\n-----END CERTIFICATE——"
This section is only for setups using prometheus as metrics provider. We use metrics from different sources for collecting PVC informations. The default setting uses cAdvisor's
kubelet_volume*
metrics.cAdvisor
The default setup uses the
cAdvisor
metrics to get the PVC informations. In that case the METRICS_FILESYSTEM
environment variable can be left at the default value that is cadvisor
.Metrics used:
Storage Metric | cAdvisor Metrics |
Capacity | kubelet_volume_stats_capacity_bytes |
Used | kubelet_volume_stats_used_bytes |
CSI
If
kubelet_volume*
metrics are not available and you are using CSI plugins, you must set the METRICS_FILESYSTEM
environment variable to csi
. In that case kube-state-metrics is required. For csi
, we get the PVC informations from the node_exporter
and kube-state-metrics
metrics.Metrics used:
Storage Metric | node_exporter Metrics | kube-state-metrics Metrics |
Capacity | node_filesystem_size_bytes | kube_persistentvolumeclaim_info |
Used | node_filesystem_size_bytes - node_filesystem_free_bytes | kube_persistentvolumeclaim_info |
The Agent self exposes metrics. The metrics can be accessed via the
/metrics
route on port :8083
.Metric | Type | Labels | Description |
replex_agent_provider_status | gauge | name : metrics provider name | Indicates whether or not a metrics provider is reachable. 1 if it is reachable and 0 if not |
replex_agent_sync_duration_count | counter | agent_version , response_code | Total number of times the metrics were synchronized with the Replex server |
replex_agent_sync_duration_sum | gauge | agent_version , response_code | The total duration of all sync requests to the Replex server in seconds |
replex_agent_retry_cache_size | gauge | cluster_id | Count of cached metrics that are waiting to be re-sent to the replex server |
replex_agent_failed_metrics_total | counter | cluster_id | The total count of once failed metrics |
These are the metrics the agent currently uses:
Property | Description | Prometheus | Instana (plugin: metric) | ||
1 | Container CPU Usage | container_cpu_usage_seconds_total | docker: cpu.total_usage | kubernetes.io/container/cpu/core_usage_time | kubernetes.cpu.usage.total |
2 | Container MEM Usage | container_memory_working_set_bytes | docker: memory.usage | kubernetes.io/container/memory/used_bytes | kubernetes.memory.working_set |
3 | Node CPU Usage | node_cpu_seconds_total | docker: cpu.total_usage | kubernetes.io/node/cpu/core_usage_time | kubernetes.cpu.usage.total |
4 | Node MEM Usage | node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes | docker: memory.usage | kubernetes.io/node/memory/used_bytes | kubernetes.memory.usage |
5 | Storage (Capacity) | kubelet_volume_stats_capacity_bytes, node_filesystem_size_bytes | - | kubernetes.io/pod/volume/total_bytes | kubernetes.kubelet.volume.stats.capacity_bytes |
6 | Storage (Used) | kubelet_volume_stats_used_bytes, node_filesystem_free_bytes | - | kubernetes.io/pod/volume/used_bytes | kubernetes.kubelet.volume.stats.used_bytes |
7 | Disk Capacity | container_fs_limit_bytes | - | - | system.disk.total |
8 | Disk Used | container_fs_usage_bytes | - | - | system.disk.used |
9 | Network I/O (Received) | container_network_receive_bytes_total | docker: network.rx.bytes | kubernetes.io/pod/network/received_bytes_count | kubernetes.network.rx_bytes |
10 | Network I/O (Sent) | container_network_transmit_bytes_total | docker: network.tx.bytes | kubernetes.io/pod/network/sent_bytes_count | kubernetes.network.tx_bytes |
11 | Disk I/O (Written) | container_fs_writes_bytes_total | docker: blkio.blk_write | - | kubernetes.io.write_bytes |
12 | Disk I/O (Read) | container_fs_reads_bytes_total | docker: blkio.blk_read | - | kubernetes.io.read_bytes |
| Variable | Required | Default | Comment |
1 | REPLEX_TOKEN | Yes | | |
2 | METRIC_PROVIDER | Yes | | Options: prometheus , datadog , stackdriver , instana , thanos |
3 | CLUSTER_ID | If "KUBERNETES_INFO_PROVIDER" == "kubernetes" | | |
4 | CLUSTER_NAME | If "KUBERNETES_INFO_PROVIDER" == "kubernetes" | | |
5 | PUSHGATEWAY_URL | No | | |
6 | PROMETHEUS_SERVER_URL | If "METRIC_PROVIDER" == "prometheus" | | |
7 | DATADOG_API_KEY | If "METRIC_PROVIDER" == "datadog" | | |
8 | DATADOG_APPLICATION_KEY | If "METRIC_PROVIDER" == "datadog" | | |
9 | DATADOG_SITE | No | com | Options: com , eu |
10 | GCP_PROJECT_ID | If "METRIC_PROVIDER" == "stackdriver" | | |
11 | INSTANA_BASE_URL | If "METRIC_PROVIDER" == "instana" | | Format: https://tenant-unit.instana.io |
12 | INSTANA_API_TOKEN | If "METRIC_PROVIDER" == "instana" | | |
13 | KUBERNETES_INFO_PROVIDER | No | kubernetes | Options: kubernetes , instana |
14 | INSTANA_CLUSTER_ID | No | | |
15 | ONLY_USE_READY_NODES | No | false | Track only nodes that are in "Ready" state |
16 | PROMETHEUS_NODE_LABEL | No | node | The label that represents the node in the Prometheus metrics |
17 | PROMETHEUS_CONTAINER_LABEL | No | container | The label that represents the container in the Prometheus metrics |
18 | PROMETHEUS_POD_LABEL | No | pod | The label that represents the pod in the Prometheus metrics |
19 | CLOUD_PROVIDER_OVERRIDE | No | Detecting automatically | Options: aws , azure , gce , custom , alibaba |
20 | USE_CONTROL_PLANE_COST | No | false | Track costs of the Kubernetes Control Plane |
21 | METRICS_FILESYSTEM | No | cadvisor | Specify the filesystem metric source. Options: cadvisor , csi |
22 | SYNC_INTERVAL_SECONDS | No | 300 | |
23 | LOG_LEVEL | No | 3 | Higher value means higher verbosity |
24 | METRICS_RETRY_INTERVAL_SECONDS | No | 300 | |
25 | METRICS_CACHE_DISK | No | true | Cache failed metrics on disk |
26 | METRICS_CACHE_DISK_DIR | No | /data/metrics | Directory to cache metrics if METRICS_CACHE_DISK == true |
27 | PROMETHEUS_BEARER_TOKEN | No | | Prometheus server requests bearer token. Only if METRIC_PROVIDER == prometheus |
Last modified 2yr ago