# Agent

### Setting Up With Thanos

#### Configuring external\_labels for Thanos

Due to the way we extract node and container information from the Prometheus metrics aggregated by Thanos, we recommend that the keys configured for `podLabel`, `nodeLabel` and `containerLabel` be renamed if already used in your Prometheus external labels.

#### Setting up the thanos receive component

Due to availability concerns with using the sidecar, we HEAVILY RECOMMEND using the `thanos receive` component which emulates all the realtime results of Prometheus that the agent needs to get metrics.

Kindly review the setup docs [here](https://thanos.io/tip/components/receive.md/) and configure the remote\_write sections of your prometheus installation(s).

Here's a sample of the prometheus configuration desired to work with the `thanos receive` component

```yaml
# prometheus.yaml
remote_write:
  url: <thanos-receive-url>/api/v1/receive
  headers:
    - THANOS-TENANT: <replex-cluster-name>
```

#### Configuring the Querier With The Replex Agent

The Querier / Query Gateway is a part of the [thanos components](https://thanos.io/v0.6/thanos/getting-started.md/#components) which provides a Prometheus compatible endpoint that works just fine with the replex agent.

If you have a Thanos instance set up, verify there is support for access to this endpoint using a simple curl command, replace `THANOS_QUERIER_URL` with the url of your thanos instance

```bash
curl $THANOS_QUERIER_URL/api/v1/query?query=up
```

This should provide reasonable output similar to this which indicates that endpoint is prometheus compatible and would work with the agent.

```
up{instance="replex.io:9090", job="prometheus"} 1
up{instance="replex.io:9091", job="pushgateway"} 1
up{instance="replex.io:9093", job="alertmanager"} 1
up{instance="replex.io:9100", job="node"} 1
```

From the previous guide, provide the `THANOS_QUERIER_URL` under the [prometheus.url](https://docs.replex.io/#metric-provider-parameters) variable as depicted in the agent docs presented in this section, and you're up and running with Thanos on Replex.

### Configuring Self-Signed SSL Certificates (On-Prem Only)

For on-prem pushgateway deployments, if the puhgateway is served with a self-signed SSL certificate, the agent may encounter errors when trying to sync with the pushgateway.

To resolve this, you can use the `sslCertificate` Helm chart parameter to pass your certificate into the agent.

Example:

```yaml
sslCertificate: “-----BEGIN CERTIFICATE-----\nMIIC1TCCAb2gAwIBAgIJAKbCs/2knCwGMA0GCSqGSIb3DQEBBQUAMBoxGDAWBgNV\nZAeRdaEZS6Bs\n-----END CERTIFICATE——"
```

#### Filesystem Metrics (Prometheus)

This section is only for setups using prometheus as metrics provider. We use metrics from different sources for collecting PVC informations. The default setting uses [cAdvisor's](https://github.com/google/cadvisor) `kubelet_volume*` metrics.

**cAdvisor**

The default setup uses the `cAdvisor` metrics to get the PVC informations. In that case the `METRICS_FILESYSTEM` environment variable can be left at the default value that is `cadvisor`.

Metrics used:

| Storage Metric | `cAdvisor` Metrics                    |
| -------------- | ------------------------------------- |
| Capacity       | `kubelet_volume_stats_capacity_bytes` |
| Used           | `kubelet_volume_stats_used_bytes`     |

**CSI**

If `kubelet_volume*` metrics are not available and you are using [CSI](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/) plugins, you must set the `METRICS_FILESYSTEM` environment variable to `csi`. In that case [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) is required. For `csi`, we get the PVC informations from the `node_exporter` and `kube-state-metrics` metrics.

Metrics used:

| Storage Metric | `node_exporter` Metrics                                     | `kube-state-metrics` Metrics      |
| -------------- | ----------------------------------------------------------- | --------------------------------- |
| Capacity       | `node_filesystem_size_bytes`                                | `kube_persistentvolumeclaim_info` |
| Used           | `node_filesystem_size_bytes` - `node_filesystem_free_bytes` | `kube_persistentvolumeclaim_info` |

### Exposed Metrics

The Agent self exposes metrics. The metrics can be accessed via the `/metrics` route on port `:8083`.

| Metric                              | Type    | Labels                           | Description                                                                                 |
| ----------------------------------- | ------- | -------------------------------- | ------------------------------------------------------------------------------------------- |
| `replex_agent_provider_status`      | gauge   | `name`: metrics provider name    | Indicates whether or not a metrics provider is reachable. 1 if it is reachable and 0 if not |
| `replex_agent_sync_duration_count`  | counter | `agent_version`, `response_code` | Total number of times the metrics were synchronized with the Replex server                  |
| `replex_agent_sync_duration_sum`    | gauge   | `agent_version`, `response_code` | The total duration of all sync requests to the Replex server in seconds                     |
| `replex_agent_retry_cache_size`     | gauge   | `cluster_id`                     | Count of cached metrics that are waiting to be re-sent to the replex server                 |
| `replex_agent_failed_metrics_total` | counter | `cluster_id`                     | The total count of once failed metrics                                                      |

### Used Metrics

These are the metrics the agent currently uses:

| Property | Description            | Prometheus                                                             | Instana `(plugin: metric)` | [Stackdriver](https://cloud.google.com/monitoring/api/metrics_kubernetes) | [Datadog](https://docs.datadoghq.com/integrations/kubernetes/) |
| -------- | ---------------------- | ---------------------------------------------------------------------- | -------------------------- | ------------------------------------------------------------------------- | -------------------------------------------------------------- |
| 1        | Container CPU Usage    | container\_cpu\_usage\_seconds\_total                                  | *docker*: cpu.total\_usage | kubernetes.io/container/cpu/core\_usage\_time                             | kubernetes.cpu.usage.total                                     |
| 2        | Container MEM Usage    | container\_memory\_working\_set\_bytes                                 | *docker*: memory.usage     | kubernetes.io/container/memory/used\_bytes                                | kubernetes.memory.working\_set                                 |
| 3        | Node CPU Usage         | node\_cpu\_seconds\_total                                              | *docker*: cpu.total\_usage | kubernetes.io/node/cpu/core\_usage\_time                                  | kubernetes.cpu.usage.total                                     |
| 4        | Node MEM Usage         | node\_memory\_MemTotal\_bytes - node\_memory\_MemAvailable\_bytes      | *docker*: memory.usage     | kubernetes.io/node/memory/used\_bytes                                     | kubernetes.memory.usage                                        |
| 5        | Storage (Capacity)     | kubelet\_volume\_stats\_capacity\_bytes, node\_filesystem\_size\_bytes | -                          | kubernetes.io/pod/volume/total\_bytes                                     | kubernetes.kubelet.volume.stats.capacity\_bytes                |
| 6        | Storage (Used)         | kubelet\_volume\_stats\_used\_bytes, node\_filesystem\_free\_bytes     | -                          | kubernetes.io/pod/volume/used\_bytes                                      | kubernetes.kubelet.volume.stats.used\_bytes                    |
| 7        | Disk Capacity          | container\_fs\_limit\_bytes                                            | -                          | -                                                                         | system.disk.total                                              |
| 8        | Disk Used              | container\_fs\_usage\_bytes                                            | -                          | -                                                                         | system.disk.used                                               |
| 9        | Network I/O (Received) | container\_network\_receive\_bytes\_total                              | *docker*: network.rx.bytes | kubernetes.io/pod/network/received\_bytes\_count                          | kubernetes.network.rx\_bytes                                   |
| 10       | Network I/O (Sent)     | container\_network\_transmit\_bytes\_total                             | *docker*: network.tx.bytes | kubernetes.io/pod/network/sent\_bytes\_count                              | kubernetes.network.tx\_bytes                                   |
| 11       | Disk I/O (Written)     | container\_fs\_writes\_bytes\_total                                    | *docker*: blkio.blk\_write | -                                                                         | kubernetes.io.write\_bytes                                     |
| 12       | Disk I/O (Read)        | container\_fs\_reads\_bytes\_total                                     | *docker*: blkio.blk\_read  | -                                                                         | kubernetes.io.read\_bytes                                      |

### Environment Variables

|    | Variable                          | Required                                        | Default                              | Comment                                                                            |
| -- | --------------------------------- | ----------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------- |
| 1  | REPLEX\_TOKEN                     | Yes                                             |                                      |                                                                                    |
| 2  | METRIC\_PROVIDER                  | Yes                                             |                                      | Options: `prometheus`, `datadog`, `stackdriver`, `instana`, `thanos`               |
| 3  | CLUSTER\_ID                       | If "KUBERNETES\_INFO\_PROVIDER" == "kubernetes" |                                      |                                                                                    |
| 4  | CLUSTER\_NAME                     | If "KUBERNETES\_INFO\_PROVIDER" == "kubernetes" |                                      |                                                                                    |
| 5  | PUSHGATEWAY\_URL                  | No                                              | <https://pushgateway.replex.io/push> |                                                                                    |
| 6  | PROMETHEUS\_SERVER\_URL           | If "METRIC\_PROVIDER" == "prometheus"           |                                      |                                                                                    |
| 7  | DATADOG\_API\_KEY                 | If "METRIC\_PROVIDER" == "datadog"              |                                      |                                                                                    |
| 8  | DATADOG\_APPLICATION\_KEY         | If "METRIC\_PROVIDER" == "datadog"              |                                      |                                                                                    |
| 9  | DATADOG\_SITE                     | No                                              | com                                  | Options: `com`, `eu`                                                               |
| 10 | GCP\_PROJECT\_ID                  | If "METRIC\_PROVIDER" == "stackdriver"          |                                      |                                                                                    |
| 11 | INSTANA\_BASE\_URL                | If "METRIC\_PROVIDER" == "instana"              |                                      | Format: `https://tenant-unit.instana.io`                                           |
| 12 | INSTANA\_API\_TOKEN               | If "METRIC\_PROVIDER" == "instana"              |                                      |                                                                                    |
| 13 | KUBERNETES\_INFO\_PROVIDER        | No                                              | kubernetes                           | Options: `kubernetes`, `instana`                                                   |
| 14 | INSTANA\_CLUSTER\_ID              | No                                              |                                      |                                                                                    |
| 15 | ONLY\_USE\_READY\_NODES           | No                                              | false                                | Track only nodes that are in "Ready" state                                         |
| 16 | PROMETHEUS\_NODE\_LABEL           | No                                              | node                                 | The label that represents the node in the Prometheus metrics                       |
| 17 | PROMETHEUS\_CONTAINER\_LABEL      | No                                              | container                            | The label that represents the container in the Prometheus metrics                  |
| 18 | PROMETHEUS\_POD\_LABEL            | No                                              | pod                                  | The label that represents the pod in the Prometheus metrics                        |
| 19 | CLOUD\_PROVIDER\_OVERRIDE         | No                                              | Detecting automatically              | Options: `aws`, `azure`, `gce`, `custom`, `alibaba`                                |
| 20 | USE\_CONTROL\_PLANE\_COST         | No                                              | false                                | Track costs of the Kubernetes Control Plane                                        |
| 21 | METRICS\_FILESYSTEM               | No                                              | cadvisor                             | Specify the filesystem metric source. Options: `cadvisor`, `csi`                   |
| 22 | SYNC\_INTERVAL\_SECONDS           | No                                              | 300                                  |                                                                                    |
| 23 | LOG\_LEVEL                        | No                                              | 3                                    | Higher value means higher verbosity                                                |
| 24 | METRICS\_RETRY\_INTERVAL\_SECONDS | No                                              | 300                                  |                                                                                    |
| 25 | METRICS\_CACHE\_DISK              | No                                              | true                                 | Cache failed metrics on disk                                                       |
| 26 | METRICS\_CACHE\_DISK\_DIR         | No                                              | /data/metrics                        | Directory to cache metrics if `METRICS_CACHE_DISK` == `true`                       |
| 27 | PROMETHEUS\_BEARER\_TOKEN         | No                                              |                                      | Prometheus server requests bearer token. Only if `METRIC_PROVIDER` == `prometheus` |
