# Monitoring and Alerts

## Quick Note

{% hint style="info" %}
This documentation is meant to be used by on-prem Replex installations.\
\
Clients hosted on `*.replex.io` need not worry about the content of this documentation.\
\
All alerts and configurations here are handled for you.
{% endhint %}

## Dependencies&#x20;

Setting up your monitoring stack requires you to have Grafana, Prometheus and Alertmanager installed on your desired cluster.\
\
Ensure Grafana version 5.3.4+ is installed as the charts JSON specification is only compatible from there.

## Setting up Prometheus&#x20;

Prometheus is used for our metrics as we expose them from our core applications using this metrics provider.\
\
We already configure our deployment spec with the required annotation (as shown below) so your Prometheus instance would scrape the metrics automatically.

```
  # k8s-file.yaml
  
  ...
  annotations:  
    prometheus.io/scrape: "true"
```

\
To access the metrics please add the following scrape config named `kubernetes-pods` from [here](https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml#L254) to your `prometheus.yml` file to complete the configuration. \
\
You may ignore this step if the configuration above is in line with what is provided on your Prometheus installation.

## Configuring AlertManager (Optional)

Alerts can also be configured based on certain metrics exposed to your Prometheus instance.

This step is only necessary if you are installing AlertManager for the first time, there's a very nice guide on getting it set up and configuring your receivers [here](http://elatov.github.io/2020/01/alerting-with-prometheus-on-kubernetes/#install-alertmanager).\
\
Despite the earlier focus on installing AlertManager, the scope of the doc is outside the installation process.\
\
The first step to setting up alerts is to confirm that your Prometheus instance has already been configured to point to AlertManager correctly.

We use the following configuration for our instance

```bash
# prometheus.yml

rule_files:
  - /etc/prometheus-rules/rules    # This points to where the rules are stored

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager.<namespace>.svc.cluster.local:9093   #FQDN to your alertmanager instance

```

{% hint style="info" %}
Restarting your Prometheus instance is required after changing this configuration
{% endhint %}

\
You can check out [this](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/) article about alerting rules and pointing Prometheus to Alert Manager.<br>

## Setting up Alerts

Once Alert Manager is properly configured with Prometheus, you may then add the following rules specified here to your Prometheus rules configuration.\
\
You can copy and modify the current template below to fit your use cases or preferred alerting messages:

```bash
# /etc/prometheus-rules/rules

groups: 
- name: uptime  
  rules: 
  - alert: CAdvisorHostDown
    expr: up{job="kubernetes-cadvisor"} == 0
    for: 1m 
    labels: 
      severity: high 
    annotations: 
      summary: CAdvisort reports Host {{ $labels.instance }} is down, investigate immediately!
  - alert: NodeExporterNodeDown
    expr: up{job="kubernetes-nodes"} == 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: NodeExporter reports {{ $labels.instance }} is down, investigate immediately!
  - alert: APIServerDown
    expr: up{job="kubernetes-apiservers"} == 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: APIServer {{ $labels.instance }} is down, investigate immediately!
  - alert: PodDown
    expr: up{job="kubernetes-pods"} == 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: Pod {{ $labels.kubernetes_namespace }}/{{ $labels.kubernetes_pod_name}} is down, investigate immediately!

- name: pvc
  rules:
  - alert: VolumeRequestThresholdExceeded
    expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.9
    for: 1m
    labels:
      severity: high
    annotations:
      summary: Volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.kubernetes_io_hostname }} exceeded threshold capacity of 90%
  - alert: UnboundedPV
    expr: kube_persistentvolume_status_phase{phase != "Bound"} == 1
    for: 1d
    labels:
      severity: high
    annotations:
      summary: PV {{ $labels.persistentvolume }} has been in phase {{ $labels.phase }} for more than 1 day.
  - alert: UnboundedPVC
    expr: kube_persistentvolumeclaim_status_phase{phase != "Bound"} == 1
    for: 5m
    labels:
      severity: high
    annotations:
      summary: PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is currently in phase {{ $labels.phase }}.

- name: replex
  rules:
  - alert: ServerErrorAlert
    expr: sum by (kubernetes_namespace, kubernetes_pod_name) (changes(server_http_request_duration_seconds_count{job="kubernetes-pods",status_code=~"5.*"}[1m])) > 0
    for: 30s
    labels:
      severity: medium
    annotations:
      summary: 5xx errors on {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} for url {{ $labels.url }} exceeded threshold of 1 requests in 1 minute
  - alert: PushGatewayError
    expr: sum by (kubernetes_namespace, kubernetes_pod_name) (changes(pushgateway_push_requests_duration_seconds_count{job="kubernetes-pods", status=~"5.*"}[30m])) > 0
    for: 30s
    labels:
      severity: medium
    annotations:
      summary: 5xx errors on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} exceeded threshold of 1 requests within 30 minutes
  - alert: PricingAPIError
    expr: sum by (kubernetes_namespace, kubernetes_pod_name) (changes(pricingapi_http_request_duration_seconds_count{job="kubernetes-pods",status_code=~"5.*"}[1m])) > 0
    for: 30s
    labels:
      severity: medium
    annotations:
      summary: 5xx errors on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} exceeded threshold of 1 requests in 1 minute
  - alert: AggregatorErrors
    expr: changes(aggregator_aggregation_duration_seconds_count{job="kubernetes-pods",status="0"}[15m]) > 0
    for: 15m
    labels:
      severity: medium
    annotations:
      summary: Failed aggregation on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} exceeded threshold of 1 requests in 15 minute

- name: database 
  rules: 
  - alert: ReplicationStopped 
    expr: pg_repl_stream_active{job="kubernetes-pods"} == 0 
    for: 10s 
    labels: 
      severity: high 
    annotations: 
      summary: Replication slot {{ $labels.slot_name }} for server {{ $labels.server }} is no longer active 
  - alert: PushGateWayDatabaseUnavailable
    expr: pushgateway_database_status{job="kubernetes-pods"} == 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: Database connection for pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active
      description: Database connection for the pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active, consider restarting the pod
  - alert: AggregatorDatabaseUnavailable
    expr: aggregator_database_status{job="kubernetes-pods"} == 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: Database connection for pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active
      description: Database connection for the pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active, consider restarting the pod
  - alert: FailedAggregationEvent
    expr: changes(aggregator_query_duration_seconds_count{job="kubernetes-pods",status="0"}[15m]) > 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: "{{ $labels.aggregation_type }} cron job failed on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }}"
      description: A failed aggregation event has occurred on {{ $labels.aggregation_type }} on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }}
  - alert: HighNumberOfConnections
    expr: pg_stat_database_numbackends{datname="postgres"} > 30
    for: 1m
    labels:
      severity: high
    annotations:
      summary: "More than 30 connections to database {{$labels.datname}} on server {{ $labels.server }}"
      description: "More than 30 connections to database {{$labels.datname}} on server {{ $labels.server }}"
```

After copying the template above and modifying (if necessary) into the Prometheus rules file, you can check your Prometheus dashboard to verify the alerts are registered on the alert page

{% hint style="info" %}
Restarting your Prometheus instance is required after editing the rules
{% endhint %}

![Snapshot of correctly configured Prometheus alerts](https://4068579783-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MFGA2pJIFrMdGZ5i84m%2F-MM0wF6T78aKqZ7cpPgx%2F-MM1jjckiwmX-JGArSXu%2Fimage.png?alt=media\&token=0418f2c9-2d4c-45bc-a385-987f5ce58642)

## Finishing with Grafana

Once Prometheus is configured, you can then proceed to install the Grafana charts.<br>

The charts are hosted and maintained publicly on Grafana&#x20;

| Name             | URL                                            |
| ---------------- | ---------------------------------------------- |
| Request Metrics  | <https://grafana.com/grafana/dashboards/13401> |
| Database Metrics | <https://grafana.com/grafana/dashboards/13400> |

Provided the metrics from the Replex components are exposed properly, you should access dashboards similar to this:

![Request Metrics Dashboard](https://4068579783-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MFGA2pJIFrMdGZ5i84m%2F-MM0wF6T78aKqZ7cpPgx%2F-MM1nk6pnJ0E6E7SdGl8%2Fimage.png?alt=media\&token=f6b24ac1-50aa-4286-b8a1-bcd72011f817)

![Database Metrics Dashboard](https://4068579783-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MFGA2pJIFrMdGZ5i84m%2F-MM0wF6T78aKqZ7cpPgx%2F-MM1ntVvLXdwcbY9qPri%2Fimage.png?alt=media\&token=3470ca31-f442-4d2b-9447-8fc130ec32ab)
