Monitoring and Alerts
Describes the process of configuring your Grafana, Prometheus and Alertmanager instances to monitor your Replex deployments.

Quick Note

This documentation is meant to be used by on-prem Replex installations. Clients hosted on *.replex.io need not worry about the content of this documentation. All alerts and configurations here are handled for you.

Dependencies

Setting up your monitoring stack requires you to have Grafana, Prometheus and Alertmanager installed on your desired cluster. Ensure Grafana version 5.3.4+ is installed as the charts JSON specification is only compatible from there.

Setting up Prometheus

Prometheus is used for our metrics as we expose them from our core applications using this metrics provider. We already configure our deployment spec with the required annotation (as shown below) so your Prometheus instance would scrape the metrics automatically.
1
# k8s-file.yaml
2
3
...
4
annotations:
5
prometheus.io/scrape: "true"
Copied!
To access the metrics please add the following scrape config named kubernetes-pods from here to your prometheus.yml file to complete the configuration. You may ignore this step if the configuration above is in line with what is provided on your Prometheus installation.

Configuring AlertManager (Optional)

Alerts can also be configured based on certain metrics exposed to your Prometheus instance.
This step is only necessary if you are installing AlertManager for the first time, there's a very nice guide on getting it set up and configuring your receivers here. Despite the earlier focus on installing AlertManager, the scope of the doc is outside the installation process. The first step to setting up alerts is to confirm that your Prometheus instance has already been configured to point to AlertManager correctly.
We use the following configuration for our instance
1
# prometheus.yml
2
3
rule_files:
4
- /etc/prometheus-rules/rules # This points to where the rules are stored
5
6
alerting:
7
alertmanagers:
8
- static_configs:
9
- targets:
10
- alertmanager.<namespace>.svc.cluster.local:9093 #FQDN to your alertmanager instance
11
Copied!
Restarting your Prometheus instance is required after changing this configuration
You can check out this article about alerting rules and pointing Prometheus to Alert Manager.

Setting up Alerts

Once Alert Manager is properly configured with Prometheus, you may then add the following rules specified here to your Prometheus rules configuration. You can copy and modify the current template below to fit your use cases or preferred alerting messages:
1
# /etc/prometheus-rules/rules
2
3
groups:
4
- name: uptime
5
rules:
6
- alert: CAdvisorHostDown
7
expr: up{job="kubernetes-cadvisor"} == 0
8
for: 1m
9
labels:
10
severity: high
11
annotations:
12
summary: CAdvisort reports Host {{ $labels.instance }} is down, investigate immediately!
13
- alert: NodeExporterNodeDown
14
expr: up{job="kubernetes-nodes"} == 0
15
for: 1m
16
labels:
17
severity: high
18
annotations:
19
summary: NodeExporter reports {{ $labels.instance }} is down, investigate immediately!
20
- alert: APIServerDown
21
expr: up{job="kubernetes-apiservers"} == 0
22
for: 1m
23
labels:
24
severity: high
25
annotations:
26
summary: APIServer {{ $labels.instance }} is down, investigate immediately!
27
- alert: PodDown
28
expr: up{job="kubernetes-pods"} == 0
29
for: 1m
30
labels:
31
severity: high
32
annotations:
33
summary: Pod {{ $labels.kubernetes_namespace }}/{{ $labels.kubernetes_pod_name}} is down, investigate immediately!
34
35
- name: pvc
36
rules:
37
- alert: VolumeRequestThresholdExceeded
38
expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.9
39
for: 1m
40
labels:
41
severity: high
42
annotations:
43
summary: Volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.kubernetes_io_hostname }} exceeded threshold capacity of 90%
44
- alert: UnboundedPV
45
expr: kube_persistentvolume_status_phase{phase != "Bound"} == 1
46
for: 1d
47
labels:
48
severity: high
49
annotations:
50
summary: PV {{ $labels.persistentvolume }} has been in phase {{ $labels.phase }} for more than 1 day.
51
- alert: UnboundedPVC
52
expr: kube_persistentvolumeclaim_status_phase{phase != "Bound"} == 1
53
for: 5m
54
labels:
55
severity: high
56
annotations:
57
summary: PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is currently in phase {{ $labels.phase }}.
58
59
- name: replex
60
rules:
61
- alert: ServerErrorAlert
62
expr: sum by (kubernetes_namespace, kubernetes_pod_name) (changes(server_http_request_duration_seconds_count{job="kubernetes-pods",status_code=~"5.*"}[1m])) > 0
63
for: 30s
64
labels:
65
severity: medium
66
annotations:
67
summary: 5xx errors on {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} for url {{ $labels.url }} exceeded threshold of 1 requests in 1 minute
68
- alert: PushGatewayError
69
expr: sum by (kubernetes_namespace, kubernetes_pod_name) (changes(pushgateway_push_requests_duration_seconds_count{job="kubernetes-pods", status=~"5.*"}[30m])) > 0
70
for: 30s
71
labels:
72
severity: medium
73
annotations:
74
summary: 5xx errors on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} exceeded threshold of 1 requests within 30 minutes
75
- alert: PricingAPIError
76
expr: sum by (kubernetes_namespace, kubernetes_pod_name) (changes(pricingapi_http_request_duration_seconds_count{job="kubernetes-pods",status_code=~"5.*"}[1m])) > 0
77
for: 30s
78
labels:
79
severity: medium
80
annotations:
81
summary: 5xx errors on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} exceeded threshold of 1 requests in 1 minute
82
- alert: AggregatorErrors
83
expr: changes(aggregator_aggregation_duration_seconds_count{job="kubernetes-pods",status="0"}[15m]) > 0
84
for: 15m
85
labels:
86
severity: medium
87
annotations:
88
summary: Failed aggregation on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} exceeded threshold of 1 requests in 15 minute
89
90
- name: database
91
rules:
92
- alert: ReplicationStopped
93
expr: pg_repl_stream_active{job="kubernetes-pods"} == 0
94
for: 10s
95
labels:
96
severity: high
97
annotations:
98
summary: Replication slot {{ $labels.slot_name }} for server {{ $labels.server }} is no longer active
99
- alert: PushGateWayDatabaseUnavailable
100
expr: pushgateway_database_status{job="kubernetes-pods"} == 0
101
for: 1m
102
labels:
103
severity: high
104
annotations:
105
summary: Database connection for pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active
106
description: Database connection for the pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active, consider restarting the pod
107
- alert: AggregatorDatabaseUnavailable
108
expr: aggregator_database_status{job="kubernetes-pods"} == 0
109
for: 1m
110
labels:
111
severity: high
112
annotations:
113
summary: Database connection for pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active
114
description: Database connection for the pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }} is no longer active, consider restarting the pod
115
- alert: FailedAggregationEvent
116
expr: changes(aggregator_query_duration_seconds_count{job="kubernetes-pods",status="0"}[15m]) > 0
117
for: 1m
118
labels:
119
severity: high
120
annotations:
121
summary: "{{ $labels.aggregation_type }} cron job failed on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }}"
122
description: A failed aggregation event has occurred on {{ $labels.aggregation_type }} on pod {{$labels.kubernetes_namespace}}/{{ $labels.kubernetes_pod_name }}
123
- alert: HighNumberOfConnections
124
expr: pg_stat_database_numbackends{datname="postgres"} > 30
125
for: 1m
126
labels:
127
severity: high
128
annotations:
129
summary: "More than 30 connections to database {{$labels.datname}} on server {{ $labels.server }}"
130
description: "More than 30 connections to database {{$labels.datname}} on server {{ $labels.server }}"
Copied!
After copying the template above and modifying (if necessary) into the Prometheus rules file, you can check your Prometheus dashboard to verify the alerts are registered on the alert page
Restarting your Prometheus instance is required after editing the rules
Snapshot of correctly configured Prometheus alerts

Finishing with Grafana

Once Prometheus is configured, you can then proceed to install the Grafana charts.
The charts are hosted and maintained publicly on Grafana
Provided the metrics from the Replex components are exposed properly, you should access dashboards similar to this:
Request Metrics Dashboard
Database Metrics Dashboard
Last modified 10mo ago