This is the second article of my practicing SRE series.
In the previous article SRE Onboarding, I discussed the concept of SRE and how SRE could help in the context of my current organisation. This article is a follow-up, a collection of tips I want to share after setting and fixing of our alerting system.
Our current alerting system is based on Prometheus with Alertmanager. As I mentioned in the previous article SRE Onboarding, the alerting system was half-broken:
- low readability: some alerts do not contain the information we need, whereas some alerts contain too much useless information
- low quality: not all alerts require manual intervention
- low freshness: some alerts are outdated, whereas some alerts are not implemented yet
I would consider a functional alerting system as one of the cornerstones to practice SRE. Therefore, I decide to make alerting great again (MAGA in short). In this article I will share some thoughts during the journey.
Discuss, trial and error as a team
To make the alerting system great again, the first step is to have a mutual understanding of it in the team. There are lot of questions need to be answered.
Scope of alerting should be identified: What is the architecture of the system? Which part is critical for our business and for our customers? What kind of errors the system might encounter? What are the indicators and their threshold value?
Next, regarding the alerting system: How should we record alerts? How should we send alerts? How should we receive the alerts?
A protocol for alert handling need to be established: who will take care of those alerts? Who should take action when the alert is firing? What is the rotation plan? What is the response time and how is it related to our SLA?
For team like us that does not have clear defined SRE role, these questions will require trial and error to find suitable answers.
Apply CaC and Build CICD pipeline for alerting system
This help to improve and maintain the freshness of alerts.
Our alerting system was more or less manually configured and deployed. The configuration is not checked into version control system. It is error-prone and the configuration tend to be outdated and could be easily lost.
The alerting system we use, Prometheus and Alertmanager, provide Configuration as Code (CaC) which could and should be checked into the version control system. This further enable us to build Continuous Integration and Continuous Deployment (CICD) pipeline (Jenkins in our case) for the alerting system.
By doing so, we could hardly lose the configuration and adding new rules is simply adding some lines in the configuration file. This will encourage the team to keep it up-to-date.
Use positive list for alerts
This could improve the quality of alerting system. Too many alerts will make people reluctant to check them. Configure only alarms that require immediate manual intervention as alarms.
To give you a concrete example: we have a Jenkins instance where our CICD pipelines are located and we want to send alerts for failed jobs. My first approach is to use
default_jenkins_builds_last_build_result metric exposed by
prometheus Jenkins plugin to alert all jobs that failed in the last run.
But after discussing with my team, we found that this may cause too many non-alerts (e.g., some pipelines are not maintained so nobody care about it) in our alerting channel. Then I wonder if I should use negative list to filter out the jobs we are not interested in. After much deliberation, I decide to use a positive list of alerts.
In this way, only the alarms we are interested in will be triggered, and since these alarms are listed in the configuration, we know exactly which alarms will be triggered.
Format alerts to contain minimal information
This aim to improve the readability of alerts by making it clear and concise.
Previously our alerts have more than 10 tags and it will take the entire screen to display one single alert in Microsoft Teams (I would avoid MS Teams for software engineering team because it is not designed for that. BTW to send alerts from Alertmanager to Microsoft Teams via webhook, you will need some intermediate server, i.e., https://github.com/prometheus-msteams/prometheus-msteams)
I would suggest to go as minimal as you can for alerts sent to your instant messaging software (Slack, Teams). In my case, I simplified it to include only the name of the alert, a sentence as a summary, and a link to the sender (Alertmanager instance) where the details could be found.
Hope those tips could help you to make your alerting system great.
After all, the alerting system is usually built by SRE and built for SRE, so treat yourself better!
Thank you for reading :)