DevOps and the Challenge of ‘Alert Noise’
DevOps represents a set of ideas and practices designed to increase responsiveness, business value and quality of delivered service. Even though, as the term implies, DevOps purports the tightening and harmonizing of the collaboration and flow of Development and Operations, much of the focus in DevOps since its inception has been centered on the development side. But as organizations continue to transform into digital businesses, they realize they need to understand IT operations in order to effectively react to customer experience issues.
Adopting a true DevOps culture is chock-full of challenges such as shifting away from legacy infrastructure to a more microservices-centric approach, integrating tools, managing priorities, environment provisioning, traceability and more. To successfully address these challenges organizations need to implement an approach in which Dev teams and Ops teams can work together, using the same artifacts. This can drastically change the way an organization is able to move forward with each new release. For DevOps teams, proactively monitoring “left” of production can be a revolutionary element in achieving these goals, helping DevOps teams amplify feedback loops, move quickly, iterate confidently and automate.
Responding to service disruptions has become one of the biggest challenges facing IT organizations today. With the rise of “always on” services and customer expectations, urgent responses to IT problems have become highly critical for all kinds of organizations. This need has resulted in changes to the structure of incident response teams and the DevOps movement itself.
Organizations need to plan and prepare for incidents while reducing ‘alert noise’ as to never miss a critical alert and always notify the right people while gaining sufficient insight to improve operational efficiency.
A Sad Tale: No Monitoring? Big Problems
There once was a team that had virtually no monitoring in place. A new player came on board and said: “We cannot continue to operate like this! We need monitoring!” So, the team started looking and found tools like PRTG and Pingdom, which was great. And they started throwing dashboards on wall screens for these different tools so they could get a kind of “command central” view. That was surely an improvement!
But one of the issues that came up was that people started getting alerts during business hours because they were not necessarily able to tune things out. Or tuning was difficult because things were held in many different places. And the team was getting alerts after hours, which again, was better than getting no alerts and having systems be down when people got into the office. But alas! Since the incoming alerts were lacking in context, the teams had insufficient information to ascertain the severity or impact of these alert nor were they able to triage and handle them properly.
A typical scenario would be for someone to wake up at 5:30 am and wow, there are all these texts and something is broken. But was it fixed? And who fixed it? Again, this was better than not knowing, but it could definitely introduce a lot of chaos swirling around. But then, the new player moved to a different position in the organization and needed to stop getting alerts and there had to be some process to get detached from everything.
Two months later, he got another alert! Where did that alert come from? Who did he have to reach out to in order to be taken off this list? And the flip side of that was, what about when someone leaves the company and no one realizes that they’re the only person getting the alert. Or worse, multiple people leave the company and no one gets ANY alerts!
The moral of the story is that, by now, we are all familiar with infamous and negatively impactful data breach examples where early warnings signals were detected and alerts generated but distributed to teams of employees that were no longer employed by the organization. A simple patch could have prevented this breach, helping the organizations to avoid massive financial and reputational damage.
Calling on Atlassian Opsgenie as a DevOps Solution
In an ideal world, your DevOps toolchain would be highly automated for incident management and allow your teams to resolve issues at DevOps speed. An alert triggered by monitoring tools like Datadog or AWS Cloudwatch would notify on-call engineers, kick your collaboration tools into gear, and automatically document the issue in ITSM and ticketing tools. The tools themselves support this type of automation, but without the right control in place your DevOps toolchain will just be flooded with noise.
The solution is to automate alert categorization before notifying responders and the rest of the DevOps toolchain.
In a typical IT environment, there are a lot of inputs that are coming from a large number of different systems on the left. Atlassian Opsgenie sits in the middle and basically connects all of those inputs to your people. Often, each of your people may use and adopt different methods of communication but Opsgenie has the capability to tailor to your team’s personal communication needs.
In addition, in a typical IT environment, we have the IT Dev stack on the left, which is constantly increasing in size and complexity. And then we add to that monitoring tools. Those tools link into those things and send out lots of notification emails out to the teams. And that’s often the scenario teams find themselves in. The ideal situation is when we add the alert management piece. We can add Opsgenie in the middle and now we can bring everything into one place. The highly flexible Opsgenie rules engine allows you to tag, route, escalate, and prioritize alerts based on the source, the alert keywords and contents, the time of day, and a number of other characteristics. And these can then be sent out to the Operation teams.
One of the byproduct of these actions is the generation and collection of data, context, documentation and reporting at the tail end. This treasure trove of information can be used to generate important business and operational insights, best practices and learning, ensuring constant improvement and success.
In this blog post we covered some of the formidable challenges faced by team adopting DevOps. We highlighted the criticality of staying aware of issues while delivering alerts to the people who can take action. In part 2 we’ll describe how Opsgenie addresses these challenges.