Atlassian OpsGenie – Helping DevOps Teams Stay in Control (Part 2)

In Part One, we took a high level view of one of the most common challenges facing DevOps teams: “alert noise”.

Basically, with all the inputs coming from the IT stack (which is constantly growing in size and complexity), and an added layer of monitoring solutions that may or may not be coordinated, many teams are getting barraged by an incessant flood of alerts. These alerts often lack context or prioritization, making it difficult to effectively prioritize. These alerts often either going to too many people or not enough. Either way, they’re not guaranteed to be going to the right person at the right time, which is really the key to effective alert management and efficient resolution of service disruptions and other key DevOps responsibilities.

In Part Two, we’re going to dive deeper into the Atlassian OpsGenie application and dig into how it addresses these common issues to offer a true solution to “alert noise” and other DevOps challenges.

The core components of the OpsGenie solution

To begin with, let’s take a quick look at the core components that make up OpsGenie:
  • Users
  • Teams Configuration
  • On-call Schedules
  • Escalation Policies
  • Routing Rules
  • Integrations
  • Services
  • Alerts
  • Reporting
Users, of course, are the individuals who make up the larger DevOps team. They can also be individuals outside DevOps who need to be aware of certain issues. Within OpsGenie, users can be thoroughly managed via unique profiles, granular permissions, and loads of personalizations that make the tool work for each team member, rather than the other way around.

Importantly, each user can control, within their personal profile, how they receive alerts, which types of alerts they receive, and what they can do with them once they arrive.

Teams are made up of individual users. When configuring teams within OpsGenie, it’s generally going to mirror your organizational teams, but you’ll want to give some forethought to any exceptions or adjustments that make sense from an alert management standpoint. That’s because the on-call scheduling, routing rules, and escalation policies are all managed at the team level.

Solution: Getting an alert to the right person at the right time

As you might expect, an on-call schedule reflects which team members (users) are responsible for handling alert responses at a given time. Alternatively, an escalation policy sets rules around where the alert should be sent first, then if and when it needs to be sent elsewhere based on the response to the initial alert. Finally, a routing rule tells the system which set of escalation policies and/or on-call schedules to adhere to based on what sort of alert is coming in, what day/time it is, etc.

Once these rules are set, the alert management system is fully automated. Clearly, the problem of ensuring the right person receives an alert at the right time has been solved. But, what about the problem of alerts lacking context and prioritization? That’s where OpsGenie’s integrations and services come in.

Solution: Adding context to alerts

OpsGenie comes with over 200 pre-loaded integrations with numerous input and output applications, and more are being added regularly. There’s also an API, so you can create your own integrations relatively easily. Basically, an integration means OpsGenie can talk to this other application and can either receive data from that application or feed data to it. In many cases, the data flow goes both ways.

For every integration you choose to activate, you’re provided with a host of possible actions OpsGenie can take with the data it receives and/or hands out. And, importantly, it’s in these various actions that you gain the power to add context to every alert you receive.

Solution: Prioritizing alerts and building a duplicatable system

OpsGenie Services are a higher-level form of automation you can put in place once you’ve worked with the system for some time and have established a successful combination of integration actions and alert management rules. Services tie all these together:

  • a specific input (or input type)
  • a protocol the system will follow
  • a list of users and their predefined responsibilities in relation to that input
  • a list of stakeholders who have no immediate responsibility, but should be aware of the situation/outcome
  • a selection of reports to be generated and distributed upon completion
So, for example, if a given alert has proven to be the harbinger of a high-priority incident the last ten times you saw it, it makes sense to create a Service around that alert. Automate the protocol, notifications, and reporting. Then, as your list of established Services grows, the number of manually triaged issues flowing in gets smaller and smaller.

Which brings us to the bottom line: OpsGenie solves all the biggest alert management issues facing today’s DevOps teams. Think back to the last downtime snafu you had to scramble to resolve. Now, think about how much faster and easier it would have been with OpsGenie onboard, skyrocketing your team’s efficiency and resolution speed.

Maxwell Traers
Maxwell Traers
Technical Content Contributor, Cprime