IntroductionIn our webinar The Coming Tsunami in Microservices: Operating Microservices at Scale, we discussed the precipice that some organizations will face as they make significant commitments to microservices. The impact of microservices can be felt from development to operations with new architecture, design and governance to tools and frameworks. Operationally, implementing microservice based architectures can be “one bridge too far” or simply put, beyond the current capabilities of your team. This blog will discuss one important challenge: debugging a failed transaction.
Debugging a Failed TransactionIn the past, with distributed applications, a common approach by operations to debug a failed transaction was to RDP / SSH into the various application servers. It was not uncommon to have 10-15 tabs open in your terminal connecting to the database(s), application server(s), proxy server(s), web server(s).
If we scale this up to a microservice architecture, we could have 10 containers / pods for one domain, 10 containers / pods for another domain, a cloud database, a cloud mem-cache cluster etc. When all the services and containers are calculated, there could be 10s-100s of instances. RDP and SSH will not solve this problem so what can we do?
Correlation identifierTo start, it is important to have each transaction uniquely identified. Regardless of what style of architecture you are using, the correlation identifier is the most critical component of distributed debugging. Examples of how an identifier could appear are:
- Message ID in a message header of a message on a queue
- HTTP Header with a custom identifier
It is the responsibility of the team to generate a unique identifier and then have each technology employed consume / forward this to the next participant. By doing this, each service / container’s log will have a correlation identifier fingerprint.
Log managementOnce messages are being processed by a microservice architecture, applications, databases and other tools are generating logs. These logs by default reside on the servers that the solutions are running on or in the cloud that they are executing within. Tools such as Elasticsearch, CloudWatch, Splunk and others provide a means to forward those logs for indexing, querying and display to help eliminate the significant effort that teams performed previously via RDP, SSH etc. Again, for this process to work well, the applications need to be instrumented to emit relevant events along with the correlation identifier for traceability.
TelemetryWhere log management can help dig deeper into the participating services for a single transaction using the correlation identifier, telemetry can provide a visual representation. This console exposes latency within the various layers of a transaction. Examples include the original receipt at the web server or API gateway layers, the individual lambda functions / containers and finally the data / storage layer. Within AWS, AWS X-Ray provides this information by collecting data from each layer and composing as shown below:
X-Ray provides a series of examples demonstrating how this technology can be plugged in and enabled for teams using AWS.
Service mapIn addition to understanding latency within the architecture real-time, understanding the participants in transactions is also extremely important. Operators can take advantage of Service Maps in X-Ray to visually see the architecture and how they relate to each other. The following image show services that have registered with X-Ray and how they interact with one another.
KubernetesMany of today’s microservices architectures leverage containers and container orchestration for production execution of services. If using this infrastructure, there are helpful commands to dig a bit deeper and understand a lower level of abstraction. Using kubectl, operators can get a variety of information. Examples include:
- Examine pod details: kubectl describe pod
- Access logs from a pod: kubectl logs
- Shell into a container: kubectl exec -it