In this post, you'll learn what distributed tracing is, how it works, how it differs from logging, its benefits and challenges, and the type of tools that can help you accomplish this practice.
In recent years, surveys by organizations such as IBM and O’Reilly have revealed that a number of enterprises are adopting microservice-based architectures for their software applications.
The surge in adoption shows how much companies believe that this architectural pattern’s benefits outweigh its challenges. However, it’s still important to recognize the implications of the challenges that come part and parcel with its adoption.
One of the glaring challenges in this pattern is the complexity around service-to-service communication. This is especially difficult with the distributed nature of microservice solutions.
Traditionally, software teams could easily navigate how requests traversed monolithic applications. By design, monolithic architectures are tightly coupled and meant to function as a single unit. In contrast, microservices are independently deployable units that expose an API of their own.
As a result, transactions to the different services can originate from multiple locations, and each service could easily be one of many stops in a single request.
Nonetheless, software teams should be able to understand and trace the network behavior across the different applications in order to deal with any performance issues, bottlenecks or errors that arise. That’s where distributed tracing comes in. Distributed tracing is a method of tracking distinct network requests in distributed application systems.
In this post, you’ll learn what distributed tracing is, how it works, how it differs from logging, its benefits and challenges, and the type of tools that can help you accomplish this practice.
What is distributed tracing?
Distributed tracing is a type of logging with an acute focus on tracking the flow, activity, and behavior of application network requests. These traces can be end-to-end, in which case the entire flow or span of the network request is captured from initiation to destination.
For example, end-to-end distributed tracing would allow you to track the flow from the time an end-user clicks on a frontend application until it reaches its backend and database service destinations.
Alternatively, distributed tracing can be solely focused on tracking the network flow between backend service requests after inception. The whole idea is to observe requests that are propagated across single or multiple runtime environments in order for teams to connect the dots seamlessly.
How distributed tracing works
To understand how distributed tracing works, the best starting point is to consider a trace. As implied earlier, a trace can be best thought of in terms of application network requests or RPCs (remote procedure calls).
The different services in your software system perform some function in response to these requests or remote procedure calls from other services. For example, one service can be dedicated to performing authentication and authorization while another stores data in a database.
The functions of the services in your system may vary, as do the relationships and interactions between them. Distributed tracing works to model and conveys the relationships between the services in your distributed system in terms of its RPCs.
Tracks requests across your distributed system, and analyzing your traces in one place.
To use distributed tracing in your software applications and environments, you need to add instrumentation to the code for a request to be monitored and tracked. Instrumenting your application source code programmatically makes it viable to be monitored and tracked with trace data.
When a request occurs, a unique trace ID will be generated, along with a unique ID for every step in the trace. The collected information will be propagated to every service in the application’s environment. Data from every step or service interaction will be captured (or collected) and analyzed.
Whatever business logic they may be performing, these services send requests and receive responses. Performing their respective functions takes a certain amount or length of time. The work or activity they do in this time period is what is referred to as a span or a segment. These spans typically consist of metadata and logs from events.
The first or initial span is known as the parent span in a tracing platform. This captures a whole path of execution for the request and represents an activity like a frontend calling an API Gateway, a microservice calling another microservice, or a microservice making a database query.
Every step after the parent span is recorded as a top-level child service in the tracing journey. In some cases, there may be multiple requests that get made when a certain service is called.
For example, an end-user on a Question and Answer platform might trigger an API call from the frontend to the backend when looking to retrieve information about a specific question. That question might have additional information such as comments, answers, and the respective system users that created them.
These three components, namely comments, answers, and users, could each represent a microservice that gets called subsequently to the initial API call. In the context of distributed tracing, these subsequent calls will be captured as nested child spans of the top-level child span (initiated by the frontend API call to fetch details for a specific question).
For optimal analysis and visualization, an end-to-end platform like Middleware captures all of these activities (spans), along with additional details such as request-response codes, request duration, latency, faults, errors, and other forms of metadata. All of this is recorded and presented in the form of an automatically generated flame graph.
This format gives a detailed birds-eye view of network requests traversing the relevant application environment and enables engineers and operators to easily detect, analyze and prioritize performance issues or errors revealed by the traces.
Here’s how it looks link in Middleware:
Software teams and other stakeholders can view all the relevant telemetry, data and service-to-service behavior collected across their distributed applications in a consolidated place.
What is the difference between distributed tracing and logging?
Both distributed tracing and logging are used to capture information on the activities in our application environments so that we can better capture low-level details and the context of the behavior under the hood.
This is especially critical when seeking to resolve various errors, faults, and performance issues. However, tracing and logging accomplish this in different ways.
For starters, logging by itself cannot capture the additional context that distributed tracing provides. Logging provides fine-grained insight into system events related to input, processing, and output. They are timestamped events. This is especially useful for debugging and auditing. Logs can be emitted at various levels or tiers in your solution.
Logs can be generated from the infrastructure, network, and application layers and will capture a specific event that occurred in your system at a certain point in time. For example, in the context of Kubernetes, you can capture log events for occurrences in your cluster, nodes, and containers.
In contrast, distributed tracing follows the full path of a single request. It does make use of logging by recording events that happen along the path of the request being traced. Distributed tracing provides more context and simplifies the process of analysis and troubleshooting by narrowing down the search scope when issues and errors occur.
In the context of complex distributed systems like microservice architectures, generating logs is essential for the respective components that make up the system. Furthermore, as mentioned above, they provide insight into the various timestamped events and can prove very useful when troubleshooting.
However, in conjunction with this, software teams should adopt distributed tracing platforms to capture the details of the wide context of their software solutions. Both should be part of your observability strategy in order to run reliable distributed software systems.
Benefits of distributed tracing
We’ve already covered the main benefits of distributed tracing to some degree. However, in this section, we’ll delve a little bit deeper into more of these benefits.
1. Enhance Team Productivity
One of the main challenges software teams face when developing microservice environments is budgeting time between troubleshooting errors and enhancing applications. Microservice environments are complex systems with several components that have different relationships with each other.
Without any observability, each service and its interaction with other services simply add to the obfuscation of the system as a whole. Distributed tracing is an important piece in observability that provides detailed and consolidated transparency for software teams to quickly trace errors when they occur and with enough context to expedite the remediation of the issues.
As a result, there’s less time dedicated to manually traversing and analyzing the system in an attempt to discover the root causes. Tracing contributes towards the optimization of discovering and remediating software incidents, errors, and performance issues. In turn, software teams can be more productive and dedicate more time to enhancing their applications.
2. Improve Application Health
An application environment that is clear and transparent to the relevant engineering teams is in a position to be very healthy because the visibility will serve as a launchpad to quickly and accurately resolve system incidents.
In contrast, application environments that are both complex and obscure are more likely to be very unhealthy because the lack of clarity acts as a hindrance to quickly resolving system issues.
Over time, the incidents and issues contribute to a bigger backlog of undetected bugs, as well as technical debt from teams taking shortcuts to progress through a long list of issues.
3. Support Environment Heterogeneity & Flexibility
Distributed tracing platforms like Middleware are agnostic to the programming languages and frameworks used in the microservices and the underlying runtime environments. This means they support a technologically heterogeneous environment comprising diverse languages, frameworks, and runtimes.
For example, a request may start from an end-user with an Android-native mobile client, which will then pass through an Amazon API Gateway, followed by a Java-based GraphQL API reaching out to multiple other services in different cloud environments running other languages and frameworks.
The trace from start to finish and the activity in between each upstream service will be captured without disruption despite the technical differences. This means software teams using distributed tracing platforms can remain flexible in choosing the technologies best suited for the functions of a specific service and not be locked into a particular language or framework because of the tracing platform.
4. Bring Understanding to Service Relationships
As mentioned in a previous point, distributed tracing provides clarity in relation to the system components and their behavior in relation to other components. When organizations adopt microservices, they typically structure their teams in one of two ways with regard to the services:
- Strong ownership – Each team is responsible for a single service
- Collective ownership – All teams have a shared responsibility over the different services in the software system
In a strong ownership model, teams are more likely to struggle to understand how all the pieces fit together in the bigger picture. This is less likely to happen in collective ownership because team members have the same level of access and contribution to each service.
Distributed tracing platforms can be especially helpful in companies that follow a strong ownership approach to their services so that teams can understand the activity flow for each service they don’t have ownership over.
As for collective ownership, distributed tracing can support this strategy from a normal observability approach but more so for optimal scaling as the distributed system grows.
5. Support Compliance with Service Level Agreements (SLAs)
Several organizations have to uphold service level agreements (SLAs) with both internal and external customers or end-users to meet defined performance goals. Using a distributed tracing platform will help consolidate and aggregate data gathered from the different microservices to track performance properly.
Leverage Middleware to analyze, diagnose & predict issues across your entire stack.
Challenges associated with distributed tracing
Now that you know the benefits of distributed tracing, it’s important to understand the challenges associated with distributed tracing:
A number of distributed tracing systems require you to apply changes to the source code of your applications in order for trace requests. This manual approach can easily introduce errors and adds to maintenance overhead. In addition, if your service is technically diverse, you need to apply the code changes based on the relevant language or framework.
2. Head-based Sampling
Head-based sampling is a sampling decision that gets applied to a trace when it’s first initiated. In this approach, organizations can sometimes fail to capture the important or highly valuable information that they desire. The opposite of this would be tail-based sampling. In the latter approach, you can capture complete trace information, including additional attributes such as a region or customer details.
3. Application Coverage Limitations
As pointed out in the section on how distributed tracing works, a tracing ID is generated and propagated throughout the flow or path of the request. This helps maintain the unique thread of the trace.
However, unless you are using an end-to-end distributed tracing system like Middleware, you can only trace the backend flow of your applications. This can introduce challenges contrary to the point of tracing because software teams may not know whether errors have their root cause in the frontend (which isn’t covered by some tracing platforms) or the backend.
Open-source distributed tracing standards
With the growing need to support the main areas of observability, namely metrics, traces and logs, there’s an increasing number of open-source approaches to distributed tracing. These approaches include OpenCensus, OpenTelemetry and OpenTracing.
Examples of distributed tracing tools
Below is a short list of distributed tracing tools that you can use in your microservice environments:
Middleware is a unified observability platform that has rich features that cover end-to-end distributed tracing. Meaning, you can observe and track requests as they move through your distributed systems, with full visibility from frontend to backend.
Middleware automatically begins collecting, analyzing, and contextualizing trace data from the user’s first interaction.
While most of the distributed tracing tools use a legacy-based (OpenTelemetry) approach, Middleware uses eBPF based approach for distributed tracing.
Jaeger is an open-source Cloud Native Computing Foundation (CNCF) project originally developed and released by Uber Technologies. It is used (majorly)to monitor, trace and troubleshoot microservices-based distributed application environments.
Jaeger helps DevOps find the root cause behind an error and also helps with performance and latency optimization.
Zipkin is an open-source distributed tracing platform used to troubleshoot issues with latency in your software services. Zipkin was developed in Java and released by Twitter.
Zipkin helps gather timing data needed to troubleshoot latency problems. And its key feature is that it includes both the collection and lookup of this timing data.
Why does your business needs distributed tracing
Organizations that follow modern approaches to architecting software solutions can benefit significantly from the detailed visibility and insight that distributed tracing offers.
Businesses can improve the overall health of their applications, team productivity, application environment heterogeneity, application flexibility, and SLA support and have a consolidated place to see all their service behavior.
How Middleware can help
Middleware’s distributed tracing platform can help your business optimize its observability strategy and accomplish system reliability. With Middleware, you’ll be able to follow the network traffic flow and data across your application environments for the entire journey.
In addition, by enabling your team with this platform, you’ll be able to successfully scale application environments with complete visibility into service-to-service communication and improve the rate at which your team source and resolve errors.
Sign up on the platform to see it in action!