Why automate packet analysis for performance troubleshooting?
There are many ways to analyze network traffic. Flows provided by network devices to reflect traffic volumes are one way; packet analysis is the other way.
When it comes to performance troubleshooting and application transaction visibility, your sole option is to rely on packet data to understand the scope and the root causes of a performance degradation (you can find out why here).
The main constraints which apply to packet analysis are:
- the time required to manually analyze packet traces,
- the skills required,
- the fact that if you have not located with sufficient precision which packets are worth analyzing, you will get nowhere. This means you already know:
- For which application,
- For which client,
- What time precisely the degradation happened,
- What is the normal response time for that application?
If you do not match these criteria, chances are high that you will most probably be wasting your time when analyzing the packets.
Automating packet analysis (i.e. implementing Wire Data analysis to get your performance analytics) is an alternative.
1. Traffic volume
The size of networks, the bandwidth used by each application as well as the number of applications continuously grow. This represents a first challenge when you want to leverage network traffic to diagnose performance issues.
What’s the maximum size of the capture file you can analyze with a software sniffer?
Usually people agree, that you cannot load with a reasonable processing time a file which exceeds 100MB. Others are saying that even a few MB are already too much.
What does 100MB represent for your network?
- 100MB = 800Mb
- 800Mb / 10Gbps = 0,08 seconds or 800Mb / 1Gbps = 0,8 seconds
Each time you load this type of file, you will view a short snapshot of your network activity: less than 1 second on a 1Gbps traffic and 0,08 second on a 10Gbps traffic.
Even though this is a short timeframe, if an application transaction represents a data exchange (query and response) of 50kB, you will still have collected 2000 transactions which will have to be manually analyzed.
It is obvious that if you cannot tell which of these transactions is the one you should pay attention to and which ones are the normal ones you can refer to, you will probably not get any answer to your questions.
2. History and retention time
One of the essential characteristics of performance degradations is that they are intermittent either due to some congestion phenomenon (on a given system or infrastructure device) or due to an application flaw.
One of the key challenges is to determine when the degradation occurred. This requires you:
- to retain the network traffic for a sufficient period of time (each hour of packet capture retained for a 1Gbps link requires 500GB of storage; retaining 24 hours of that raw traffic requires 12TB).
- to first define what is the normal behavior / response time of the application (overall and for a given type of transaction). Then you need to identify when the performance got degraded.
3. Performance Overview
To identify which flows are worth analyzing in detail, you need to be able to locate the perimeter of the degradation; for this, you need an overview of the performance of a given segment / application which enables to identify the scope of the degradation:
- For which client,
- Connecting to which server,
- For which transactions (all, some and which ones).
4. Metric computation
To troubleshoot performance issues and identify the root cause of a degradation, you need a comprehensive set of metrics:
- Network health metrics (latency, packet loss, retransmission, flow characteristic - QoS settings, path, etc.),
- TCP metrics (Session setup metrics, TCP errors, server response times, query and response transfer times),
- Common services metrics (DNS response times and success rates),
- Application transaction metrics (processing time, query and response PDU transfer, error code, page load times for HTTP).
You certainly want to manipulate data based on these metrics on large traffic volumes, and see the evolution through time, then drill down on the flows which present an anomaly. For that, you cannot rely on manual (or close to ex post metric computation, which ends up to long processing times).
5. Application transaction visibility
There is no point doing an efficient troubleshooting of degradations by looking at application performance metrics that would apply to all transactions globally considered. To get to the root cause and provide actionable information to fix the issue, you need to be able to define the perimeter of the degradation at the individual transaction level.
This requires you to identify the slow transactions but also to compare them to baseline information related to that specific transaction type. For example, to compare a specific SQL query performance to similar SQL queries.
Doing this requires that you can:
- Identify similar queries in the large volume of traffic,
- Easily access the metrics for these transactions for comparison.
Although there is still value in using a packet analyzer and accessing the details of frames once you know exactly what the frames corresponding to the defective transactions are, you need a new approach to packet analysis.
The very essential requirement is certainly automation to:
- cope with volumes,
- automate metric computation,
- provide an overview,
- a drill down interface,
- and application transaction visibility.
If you would like to read a guide on how to approach automation for network analysis, I recommand that you take a look at this guide: "Performance Troubleshooting: 6 reasons to change your approach of network traffic analysis"