How to diagnose TCP Connection set up issues
TCP Series #1: What can go wrong in a TCP Handshake?
Here is a first article of a series covering all you need to know to troubleshoot performance issues impacting applications relying on the TCP Protocol.
Let's have a look at how TCP sessions are established... and what can go wrong!
The TCP protocol is a connection-oriented protocol, which means a connection is established and maintained until the application programs at each end have finished exchanging messages.TCP works with the Internet Protocol (IP).
TCP provides reliable, ordered, and error-free transmission. To do so TCP has some features, like a Handshake, Reset, Fin, Ack, Push packets, and other kind of flags to keep the connection alive and to not lose any information.
TCP is used under lots of applicative protocols, like HTTP, So it is important to know how to diagnostic TCP issues. In this series of articles, we will explain those TCP meta-information and explain why they are important for performance troubleshooting and how to measure them easily with PerformanceVision (PV).
How does a session Start? TCP Handshake & Connection Time
A TCP connection, also called 3-way Handshake is done with a SYN, SYN+ACK and ACK packets. From this handshake, we can extract a performance metric called Connection Time (CT), which summarizes how fast session can be setup between a client and a server over a network. For details, look at this nice article on Wikipedia.
Fig1 - How TCP handshake is analysed
The three steps of the TCP handshake are:
- The 'SYN' is the first packet sent from a client to a server, it literally ask of a server to open a connection with it.
- If it’s possible, the server will respond with a 'SYN+ACK', means “I receive your 'SYN' and I’m OK”
- And finally the client send an 'ACK' to validate the connection.
How to Diagnose TCP connection faults
1 - SYN Without Connections
A first case you can easily diagnostic with PV is: “Could my clients connect to my servers?” In PV menu, go to Application → Clients, then choose the TCP theme and set the Filter called “Only Unilateral Flow”. The pattern is that we only see traffic from the client to the server and no response from the server.
Fig2 - Filter on Unilateral Flows Only
This means you want to see top client IPs with flows from the client only and without any response.
For Advanced Users of PerformanceVision
We set the filters to see unilateral flows, and this shows mostly 'SYN' issues, but you could get other kind of flows. To query only the 'SYN' without connections and only them, use custom filter:
syn.count > 0 and ct.count = 0
Fig3 - PV found unilateral flows and sort them.
As you see on the results above, there are several IPs which demand to connect to a server (SYN > 0) but they cannot connect to them (Connections = 0).
Here are common failure cases:
- A firewall denies those connections, then you could apply the same query on client zones (in the same menu) to see if the IPs are in the same zone.
- The server does not exists any more or is not available. This happen frequently when a server IP changed, some client continue to query the old one.
2 - Bad connection ratio
In a perfect world, you should have 1 'SYN' per TCP connection. PV provides a metric to see this connection efficiency, it is a 'SYN' per Connection rate (which corresponds to the number of SYN packets compared to the number of TCP sessions set up). This metric is available in the 'details' tables by using the TCP theme. You can also graph its evolution over time in Application → Custom charts.
Fig4 - PV custom chart SYN/Conn
A bad 'SYN' efficiency is sometimes a network issue. Thus the mis-connection are caused by packet loss or contingency . You can check this assumption by looking at the Connection Time. If it remains low and impacts several hosts, then it’s probably a network issue.
Else if the Connection Time is high, the issue is on the server side, it is overloaded and cannot answer to all clients. Finally, if the 'SYN' ratio is huge, then you can have security issues, like a DDOS attack.
The network latency - RTT (Round Trip Time) - can give you another indication that the issue is on the network side. PV provides the RTT in the Network Performances metric theme.
Fig5 - Troubleshoot connections with Connection Times and SYN rates
In this first article, we saw a short presentation of TCP performance metrics and how the TCP protocol handles the connections with SYN / SYN+ACK / ACK packets. We also see some common failure cases that can be diagnosed easily with PerformanceVision.
To troubleshoot these kind of issues we used pages Top Clients, Top Client Zones and Custom Charts. To go further, we used “Advanced Filter: Unilateral Flows” to filter flows with no responses.
We introduce several metrics: the number of 'SYN' and 'Handshakes' (connections), the SYN Efficiency and the Connection Time.
In a next article, we will have a look at how to end a connection with Reset and Fin packets.
In the meantime; if you would like to give it a try, just download our evaluation virtual appliance: