
TCP Protocol: From Packet Structure to Reliable Transmission Mechanisms
This article is not an introductory guide to the TCP protocol but is intended for learners who already have some foundational knowledge of networking. Before reading, it is recommended that you at least understand the basic concepts and working principles of TCP. This article aims to help you systematically connect fragmented TCP knowledge points into a complete framework.
To better understand the content, we recommend reviewing the following foundational materials first:
- Ruan Yifeng’s blog post "Introduction to the TCP Protocol," which explains the core concepts of TCP in an easy-to-understand manner.
- A video tutorial by "Code Novice" titled "Computer Networks - Analysis of a TCP Communication Process," which visually demonstrates the entire TCP communication process.
TCP (Transmission Control Protocol), as one of the core protocols of the transport layer, provides connection-oriented, reliable, byte-stream-based transmission services for upper-layer applications. Its reliability is achieved through the following core mechanisms:
- Ordered Transmission: Data chunks are numbered using a sequence number mechanism.
- Acknowledgment (ACK): The ACK mechanism ensures data delivery confirmation.
- Error Recovery: Lost packets are handled through timeout retransmission.
- Rate Adjustment:
- End-to-end flow control (sliding window mechanism)
- Network-level congestion control algorithms
A TCP packet is essentially a binary data transmission unit, consisting of a fixed header and a variable data section. The standard header is 20 bytes long (excluding options), with the following structure:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options (if Data Offset > 5) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Port Identification (16 bits each)
- Source Port: Identifies the sender’s application port.
- Destination Port: Identifies the receiver’s application port.
- Together, they define the endpoints of a TCP connection.
Sequence Control Fields (32 bits each)
- Sequence Number (Seq):
- Indicates the position of the first byte of the current segment in the byte stream.
- Randomly initialized during connection setup (ISN - Initial Sequence Number).
- Acknowledgment Number (Ack):
- Indicates the next expected byte sequence number.
- Implicitly confirms all data before this sequence number has been received successfully.
Control Flags (1 bit each)
- URG: Urgent pointer valid flag.
- ACK: Acknowledgment number valid flag (usually set to 1 after connection establishment).
- PSH: Instructs the receiver to immediately deliver data to the application layer.
- RST: Forces an abnormal connection reset.
- SYN: Synchronizes sequence numbers (used for connection setup).
- FIN: Graceful connection termination flag.
Flow Control Field
- Window Size: A 16-bit field indicating the available buffer space (in bytes) on the receiver’s side, forming the basis of sliding window flow control.
When first encountering the TCP packet structure, the numerous fields may seem overwhelming. It’s advisable not to rush to memorize all details at once but to gradually understand each field’s role in the context of actual communication processes discussed in later sections.
The TCP connection establishment process involves exchanging specific control packets (three-way handshake) to achieve the following goals:
- Synchronize initial sequence numbers (ISN).
- Exchange TCP parameters (e.g., window size, MSS).
- Create and initialize socket data structures in both endpoints’ kernels.
This socket is essentially a complex state machine. TCP simulates the concept of a "connection" by maintaining this state machine. Notably, a TCP connection is not a physical link but a logical state maintained by both parties.
Detailed Three-Way Handshake Process
Let’s analyze the process using an example where a client (192.168.1.100
) connects to a server (203.179.24.36:80
):
-
SYN Packet (Client → Server)
Sequence Number (Seq) = 123456789 (randomly generated ISN) Acknowledgment Number (Ack) = 0 Control Flags = SYN Window Size = 65535
- Client enters the
SYN_SENT
state. - MSS option is typically negotiated here.
- Client enters the
-
SYN-ACK Packet (Server → Client)
Sequence Number (Seq) = 987654321 (server’s ISN) Acknowledgment Number (Ack) = 123456790 (client’s ISN + 1) Control Flags = SYN|ACK Window Size = 8192
- Server enters the
SYN_RCVD
state. - The acknowledgment of the client’s SYN is implicit in the Ack field.
- Server enters the
-
ACK Packet (Client → Server)
Sequence Number (Seq) = 123456790 Acknowledgment Number (Ack) = 987654322 (server’s ISN + 1) Control Flags = ACK Window Size = 65535
- Both parties enter the
ESTABLISHED
state. - This packet may carry application-layer data (e.g., an HTTP request).
- Both parties enter the
Key Design Considerations
-
ISN Randomization:
- Prevents interference from stale connections (old packets from the same connection tuple may still be in the network).
- Security measure to prevent sequence number prediction attacks.
-
Sequence Number Semantics:
- SYN and FIN flags each consume one sequence number.
- During data transmission, sequence numbers increment by the byte count.
-
Why Three-Way Handshake?
- Ensures bidirectional communication capability.
- Prevents delayed connection requests from unexpectedly reaching the server.
Four-Way Handshake Example
Assume the current sequence numbers are:
- Active closer:
Seq = 2000000000
- Passive closer:
Seq = 3000000000
-
First FIN (Active → Passive)
Seq = 2000000000 Ack = 3000000000 Flags = FIN|ACK
- Active party enters
FIN_WAIT_1
.
- Active party enters
-
Second FIN (Passive → Active)
Seq = 3000000000 Ack = 2000000001 Flags = ACK
- Passive party enters
CLOSE_WAIT
. - Active party moves to
FIN_WAIT_2
upon receipt.
- Passive party enters
-
Third FIN (Passive → Active)
Seq = 3000000000 Ack = 2000000001 Flags = FIN|ACK
- Passive party enters
LAST_ACK
.
- Passive party enters
-
Fourth FIN (Active → Passive)
Seq = 2000000001 Ack = 3000000001 Flags = ACK
- Active party enters
TIME_WAIT
. - Passive party fully closes the connection upon receipt.
- Active party enters
Deep Dive into TIME_WAIT State
The TIME_WAIT
state lasts for 2MSL (Maximum Segment Lifetime), serving the following purposes:
-
Ensures Reliable Termination:
- Guarantees the final ACK reaches the peer.
- If the ACK is lost, the passive party retransmits FIN, and the active party can still respond.
-
Cleans Up Stale Packets:
- Waits long enough for residual packets in the network to expire.
- Prevents old data from being mistaken for new connections with the same tuple.
-
Practical Implications:
- May cause port exhaustion on high-performance servers.
- Can be mitigated using the
SO_REUSEADDR
socket option.
Through this carefully designed state machine, TCP achieves reliable end-to-end communication over the unreliable IP layer. Understanding these state transitions is crucial for diagnosing network issues and optimizing TCP performance.
Sliding Window Demonstration
Consider the following scenario:
- Initial sequence number: 1
- Send window size: 4 segments
- MTU: Fixed size
Phase 1: Initial Transmission
[Sender Actions]
1. Send Segment 1 (Seq=1, Len=100) → Sequence space moves to 101
2. Send Segment 2 (Seq=101, Len=150) → Sequence space moves to 251
3. Send Segment 3 (Seq=251, Len=200) → Sequence space moves to 451
4. Send Segment 4 (Seq=451, Len=50) → Sequence space moves to 501
(Window exhausted; waits for ACKs)
Phase 2: Receiver Processing
[Receiver Actions]
1. Successfully receives Segments 1, 2, 3 (450 bytes total).
2. Sends Ack=451 (expects next byte 451).
3. Receives Segment 4.
4. Sends Ack=501.
Phase 3: Window Advancement
[Sender Actions]
1. Upon Ack=451, window slides to 451.
2. Upon Ack=501, window slides to 501.
3. New window range: 501 to (501 + 4*MTU).
Dynamic Window Adjustments
- Window Scaling: Negotiates a scaling factor (up to 1GB).
- Zero Window Probing: Sender probes receiver when window size is 0.
- Delayed ACKs: Receiver may delay ACKs (typically ≤500ms) to improve throughput.
Fast Retransmit Trigger Mechanism
Example Scenario (Window Size=5):
1. Send Segments: Seq=1,101,201,301,401
2. Result:
- Segment 2 (Seq=101) lost.
- Segments 1,3,4,5 arrive.
3. Receiver Behavior:
- Receives Seq=1: Sends Ack=101.
- Receives Seq=201: Sends Ack=101 (dup ACK #1).
- Receives Seq=301: Sends Ack=101 (dup ACK #2).
- Receives Seq=401: Sends Ack=101 (dup ACK #3).
4. Sender Response:
- Triggers fast retransmit upon 3rd dup ACK.
- Immediately retransmits Seq=101.
5. Recovery:
- Receiver acknowledges all data with Ack=501.
Retransmission Strategy Comparison
Feature | Timeout Retransmit | Fast Retransmit |
---|---|---|
Trigger Condition | Timer expiration | 3 duplicate ACKs |
Response Time | ≥RTO (200ms-2s) | ≤1 RTT (~50ms) |
Network Efficiency | Low (transmission halts) | High (continues sending) |
Use Case | Consecutive losses/ACK loss | Single packet loss |
Congestion Response | Window=1, slow start | Fast recovery algorithm |
Notes:
- Dup ACK threshold is configurable (Linux default: 3).
- Selective ACK (SACK) improves recovery for multiple losses.
- Timestamps help detect spurious retransmissions.
These mechanisms ensure reliability while dynamically adapting to network conditions. Understanding them is key for performance tuning and troubleshooting, especially in high-latency or loss-prone networks.
The root cause of sticky packets lies in the mismatch between TCP’s byte-stream nature and application-layer message boundaries:
Since TCP has no inherent message boundaries, multiple writes by the sender may be merged into one segment (sender-side sticking), and the receiver may deliver multiple messages at once due to buffer size or read timing (receiver-side sticking).
This semantic gap between the transport layer’s byte orientation and the application layer’s message orientation—combined with MTU limits, ACK delays, and other mechanisms—requires additional framing logic to reconstruct message boundaries.
Comparison of Solutions
Method | Principle | Pros | Cons | Use Cases |
---|---|---|---|---|
Fixed Length | Fixed message size | Simple parsing | Wastes bandwidth | FIX Protocol |
Delimiters | Special chars (e.g., \r\n ) | Text-friendly | Needs escaping | HTTP headers |
TLV Format | Type-Length-Value | Extensible, structured | Complex parsing | ASN.1 encoding |
Length Prefix | Header declares payload size | Efficient, binary-friendly | Requires length upfront | gRPC/Protobuf |
Engineering Tips:
- Prefer delimiters for text protocols (e.g., HTTP’s
\r\n\r\n
). - Use length prefix + TLV for binary protocols.
- Use
MSG_WAITALL
flag to ensure complete reads.
HOL blocking occurs because TCP’s strict in-order delivery clashes with network unpredictability:
When a single packet is lost or delayed, the receiver cannot deliver subsequent in-order packets to the application (even if they belong to different messages), stalling the entire connection.
This stems from TCP’s coupling of reliability (retransmissions) and ordering (sequence numbers). It creates cascading bottlenecks at both physical (single interface) and transport (single connection) layers, especially worsening performance in multiplexed scenarios. Newer protocols like QUIC address this by introducing per-stream multiplexing and independent retransmissions.
Traditional TCP Throughput Model
T ≈ min( W*MSS/RTT , MSS/(RTT*sqrt(p)) )
Where:
- W: Congestion window size (packets).
- p: Packet loss rate.
- MSS: Maximum segment size.
- RTT: Round-trip time.
Multiplexed Protocol Optimization
QUIC improves throughput via:
T ≈ n*[ min( (W/n)*MSS/RTT , MSS/(RTT*sqrt(p/n)) ) ]
Key enhancements:
- Stream isolation: One stream’s blockage doesn’t affect others.
- Shared congestion control: n streams share total window W.
- Reduced loss impact: Effective loss rate drops to p/n.
Example Comparison (MSS=1460B, RTT=50ms):
Loss Rate | TCP Throughput | QUIC (4 streams) |
---|---|---|
0.1% | 11.7Mbps | 23.4Mbps |
1% | 3.7Mbps | 7.4Mbps |
Standard Parameters
Parameter | RFC Default | Linux Default | Tuning Advice |
---|---|---|---|
Initial Window (IW) | 10*MSS (~14KB) | 10*MSS | BBR: Set to 16*MSS |
Minimum RTO | 200ms | 200ms | Data centers: 10ms |
Max Retransmissions | 15 | 15 | Wireless: 8 |
TIME_WAIT Duration | 2*MSL (120s) | 60s | Short-lived conns: 30s |
Max Receive Window | 65KB | 4MB (Linux 4.14+) | High-speed: ≥16MB |
Optimization Example: Video Streaming
# Linux kernel tuning
sysctl -w net.ipv4.tcp_window_scaling=1
sysctl -w net.core.rmem_max=16777216
sysctl -w net.ipv4.tcp_slow_start_after_idle=0
- MPTCP: Aggregates multiple physical paths.
- QUIC: UDP-based reliable transport.
- BBR: Replaces loss-based congestion control.
- Zero-Copy: Reduces kernel-to-userspace copies.
Note: All models here are based on RFCs and Linux implementations. Real-world performance may vary—always A/B test in production.
This document combines theory, models, and practical insights to systematically explain TCP’s core mechanisms and optimization strategies. Mastering these advanced topics is essential for building high-performance networked applications.