+-----------+----------+----------+------------------+
| Data Link | IP | TCP/UDP | Data |
| Header | Header | Header | |
+-----------+----------+----------+------------------+
0 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP source port | UDP destination port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP message length | UDP checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The UDP checksum is calculated over data, UDP header, and a "psuedo-header"
composed of information taken from the IP header. In addition, the data is
padded with values of 0 to a 16-bit boundary. This is necessary because
the checksum operates over a set of consecutive 16-bit values. The layout
of the buffer used for the calculation is given below.
0 7 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ----
| source IP address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ psuedo
| destination IP address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ header
| 0 | proto | UDP message length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ----
| UDP source port | UDP destination port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ UDP header
| UDP message length | UDP checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ----
| |
| DATA |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Pad (0) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The psuedo header is used so that UDP can verify that it reached its
correct destination. This is possible because the sender is the one
who generates the checksum and any change to the psuedo header
(different source or destination addresses) would cause a different
checksum to be calculated at the receiver. We will see that TCP also
uses the same psuedo header when it calculates its checksum. The proto
field in UDP is set to 17, the value used in the IP header for UDP.
The UDP message length field in the UDP header is redundant because
the IP header contains a length field.
0 7 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TCP source port | TCP destination port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| acknowledgment number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| hlen | res |U|A|P|R|S|F| window size |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TCP checksum | urgent pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TCP options (if any) ...|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The hlen field contains the size of the TCP header in 32-bit words.
The res field is reserved for future use. We will explore the TCP header
as we start looking at its operation. The U, A, P, R, S, and F
bits are flags. They stand for:
TCP uses sequence numbers on each byte of data it sends. Thus the sequence number field of the segment is the sequence number of the first byte of data in the segment. The length of the segment is contained in the IP header.
TCP uses a window advertisement mechanism to perform flow control. Essentially, the receiver sends back in its ACKs the size of its buffer that it has left. This is called the window and is placed in the window size field of the TCP header. This is expressed in bytes.
client server
| | LISTEN (passive open)
(active open) | SYN ISN=J |
SYN_SENT +-------------------------------------->| SYN_RCVD
| |
| SYN ISN=K, ACK J+1 |
ESTABLISHED |<--------------------------------------+
| |
| ACK K+1 |
+-------------------------------------->| ESTABLISHED
The sequence is that a client initiates a connection (by performing
an active open). This is the sending of a segment with the SYN bit on.
The Initial Sequence Number (ISN) is set to J in the segment. The server
responds with a SYN with an ISN of K. In addition, it ACKs J+1. The
client then sends an ACK for K+1. The connection is then established
and data may flow.
By the same kind of method, FINs are used to terminate a connection. Because TCP is full-duplex, each direction of data flow must be closed independently. The sequence is shown graphically below.
FIN M
(active close) +-------------------------------------->| (passive close)
FIN_WAIT_1 | | CLOSE_WAIT
| ACK M+1 |
FIN_WAIT_2 |<--------------------------------------+
| |
| FIN N |
TIME_WAIT |<--------------------------------------+ LAST_ACK
| |
| ACK N+1 |
+-------------------------------------->| CLOSED
The rules of what is a client and what is a server don't apply to the
termination process. Either the client or server may initiate connection
termination. The first entity to send a FIN is the active close agent.
The one receiving the FIN is the passive close agent.
TCP specifies a state machine to deal with a lot of exceptions to the processes above. The state machine was handed out in class. The all capitol statements in the diagrams indicate the TCP state each entity goes through.
For connection establishment, the SYNs are both retransmitted until the next expected segment is received from the other entity. If a number of retransmissions are done, but no segment is received, the connection is aborted and the application notified. This is the same for most of the other messages sent in the TCP state machine.
A RST (Reset) is sent under a few conditions. A RST causes the connection to be aborted and closed. A RST is sent when a SYN is received to a nonexistent port. A RST is also sent if a segment is received that is incorrect due to state. A RST is also sent as a method of aborting a connection. In this case it replaces a FIN.
A connection can be terminated in one direction and kept open in the other. This is called a Half-Closed connection and can be performed using a special flag to the shutdown() sockets call. An example of this is the rsh command where the command is executed, the one end closed, but the other left open to retrieve the data.
A connection may also be Half-Open. In this case, 1 side crashes and comes back up. The state of any TCP connection is not known and any TCP segment that is not a SYN will cause the machine to send a RST. The state of having one side think it is in one state and the other having no knowledge of the connection is a Half-Open connection.
Special conditions allow both sides to simultaneously attempt to do perform an active open. This is called a Simultaneous Open. This is rare, but not impossible. In this case, both sides send a SYN and then both sides send SYN/ACKs. Both sides go through SYN_SENT, SYN_RCVD, and ESTABLISHED states.
Also, special conditions allow both sides to simultaneously attempt to perform an active close. This is called a Simultaneous Close. In this case, both sides send FINs, then send ACKs for each others FINs. Both sides go through the FIN_WAIT_1, CLOSING, and TIME_WAIT states.
TCP uses several implementation timers to manage connections. Among these are:
Under a lot of conditions, we would like to collapse some small pieces of data that would normally be their own messages into a single message. High delay lines like modems is a good example. To do this, TCP uses the Nagle algorithm. In short, the algorithm states, "if outstanding data that has not been ACKed exists, then small segments are not sent until the outstanding data is ACKed." What this effectively does is hold off the sending of any small segments until the outstanding data is ACKed. This then causes these small segments to be put into larger segments and sent as one.
The control of this operation above and the flow control window is done by a sliding window mechanism. We can look at this sliding window in terms of four sections shown below. In this diagram, each number is a byte of data.
<----------- offered window --------->
<-- usable window -->
+------------------------------------+
1 2 3 | 4 5 6 | 7 8 9 | 10 11 12
+------------------------------------+
<--------------> <--------------> <-----------------> <------------->
sent and ACKed sent,not ACKed can send ASAP can't send until
window moves
The window in the diagram is the box and it moves as data is ACKed. The
window changes size based on what the advertised window is from the receiver.
If the left edge of the window advances, the window is said to be
closing. This indicates data is sent and ACKed. If the right edge
of the window moves to the left, the window is said to be shrinking. This
is bad and should never happen. In this case, we assume all of the other
edges stay the same and this indicates that the receiver decrements its
window without receiving any more data. If the right edge of the window
moves to the right, the window is said to be opening. This indicates
that the receiver increments its window.
If the window size is 1 segment, then the operation is called stop-and-wait flow control. It is easy to see that that is exactly what happens, we send a segment and wait for its ACK before another may be sent.
To see the effect a sliding window has on throughput, let us look at Network Utilization of TCP. Network utilization is a measure of the effectiveness of a protocol by looking at the ratio of its data to its data plus control. This is sometimes called "goodput" as well. If we have a single ACK per segment, then the ratio looks something like: data/(80+data). For this we assume that the IP header is 20 bytes and that the TCP header is 20 bytes. We have 1 data segment and 1 ACK. For the data segment we have 20 bytes of IP and 20 bytes of TCP. For the ACK we have 20 bytes of IP and 20 bytes of TCP. If we take a normal Ethernet MTU of 1460, we get 1460/(80+1460) which is about 94.8%. On a 10 Mbps Ethernet, this is about 9.48 Mbps of data throughput (or goodput). On a phone line, MTU of about 256, we get about 76.8%. What happens if we try to get a single ACK for every 4 segments? Well, then we get 4 data and 1 ACK and our ratio looks like 4*data/(200+4*data). With an MTU of 1460, this gives 96.7% and with an MTU of 256, we get 83.7%. You can see that TCP can become much more efficient when we allow more and more to be sent at once and hope that fewer ACKs are sent back.
It should also be notable to see why a phone line uses a small MTU. Imagine a modem with 9600 bps, 8 bits of data, 1 stop bit, and 1 parity bit. This means that only 7680 bps is usable, (8/10)*9600. This is about 960 bytes/sec. 1024 bytes would take 1.066 sec to transmit. This is too long to have much response time, so we must use something that balances a good MTU size with a good response time. It turns out that 296 bytes takes about 308 msec to transmit and is a happy medium.
A bigger window indicates more segments outstanding and unACKed. Which indicates more throughput and less waiting for ACKs if loss is not a problem. So, if we want to look at our maximum capacity, we want to fill our network with data, i.e. "fill the pipe". The network capacity can approximated by its bandwidth-delay product. In other words, the bandwidth of a pipe multiplied by its delay gives us an estimate of its capacity. For a T1 line across country (1,544,000 bps and 0.06 sec round-trip time (RTT)), we get a product of 11,580 bytes. We can visualize this in the picture below.
<--------- delay -------------->
+------------------------------+ ^
| Pipe | | bandwidth
+------------------------------+ v
Suppose, we have an 8 KB receive buffer on the receiver and an RTT of 1 msec.
What is the maximum bandwidth that we can expect? By pluggin in the
product above, we get something like RTT*X = 8 KB. Factoring for X with
RTT=0.001 sec, we get a maximum bandwidth of about 65.536 Mbps. This is
the fastest that TCPs flow control will allow us to send if the maximum
window size is 8 KB.
Congestion can occur two ways:
Loss hurts throughput. In addition to TCPs flow control window, it maintains a window for congestion control. This is an estimate of the networks capacity. This congestion window can be estimated by analyzing the behavior of the RTT and the loss experienced.
TCPs congestion control ties in very tightly with how its retransmission timer is calculated. We want the retransmit timer to be fairly accurate so that we don't wait too long to send a retransmission. We also don't want it to be too short so that it spruiously retransmits when there is not loss. To find this happy medium, we must estimate and track the RTT to the receiver. This can fluctuate based on changing routes to the receiver as well as any queuing done by the routers as described above. We measure RTT by timing the time it takes to send a piece of data and getting an ACK for it later. Also, TCP provides a timestamp option where we can measure this even better. It is important to not measure the RTT on a segment that is retransmitted. In this case, we can't tell which data segment generates the ACK, the original or the retransmission. This is called Karn's algorithm. Once we have a measurement of RTT, we have to smooth it using a low pass filter, such as R<-gR+(1-g)M where M is the measure and R is the smoothed RTT. Van Jacobson noticed that you need to track variance as well. His algorithm is:
TCPs congestion control window is controlled by retransmissions and received ACKs. A TCP connection uses slow start when it begins. During slow start, each ACK received causes the congestion window to increase by 1 segment. This is an exponential increase because as the window opens up, the window will be increased faster and faster. At some point, TCP uses another method of increasing the congestion window. This is congestion avoidance. In this method, the window is increased by 1/window each time and ACK is received. This is a linear increase because the window will only increase by 1 segment during an RTT. When a retransmission is performed, the congestion window is set to 1 segment and slow start is performed until its reaches 1/2 of its previous value. At that point, congestion avoidance is used. TCP also provides another signal to change the congestion window. A receiver, when receiving data that is out of order, will send an immediate duplciate ACK back to the sender. This acts as a signal to the sender that it needs the next piece after the ACKed data. This is called fast retransmit. Upon receiving 3 duplciate ACKs, the sender retransmits the data, sets its congestion window to 1/2 its current value and follows a congestion avoidance-style increase. This is called fast recovery. TCPs congestion control has more caveats to it than this, but the general idea is the same.