The idea (and most of the text) for both the following assignments has been taken from the courses CS 244a, Stanford University and CPSC441, University of Calgary Canada. It is recommended the text available at these websites be consulted to facilitate your comprehension. A copy of the same is available locally. [proxy-1] [proxy-2] [tcp]
A Simple Web Proxy
The purpose of this assignment is to learn about Web proxies and the HyperText Transfer Protocol (HTTP). Along the way, you will also learn a bit about TCP/IP and socket programming.
A Web proxy is a software entity that functions as an intermediary between a Web client (browser) and a Web server. The Web proxy intercepts Web requests from clients and reformulates the requests for transmission to a Web server. When a response is received from the Web server, the proxy sends the response back to the client. While the presence of the proxy as an intermediary in the request-response interaction adds some overhead, one advantage of a proxy is that it conceals the identity of the client from the Web server. That is, from the server's point of view, the proxy is the client. Similarly, from the client's point of view, the proxy is the server. A Web proxy thus provides a single point of control to regulate Internet access between clients and servers.
In some deployments of Web proxies, the proxy is augmented to have a local storage capability called a cache. If one or more clients access the same Web content repeatedly, then a Web proxy offers a natural point to store a "local" cached copy of that Web content. By storing a copy locally, the proxy can respond to some requests immediately without contacting the origin Web server. This reduces the response time for the Web user, reduces traffic on the core Internet, and offloads the server from processing repeated requests. These are the main reasons why Web proxy caches are popular.
In this assignment, you will implement and test a simple Web proxy. This Web proxy performs the first role (proxying) but not the second role (caching). The goals of the assignment are to build a properly functioning Web proxy for simple Web pages, and then to extend the functionality in certain ways to offer some novel Web proxy features. You do not need to implement caching in this Web proxy.
The most important HTTP command for your Web proxy to handle is the "GET" request, which specifies the URL for an object to be retrieved. In the basic operation of your proxy, it should be able to parse, understand, and forward to the Web server a (possibly modified) version of the client request. Similarly, the proxy should be able to parse, understand, and return to the client a (possibly modified) version of the response that the Web server provided to the proxy. Your proxy should be able to handle response codes such as 200 (OK) and 404 (Not Found) correctly, notifying the client as appropriate. Reasonable handling of Conditional GET requests and 304 (Not Modified) responses is also desirable. Adding support for HTTP request redirection (302) is optional; such requests can be handled "recursively" by your proxy if you want to implement this. (Tricky!)
You will need at least one TCP (stream) socket for client-proxy communication, and at least one additional TCP (stream) socket for proxy-server communication. If you want your proxy to support multiple concurrent HTTP transactions (highly recommended), you will need to fork child processes for request handling as well. Each child process will use its own socket instances for its communications with the client and with the server.
You should be able to compile and run your Web proxy on any machine, or even your home machine. You should be able to use your proxy from any Web browser (e.g., Netscape Navigator, Internet Explorer, Mozilla Firefox), and from any machine (either on campus or at home). To test the proxy, you will have to configure your Web browser to use your specific Web proxy (e.g., look for menu selections like Edit, Preferences, Advanced, Proxies).
As you design and build your Web proxy, give careful consideration to how you will debug and test it. For example, you may want to print out information about requests and responses received and processed. Once you become confident with the basic operation of your Web proxy, you can toggle off the verbose debugging output. If you are testing on your home network, you can also use tools like ethereal or tcpdump to collect network packet traces. By studying the HTTP/TCP packets going to and from your proxy, you can convince yourself that it is working properly.
In your testing of the proxy, you may want to go through incremental steps similar to the following:
- download a small ASCII text file such as this 1 KB test file.
- download a larger ASCII text file such as this 16 KB test file.
- download a simple HTML file such as the projects page .
- download a modest Web pages with a few embedded objects, such as your instructors main page.
- download a more complicated Web page such as your instructors home page.
- download a more complicated Web page such as the BBC's home page.
The primary test of correctness for your proxy is a simple visual test. That is, the content displayed by your Web browser should look the same regardless of whether you are using your Web proxy or retrieving content directly from the Web server. This mode of operation can be called "invisible" mode, since the presence of the proxy is invisible to the user.
In addition to invisible mode, please implement a "visible" mode for your proxy, wherein your proxy inserts an additional tag line such as "This page retrieved by dexter's proxy" to be displayed at the bottom of the Web page. This feature will involve modifying the HTTP response header and the HTML byte stream. You should be able to toggle between visible mode and invisible mode either in your source code, or using a command-line option when you start your proxy. Visible mode can be helpful while debugging your proxy (e.g., if you forget whether your browser is configured to use the proxy or not).
The suggested grading scheme for the assignment is as follows:
- 10 marks for the design and implementation of a functional Web proxy that can handle simple HTTP GET interactions between client and server, assuming OK responses. Your implementation should include proper use of TCP/IP socket programming, and suitably commented code. This proxy should support both invisible and visible modes of operation.
- 5 additional marks for a Web proxy that can properly handle requests for multiple embedded objects within a typical Web page
- 5 additional marks for a Web proxy that can handle multiple requests concurrently using forked child processes
- 5 additional marks for a clear and concise user manual (about 1 page) that describes how to compile, configure, and use your Web proxy. Make sure to indicate the required features and optional features (if any) that the proxy supports.
- 5 additional marks for a description of the testing of the proxy, accompanied by documented evidence (e.g., debug output, packet traces), where appropriate. The latter is particularly important if your Web proxy is not fully working. Make sure to clarify where and how the testing was done (e.g., home, university, work), and which cases were successful, and which ones were not.
TCP Traffic Analysis
The purpose of this assignment is to learn about the Transmission Control Protocol (TCP). In particular, you will write a program to analyze a specially formatted network traffic trace file, in order to assess and understand the TCP/IP protocol, including its handshaking behaviour and its protocol states.
The file 441trace.dat (270 KB ASCII data file) shows some TCP/IP packet traffic collected using a network traffic analyzer on a research network at the University of Calgary. (You may use a different trace file compiled by wireshark/ethereal) This trace contains 2,808 TCP/IP packets, and lasts about 3.6 minutes. During the period traced, a single Web client was downloading Web pages from different Web sites on the Internet. This trace is to be used for your TCP traffic analysis, and for answering the questions given below.
Each line of data in the trace file represents one TCP/IP packet. There are multiple columns of data on each line, separated by spaces. The columns, from left to right, represent:
- the timestamp (in seconds) at which the packet was seen
- the IP source address in the packet
- the IP destination address in the packet
- the size of the IP packet (including the IP header and the TCP header)
- the protocol type in the packet (always TCP in this trace)
- the source port specified in the packet
- the destination port specified in the packet
- the TCP sequence number carried in the packet (the values shown represent the sequence number associated with the first byte of data and the last byte of data in the packet, thus indicating if the packet is carrying TCP data. These two values are separated by a colon.)
- the TCP acknowledgement number carried in the packet
- the receive window advertised by the receiver
- the TCP flags carried in the packet, if any. These flags are encoded in the trace as 'S' for a SYN packet (i.e., handshake to open a connection), 'F' for a FIN packet (i.e., handshake to close a connection), 'P' for the PUSH bit, 'A' to indicate a valid acknowledgement number, and 'R' to reset a connection due to a protocol error. These are the only possibilities in this trace.
An example of a line in this trace format is:
7.974098 192.168.1.9 -> 18.104.22.168 44 TCP 1104 80 533868 : 533868 0 win: 32768 S
This packet traveled from IP source address 192.168.1.9 (port 1104) to IP destination address 22.214.171.124 (port 80) at time 7.974098 sec. It was a SYN packet of size 44 bytes (including TCP/IP protocol headers). The proposed starting TCP sequence number was 533868. This packet carried no actual TCP data bytes. The acknowledgement field was invalid, and initialized to 0. The flow control window size advertised was 32 KB.
You need to write a program (20 marks) for parsing and processing trace files in this format (or in a format compatible with wireshark) , and tracking TCP state information. In particular, the program processes the trace file and computes summary information about TCP connections. Note that a TCP connection is identified by a 4-tuple (IP source address, source port, IP destination address, destination port), and packets can flow in both directions on a connection (i.e., from host A to host B, and from host B to host A). Also note that the packets from different connections can be arbitrarily interleaved with each other in time, so your program will need to extract packets and associate them with the correct connection.
The summary information to be computed for each TCP connection includes:
- the state of the connection. Possible states are: S0F0 (no SYN and no FIN), S1F0 (one SYN and no FIN), S2F0 (two SYN and no FIN), S1F1 (one SYN and one FIN), S2F1 (two SYN and one FIN), S2F2 (two SYN and two FIN), S0F1 (no SYN and one FIN), S0F2 (no SYN and two FIN), and so on, as well as R (connection reset due to protocol error). Getting this state information correct is the most important part of your program. We are especially interested in the complete TCP connections for which we see at least one SYN and at least one FIN. For these complete connections, you can report additional information, as indicated in the following.
- the starting time, ending time, and duration of each complete connection
- the number of packets sent in each direction on each complete connection, as well as the total packets
- the number of data bytes sent in each direction on each complete connection, as well as the total bytes. This byte count is for data bytes (i.e., excluding the TCP and IP protocol headers).
For testing your program, here are some small example traces. The trace example1.dat - (17 TCP packets) contains a single complete TCP connection. The trace example2.dat - (12 TCP packets) contains a single TCP connection that is reset. The trace example3.dat - (100 TCP packets) contains 8 TCP connections (5 complete, 2 reset, and 1 still in progress when the trace ended). When you have your program working properly, you can run it on the real trace file for this assignment.
Use your program, and the 441trace.dat file, to answer the following questions.
- (a) (1 mark) How many complete TCP connections are observed in the trace?
- (b) (1 mark) How many reset TCP connections are observed in the trace?
- (c) (1 mark) How many TCP connections were still open when the trace capture ended?
- (d) (1 mark) What are the minimum, mean, and maximum time durations of the complete TCP connections that you observed?
- (e) (1 mark) What are the minimum, mean, and maximum number of packets sent on the complete TCP connections that you observed?
- (f) (1 mark) What is the minimum, mean, and maximum number of data bytes sent on the TCP connections that you observed?
- (g) (4 marks) Find in this trace a complete TCP connection that claims to have downloaded about 33 KB TCP data bytes from the server. For this connection, answer the following questions.
(i) What source port number did the client use to initiate this Web transfer?
(ii) What is the IP address of the Web server?
(iii) What is the domain name of the Web site?
(iv) What is the approximate round trip time to this Web site? Justify your answer. For example, show some traceroute output illustrating the current Internet routing path to this Web site.
Last updated: 23rd of April, 2007.