C Web Server (part from me, rest given in an online EdX CS50 class from Harvard) and the plans to expand it to interface to the physical computing world:

Link to the web server in C project class files

I implemented parts of a basic web server in C, taken from an online class's description.

I also have plans to learn other ways of doing web servers, in the newer languages for that purpose that people use, as listed in my "current learning" page.

The parts I wrote for this web server project I used to get started were in the file listed in the next section called "server.c":
- lookup, to return the type of web page requested from the client browers, such as html, css, gif, ico, jpg, js (javascript), php, and png
- parse, to extract the absolute path and query from the client web browser's web page request line, and return standard web error messages if the request had an error
- load to read all bytes from a web page if html, dynamically allocate memory on the heap, and save a pointer to the web page content and length, or if a php web page request, invoke the php interpreter to run the requested php script, and return in content what the php script returned and it's length
- indexes, which returns a "path/to/a/directory" with either the html or php file extension on the end of it, if it actually exists.
- This project helped me learn to implement a web server in a language I know very well, C, before learning to do that in the more common way of programming web servers these days in languages I do not know as well.
- Future plans may be to expand it to work on a board for the physical computing interface environment, where the board responds to requests from the client web browser requests.
- The socket communications between this server and the client browser were done by the online course people.
- The directory structure for this basic web server in C was as follows, and I simply filled in the above functions described in the file called server.c.
- The directory contained these files for getting started below.
  Makefile
  
  public directory
  
  cat.html (which has an IMG tag whose src attribute is cat.jpg, provided by class project, and we love all friendly type animals!!
  
  cat.jpg
  
  favicon.ico
  
  hello.html (has a form thats configured to submit via GET a text field called name to "hello.php"
  
  hello.php (mostly HTML but inside its body is a bit of PHP code to deal with html special characters
  
  test
  
  index.html
  
  index.php
  
  server.c (implements web server that knows how to serve static content (i.e., files ending in .html , .jpg , et al.) and dynamic content (i.e., files ending in .php ).
- "Usage: server (port number) (path to root)"
  To specify a (TCP) port number on which server should listen for HTTP requests, include (port number) as a command line argument.
  
  If you do not specify a port number, the program will default to port 8080.
  
  The last command line argument to server should be the PATH to your servers "root" (the directory from which files will be served).
  
  To Test:
  
  Server is started with: "./server public"
  
  Listening on port 8080 (should be output)
  
  Using the public directory as servers root, the directory from which files will be served.
  
  Under the public dir is the "test/index.html" file. You only specify directories or files underneath the public directory after the localhost:8080 to request web pages.
- Tests: in client browser while this server is running on same machine:
  - "http://localhost:8080/test/index.html" (starts a video of singer)
  - "http://localhost:8080/cat.jpg" (shows photo of a happy cat, and we like all happy animals!)
  - you should also see "GET /cat.jpg HTTP/1.1" in your terminal window, which is the "request line" that your browser sent to the server
  - Below that you should see all of the headers that your browser sent to server followed by "HTTP/1.1 200 OK" which is the servers response to the browser
  - "http://localhost:8080/test/index.php" (should do a dir listing) NOTE: still getting this to work!
- Another way to test it:
  - Open up Chromes developer tools, per the instructions at "https://developer.chrome.com/devtools"
  - Then, once open, click the tools Network folder, and then, while holding down Shift, reload the page.
  - Not only should you see Happy Cat again. You should also see the following in your terminal window.
  - "GET /cat.jpg HTTP/1.1"
  - "HTTP/1.1 200 OK"
  - You might also see the following.
  - "GET /favicon.ico HTTP/1.1"
  - "HTTP/1.1 200 OK"
  - Whats happening is, by convention, a lot of websites have in their root directory a "favicon.ico" file, which is a tiny icon thats meant to be displayed a browsers address bar or folder. If you do see those lines in your terminal window, that just means the browser (Chrome in this example) is guessing that your server, too, might have "favicon.ico" file, which it does!
- A Walkthrough Demo type test:
  - "http://localhost:8080/cat.html" (shows photo of a happy cat, with a margin around him, unlike when it was just the jpg, due to Chromes default CSS properties)
  - If you look at the Chrome developer tools Network folder (possibly after reloading, if they werent still open), you should see that Chrome first requested "cat.html" followed by "cat.jpg" , since the latter, recall, was specified as the value of that img elements src attribute that we saw earlier in "cat.html".
  - To confirm, take a look at the developer tools Elements folder, wherein you will see a pretty printed version of the HTML in "cat.html". You can even change it but only Chromes in memory copy thereof.
  - To tinker with the developer tools Styles folder, even though this page doesnt have any CSS of its own, you can see and change (temporarily) Chromes default CSS properties via that folder.
  - If you look at the pages source code (as via the developer tools Elements folder), you will see your name embedded within the HTML! By contrast, files like "cat.jpg" and "cat.html" (and even "hello.html" ) are "static" content, since they are not dynamically generated.
  - To test code via a command line rather than with a browser, this is one technique.
  - So lets show you one other technique.
  - Open up a second terminal window and position it alongside your first.
  - In the first terminal window, execute:" ~/server public" from within your own "~/workspace/webserver" directory, if the server isnt already running.
  - Then, in the second terminal window, execute the below. (Note the "http://" this time instead of "https://" .) "curl dash_i http://localhost:8080/". If you havent used curl before, it is a command line program with which you can send HTTP requests (and more) to a server in order to see its responses. The "dash-i" flag tells curl to include responses HTTP headers in the output. Odds are, whilst debugging your server, you will find it more convenient (and revealing!) to see all of that via curl than by poking around Chromes developer tools. Incidentally, take care not to request "cat.jpg" (or any binary file) via curl , else you will see quite a mess!
- "server.c" is a tour of what was written by the online class people, and what I wrote:
- In "server.c", only the lookup, parse, load, and indexes functions were written by me, as described in the previous section.
- Next I describe what was done by the online course people in "server.c".
- Atop the file are a bunch of "feature test macro requirements" that allow them to use certain functions that are declared (conditionally) in the header files further below.
- Defined next are a few constants that specify limits on HTTP requests sizes. They (arbitrarily) based their values on defaults used by Apache, a popular web server. See "http://httpd.apache.org/docs/2.2/mod/core.html".
- Defined next is BYTES , a constant the specifies how many bytes we will eventually be reading into buffers at a time.
- Next are a bunch of header files, followed by a definition of BYTE , which we have indeed defined as an 8 bit char, followed by a bunch of prototypes.
- Finally, just above main are a just a few global variables.
- main: Atop main is an initialization of what appears to be a global variable called errno . In fact, errno is defined in "errno.h" and is used by quite a few functions to indicate (via an int ), in cases of error, precisely which error has occurred. See man errno for more details.
- Shortly thereafter is a call to getopt , which is a function declared in "unistd.h" that makes it easier to parse command line arguments. See man 3 getopt if curious.
- Notice how we use getopt (and some Boolean expressions) to ensure that server is used properly.
- Next notice the call to start (for which you may have noticed a prototype earlier). More on that later.
- Below that is a declaration of a struct sigaction via which we will listen for SIGINT (i.e., "control c"), calling handler (a function defined by us elsewhere in "server.c" ) if heard.
- And then, after declaring some variables, main enters an infinite while loop.
- Atop that loop, we first free any memory that might have been allocated by a previous iteration of the loop.
- We then check whether we have been "signalled" via "control c" to stop the server.
- Thereafter, within an if statment, is a call to "connected" , which returns true if a client (e.g., a browser or even curl ) has connected to the server.
- After that is a call to parse , which parses a browsers HTTP request, storing its "absolute path" and "query" inside of two arrays that are passed into it by reference.
- Next is a bunch of code that decodes that path (decoding any URL encoded characters like "%20" ) and "resolves" the path to a local path, figuring out exactly what file was requested on the server itself.
- Below that, we ascertain whether that path leads to a directory or to a file and handle the request accordingly, ultimately calling list , interpret , or transfer .
- For directories (that do not have an "index.php" or "index.html" file inside them), we call list in order to display the directorys contents.)
- For files ending in ".php" (whose "MIME type" is "text/x_dash_php" ), we call interpret .
- For other (supported) files, we call transfer.
- And that is it for main! Notice, though, that throughout main are a few uses of continue , the effect of which is to jump back to the start of that infinite loop. Just before continue in some cases, too, is a call to error (another function they wrote) with an HTTP status code. Together, those lines allow the server to handle and respond to errors just before returning its attention to new requests.
- connected: connected is below main. "memset()" function fills the first sizeof(client socket address) bytes of the memory area pointed to by (client socket address) with the constant byte zero.
- accept: extracts the first connection request on the queue of pending connections for the listening socket (server), sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket (for this client it just connected to).
- error: error calls "reason" to determine the reason for the failure of obtaining the request for the client and places it in a phrase string. It forms a template string and then renders the template into a body string and its length. It then adds the headers and responds with the error code, header, body, and length to the client.
- freedir: This function exists simply to facilitate freeing memory that is allocated by a function called scandir that we call in list.
- handler: This function (called whenever a user hits "control c") essentially tells main to call stop by setting signaled , a global variable, to true .
- htmlspecialchars :This function, named identically to that PHP function we saw earlier, escapes characters "(e.g., < as < )" that might otherwise "break" an HTML page. We call it from list , lest some file or directory we are listing have a "dangerous" character in its name.
- indexes: I wrote this function. It returns a "path/to/a/directory" with either the html or php file extension on the end of it, if it actually exists. The function, given a "/path/to/a/directory", returns "/path/to/a/directory/index.php" if "index.php" actually exists therein, or "/path/to/a/directory/index.html" if "index.html" actually exists therein, or NULL . In the first of those cases, this function should dynamically allocate memory on the heap for the returned string.
- interpret: This function enables the server to interpret PHP files. It is a bit cryptic at first glance, but in a nutshell, all we are doing, upon receiving a request for, say, "hello.php" , is executing a line like "QUERY_STRING='name=Alice' REDIRECT_STATUS=200 SCRIPT_FILENAME=/path/to/public" the effect of which is to pass the contents of "hello.php" to PHPs interpreter "(i.e., php_cgi )", with any HTTP parameters supplied via an "environment variable" called QUERY_STRING. Via load (a function we wrote), we then read the interpreters output into memory (via load ). And then we respond to the browser with (dynamically generated) output.
- popen: That function opens a "pipe" to a process ( "php_cgi" in our case), which provides us with a FILE pointer via which we can read that processs standard output (as though it were an actual file). You will notice how this function calls load , though, in order to read the PHP interpreters output into memory.
- list: A function that generates a directory listing. Notice how much code it takes to generate HTML using C, thanks to requisite memory management. (They pointed out here that with PHP this part is easier).
- load: This is a function that I wrote to read all bytes from a web page. If it was html, it dynamically allocated memory on the heap, and saved a pointer to the web page content and length, or if a php web page request, invoke the php interpreter to run the requested php script, and return in content what the php script returned and it' length.
  1. reads all available bytes from file.
  
  2. stores those bytes contiguously in dynamically allocated memory on the heap.
  
  3. stores the address of the first of those bytes in "*content".
  
  4. stores the number of bytes in *length.
  
  Note that content is a "pointer to a pointer" "(i.e., BYTE** )", which means that you can effectively "return" a "BYTE*" to whichever function calls load by dereferencing content and storing the address of a BYTE at "*content" . Meanwhile, length is a pointer "(i.e., size_t* )", which you can also dereference in order to "return" a "size_t" to whichever function calls load by dereferencing length and storing a number at "*length".
- lookup: This is a function I wrote. It returns:
  - "text/css" for any file whose path ends in ".css" (or any capitalization thereof)
  - "text/html" for any file whose path ends in ".html" (or any capitalization thereof)
  - "image/gif" for any file whose path ends in ".gif" (or any capitalization thereof)
  - "image/x_dash_icon" for any file whose path ends in ".ico" (or any capitalization thereof)
  - "image/jpeg" (not image/jpg ) for any file whose path ends in ".jpg" (or any capitalization thereof),
  - "text/javascript" for any file whose path ends in ".js" (or any capitalization thereof)
  - "text/x_dash_php" for any file whose path ends in ".php" (or any capitalization thereof)
  - "image/png" for any file whose path ends in ".png" (or any capitalization thereof)
  - or NULL otherwise.
- parse: This is a function that I wrote to extract the absolute path and query from the client web browsers web page request line, and return standard web error messages if the request had an error. The function parses (i.e., iterates over) the "line" argument it is given, extracting its absolute path and query and storing them at "abs_path" and "query", respectively.
  abs_path:("Per 3.1.1 of http://tools.ietf.org/html/rfc7230 (http://tools.ietf.org/html/rfc7230)"), is a request line is defined as method SP request target SP HTTP version CRLF wherein SP represents a single space "( )" and CRLF represents "\r\n" . None of method , request target , and HTTP version meanwhile, may contain SP. (Per 5.3 of the same RFC), request target, meanwhile, can take several forms, the only one of which your server needs to support is "absolute path [ '?' query ]" whereby "absolute path" (which will not contain '?' ) must start with '/' and might optionally be followed by a '?' followed by a query ,which may not contain double quotes. We had to ensure that request line (which is passed into parse as line) is consistent with these rules. If it is not, we responded to the browser with "400 Bad Request" and returned false. Even if request line is consistent with these rules, if method is not GET, we responded to the browser with "405 Method Not Allowed" and return false. If request target does not begin with '/' , we responded to the browser with "501 Not Implemented" and return false. If request target contains a double quote , we responded to the browser with "400 Bad Request" and returned false. If HTTP version is not "HTTP/1.1" , we responded to the browser with "505 HTTP Version Not Supported" and returned false. If all is well, we stored "absolute path" at the address in "abs_path" (which was also passed into parse as an argument). We could assume that the memory to which "abs_path" points was at least of length "LimitRequestLine + 1".
  
  query: We stored at the address in query the query substring from request target. If that substring was absent (even if a '?' is present), then query should be 2 double quotes , thereby consuming one byte, whereby query[0] is "\0". We could assume that the memory to which query points was at least of length "LimitRequestLine + 1". For instance, if request target is "/hello.php" or "/hello.php?", then query should have a value of double quotes . And if request target was "/hello.php?q=Alice", then query had value of "q=Alice".
- reason: This function simply mapped HTTP "status codes" (e.g., 200 ) to "reason phrases" (e.g., OK ).
- redirect: This function redirects a client to another location (i.e., URL) by sending a status code of 301 plus a Location header.
- request: When the server receives a request from a client, the server does not know in advance how many characters the request will comprise. So this function iteratively reads bytes from the client, one buffers worth at a time, calling realloc as needed to store the entire message (i.e., request). Notice this functions use of pointers, dynamic memory allocation, pointer arithmetic, and more. Ultimately, it keeps reading bytes from the client until it encounters "\r\n\r\n" (aka CRLF CRLF), which, according to HTTPs spec, marks the end of a requests headers. Note that read() is quite like fread except that it reads from a "file descriptor" (i.e., an int ) instead of from a FILE pointer "(i.e., FILE* )".
- respond: It is this function that actually sends a client an HTTP response, given a status code, head, body, and that bodys length.
- Know that dprintf is quite like printf (or, really, fprintf ) except that the former, like read , writes to a "file descriptor" instead of to a FILE* .
- start: Start is the function that configures the server to listen for connections on a particular TCP port!
- stop: Stop does the opposite, freeing all memory and ultimately compelling the server to exit, without even returning control to main.
- transfer() This functions purpose in life is to transfer a file from the server to a client. Whereas interpret handles dynamic content (generated by PHP scripts), transfer handles static content (e.g., JPEGs). Notice how this function calls load in order to read some file from disk.
- urldecode() This function, also named after a PHP function, URL decodes a string, converting special characters like "%20" back to their original values.
- Current Status of Web Server in C: Works for HTML static page requests, but not for a PHP request to return a directory, and the simple Perl script mentioned below has not yet been tried.
- Possible plans for expanding the web server implemented in C described above (there are 2 parts to it):
  This part is similar to what I did in the online class assignment: write a C/C++ program that implements a web server. This web server will conform to "HTTP/1.x" for the purposes of client requests, and it will need to process client HTTP GET requests for web pages hosted on the server machine. It will need to use sockets to implement the communication between a client on one machine and the server on either the same or a remote machine.
- Add Interaction of physical computing with the web server: The new part, described in more detail below, will be adding functionality to the web server to support interaction with a physical device, a case where computers interact with the physical world through a collection of sensors and actuators. This forms a physical computing environment.
- Web server Description:
  The Basic HTTP Protocol: The basic structure of interaction between a web client and web server is as follows:
  - Client sends request (from a suitable browser)
  - GET filename HTTP/version
  - optional arguments
  - a blank line
  - Server sends reply
  - "HTTP/version" status code status message
  - additional information
  - a blank line
  - content
  - It will need to ensure the information sent back from the server is formatted as described above.
  - The additional information sent back in a server reply is of the form:
    "Content-type:text/plain"
    
    "text/html"
    
    "image/gif"
    
    "image/jpeg"
    
    "xxx/yyy"
  - The Server
    - Assume the server forks a child process for each incoming request.
    - Additionally, assume each request involves establishing a new connection with the server, rather than maintaining any notion of a session for multiple requests.
    - The web server is in a file called "webserv.c".
    - For execution, bind the server to a specific port.
    - To run the web server, type:
      "$./webserv port-number"
      
      where "port-number" should be a value in the range 5000-65536 such as 8080.
    - The server will handle a series of requests, such as listing the contents of a directory on the server machine, retrieving a file for viewing on the client, and running cgi scripts.
    - In particular, the server will handle static and dynamic content requests. In the former case, the content is simply retrieved from a preexisting file on the server. For dynamic content, you will be expected to run a program on the server, to process data and generate an HTML compliant file for sending back to the client.
    - The web server will handle HTML status codes 200 (successful request), 404 (Not Found), and 501(Not Implemented). For status code 501, the server does not recognize the request method. Further information about status codes can be found.
  - The Client
    The client is any web browser of your choice.
    
    Requests from the client should be in the form:
    - "http://ip.address.of.server:port-number/request"
      "ip.address.of.server" the IP address of the server machine
      
      "port-number" numeric port on which the server listens.
      
      Together with the IP address, this identifies an end point of communication (or socket) to which the client connects.
      
      request either a subdirectory on the server that you wish to list, the name of a html file, or a cgi file. In the latter case, a reference to a script on the server ia executed to perform some command. The content of a cgi script, such as "test.cgi", must be set executable on the server and must refer to a shell or Perl script such as the following:
      
      "test.cgi" (set executable using chmod 755 "test.cgi"):
      
      "#!/bin/sh"
      
      "# test.cgi" a simple test
      
      printf "Content-type: text/plain\n\nThis is a test!\n"
      
      To execute a Perl script, you can issue a request such as:
      
      "http://ip.address.of.server:port-number/request.cgi"
      
      where "request.cgi" is an executable Perl script on the server having contents such as:
      
      "#!/usr/bin/perl"
      
      "# perl-test.cgi -- a simple Perl script test"
      
      print "Content-type: text/plain\n\nThis is a Perl test!\n;"
  - Basic Test Cases:All of the followiing test cases will be supported:
    - A request for a directory listing
    - A request for a valid (and non existing) html file. NOTE: A nonexistent request corresponds to an HTTP error status code of 404.
    - A request for a static image (in either gif or jpeg format, having a file ending of .gif, .jpg or .jpeg)
    - A request for a cgi script that requires execution of a basic shell command, executed using sh
    - A request for a perl script in a cgi file to process raw data and format it into an html file.
    - A request for a dynamically-created image using gnuplot on the server. Information about gnuplot can be found at: "http://www.gnuplot.info/"
    - For the latter case, above, it is assumed that the request specifies a cgi file describing a perl script. The perl script will process data as described in the next subsection.
    - Dynamic Content using Gnuplot: In this case, a program will be executed on the server called "my histogram", as follows: "$my-histogram file pattern1 pattern2 ... patternN" file specifies the name of a file you wish to search for all occurrences of a given regular expression pattern or string sequence. For example: "$my-histogram file 'and' 'but' 'so' 'he.*lo'" will tally all occurrences of the words "and", "but" and "so" in file, along with all strings that match the pattern "he.*lo" such as "hello" etc. You can assume all regular expression patterns that are acceptable to grep "-e" are valid. You can assume the number of pattern arguments is limited to 5.
    - Once "my histogram" has tallied all occurrences of the matching strings for each pattern, the results will be plotted as a histogram using gnuplot. The output of "my histogram" will be piped to gnuplot using a Perl command as follows: "open (GNUPLOT, '|gnuplot'); # Notice the vertical bar for a pipe" After which piping commands to gnuplot is analogous to writing to a file. "my histogram" will be written in C, but any language can be used, Python, Perl, etc. You are also free to use shell commands such as "grep -e" if you wish, or the built in Perl regular expression features. The output of gnuplot histogram will be formatted to show "frequency" up the y axis and the labelled patterns on the x axis, so there is one frequency bar per pattern.
    - Next, gnuplot will be commanded to output the histogram to a file that records the information in gif or jpeg format.
    - After this, the cgi script will send your gnuplot gif or jpeg image back to the client for viewing.
    - Pretty Printed Output Just as this webpage has been formatted using html, an executable on the server will be invoked as part of your CGI script to pretty print your histogram.
    - Specifically, the histogram image file will be embedded in an HTML page that has a 16pt RED font title and white background. The title should read: "My Webserver". (I may experiment with the generated HTML content, producing image backgrounds and additional details. The base case will be formatted as described, however.)
    - The title will be centered on the page. Below it, will be a blank line (spacing of which is your choosing) followed by the histogram, which is also centered.
    - Advanced Features: A multi threaded web server: Instead of using "fork()" calls for each client request, instead spawn a thread using my own thread creation routines, based on the signaltstack() method. Specifically, "make/get/set/swapcontext()" functions or any pre-existing thread packages (e.g., pthreads) will NOT be used. Instead, my own thread management code.
    - A web cache: I will develop a method to cache files in RAM for subsequent requests. The RAM cache should be a pool of memory of some defined size. This will to be a tunable parameter from 4KB to 2MB.
    - Upon initialization, the cache is empty, but gets filled for each file request until it is full. At that point I will adopt a simple replacement strategy my choice (e.g., first in first out, random, or least recently used).
    - I will specify my replacement method in a README file. To make the cache beneficial, you should support requests that are both in the server's filesystem and also on a remote host.
    - Client requests should provide an optional argument to indicate the remote host location for files that are not stored in the server's local filesystem.
    - To test this feature, the server act like a client for a remote host machine, thereby retrieving the necessary file(s) for placement in the web cache. In turn, these files will be relayed back to the original client.
    - Server configuration. For testing purposes, a way will be provided to disable the above advanced features, so that the web server falls back to operating in normal mode (without web caching and threads). This means I will produce only one version of my code. To enable or disable features of my server, I will either use a configuration script, pass in command line arguments, or (worst case) use defined constants within your code.
    - Physical Computing (NOTE: I may use what I learned in my embedded systems online class projects for this section, not sure yet, and that is described separately).
    - To tackle this part of the assignment requires me to have access to an Arduino Uno or similar Arduino compatible device. (In my embedded systems class, we used a TI Cortex M Arm based microcontroller Launch board, but I may instead use an Arduino). These can be purchased for about "$5.99" (roughly the price of two coffees) from places such as Microcenter. You can also buy a good quality starter kit from Amazon, which is a little more expensive but includes everything to get going with some basic building projects. If I have access to a Raspberry Pi or other similar single board computer, I can use that too.
    - The idea: as a way to do physical computing, I will be creative for this physical computing section. One idea would be to have an Arduino board connected via a serial interface to a server PC running the web server I have created.

Karen Shay West's Home Page

C Web Server (part from me, rest given in an online EdX CS50 class from Harvard) and the plans to expand it to interface to the physical computing world: