CMPSC 311, Spring 2013, Project 3

Posted Jan. 30, 2013.  Due Feb. 8, 2013, by 11:55 pm on ANGEL.  25 points.

The goal of this project is to develop something like part of the first phase of a C compiler.  You should create a project3 directory along the lines of Project 2.  Again, if you have any problems using Unix, the editors or the compilers, don't hesitate to ask for help.  Otherwise, do this project on your own.

• CP:AMA
• Sec. 2.1 (Writing a Simple Program; Compiling and Linking), Sec. 2.3 (Comments), Sec. 2.8 (Layout of a C Program; tokens), Ch. 2 Q&A (p. 31)
• Sec. 3.1 (The printf Function; Escape Sequences)
• Sec. 7.3 (Character Types; Escape Sequences)
• Sec. 13.1 (String Literals), Ch. 13 Q&A (p. 304-305)
• Sec. 22.1 (Text Files versus Binary Files), Ch. 22 Q&A (p. 577-578)
• other sections as mentioned below
• CS:APP
• Sec. 1.2, 1.3 (review)
• Sec. 2.1.5, Representing Strings
• C:ARM
• Ch. 15 intro (Input/Output Facilities), Sec. 15.1 (FILE, EOF), Sec. 15.4 (stdin, stdout, stderr), Sec. 15.6 (getc), Sec. 15.9 (putc), Sec. 15.14 (feof, ferror)
• Ch. 2 intro (Lexical Elements), Sec. 2.1 intro (Character Set), Sec. 2.1.2 (Whitespace and Line Termination), Sec. 2.1.4 (Trigraphs), Sec. 2.2 (Comments), Sec. 2.7.5 (Escape Characters), Sec. 2.7.6 (Character Escape Codes), Exercise 4 in Sec. 2.10
• APUE
• Sec. 5.1 (Standard I/O Library, Introduction), Sec. 5.2 (Streams and FILE Objects), Sec. 5.3 (Standard Input, etc.), Sec. 5.6 (Reading and Writing a Stream)
• Some parts of these sections might not be directly relevant to the project, but they will all be useful later.

Tokenization is the compiler step that groups characters into tokens; this is also called lexical analysisSyntactic analysis and code generation follow lexical analysis.  This project deals with some of the C compiler steps that must be applied before tokenization.
• Some parts of the C language will be ignored here for simplicity, such as multibyte characters and wide characters (CP:AMA, Sec. 25.2; C:ARM, Sec. 2.1.5).
• In C, outside the preprocessor, the categories of tokens are keywords, identiﬁers, constants, string literals, and punctuators.
Whitespace characters consist of space, end-of-line (newline in Unix), vertical tab, form feed, and horizontal tab.  Note that the two-character sequence \n is not the same as the single character newline.  Whitespace is ignored but is used to separate adjacent tokens as the compiler checks the syntax of the program.  Comments are treated as whitespace, and are replaced by spaces or newlines (depending on how the comment was written) before tokenization.

Here is one point that you should understand, that often confuses people.  When you write a program, the four-character sequence '\r' in the program source file represents a single abstract character (the carriage return character).  When the program is translated by the compiler, those four characters in the source file are translated into one token, and then into one numeric value in the executable file (an int value in C and a char value in C++).  The actual numeric value that is used is taken from a table that defines the execution environment's coding system for characters.  On the computers you have available today, you can reasonably expect that the coding system will be ASCII, which is the same as 8-bit Unicode, but that isn't guaranteed.  You should not use the numeric codes directly in your program, and (unless you are absolutely certain which computer you are using) you should not even implicitly assume properties of the coding system.  Functions like isalpha() in <ctype.h> are the proper place for those assumptions.

Here is how to see the output of the C preprocessor; this will useful for comparison at various times.  Suppose your source file is xyz.c.  The command
cc -E xyz.c
will send the result of the preprocessor to standard output, and the command
cc -E xyz.c > xyz.i
will write the result of the preprocessor to the file xyz.i.  Just as the filename xyz.c tells the compiler to expect C source code, the name xyz.i tells the compiler to expect C source code that does not need to go through the preprocessor.  Of course, you can replace "cc" with "c99" or "gcc" or "gcc -std=c99" and get the same behavior.

This project should be implemented as four programs connected through a command pipeline.  To run the set of programs on the source file xyz.c, use the command

cat xyz.c | pass1 | pass2 | pass3 | pass4

We will describe each pass separately, but remember that the ordering is important.  Project 3 requires you to implement all four passes as separate programs, to emphasize the use of pipelines and concurrent processes.

The overall structure of each pass should be something like this, where c0 is the current character for inspection, and c1 is the next character.  Either of these could be EOF.  The pair [c0, c1] acts like a moving window into the input stream.  The variable shift acts as an indicator for whether or not to "single shift" the pair [c0, c1] to [c1, getc(stdin)], or to "double shift" it to [getc(stdin), getc(stdin)].  If you already know about shift registers, that's all that's going on here.

int c0, c1;

c0 = getc(stdin); c1 = getc(stdin);

while (c0 != EOF) {
  int shift = 1;

  ... make a decision using c0 and c1, change some variables, maybe call putc()

  if (shift == 1)
    { c0 = c1; c1 = getc(stdin); }
  else /* shift == 2 */
    { c0 = getc(stdin); c1 = getc(stdin); }
}

When you need to print a warning message, use stderr with fprintf().  Examples will be given.

The first part of compiling a C program begins with some more-or-less simple transformations.  We will simplify these transformations for this project, but not by much.  There are four steps, which must occur in the given order:
• certain control characters (FF, HT, VT) are replaced by a space character, and end-of-line markers (CR, LF, CRLF) are replaced with the newline character (LF), as described in CP:AMA, Sec. 22.1 (Text Files versus Binary Files), or C:ARM Sec. 2.1.2 (second paragraph);
• ISO trigraphs are replaced according to CP:AMA, Sec. 25.3, or C:ARM Sec. 2.1.4;
• source line continuation is implemented as described in CP:AMA, pp. 278-279 and 318, or C:ARM Sec. 2.1.2 (third paragraph);
• comments are removed and replaced with whitespace, as described in C:AMA, p. 31, or C:ARM, Sec. 2.2.
We will speak of "the input C program", but in fact for this project the input is arbitrary and does not need to resemble a program at all.

In the first step, the whitespace characters form feed, vertical tab and horizontal tab (FF, VT, HT) are each replaced with a single space character, the two-character sequence return-newline (CR-LF) is replaced by a single newline character (LF), and the carriage return character (CR) is replaced by a single newline character.  All other characters are unchanged from input to output.  The intent is that "strange spaces" become normal spaces, and all line-endings become Unix-style line-endings.
• This replacement occurs inside or outside string literals as written in the input C program.
• Do not replace the two characters \r or the four characters '\r', etc.  For example, printf("\n\n\r\r\n") should not be changed, as it only has representations of the carriage return character. 
• It is not necessary to assume that end-of-line is Unix-style or DOS-style, since either case becomes Unix-style afterward.
• When developing your program, it would be a good idea to implement and test this step first, as the second step is more complicated.  The project starter kit shows how to do a simple test of each step separately.
In the second step, ISO trigraphs are removed and replaced according to CP:AMA, Sec. 25.3, or C:ARM Table 2-2.
• This replacement occurs inside or outside character constants and string literals.
• For example, '??(' becomes '[', "??(" becomes "[", and A??(i??) becomes A[i].
• C:ARM Sec. 2.1.4 has a technical explanation of how trigraphs work, and why trigraphs are part of C, but does not stress one essential point.  If you want your program to be read and understood quickly by a human being, do not use trigraphs.  The compiler supports trigraphs because, in some cases, they solve a real problem.  Your solution to this project can be written without using trigraphs, but if you think it cannot, then ask for a suggestion on how to avoid them.
In the third step, the two-character sequence backslash-newline is removed and not replaced.
• This replacement occurs inside or outside character constants and string literals.
• Again, note that '\n' does not contain the two-character sequence backslash-newline, nor does "\\\n".
In the fourth step, comments are removed and replaced with a single space character (if originally on one line), or with multiple newline characters (if originally on multiple lines).  This preserves the original formatting and line numbers of the program.

Any other character transformations that would be done by a real compiler should be omitted from this project.
• In particular, the alternate token spellings (digraphs and macros) described in CP:AMA Sec. 25.3, or  C:ARM Sec. 2.4, cannot be applied inside a string literal.  Recognizing string literals is similar to recognizing /* ... */ comments in Pass 4.

Here is some more detailed information, and a few examples.

Pass 1, Whitespace and End-of-Line Markers

See CP:AMA. Sec. 22.1 (Text Files versus Binary Files), Ch. 22 Q&A (p. 577-578), or C:ARM, Sec. 2.1.2, paragraph 2.

The whitespace characters form feed, vertical tab and horizontal tab are each replaced with a single space character, the two-character sequence return-newline is replaced by a single newline character, and the return character is replaced by a single newline character.  All other characters are unchanged from input to output.  This occurs for the entire source file, including inside character constants and string literals.
• This step is not specified in the C Standard.  The GNU C preprocessor implements it, and the Sun C preprocessor does not; see the file print-return.c for an example.
• The name 'carriage return' is archaic since today almost no one uses typewriters or teletypes with actual carriages, so 'return' is more common.  As an action, it means "go back to the beginning of the same line".  The 'form feed' action is "go to the beginning of the next page".  For some historical notes on the 'vertical tab', see here.
• For the Unicode Newline Guidelines, see here.  That's more than is needed for this project.
Here's an example for Pass 1.

% cat example1.c
#include <stdio.h>
int main(void) { printf("a\f1\v2\t3\nb\rc\r\n"); return 0; }

% cc example1.c

% a.out | od -c

0000000    a  \f   1  \v   2  \t   3  \n   b  \r   c  \r  \n
0000015

% a.out | pass1 | od -c

0000000    a       1       2       3  \n   b  \n   c  \n
0000014

% cat example1.c | pass1 | od -c

0000000    #   i   n   c   l   u   d   e       <   s   t   d   i   o   .
0000020    h   >  \n   i   n   t       m   a   i   n   (   v   o   i   d
0000040    )       {       p   r   i   n   t   f   (   "   a   \   f   1
0000060    \   v   2   \   t   3   \   n   b   \   r   c   \   r   \   n
0000100    "   )   ;       r   e   t   u   r   n       0   ;       }  \n
0000120

Pass 2, Trigraph Replacement

See CP:AMA. Sec. 25.3, or C:ARM, Sec. 2.1.4.

Certain multi-character sequences are converted to single characters in the source character set.  This occurs for the entire source file, including inside character constants and string literals.

This pass requires looking ahead three characters, and a switch statement would be useful.

See also CP:AMA, Sec. 7.3 (Character Types; Escape Sequences), or C:ARM, Sec. 2.7.5, Escape Characters, and Sec. 2.7.6, for the character escape code \? .  The latter explains why you don't need to do anything here about \? .

Pass 3, Source Line Continuation

See CP:AMA. Sec. 13.1 (Continuing a String Literal), or C:ARM, Sec. 2.1.2, paragraph 3.

A line ending with backslash is joined with the following line.

Don't forget that, inside a character constant or a string literal, the sequence \\ is the character escape code for the backslash character.  If you see backslash followed by any character other than newline, the backslash should be copied to the output without change.  It is only backslash followed by newline that represents source line continuation, and then the backslash and newline are not sent to the output.

There are actually four cases to consider when a backslash is noticed:

 backslash end-of-file faulty input backslash (not newline) looks like an escape character backslash newline source line continuation backslash newline end-of-file faulty input

In the case of source line continuation, we expect to find another character after the newline.  At this point, we don't bother checking that the second case is a proper escape character; that would be done later, in tokenization and lexical analysis.

There are two potential warning messages that should be issued in this pass, with output to stderr:
• fprintf(stderr, "warning: backslash at end of file\n");
• fprintf(stderr, "warning: backslash-newline at end of file\n");
Examples are given later.

Pass 4, Comment Removal

See CP:AMA. Sec. 2.3 (Comments), or C:ARM, Sec. 2.2, Comments, Sec. 2.3, Tokens, Sec. 2.7, Constants.

Comments are treated as whitespace.  Each comment is replaced by one space character, or by several newline characters, depending on how the comment was written.  Note that the interior of character constants (like 'a') and string literals (like "a") do not contain comments, so this pass is more complex than the first three.  You need to do some elements of tokenization here, but only to recognize the beginning and end of character constants, string literals, and the two forms of comments.
• The string literal "/* xyz */" looks like it contains a comment, but it doesn't, so don't change the string.  It goes to the output of this pass as-is.
The two forms of comment are
/* extending to the first */
// extending to the end of line

Note that this pass occurs after source line continuation, so the comment
// foo \
bar
will already have been modified to
// foo bar

The /* */ form can extend across source code lines.  It should be replaced with one space character if all on one line, or with one or more newline characters if on multiple lines (as a partial attempt to preserve the original number of lines).

Comments are not recognized within other comments.  The first example (a nested comment) is illegal in Standard C and the second is one comment not two,  The third example is not terminated (it might be terminated on a later line).
/* /* something */ */
// /* something */
/* // something

The structure of this pass is more complicated than the first three, so here it is in more detail, using the program schema presented earlier.

String literals (CP:AMA, Sec. 13.1 and pp. 304-305, or C:ARM Sec. 2.7.4)
• If c0 is " (double-quote), copy characters up to (and including) the next ".  Note that if the character sequence \" appears, just copy it; this does not end the string literal.
• You don't need to do anything special for a wide-character string literal, which looks like L"...".
• Strings that were continued across lines with backslash-newline should have been fixed in Pass 3.
• Two adjacent strings, such as "abc" "def", will be concatenated in a later pass (not part of this project).
Character constants (CP:AMA, Sec. 7.3, or C:ARM Sec. 2.7.3)
• If c0 is ' (single-quote), copy characters up to (and including) the next '.  Note that if the character sequence \' appears, just copy it; this does not end the character constant.
• You don't need to do anything special for a wide-character constant, which looks like L'...'.
Single-line comments (CP:AMA, Sec. 2.3 and pp. 31-32, or C:ARM Sec. 2.2)
• If c0 is / and c1 is /, read characters up to (and including) the next newline, and output only one space character and a newline.
• If the comment is not terminated (you found // but not a following newline before end-of-file) then
fprintf(stderr, "warning: unterminated // comment, fixed, Pass 4\n");
and keep going (you should be at the end of the file at this point).
Multi-line comments (CP:AMA, Sec. 2.3 and pp. 31-32, or C:ARM Sec. 2.2)
• If c0 is / and c1 is *, read characters up to (and including) the next */ sequence, and output only one space character (if you had not seen a newline along the way) or n newlines (if you found n newlines before the close of the comment, for n > 1).
• Note that the /* */ form is replaced by either one space character, or one or more newline characters, and each case is treated as whitespace in later compiler steps.
• If the comment is not terminated (you found /* but not */ before end-of-file) then
fprintf(stderr, "warning: unterminated /* */ comment, fixed, Pass 4\n");
and keep going (you should be at the end of file at this point).
Escape characters and escape codes (CP:AMA, Sec. 7.3, or C:ARM Secs. 2.7.5-7)
• Remember, \' does not start or end a character constant, and \" does not start or end a string literal.
Here's an example for Pass 4.  For display purposes, we added an extra : character before and after the space character that replaces a single-line comment, and added a + character before every newline that replaces part of a multiline comment.

% cat example4.c
/* first */
char foo[] = "// looks like a comment";         /* is a comment */
char bar[] = "/* looks like a comment */";      // is a comment
char blit[] = "\a\b\n\t'\"";    // blit
char blot0 = '"';               // blot0
char *blot1 = "'";              // blot1
/* second
* third */     // fourth /* inside a comment */
void func(void) // one
// two
{
// three
int a /* b */ c;/* xyz */
int d/*e*/f;// xyz
return; // four
// five
}

// six

char blat[] = "";               // blat
char blat0[] = "x";             // blat0
char blat1 = 'x';               // blat1
char blat2[] = "\a";            // blat2
char blat3 = '\a';              // blat3
char blat4[] = "\"y";           // blat4
char blat5[] = "z\"";           // blat5
char blat6 = '\'';              // blat6

/* a
* b
* c
*/

/* x
* y

% pass4 < example4.c
: :
char foo[] = "// looks like a comment";         : :
char bar[] = "/* looks like a comment */";      : :
char blit[] = "\a\b\n\t'\"";    : :
char blot0 = '"';               : :
char *blot1 = "'";              : :
+
: :
void func(void) : :
: :
{
: :
int a : : c;: :
int d: :f;: :
return; : :
: :
}

: :

char blat[] = "";               : :
char blat0[] = "x";             : :
char blat1 = 'x';               : :
char blat2[] = "\a";            : :
char blat3 = '\a';              : :
char blat4[] = "\"y";           : :
char blat5[] = "z\"";           : :
char blat6 = '\'';              : :

+
+
+

+

warning: unterminated /* */ comment, fixed, Pass 4

(This is the end of the Pass 4 description, you can start breathing again.)

The remaining steps of compilation are
• the C preprocessor, for textual substitutions as directed by preprocessor command lines;
• tokenization, or lexical analysis, to collect characters into tokens;
• syntactic analysis, to decide if the program is legal;
• code generation, to produce machine instructions in an object file.
Most compilers are also capable of running the linker, to produce a single executable file from multiple object files.

Note that the preprocessor also goes through tokenization, syntactic analysis and code generation, but there are some additional tokens such as the # and ## operators, the syntax is different, and "code generation" produces plain text output.  You can see this output by using the -E compiler option.

The C tokenization step includes replacing alternate token spellings (digraphs, CP:AMA, Sec. 25.3, or C:ARM Sec. 2.4), which does not occur inside character strings.  You don't need to consider these for this project.

Project starter kit and testing methods

The programs you turn in (pass1.c, pass2.c, pass3.c, pass4.c) should all read from stdin, write to stdout, and write warning messages to stderr.

Here is a Makefile that you might find useful.  This is part of the materials to be turned in.

For the following files, be careful not to introduce DOS-style end-of-line markers when you save the files, or to add or delete space characters by copy-and paste.  Use the browser to save the files directly to your project3 directory.

Create a file with whitespace and end-of-line characters, test your Pass 1 solution
• print-test1.c
• test1-in
• The web browser might display a box, or nothing at all, instead of CR, FF and VT.
• The vi editor may display CR, FF, VT as ^M, ^L, ^K, for control-M, control-L, control-K.
• test1-out-correct
• This should display cleanly in the browser and editor.
• These files were created with a correct version of pass1, and  cc -o test1 print-test1.c
test1 > test1-in
cat test1-in | pass1 > test1-out-correct
• If you get the response  test1: Command not found.  then try using  ./test1  instead.
• od -c test1-in
• cat test1-in | pass1 > test1-out 
• od -c test1-out

• diff test1-out-correct test1-out
• cat test1-in | pass1 | diff test1-out-correct -
• If you got it right, there will be no output.
• diff reports differences between its input files (- represents standard input).  When there is nothing to say, it's quiet.
Create a file with trigraphs, test your Pass 2 solution
Create a file with source line continuations, test your Pass 3 solution
Create a file with all of the above, test your combined solution
• test123-out-correct
• cat test[123]-in | pass1 | pass2 | pass3 | diff test123-out-correct - 
• If you got it right, there will be no output.
Another test case, also mentioned later in the Q&A section

Further notes

If you decide to write a recursive solution, instead of the iterative solution that has been suggested, be sure that the depth-of-recursion is not proportional to the size of the input, measured in characters.

If you generate an infinite loop or an infinite recursion, type control-C to interrupt and terminate the process.  This works even if the command you used was a pipeline, with more than one process.

If you run the vi or vim editor on test1-in, the whitespace and line break characters are displayed in the editor buffer as shown in the third column of this table.

 '\n'  newline (end of line)  '\r'  carriage return ^M '\f'  form feed ^L '\v'  vertical tab ^K

If you want to check your work against an actual C compiler, put your input data in a file hack.c, and compile it with one of
[Sun]    cc -E hack.c [GCC]    gcc -E -trigraphs hack.c
The -E option tells the compiler to print the preprocessor's result to standard output.  Note that the file hack.c does not need to contain a valid C program, or even anything close to one.  To emphasize that trigraphs are NOT part of normal C programs, the GNU compiler requires you to make an extra effort to enable them.

You should definitely consider adding some additional warning messages, but make sure these go to stderr, not stdout.  Here are some examples; follow this style as closely as you can, and not just for Pass 1.
warning: empty input, Pass 1 warning: input error, Pass 1 
Note that the file /dev/null is guaranteed to be empty.

When your code is complete and you are satisfied it is right, here is how to turn it in for a grade.

Login to one of the CSE Linux systems, cd to your cmpsc311/project3 directory, maybe transfer files from your home system, and run some final tests just to make sure.
• The grading tests will use the GCC-compiled program on Linux.
• We will also use Sun's compiler on Solaris to see if any compiler errors or warnings are reported differently.
• The lint program on Solaris is often useful for catching problems not reported by the compilers.
Your code should be in four files named pass1.c, pass2.c, pass3.c and pass4.c which each start like 
/* CMPSC 311, Spring 2013, Project 3, Pass 1 * * Author:   <your name> * Email:    <your preferred email address> *... <additional comment text> */

except, of course, the proper pass number and your own name and address should be there.  If you have some functions that are shared between pass[1234].c, these should go into the optional files pr3lib.h and pr3lib.c.

You should have a file named Makefile, which could be the same as the one linked above.  Modify Makefile, to enter your own name and address.  Otherwise, just add to it, and don't remove any of the targets that are already there.

Pick up the file wrap, and don't change it.  This contains the shell script that will bundle your project files.

Execute the command
sh wrap
which (if you have done everything right) should produce output like this.  If you made a mistake, you would get output like this instead.  You can expect some variation in the output because of different user names, file modification times, file sizes, Linux vs. Solaris, etc.

The wrap script will create a "gzipped tar file" containing the files README, Makefile, pass1.c, etc., and some additional data.  The actual name of the file depends on your username.  Tar stands for "tape archive"; it accumulates a collection of files into one file, preserving ownership and date information.  gzip does file compression, and the original tar file will be recovered by using gunzip.

Login to ANGEL and put the file project-3-username.tar.gz in the ANGEL Dropbox for Project 3 (with your username substituted, of course).
• By now you have the SSH Client binary mode / text mode stuff working right.  If not, review the Project 2 instructions.
The ANGEL Dropbox will be open for two days after the due date, but you really should get this in on time.  Nothing will be accepted more than two days late.

• It is important that you write programs that are easily understood by another person.  Correct programs should be obviously correct.  Testing will help confirm correctness, but that's not the whole story.  If a knowlegeable person cannot read your program and immediately see that it is correct, then it isn't well written, and you should try to improve it.  You can expect points to be deducted for "bad program", even if it passes the tests.
• It is important that your programs compile without errors.  The testing will be automated as much as possible, and there are too many people in the class for the grader to spend time fixing your program.  Ask for help from the TA or instructor before the project is due.
• In general, grades will be split among three parts:
• "the program is readable and well-written"
• "the program compiles without errors (even better, without warnings)"
• "the program runs, terminates cleanly, and produces correct output"
• Depending on the assignment, we may require output in a specific format.  If so, there will be examples.  In particular, this project requires output in a specific format, as shown by the examples above.
If you have questions about the assignment description or expectations, or need more examples, send mail to dheller at cse.psu.edu, with the subject line "CMPSC 311 Project 3" (this will help keep things organized).

Q&A

How is backspace treated in the source file?

C:ARM, p. 12, suggests that it acts like a space.  Suppose the source file xyz.c has a backspace character in it (remember, that's an actual backspace, not '\b' which only represents a backspace character).  The command "cc -E xyz.c" shows that the backspace is not removed by the preprocessor.  Actually, you need to run "cc -E xyz.c | od -c" to see this clearly.  The command "cc -c xyz.c", which just produces the object file xyz.o, shows that the backspace is an invalid source character (in an identifier, it acts like a space, separating tokens) but a valid character in a string (it remains in the object file, in the data section).

What should you do if  \\ ends a line?  Is this a case of source line continuation, or is it just an escape character?

The pair of characters \\ would not normally appear except inside a character constant or a string literal, where it does not act like source line continuation, because the immediately-following character is not a newline.

The source file (let's call it x.c)
 
int foo\
bar;
int foo\\
bar;
int c = '\\';
char d[] = "\\";
char e[] = "\\\
";

should become

int foobar;
int foo\bar;
int c = '\\';
char d[] = "\\";
char e[] = "\\";

Try the Sun C preprocessor

% cc -E x.c
# 1 "x.c"
int foobar;
# 3
int foo\bar;
# 5
int c = '\\';
char d[] = "\\";
char e[] = "\\";
#ident "acomp: Sun C 5.8 Patch 121015-01 2006/01/26"

Try the GNU C preprocessor

% gcc -E x.c
# 1 "x.c"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "x.c"
int foobar;

int foo\bar;

int c = '\\';
char d[] = "\\";
char e[] = "\\";

Try an incorrect implementation of Pass 3

% cat x.c | pass3
int foobar;
int foo\\
bar;
int c = '\\';
char d[] = "\\";
char e[] = "\\";

What should you do if   backslash newline   ends the file?

This case cannot occur in a legal C program; see Clause 2 below.  For this project, it should bring a warning and drop the last two bytes.

Here's an example.  The program that generates a.out is simply doing  printf("This\\\n");

% cc example2.c % a.out | od -c 0000000    T   h   i   s   \  \n                                        0000006 % a.out | pass1 | od -c 0000000    T   h   i   s   \  \n                                        0000006 % a.out | pass1 | pass2 | od -c 0000000    T   h   i   s   \  \n                                        0000006 % a.out | pass1 | pass2 | pass3 | od -c warning: backslash-newline at end of file 0000000    T   h   i   s                                                0000004 
What should you do if only   backslash   ends the file?

This case cannot occur in a legal C program; see Clause 2 below.  Here's a similar example.

% cc example3.c % a.out | od -c 0000000    T   h   i   s   \                                            0000005 % a.out | pass1 | od -c 0000000    T   h   i   s   \                                            0000005 % a.out | pass1 | pass2 | od -c 0000000    T   h   i   s   \                                            0000005 % a.out | pass1 | pass2 | pass3 | od -c warning: backslash at end of file 0000000    T   h   i   s   \                                            0000005

Note that the warning message is directed to stderr while the output of pass[123] is directed to stdout.  The warning would be upgraded to an error in an actual C compiler.

One more pathological case:

Suppose the file content per od -c is    T   h   i   s   \   \   \n   \n
The inner backslash-newline is a line continuation, so it is removed, but the outer (remaining) backslash-newline is not.

How should nested comments be treated?

Nested // comments present no problem - just discard characters from // to the first newline character, and it doesn't matter if some of them look like a comment.

Nested /* */ comments are not legal in C.  It's ok to write pass4.c in a way that doesn't bother checking for illegal C.

Here's an example.  For display purposes, we added an extra : character before and after the space character that replaces a single-line comment, and added a + character before every newline that replaces part of a multiline comment.

% cat < test4a-in.c
/* comment 1
/* comment 2
*/ // end of comment 2
*/ // end of comment 1
int a = 1;
// comment 3 /* comment 4 */
int b = 2;

% pass4 < test4a-in.c
+
+
: :
*/ : :
int a = 1;
: :
int b = 2;

% gcc -c test4a-in.c
test4a-in.c:4: error: syntax error before ‘/’ token

In this example, the single-quote on the first line matches the single quote on the fifth line, so all the comments between them are just copied without change.  Is that right?
That's enough.
/* first */
char foo[] = "// looks like a comment";         /* is a comment */
char bar[] = "/* looks like a comment */";      // is a comment
char blit[] = "\a\b\n\t'\"";    // blit

For this project, this is acceptable.  In a later pass, the compiler would be expected to complain about a multi-character constant, but you don't need to do that here.
I have a weird error in lint when I have it test pass4.  I may be misunderstanding the error but I don't see why pass4 would be dependent on pass2.
eru 110% lint pass4.c
lint: errors in pass4.c; no output created
lint: pass2 not run - errors in pass4.c

The last line refers to pass2 of lint.  The first pass is basically a compiler, the second pass does the checking.

(If there are further questions, they will be added here.)

If you want to see the actual rules for C, here is an excerpt from the C99 Standard, Sec. 5.1.1.2, Translation phases.

The precedence among the syntax rules of translation is specified by the following phases.

[footnote - Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.  Source files, translation units, and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation.  The description is conceptual only, and does not specify any particular implementation.]

1.  Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.  Trigraph sequences are replaced by corresponding single-character internal representations.
• This corresponds to Pass 1 and Pass 2 of the current project.
2.  Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines.  Only the last backslash on any physical source line shall be eligible for being part of such a splice.  A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
• This corresponds to Pass 3 of the current project.
3.  The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments).  A source file shall not end in a partial preprocessing token or in a partial comment.  Each comment is replaced by one space character.  New-line characters are retained.  Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
• This corresponds to Pass 4 of the current project.
4.  Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.  If a character sequence that matches the syntax of a universal character name is produced by token concatenation (6.10.3.3), the behavior is undefined.  A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.  All preprocessing directives are then deleted.

5.  Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.

6.  Adjacent string literal tokens are concatenated.

7.  White-space characters separating tokens are no longer significant.  Each preprocessing token is converted into a token.  The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

8.  All external object and function references are resolved.  Library components are linked to satisfy external references to functions and objects not defined in the current translation.  All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Last revised, 29 Jan. 2013 (Q&A added 7 Feb. 2013)