CMPSC 311, Spring 2013, Project 3

Posted Jan. 30, 2013.  Due Feb. 8, 2013, by 11:55 pm on ANGEL.  25 points.

The goal of this project is to develop something like part of the first phase of a C compiler.  You should create a project3 directory along the lines of Project 2.  Again, if you have any problems using Unix, the editors or the compilers, don't hesitate to ask for help.  Otherwise, do this project on your own.



Reading



Let's start with some background information and vocabulary.

Tokenization is the compiler step that groups characters into tokens; this is also called lexical analysisSyntactic analysis and code generation follow lexical analysis.  This project deals with some of the C compiler steps that must be applied before tokenization.
Whitespace characters consist of space, end-of-line (newline in Unix), vertical tab, form feed, and horizontal tab.  Note that the two-character sequence \n is not the same as the single character newline.  Whitespace is ignored but is used to separate adjacent tokens as the compiler checks the syntax of the program.  Comments are treated as whitespace, and are replaced by spaces or newlines (depending on how the comment was written) before tokenization.

Here is one point that you should understand, that often confuses people.  When you write a program, the four-character sequence '\r' in the program source file represents a single abstract character (the carriage return character).  When the program is translated by the compiler, those four characters in the source file are translated into one token, and then into one numeric value in the executable file (an int value in C and a char value in C++).  The actual numeric value that is used is taken from a table that defines the execution environment's coding system for characters.  On the computers you have available today, you can reasonably expect that the coding system will be ASCII, which is the same as 8-bit Unicode, but that isn't guaranteed.  You should not use the numeric codes directly in your program, and (unless you are absolutely certain which computer you are using) you should not even implicitly assume properties of the coding system.  Functions like isalpha() in <ctype.h> are the proper place for those assumptions.

Here is how to see the output of the C preprocessor; this will useful for comparison at various times.  Suppose your source file is xyz.c.  The command
cc -E xyz.c
will send the result of the preprocessor to standard output, and the command
cc -E xyz.c > xyz.i
will write the result of the preprocessor to the file xyz.i.  Just as the filename xyz.c tells the compiler to expect C source code, the name xyz.i tells the compiler to expect C source code that does not need to go through the preprocessor.  Of course, you can replace "cc" with "c99" or "gcc" or "gcc -std=c99" and get the same behavior.



This project should be implemented as four programs connected through a command pipeline.  To run the set of programs on the source file xyz.c, use the command

cat xyz.c | pass1 | pass2 | pass3 | pass4

We will describe each pass separately, but remember that the ordering is important.  Project 3 requires you to implement all four passes as separate programs, to emphasize the use of pipelines and concurrent processes.

The overall structure of each pass should be something like this, where c0 is the current character for inspection, and c1 is the next character.  Either of these could be EOF.  The pair [c0, c1] acts like a moving window into the input stream.  The variable shift acts as an indicator for whether or not to "single shift" the pair [c0, c1] to [c1, getc(stdin)], or to "double shift" it to [getc(stdin), getc(stdin)].  If you already know about shift registers, that's all that's going on here.

int c0, c1;

c0 = getc(stdin); c1 = getc(stdin);

while (c0 != EOF) {
  int shift = 1;

  ... make a decision using c0 and c1, change some variables, maybe call putc()

  if (shift == 1)
    { c0 = c1; c1 = getc(stdin); }
  else /* shift == 2 */
    { c0 = getc(stdin); c1 = getc(stdin); }
}

When you need to print a warning message, use stderr with fprintf().  Examples will be given.



The first part of compiling a C program begins with some more-or-less simple transformations.  We will simplify these transformations for this project, but not by much.  There are four steps, which must occur in the given order:
We will speak of "the input C program", but in fact for this project the input is arbitrary and does not need to resemble a program at all.

In the first step, the whitespace characters form feed, vertical tab and horizontal tab (FF, VT, HT) are each replaced with a single space character, the two-character sequence return-newline (CR-LF) is replaced by a single newline character (LF), and the carriage return character (CR) is replaced by a single newline character.  All other characters are unchanged from input to output.  The intent is that "strange spaces" become normal spaces, and all line-endings become Unix-style line-endings.
In the second step, ISO trigraphs are removed and replaced according to CP:AMA, Sec. 25.3, or C:ARM Table 2-2.
In the third step, the two-character sequence backslash-newline is removed and not replaced.
In the fourth step, comments are removed and replaced with a single space character (if originally on one line), or with multiple newline characters (if originally on multiple lines).  This preserves the original formatting and line numbers of the program.

Any other character transformations that would be done by a real compiler should be omitted from this project.



Here is some more detailed information, and a few examples.

Pass 1, Whitespace and End-of-Line Markers

See CP:AMA. Sec. 22.1 (Text Files versus Binary Files), Ch. 22 Q&A (p. 577-578), or C:ARM, Sec. 2.1.2, paragraph 2.

The whitespace characters form feed, vertical tab and horizontal tab are each replaced with a single space character, the two-character sequence return-newline is replaced by a single newline character, and the return character is replaced by a single newline character.  All other characters are unchanged from input to output.  This occurs for the entire source file, including inside character constants and string literals.
Here's an example for Pass 1.

% cat example1.c
#include <stdio.h>
int main(void) { printf("a\f1\v2\t3\nb\rc\r\n"); return 0; }

% cc example1.c


% a.out | od -c

0000000    a  \f   1  \v   2  \t   3  \n   b  \r   c  \r  \n            
0000015

% a.out | pass1 | od -c

0000000    a       1       2       3  \n   b  \n   c  \n                
0000014

% cat example1.c | pass1 | od -c

0000000    #   i   n   c   l   u   d   e       <   s   t   d   i   o   .
0000020    h   >  \n   i   n   t       m   a   i   n   (   v   o   i   d
0000040    )       {       p   r   i   n   t   f   (   "   a   \   f   1
0000060    \   v   2   \   t   3   \   n   b   \   r   c   \   r   \   n
0000100    "   )   ;       r   e   t   u   r   n       0   ;       }  \n
0000120

Pass 2, Trigraph Replacement


See CP:AMA. Sec. 25.3, or C:ARM, Sec. 2.1.4.

Certain multi-character sequences are converted to single characters in the source character set.  This occurs for the entire source file, including inside character constants and string literals.

This pass requires looking ahead three characters, and a switch statement would be useful.

See also CP:AMA, Sec. 7.3 (Character Types; Escape Sequences), or C:ARM, Sec. 2.7.5, Escape Characters, and Sec. 2.7.6, for the character escape code \? .  The latter explains why you don't need to do anything here about \? .

Pass 3, Source Line Continuation

See CP:AMA. Sec. 13.1 (Continuing a String Literal), or C:ARM, Sec. 2.1.2, paragraph 3.

A line ending with backslash is joined with the following line.

Don't forget that, inside a character constant or a string literal, the sequence \\ is the character escape code for the backslash character.  If you see backslash followed by any character other than newline, the backslash should be copied to the output without change.  It is only backslash followed by newline that represents source line continuation, and then the backslash and newline are not sent to the output.

There are actually four cases to consider when a backslash is noticed:

backslash end-of-file faulty input
backslash (not newline) looks like an escape character
backslash newline source line continuation
backslash newline end-of-file
faulty input

In the case of source line continuation, we expect to find another character after the newline.  At this point, we don't bother checking that the second case is a proper escape character; that would be done later, in tokenization and lexical analysis.

There are two potential warning messages that should be issued in this pass, with output to stderr:
Examples are given later.

Pass 4, Comment Removal


See CP:AMA. Sec. 2.3 (Comments), or C:ARM, Sec. 2.2, Comments, Sec. 2.3, Tokens, Sec. 2.7, Constants.

Comments are treated as whitespace.  Each comment is replaced by one space character, or by several newline characters, depending on how the comment was written.  Note that the interior of character constants (like 'a') and string literals (like "a") do not contain comments, so this pass is more complex than the first three.  You need to do some elements of tokenization here, but only to recognize the beginning and end of character constants, string literals, and the two forms of comments.
The two forms of comment are
/* extending to the first */
// extending to the end of line

Note that this pass occurs after source line continuation, so the comment
// foo \
bar
will already have been modified to
// foo bar

The /* */ form can extend across source code lines.  It should be replaced with one space character if all on one line, or with one or more newline characters if on multiple lines (as a partial attempt to preserve the original number of lines).

Comments are not recognized within other comments.  The first example (a nested comment) is illegal in Standard C and the second is one comment not two,  The third example is not terminated (it might be terminated on a later line).
/* /* something */ */
// /* something */
/* // something

The structure of this pass is more complicated than the first three, so here it is in more detail, using the program schema presented earlier.

String literals (CP:AMA, Sec. 13.1 and pp. 304-305, or C:ARM Sec. 2.7.4)
Character constants (CP:AMA, Sec. 7.3, or C:ARM Sec. 2.7.3)
Single-line comments (CP:AMA, Sec. 2.3 and pp. 31-32, or C:ARM Sec. 2.2)
Multi-line comments (CP:AMA, Sec. 2.3 and pp. 31-32, or C:ARM Sec. 2.2)
Escape characters and escape codes (CP:AMA, Sec. 7.3, or C:ARM Secs. 2.7.5-7)
Here's an example for Pass 4.  For display purposes, we added an extra : character before and after the space character that replaces a single-line comment, and added a + character before every newline that replaces part of a multiline comment.

% cat example4.c
/* first */
char foo[] = "// looks like a comment";         /* is a comment */
char bar[] = "/* looks like a comment */";      // is a comment
char blit[] = "\a\b\n\t'\"";    // blit
char blot0 = '"';               // blot0
char *blot1 = "'";              // blot1
/* second
 * third */     // fourth /* inside a comment */
void func(void) // one
// two
{
  // three
  int a /* b */ c;/* xyz */
  int d/*e*/f;// xyz
  return; // four
  // five
}

// six

char blat[] = "";               // blat
char blat0[] = "x";             // blat0
char blat1 = 'x';               // blat1
char blat2[] = "\a";            // blat2
char blat3 = '\a';              // blat3
char blat4[] = "\"y";           // blat4
char blat5[] = "z\"";           // blat5
char blat6 = '\'';              // blat6

/* a
 * b
 * c
 */

/* x
 * y

% pass4 < example4.c
: :
char foo[] = "// looks like a comment";         : :
char bar[] = "/* looks like a comment */";      : :
char blit[] = "\a\b\n\t'\"";    : :
char blot0 = '"';               : :
char *blot1 = "'";              : :
+
     : :
void func(void) : :
: :
{
  : :
  int a : : c;: :
  int d: :f;: :
  return; : :
  : :
}

: :

char blat[] = "";               : :
char blat0[] = "x";             : :
char blat1 = 'x';               : :
char blat2[] = "\a";            : :
char blat3 = '\a';              : :
char blat4[] = "\"y";           : :
char blat5[] = "z\"";           : :
char blat6 = '\'';              : :

+
+
+


+

warning: unterminated /* */ comment, fixed, Pass 4


(This is the end of the Pass 4 description, you can start breathing again.)



The remaining steps of compilation are
Most compilers are also capable of running the linker, to produce a single executable file from multiple object files.

Note that the preprocessor also goes through tokenization, syntactic analysis and code generation, but there are some additional tokens such as the # and ## operators, the syntax is different, and "code generation" produces plain text output.  You can see this output by using the -E compiler option.

The C tokenization step includes replacing alternate token spellings (digraphs, CP:AMA, Sec. 25.3, or C:ARM Sec. 2.4), which does not occur inside character strings.  You don't need to consider these for this project.



Project starter kit and testing methods

The programs you turn in (pass1.c, pass2.c, pass3.c, pass4.c) should all read from stdin, write to stdout, and write warning messages to stderr.

Here is a Makefile that you might find useful.  This is part of the materials to be turned in.

For the following files, be careful not to introduce DOS-style end-of-line markers when you save the files, or to add or delete space characters by copy-and paste.  Use the browser to save the files directly to your project3 directory.

Create a file with whitespace and end-of-line characters, test your Pass 1 solution
Create a file with trigraphs, test your Pass 2 solution
Create a file with source line continuations, test your Pass 3 solution
Create a file with all of the above, test your combined solution
Another test case, also mentioned later in the Q&A section



Further notes

If you decide to write a recursive solution, instead of the iterative solution that has been suggested, be sure that the depth-of-recursion is not proportional to the size of the input, measured in characters.

If you generate an infinite loop or an infinite recursion, type control-C to interrupt and terminate the process.  This works even if the command you used was a pipeline, with more than one process.

If you run the vi or vim editor on test1-in, the whitespace and line break characters are displayed in the editor buffer as shown in the third column of this table.

'\n'
newline
(end of line)
'\r'
carriage return
^M
'\f'
form feed ^L
'\v'
vertical tab ^K

If you want to check your work against an actual C compiler, put your input data in a file hack.c, and compile it with one of
[Sun]    cc -E hack.c
[GCC]   
gcc -E -trigraphs hack.c
The -E option tells the compiler to print the preprocessor's result to standard output.  Note that the file hack.c does not need to contain a valid C program, or even anything close to one.  To emphasize that trigraphs are NOT part of normal C programs, the GNU compiler requires you to make an extra effort to enable them.

You should definitely consider adding some additional warning messages, but make sure these go to stderr, not stdout.  Here are some examples; follow this style as closely as you can, and not just for Pass 1.
warning: empty input, Pass 1
warning: input error, Pass 1

Note that the file /dev/null is guaranteed to be empty.



When your code is complete and you are satisfied it is right, here is how to turn it in for a grade.

Login to one of the CSE Linux systems, cd to your cmpsc311/project3 directory, maybe transfer files from your home system, and run some final tests just to make sure.
Your code should be in four files named pass1.c, pass2.c, pass3.c and pass4.c which each start like
/* CMPSC 311, Spring 2013, Project 3, Pass 1
*
* Author: <your name>
* Email: <your preferred email address>
*
... <additional comment text>
*/
except, of course, the proper pass number and your own name and address should be there.  If you have some functions that are shared between pass[1234].c, these should go into the optional files pr3lib.h and pr3lib.c.

You should have a file named Makefile, which could be the same as the one linked above.  Modify Makefile, to enter your own name and address.  Otherwise, just add to it, and don't remove any of the targets that are already there.

Pick up (i.e., download or copy) the file README, and modify it accordingly.  Any messages or comments about the project should go here.

Pick up the file wrap, and don't change it.  This contains the shell script that will bundle your project files.

Execute the command
sh wrap
which (if you have done everything right) should produce output like this.  If you made a mistake, you would get output like this instead.  You can expect some variation in the output because of different user names, file modification times, file sizes, Linux vs. Solaris, etc.

The wrap script will create a "gzipped tar file" containing the files README, Makefile, pass1.c, etc., and some additional data.  The actual name of the file depends on your username.  Tar stands for "tape archive"; it accumulates a collection of files into one file, preserving ownership and date information.  gzip does file compression, and the original tar file will be recovered by using gunzip.

Login to ANGEL and put the file project-3-username.tar.gz in the ANGEL Dropbox for Project 3 (with your username substituted, of course).
The ANGEL Dropbox will be open for two days after the due date, but you really should get this in on time.  Nothing will be accepted more than two days late.

Only the most recent version in the Dropbox will be graded.



Additional Notes.
If you have questions about the assignment description or expectations, or need more examples, send mail to dheller at cse.psu.edu, with the subject line "CMPSC 311 Project 3" (this will help keep things organized).



Q&A

How is backspace treated in the source file?

C:ARM, p. 12, suggests that it acts like a space.  Suppose the source file xyz.c has a backspace character in it (remember, that's an actual backspace, not '\b' which only represents a backspace character).  The command "cc -E xyz.c" shows that the backspace is not removed by the preprocessor.  Actually, you need to run "cc -E xyz.c | od -c" to see this clearly.  The command "cc -c xyz.c", which just produces the object file xyz.o, shows that the backspace is an invalid source character (in an identifier, it acts like a space, separating tokens) but a valid character in a string (it remains in the object file, in the data section).

What should you do if  \\ ends a line?  Is this a case of source line continuation, or is it just an escape character?

The pair of characters \\ would not normally appear except inside a character constant or a string literal, where it does not act like source line continuation, because the immediately-following character is not a newline.

The source file (let's call it x.c)

int foo\
bar;
int foo\\
bar;
int c = '\\';
char d[] = "\\";
char e[] = "\\\
";

should become

int foobar;
int foo\bar;
int c = '\\';
char d[] = "\\";
char e[] = "\\";

Try the Sun C preprocessor

% cc -E x.c
# 1 "x.c"
int foobar;
# 3
int foo\bar;
# 5
int c = '\\';
char d[] = "\\";
char e[] = "\\";
#ident "acomp: Sun C 5.8 Patch 121015-01 2006/01/26"

Try the GNU C preprocessor

% gcc -E x.c
# 1 "x.c"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "x.c"
int foobar;
 
int foo\bar;
 
int c = '\\';
char d[] = "\\";
char e[] = "\\";

Try an incorrect implementation of Pass 3

% cat x.c | pass3
int foobar;
int foo\\
bar;
int c = '\\';
char d[] = "\\";
char e[] = "\\";

What should you do if   backslash newline   ends the file?

This case cannot occur in a legal C program; see Clause 2 below.  For this project, it should bring a warning and drop the last two bytes. 

Here's an example.  The program that generates a.out is simply doing  printf("This\\\n");

% cc example2.c

% a.out | od -c
0000000    T   h   i   s   \  \n                                       
0000006

% a.out | pass1 | od -c
0000000    T   h   i   s   \  \n                                       
0000006

% a.out | pass1 | pass2 | od -c
0000000    T   h   i   s   \  \n                                       
0000006

% a.out | pass1 | pass2 | pass3 | od -c
warning: backslash-newline at end of file
0000000    T   h   i   s                                               
0000004

What should you do if only   backslash   ends the file?

This case cannot occur in a legal C program; see Clause 2 below.  Here's a similar example.

% cc example3.c

% a.out | od -c
0000000    T   h   i   s   \                                           
0000005

% a.out | pass1 | od -c
0000000    T   h   i   s   \                                           
0000005

% a.out | pass1 | pass2 | od -c
0000000    T   h   i   s   \                                           
0000005

% a.out | pass1 | pass2 | pass3 | od -c
warning: backslash at end of file
0000000    T   h   i   s   \                                           
0000005


Note that the warning message is directed to stderr while the output of pass[123] is directed to stdout.  The warning would be upgraded to an error in an actual C compiler.

One more pathological case:

Suppose the file content per od -c is    T   h   i   s   \   \   \n   \n
The inner backslash-newline is a line continuation, so it is removed, but the outer (remaining) backslash-newline is not.

How should nested comments be treated?

Nested // comments present no problem - just discard characters from // to the first newline character, and it doesn't matter if some of them look like a comment.

Nested /* */ comments are not legal in C.  It's ok to write pass4.c in a way that doesn't bother checking for illegal C.

Here's an example.  For display purposes, we added an extra : character before and after the space character that replaces a single-line comment, and added a + character before every newline that replaces part of a multiline comment.

% cat < test4a-in.c
/* comment 1
   /* comment 2
    */ // end of comment 2
 */ // end of comment 1
int a = 1;
// comment 3 /* comment 4 */
int b = 2;

% pass4 < test4a-in.c
+
+
 : :
 */ : :
int a = 1;
: :
int b = 2;

% gcc -c test4a-in.c
test4a-in.c:4: error: syntax error before ‘/’ token

(Added, Feb. 7)

In this example, the single-quote on the first line matches the single quote on the fifth line, so all the comments between them are just copied without change.  Is that right?
That's enough.
/* first */
char foo[] = "// looks like a comment";         /* is a comment */
char bar[] = "/* looks like a comment */";      // is a comment
char blit[] = "\a\b\n\t'\"";    // blit

For this project, this is acceptable.  In a later pass, the compiler would be expected to complain about a multi-character constant, but you don't need to do that here.
I have a weird error in lint when I have it test pass4.  I may be misunderstanding the error but I don't see why pass4 would be dependent on pass2.
eru 110% lint pass4.c
lint: errors in pass4.c; no output created
lint: pass2 not run - errors in pass4.c

The last line refers to pass2 of lint.  The first pass is basically a compiler, the second pass does the checking.

(If there are further questions, they will be added here.)



If you want to see the actual rules for C, here is an excerpt from the C99 Standard, Sec. 5.1.1.2, Translation phases.

The precedence among the syntax rules of translation is specified by the following phases.

[footnote - Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.  Source files, translation units, and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation.  The description is conceptual only, and does not specify any particular implementation.]

1.  Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.  Trigraph sequences are replaced by corresponding single-character internal representations.
2.  Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines.  Only the last backslash on any physical source line shall be eligible for being part of such a splice.  A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
3.  The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments).  A source file shall not end in a partial preprocessing token or in a partial comment.  Each comment is replaced by one space character.  New-line characters are retained.  Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
4.  Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.  If a character sequence that matches the syntax of a universal character name is produced by token concatenation (6.10.3.3), the behavior is undefined.  A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.  All preprocessing directives are then deleted.

5.  Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.

6.  Adjacent string literal tokens are concatenated.

7.  White-space characters separating tokens are no longer significant.  Each preprocessing token is converted into a token.  The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

8.  All external object and function references are resolved.  Library components are linked to satisfy external references to functions and objects not defined in the current translation.  All such translator output is collected into a program image which contains information needed for execution in its execution environment.



Last revised, 29 Jan. 2013 (Q&A added 7 Feb. 2013)