CMPSC 311, Spring 2013, Project 2

Posted Jan. 18, 2013.  Due Jan. 28, 2013, by 11:55 pm on ANGEL.  25 points.

The goal of this project is to improve your Unix skills, and learn some more about characters and text files.  You should be able to login, find your home directory, create a subdirectory, and run a text editor such as vi or emacs.  We'll show you some other steps for compiling and running a program, and then tell you how to turn in the program for a grade.  If you have any problems using Unix, the editor or the compiler, don't hesitate to ask for help.  Otherwise, do this project on your own.

Since this may be only the second time you have used Unix and the Unix tools (including Project 1), this project will be accepted up to two days after the due date and time, without penalty, if your delay is only a matter of learning to use Unix.  We'll explain how to make note of this in the section about how to turn in the program.

There are several links to files that you should download, both C code and test cases.  These should go into a directory which has only material for this project.  For example, you could execute these commands before downloading any of the files:
cd
mkdir cmpsc311
mkdir cmpsc311/project2
mkdir cmpsc311/project2/testdata
cd cmpsc311/project2
Then, as you pick up files, run the command
ls -l
to see what you have so far (for the moment, only the empty directory testdata).  The instructions for compiling and running the program, and for turning it in, will assume you have done something like the first four commands.



Reading, CP:AMA
There are references here to the book Advanced Programming in the Unix Environment, but you don't actually need the book.



The small program in APUE Fig. 1.5 shows how to read a file from stdin, and copy it to stdout, one character at a time.  Let's start with that one.  You can pick up all the code from the book by following the instructions here, or you can just pick up the file itself as linked below.  (Save the file with right-click or control-click on the link, depending on your browser and system.  It's easier than copy-and-paste.)

Source Code Starter Kit
There are examples of correct behavior, and links to more test files, given below.



The first premise of the project is that there are two widely-used conventions for marking the end-of-line in a plain text file.  The Unix style is to use one newline character ('\n').  The DOS style is to use a return-newline sequence ('\r' immediately followed by '\n').  The old Apple Macintosh style was to use one return character ('\r'), but that has now been replaced with the Unix style with Mac OS X.  More complex file formats, such as Word's .doc or Acrobat's .pdf do fancier things (and the new .docx is even worse). 
Can you write a program that differentiates Unix-style, DOS-style and old MacOS-style line endings?  Can you test it, and be confident it is correct?

The second premise of the project (and, actually, the easier part) is that characters can be classified in various ways.  Simply count how many of each character class you read.



Starting from pr2.2.c, write a program pr2.3.c that will
If a read error occurs, just treat it as end-of-file, and print the message "Input terminated on error".  No message is required if end-of-file occurs without an error.  The ferror() function is useful here.
If the file is empty, print the message "File is empty" and skip the rest of the output.
Next, print the character counts; some examples will follow.

Next, if the file does not end with a newline character, print the message "File does not end with a newline character".  Verify that the number of alphanumeric characters (determined by isalnum) is the sum of the number of alphabetic characters (isalpha) and the number of digit characters (isdigit).  If not, print the message "Alphanumerics don't add up"; if you actually print this message, there's a bug in your program, since the library functions are implemented correctly.

Next, try to diagnose the file as DOS-style, UNIX-style or old MacOS-style.  The nine possible messages are
That's all that is required for output.  The full program, with a decent amount of comments, can be done in 125-175 lines of C code.

If you have not seen the pattern underlying the nine possible messages in the last step, consider this table:
number of occurences
return  newline return-newline  diagnosis
0       0       0             cannot be determined
0       0      > 0            DOS
0      > 0      0             Unix (including Mac OS X)
0      > 0     > 0            mixed UNIX/DOS
> 0     0       0             MacOS (before Mac OS X)
> 0     0      > 0            mixed MacOS/DOS
> 0     > 0     0             mixed MacOS/UNIX
> 0     > 0     > 0            mixed MacOS/UNIX/DOS
PLEASE NOTE.  We will use an automated test method to help grade the programs.  It is important that you use the messages given here, and follow the style of the examples, to avoid problems for you and the grader.  We might not take off any points on this project for using different messages, but we reserve the option of doing so, and you should get into the right habits early.



To compile the program, and give it the name pr2, use one of these commands
You might need to use C99, to get access to the isblank() function.  In that case,
To check the program more thoroughly,


Here is some sample output from various cases.  The upper box holds the input file and the lower box holds the program output.  The tab, return and newline characters are not visible, so we use Δ or ∇ to indicate where a return or newline character is located, and replace a tab with 8 spaces at the beginning of a line.  No other tabs appear in these examples.  In each case the test run is like  pr2 < testfile .  The character counts are printed with printf("characters  %8d\n", count); and so on.

Referring to the Makefile, the first four examples were run with the command  make test-gcc-1 .  The fifth example was run with the command  pr2-gcc < exampledata/this ; this example is not supplied, to encourage you to create your own examples.

asdf
characters         5
  isalnum          4
  isalpha          4
  isdigit          0
  isgraph          4
  isprint          4
  iscntrl          1
  islower          4
  isupper          0
  ispunct          0
  isblank          0
  isspace          1
  isxdigit         3
end of line?
  return                 0
  newline                1
  return-newline         0
oddities?
  none detected
file type?
  UNIX-style text file, only LF line terminators

asdfΔ∇
characters         6
  isalnum          4
  isalpha          4
  isdigit          0
  isgraph          4
  isprint          4
  iscntrl          2
  islower          4
  isupper          0
  ispunct          0
  isblank          0
  isspace          2
  isxdigit         3
end of line?
  return                 0
  newline                0
  return-newline         1
oddities?
  none detected
file type?
  DOS-style text file, only CRLF line terminators

asdfΔ
characters         5
  isalnum          4
  isalpha          4
  isdigit          0
  isgraph          4
  isprint          4
  iscntrl          1
  islower          4
  isupper          0
  ispunct          0
  isblank          0
  isspace          1
  isxdigit         3
end of line?
  return                 1
  newline                0
  return-newline         0
oddities?
  File does not end with a newline character
file type?
  MacOS-style text file, only CR line terminators

asdf
characters         4
  isalnum          4
  isalpha          4
  isdigit          0
  isgraph          4
  isprint          4
  iscntrl          0
  islower          4
  isupper          0
  ispunct          0
  isblank          0
  isspace          0
  isxdigit         3
end of line?
  return                 0
  newline                0
  return-newline         0
oddities?
  File does not end with a newline character
file type?
  Strange text file, no CR or LF at all

This
        file
has
        seven
lines
        of
text with three tab characters.
characters        64
  isalnum         49
  isalpha         49
  isdigit          0
  isgraph         50
  isprint         54
  iscntrl         10
  islower         48
  isupper          1
  ispunct          1
  isblank          7
  isspace         14
  isxdigit        17
end of line?
  return                 0
  newline                7
  return-newline         0
oddities?
  none detected
file type?
  UNIX-style text file, only LF line terminators




If you want to create a test file like the second example with a '\r' character in it, try running this program
dos-style.c
#include <stdio.h>
int main(void) { printf("asdf\r\n"); return 0; }
cc dos-style.c
a.out > asdf-dos
On Solaris, if you run vi on asdf-dos, the return character displays as ^M (control-M), and the newline character is not displayed explicitly.  The vim editor (vim on Solaris, vi on Linux and Mac OS X) will not display the return character, but will note  "asdf-dos" [dos] 1L, 6C  on the bottom line of the editor window.

If you want to create a file with just a newline character at the end, run the obvious modification of dos-style.c, which would be unix-style.c.

Files created with Notepad on Windows, or saved as "Text Document - MS-DOS Format" with Wordpad, typically do not end with a return-newline sequence.  Files created with Word and saved as "Plain Text" do end with return-newline.  Files created with Excel can be saved as "Comma Separated Values (.csv)", "Windows Comma Separated (.csv)" or "MS-DOS Comma Separated (.csv)"; we'll leave it up to you to discover if all three formats are actually the same, or are different, and which, if any, text file format they resemble (but that's not actually part of the project assignment).
If you use the SSH Client on Windows, then moving .txt or .csv files for this project from Windows to Solaris or Linux should be done in Binary mode - in the File Transfer window in the SSH Client, select Operation / File Transfer Mode / Binary  (there is also a selection button for Binary mode, but it isn't obvious from the picture what it does).  The selections ASCII or Auto Select will strip the return characters from the file as part of the transfer.  You probably want Auto Select most of the time, but not now.

Here is another example in four different formats.  Each file's text has the words one two three on three lines, but the lines end differently.  These files should go into your testdata directory, created earlier.  Note that the files are not visibly different when displayed by a web browser.
Here are the bytes of these files, using the od utility on Solaris.  The byte numbers on the left side are in hexadecimal, starting from 0; this is requested by the operand x on the command line, after the file name.  You can use these examples to make sure you know what data you have before deciding if your program is right or wrong.  (% is the command prompt, the highlighted text is what we typed.)

% od -c vi123 x
0000000   o   n   e  \n   t   w   o  \n   t   h   r   e   e  \n
000000e

% od -c word123.txt x
0000000   o   n   e  \r  \n   t   w   o  \r  \n   t   h   r   e   e  \r
0000010  \n
0000011

% od -c wordpad123.txt x
0000000   o   n   e  \r  \n   t   w   o  \r  \n   t   h   r   e   e
000000f

% od -c mixed123 x
0000000   o   n   e  \r  \n   t   w   o  \n   t   h   r   e   e  \r  \n
0000010


On Linux or Mac OS X, use the command  od -A x -t c vi123  and so on.  The output is the same, only the command is different.

Here is the output of the tests on these files, using Sun's compiler on Solaris.

% cc -v -o pr2 pr2.3.c

% pr2 < testdata/vi123
characters        14
  isalnum         11
  isalpha         11
  isdigit          0
  isgraph         11
  isprint         11
  iscntrl          3
  islower         11
  isupper          0
  ispunct          0
  isblank          0
  isspace          3
  isxdigit         3
end of line?
  return                 0
  newline                3
  return-newline         0
oddities?
  none detected
file type?
  UNIX-style text file, only LF line terminators

% pr2 < testdata/word123.txt
characters        17
  isalnum         11
  isalpha         11
  isdigit          0
  isgraph         11
  isprint         11
  iscntrl          6
  islower         11
  isupper          0
  ispunct          0
  isblank          0
  isspace          6
  isxdigit         3
end of line?
  return                 0
  newline                0
  return-newline         3
oddities?
  none detected
file type?
  DOS-style text file, only CRLF line terminators

% pr2 < testdata/wordpad123.txt
characters        15
  isalnum         11
  isalpha         11
  isdigit          0
  isgraph         11
  isprint         11
  iscntrl          4
  islower         11
  isupper          0
  ispunct          0
  isblank          0
  isspace          4
  isxdigit         3
end of line?
  return                 0
  newline                0
  return-newline         2
oddities?
  File does not end with a newline character
file type?
  DOS-style text file, only CRLF line terminators

% pr2 < testdata/mixed123
characters        16
  isalnum         11
  isalpha         11
  isdigit          0
  isgraph         11
  isprint         11
  iscntrl          5
  islower         11
  isupper          0
  ispunct          0
  isblank          0
  isspace          5
  isxdigit         3
end of line?
  return                 0
  newline                1
  return-newline         2
oddities?
  none detected
file type?
  Mixed UNIX/DOS-style text file, both CRLF and LF line terminators

% pr2 < pr2
characters     10140
  isalnum       2419
  isalpha       2099
  isdigit        320
  isgraph       3254
  isprint       3608
  iscntrl       5232
  islower       1705
  isupper        394
  ispunct        835
  isblank        423
  isspace        554
  isxdigit      1005
end of line?
  return                17
  newline               65
  return-newline         0
oddities?
  File does not end with a newline character
file type?
  Not a text file

Note that you should not expect to have exactly the same result on the last example if you use Linux, or the GNU compiler, or a different processor, or a different version of the source code.  The executable file has contents that are OS-, compiler- and processor-specific, and even then it will depend on your program's structure.



At some point, you may get tired of typing certain commands over and over again.  Unix systems come with a utility named make which allows one file (called a make file) to contain sequences of commands that can be requested easily.  Make has a lot of other features that are useful in large projects.  We'll stick to the basics here, and describe more about make in later projects.

The examples here come from the file Makefile .  The file name is not random, so don't change it (Makefile.txt is a popular but wrong choice, for example).

Any line in a make file that starts with # is a comment line.
# CMPSC 311, Spring 2013, Project 2

Symbols can be defined, and can be evaluated later with the $() form.
SRC = pr2.3.c
cc -v -o pr2-sun $(SRC)

Make files describe dependencies between targets and sources.

In this example, all is the target and pr2-sun pr2-gcc pr2-lint are the sources it depends on.
all: pr2-sun pr2-gcc pr2-lint

A target-dependency line can be followed by a sequence of commands.  Each of the command lines must be indented with one tab character.  If you copy-and-paste a make file from someplace, it is likely that the tab will be replaced with eight spaces, and make won't know what to do about that except complain.

In this example, the target pr2-sun depends on $(SRC), which is pr2.3.c, and the cc command will be run to make the target.
pr2-sun: $(SRC)
        cc -v -o pr2-sun $(SRC)

When you give the command
make pr2-sun
make will compare the last-modification time of the requested target (pr2-sun) with the last-modification time of each of its dependent files (here, only pr2.3.c).  If any of the dependent files is newer than the target, or if the target doesn't exist, or if the target has no dependent files, then the associated command sequence is executed.  Thus we have
% make pr2-sun
cc -v -o pr2-sun pr2.3.c

% make pr2-sun

`pr2-sun' is up to date.

make will first recursively check the target's dependent files, to see if any of those are out-of-date, and may cause other command sequences to be executed before the target's.  In this case, nothing else needs to be done for the target pr2-sun, but something might need to be done for the target all.

It's not actually necessary for the commands under a target to create a file with the name of the target.  In this case, the make file is a good way of packaging a short command sequence.  For example, the target clean is often used to delete files and restore the directory to a "clean" state. 
clean:
        -rm pr2-sun pr2-gcc pr2 a.out *.o
        -rm asdf-dos asdf-unix asdf-oldmacos asdf-empty


The hyphen before rm indicates to make that when the rm command is executed, make should ignore the success or failure of the command.  Normally, make will stop executing commands when one of them fails.  Thus, the command
make clean
will remove files, and make will not complain about rm trying to remove a file that doesn't exist or can't be removed.  (The -f option to rm will have a similar effect in this instance, and it's quieter.)
 
If you are working on Linux or Mac OS X, then give the command
make pr2-gcc
to compile the program.  Using  make pr2-sun or  make all  will do the wrong thing.  Note that the lint program is available only on Solaris.

If you don't specify a target, the command make will assume the target is the first one it finds, which is dummy in this case.

If you want to see what make would do, but not actually do it, add the option -n, as in the command
make -n pr2-gcc

There are many other features of make that we will get to later in the course.  It's a useful program that should be part of your Unix skillset.



When your program is complete and you are satisfied it is right, here is how to turn it in for a grade.

Login to one of the CSE Linux systems, cd to your cmpsc311/project2 directory, maybe transfer files from your home system, and run some final tests just to make sure.

Your program should be entirely in one file named pr2.3.c which starts like
/* CMPSC 311, Spring 2013, Project 2
*
* Author: <your name>
* Email: <your preferred email address>
*
... <additional comment text>
*/
except, of course, your own name and address should be there.

You should have a file named Makefile, which could be the same as the one linked above.  Modify Makefile, to enter your own name and address, and change the version number.  Otherwise, just add to it, and don't remove any of the targets that are already there.

Pick up (i.e., download or copy) the file README, and modify it accordingly.  Any messages or comments about the project should go here.

Pick up the file wrap, and don't change it.  This contains the shell script that will bundle your project files.  We will describe shell scripts in more detail later in the course.

Execute the command
sh wrap
which (if you have done everything right) should produce output like this.  If you made a mistake, you would get output like this instead.  You can expect some variation in the output because of different user names, file modification times, file sizes, Linux vs. Solaris, etc.

The wrap script will create a "gzipped tar file" containing the files README, Makefile, pr2.3.c, and some additional data.  The actual name of the file depends on your username.  Tar stands for "tape archive"; it accumulates a collection of files into one file, preserving ownership and date information.  gzip does file compression, and the original tar file will be recovered by using gunzip.

Login to ANGEL and put the file project-2-username.tar.gz in the ANGEL Dropbox for Project 2 (with your username substituted, of course).
The ANGEL Dropbox will be open for three days after the due date, but you really should get this in on time.  There will be a penalty for late projects, as a percentage of the grade, increasing with time.  Nothing will be accepted more than three days late.
Only the most recent version in the Dropbox will be graded.



Additional Notes.



If you have questions about the assignment description or expectations, or need more examples, send mail to dheller at cse.psu.edu, with the subject line "CMPSC 311 Project 2" (this will help keep things organized).



Last modified 16 Jan. 2013