CMPSC 311, Introduction to Systems Programming

NULL-terminated arrays of null-terminated character strings

A NULL-terminated array of null-terminated character strings is a common data structure, and you need to understand it completely before you try to understand how main() works.  We need to discuss characters, arrays of characters, character strings, pointers, and arrays of pointers.

Indented text goes into more detail, or explains something related to the previous topic, and perhaps could be skipped on your first reading.


Pointers in C are the same as pointers in C++, but not the same as references in C++.  A pointer value is essentially the numerical address of a location in the process address space (the total memory and currently valid memory locations associated with the running program).  Assuming the pointer value is valid, you can go to that address in memory and retrieve some bytes.  Pointers have a type, so you can tell how to interpret the bytes at that address (how many do you need, and what can you do with them?).  The pointer type void * is used when you need to hold a pointer value, but want to prevent access to the data; in particular, there is no byte-count information available.  The pointer type char * is used in C to indicate the address of a single character, or the start of an array of characters, or the start of a character string; which one you intend should be clear from the context of its use.

The distinction between address and location might seem unnecessary.  The simplest analogy is to a box with a label on the outside and some contents on the inside.  The label describes the contents of the box.  The box itself is like a location.  If you organize the boxes in numerical order on shelves, and leave them in the same place forever, then the address tells you precisely where to find the box you are looking for.  A read operation inspects the contents, and a write operation changes the contents, but first you need to find the right location.  In most memory designs, it is therefore enough to speak of the address, since (assuming the address is valid) there is one and only one location associated with it, and we know how to find that location quickly.

The analogy changes when you can move boxes and their contents without changing their labels.  Now, to find a box, you can't go directly to one particular location; you need to inspect all the locations to find a matching label.  This is the idea behind an associative memory, which is used frequently as a data structure and as a hardware component.  An alternative is to allow the boxes to be moved, but maintain another set of boxes in fixed locations that tell you where the box you are looking for is currently located.  You can move a box anywhere as long as you leave a "forwarding address" to it.  This is how Java organizes objects in memory.  In this setting, a pointer is like a forwarding address.  Java really does have pointers, but they are hidden from the programmer, and can only be manipulated indirectly if at all.

Most computers today are byte-addressable, meaning that an address refers to one byte.  When you have a data item that occupies more than one byte, its address is the lowest address among the addresses of its bytes, and those bytes are expected to be at consecutive addresses in memory.  Note that an address is now typically 4 or 8 bytes (32 or 64 bits), but the C language leaves it up to the implementation to determine the size of a pointer; see CP:AMA pp. 252-3 for some examples of older addressing schemes.

The type conversion rules in C allow a void * pointer to be cast to any "pointer to data" type, and vice versa; this does not change the numerical value of the address, only the way bytes at that address are to be interpreted.  Some machines may enforce additional rules about addresses relative to the type; this is usually done for memory alignment, which improves speed of access.  Pointers to functions are also possible, with restrictions on type conversions and other operations.  Dynamic memory allocation in C uses the function malloc() which returns a value of type void *; the call to malloc() specifies how many bytes to allocate, but does not specify a type or initial value for those bytes.  Dynamic memory allocation in C++ uses the operator new, which specifies type and initial value information.  We'll discuss more of this later in the course.

There are two fundamental operators on pointers, unary & and unary * .

Don't confuse the period at the end of the sentence with the operator token.  Also, don't confuse unary & with binary & (bitwise AND) or binary && (logical AND) or the C++ reference declarator, and don't confuse unary * with binary * (multiply) or the C/C++ pointer declarator.  C++ has two additional operators .* and ->* (pointer to member) and a pointer to member declarator ::* that won't be discussed here.

The unary "address" operator & (also called the "address of" operator) yields a pointer value from something that represents declared storage.  The most common case is to declare some variable a, and then use &a to obtain its address.  If the type of a is T then the type of &a is T * (pointer to T).  If you declare an array A, then the value of A is the address of A[0].  If the type of A is T[n] (array of T) then the type of A is automatically converted to T * in many circumstances.  Note that information about the number of elements in A is not carried with the use of A in most cases.  For example, passing the name of an array as a function argument passes only the address of the first element of the array, and the size of the array must be passed separately, or the array contents must be constructed so that the effective end of the array can be detected by looking at the values of the array elements.  This is what happens with NULL-terminated arrays of null-terminated character strings.

You can't take the address of a numerical constant like 1 or '1'.  But, you can take the address of a string literal like "1", since a character string in C is an array of characters.  A string literal represents unnamed storage which should be treated as a constant.  If you take the address of a function, by writing &function_name, you normally get the address of the first byte of the compiled code for the function, but you should not depend on the details.

Exercise.  Why is it not legal to take the address of a numerical constant, such as 1 or '1'?

The unary "indirection" operator * (also called the "dereference" operator) yields a value from a pointer.  If the type of p is T * (pointer to T) then the type of *p is T.  This action of "following the pointer" requires that the pointer is valid and T is not void.  The null pointer constant, written as the symbol NULL that is defined in <stddef.h> and other headers, has the numerical value 0 and cannot be dereferenced.  The expression A[n] is defined as the expression *(A + n) once the operation "pointer plus integer" is defined properly (more on this next).  The expression *p can also be used on the left side of an assignment operator if the type T allows it (mostly, not const-qualified) and the pointer is valid for that type (not null and memory alignment is proper).

An attempt to dereference a null pointer should terminate the process immediately.  Attempts to dereference a non-null but otherwise invalid pointer may terminate the process immediately, or later, or not at all.  A delayed failure is worse than an immediate failure, because it is harder to find the origin of the problem.  As a wise but impertinent computer once said, "Double-check your damn pointers, okay?"

The "pointer plus integer" operation is defined in a way that makes pointer++ equivalent to "move the pointer to the next array element".  Since a valid pointer can be treated as the address of an array element, adding an integer to a pointer yields a pointer to a different element in the same array.  So, if p points to A[n] then p+m points to A[n+m].  Similarly, &A[n] has the same type and value as A+n.  The compiler takes care of the details of how the address arithmetic is actually performed; this depends on some machine-dependent details related to sizeof(A[n]) (more on this next).  You need to take care of further details to show that if p is a valid pointer then p+m is also a valid pointer, and neither one is aimed outside the allocated storage for the underlying array.

A pointer aimed at "one position past the end of the array" is legal, as long as you don't try to dereference it.  This is usually used for terminating loops which increment a pointer using ++, and compare the result to the end-of-array pointer.

The sizeof operator is often useful; it tells you the number of bytes occupied by a variable, or of an array if the size is known, or of some data type which has a known size (the so-called incomplete types don't have a known size).  The result of sizeof is an integer of type size_t, where size_t is defined by macro or typedef in <stddef.h> and some other headers.  For example, sizeof(int) is typically 4 or 8 on modern systems.

Actually, sizeof tells you the number of char equivalents that would occupy the same space, not the number of bytes.  So, sizeof(char) is always 1.  The Posix Standard, however, does specify that a byte has 8 bits.

You can't find the size of a function using sizeof.

You can expect that sizeof(char) sizeof(short int) sizeof(int) sizeof(long int) sizeof(long long int) (where    is mathematical less-than-or-equal).

See also CP:AMA pp. 167-8 (a subsection of Sec. 8.1) and the alert on p. 196 (in Sec. 9.3).

To improve your understanding of pointers, draw boxes with labels to represent values stored in memory and the names and types of the variables that are associated with those locations.

For example, the declaration

int a;

gives you one box labeled a (the name or identifier) and int (the type).  The size of the box is sizeof(a), which is sizeof(int).  The inside of the box has a question mark since we didn't specify an initial value for a.

If you want to subdivide the "int box" into smaller boxes representing each of the bytes of a, that might help with your understanding of memory use, but it doesn't help your understanding of pointer use, since the bytes of a are not individually named and are not directly accessible.  Moreover, there are two commonly used methods of assigning the bytes of a to memory locations - least-significant byte at the lowest address, or most-significant byte at the lowest address (this is the big-endian, little-endian problem).  All you can be completely sure of is that the bytes of a will be stored in consecutive locations, and that the machine is consistent in its choice of how to assign the bytes to memory.  Problems only arise when you try to exchange data between machines that use different sizes of ints and different organizations of the bytes.

Exercise.  Write an expression that extracts the second byte from a.  Use only the shift and bitwise-logical operators (binary <<, >>, &, |, ^), and appropriate constants.

Exercise.  Write an expression that extracts the second byte from a.  Do not use the shift or bitwise-logical operators.

Exercise.  Comparing your two expressions, did "second byte" mean the same in both?  If yes, is there another possible meaning?

Global variables have the default initial value 0, but it's a often good idea to help the human reader and specify the initial value explicitly.

Exercise.  In C, what is the difference between a declaration and a definition?  When we said "declaration" previously, should we have said "definition"?

You can find examples of the boxes-and-arrows diagrams in CP:AMA Ch. 11, Pointers; Ch. 12, Pointers and Arrays; Ch. 13, Strings, and especially in Ch. 17, Advanced Uses of Pointers.  However, these diagrams are not labeled as completely as they should be.

The declaration

int b[10];

gives you 10 int boxes, adjacent in memory, labeled b[0] through b[9].  The array element b[0] is at a lower address than b[1], with nothing between b[0] and b[1], and so on.  There is implicitly a box labeled b which contains a pointer of type int * and a value that is the address of b[0].  However, this is just notional - an assignment like b = &a is illegal, at least because b is a constant.

Expressions in C like b[n] are expected to provide a value of n between 0 and 9, inclusive, but nothing is automatically checked at compile time or at run time to verify this.  The expressions b[-1] and b[10] are legal and may provide a value, but the value is meaningless and potentially dangerous.  This "no bounds-checking" feature of C is regarded as a failure of language design (for example, by designers of languages like Java that force bounds-checking) or as an efficient use of machine time (by people who always write correct code and always use good data, and therefore don't need run-time bounds-checking).  In C++, you can compromise and overload the [] operator.  If you don't trust your data, then check it.

If you were to use a similar bit of code as a function parameter in a function prototype or in a function definition, then you have a choice of writing int arr[] or int *arr, because only a pointer to the array would be passed to the function, not the array itself.  For example,

int func(int arr[]);

could equally well be written as

int func(int *arr);

The rules for function prototypes also allow

int func(int *);

but that's rather uninformative when you have two such parameters.

The size of an array is seldom given as part of a function parameter type, since the C language does not perform bounds-checking on array indexes, and wouldn't pass the size information implicitly.  If you need bounds checking, pass the size of the array explicitly as an additional parameter, and write the code for the bounds checking yourself.  The same applies to loop limits.

Exercise.  Does "size of the array" mean number of elements, or number of bytes?

The declaration and initialization (using the address operator &)

int *c = &a;

gives you one box labeled c, which holds a pointer value of type int *, and that value tells you where in memory the variable a is stored.  You can illustrate this by drawing an arrow from the inside of c's box to the outside of a's box.  Note that the type labels should be consistent: a pointer to int must point to an int.

The declaration

int *d;

gives you one box labeled d, which holds a pointer value of type int *, but the value of d is indeterminate since it was not initialized.  You might be lucky, and the initial value would be NULL, the null pointer constant, or you might be unlucky, and the value would be non-NULL, a wild pointer.  You can illustrate this by drawing an arrow from the inside of d's box to nowhere, using a question mark at the end of the arrow.

Remember, allocating a pointer does not allocate anything for it to point to; that is done by initialization or assignment from something that does refer to or point to allocated storage.  Explicit initialization of a pointer to NULL is a good idea, unless there is some other appropriate initializer.

Keep drawing boxes with labels and arrows until you can get the sketch right in your head, and go back to actual paper drawings when things get complicated or you just lost track of what points to what.  Most bugs involving pointers can be resolved with two of these drawings, one of what the code actually does, and another of what you thought it should do.

When you read someone else's pointer code, the boxes-and-arrows technique will almost surely help in understanding the code.

Characters in any programming language are ultimately encoded as bit-sequences, but the particular encoding method is implementation-dependent.  The rule in C is that character values are a subrange of the integer values, and can fit in the char type, which has fewer bits than the int type.  When a char is converted to an int, more bits are prepended (attached to the most-significant end of the binary number) so that the numerical value does not change.  [It gets to be a little tricky when you consider there are unsigned char, signed char, and char types.  The char type in C agrees with one of the other two, but the C Standard allows either choice.  C also has wide characters and multibyte characters, which you need to deal with Unicode, and the new edition of C has explicit support for Unicode characters.]

Character constants in C start and end with a single-quote ' .  The value 'a' has int type in C (but char type in C++), and its numerical value is implementation-dependent.  The value '\0' is the null character; it has numerical value 0.  Escape characters and escape codes allow you to write character constants that are not directly on the keyboard.  Do not confuse '\0' with '0'; they have the same type but different values.  When you mean to use the null character, write '\0' and not 0; this will help people reading the program understand what your intent is.  Do not confuse '\0' with NULL; they have the same numerical value but different types.  When a character constant of type int (such as 'a') is assigned to a char variable, or converted to char in an expression, the upper bits of the int are thrown away.

Exercise (mildly tricky).  Recall that getc() in <stdio.h> can return EOF, which is a macro defined in <stdio.h> as -1.  When EOF is converted to a char, for example by (char) EOF, what is the result?

Exercise.  Write a function that returns 0 if char acts like unsigned char, or 1 if char acts like signed char.  Hint: what is the minimum value of these types?  Bigger hint: CHAR_MIN in <limits.h>.

A character string in C is a null-terminated array of char.  To detect the end of a string, iterate through it and watch for the null character.  For example, the C library function strlen() in <string.h> could be written as

int strlen(char s[])  /* string length */
  int n;
  for (n = 0; s[n] != '\0'; n++) ;
  return n;

or as

int strlen(char *s)
  int n = 0;
  while (*s != '\0') { s++; n++; }
  return n;

or as

int strlen(char *s)
  int n = 0;
  while (*s++ != '\0') n++;
  return n;

Exercise for now.  Which version is easier to understand?  If you can't understand the second version at all, then you need to review the basics of pointers, and perhaps also function parameters.  The third version might confuse you if you're not sure about operator precedence.  CP:AMA p. 262 has a nice summary explanation.

Exercise for later.  Which version is more efficient at runtime?  This would be important to know if you need to write a string library.  For now, don't forget that C already has a string library, and there are a lot of functions you don't need to write yourself.

Exercise.  What is returned by strlen(NULL) ?  What is returned by strlen("") ?

Exercise.  What is wrong with this version?

int strlen(char s[])
  for (int n = 0; s[n] != '\0'; n++) ;
  return n;

C++ refers to the C string library in the <cstring> header.  Don't forget that C doesn't have classes, so you can't write something like s.length().  Use strlen(s) instead.

String constants in C (correctly called string literals) start and end with a double-quote " .  The compiler allocates space for string literals in the data section of the process address space, and assigns initial values, including the extra null character which marks the end of the string.  So, the literal "" (the empty string) occupies one byte of memory.  You always need to be aware of the one-extra-char that is stored with a string.

Exercise (mildly tricky).  What are the values of strlen("a") and strlen('a')?  What are the values of sizeof("a") and sizeof('a')?  If we declare  char s[] = "a";  what are strlen(s) and sizeof(s)?  If we declare  char *t = "a";  what are strlen(t) and sizeof(t)?

Here are the results on a Mac OS X system:
strlen("a") = 1
sizeof("a") = 2
sizeof('a') = 4
strlen(s) = 1
sizeof(s) = 2
strlen(t) = 1
sizeof(t) = 4
Can you explain these results?  Why does strlen('a') not appear here?  What would be different in C++?

Exercise (this is important).  Is there anything wrong or suspicious about this code?  If nothing is wrong, when it runs, how many values are copied?  (count each character separately).

char *a = "abc";
char *b = "bcd";

a = b;

Exercise (this is important).  The same as previous but with

char a[] = "abc";
char b[] = "bcd";

Exercise.  The same as previous but with

char a[3] = "abc";
char b[3] = "bcd";

Exercise.  The same as previous but with

char a[6] = "abc";
char b[6] = "bcd";

When you need to copy a string, use strcpy() or strdup().  The former overwrites existing storage.  The latter allocates new storage, which can later be deallocated.  strcpy() is in the Standard C library, while strdup() is in the Posix library, and in the Sun and GNU versions of the C library.

Exercise.  Why is strncpy() safer than strcpy()?

Programming note.  If you use strdup() and the compiler gives you a message like
warning: implicit function declaration: strdup
then try adding the option -D_XOPEN_SOURCE=600 to the compile command.

Now let's consider the declaration

char *names[3];

This defines an array of 3 elements, each a pointer to char.  We could initialize the array by

char *names[3] = { "abc", "defghi", NULL };

which sets the first two elements of names[] to non-NULL pointers to the string literals "abc" and "defghi", and sets the remaining element of names[] to be NULL.  So, names[] is a NULL-terminated array of null-terminated character strings.  If you change 3 to 10, all that happens differently is that some additional storage is allocated for the array.  Since NULL is used to mark the effective end of the array, those extra elements would not be inspected by a carefully-written program (their values are 0 or NULL by C's rules for initializers and null pointers).

Exercise (this is important).  Draw a boxes-and-arrows diagram of the names array and everything it points to.  Compare your diagram to CP:AMA p. 303 (or CS:APP Fig. 8.19 and 8.20, or APUE Fig. 7.5).  Add some more indicators to the diagrams (yours and the book's) to make it clear which direction represents increasing addresses.

The most common code snippets with an array like names[] are to iterate through all its elements, or search for a particular element.

char **p;

for (p = names; *p != NULL; p++)
  printf("%s\n", *p);

for (p = names; *p != NULL; p++)
  if (strcmp(search_key, *p) == 0)  // string comparison
if (*p == NULL) ... // not found

You can also iterate through all the characters of the strings that names points to.

char *q;

for (p = names; *p != NULL; p++)
  for (q = *p; *q != '\0'; q++)
    putc(*q, stdout);

Exercise (this is important).  Explain the type of p and its assignment from names.  In particular, why did we not use char *p[]; ?  Using the boxes-and-arrows illustration, show how the first for loop works.

Exercise (this is important).  Explain the type of q and its assignment as derived from names.  Using the boxes-and-arrows illustration, show how the nested for loops work.

Exercise.  What happens if a supposedly NULL-terminated array of null-terminated character strings isn't?

Exercise.  Rewrite the first for loop to use putc() instead of printf().  (This is to review your understanding of the data structures, not a recommended practice.  Use the string library as often as possible.)

Exercise (mildly tricky, but important).  How many bytes are allocated by this declaration?

char *foo[] = { "one", "two", "three", NULL };

Example.  The parameter verbose controls how much information is printed.

/* print elements of (*array)[] up to (but not including) a NULL pointer */

void print_NULL_terminated_array(char *name, char **array[], int verbose)
  char **p;

  if (verbose == 0)
      for (p = *array; *p != NULL; p++)
        { printf("%s\n", *p); }
      int i = 0;
      printf("address %p: %s = %p\n", array, name, *array);
      for (p = *array; *p != NULL; p++, i++)
          printf("  address %p: %s[%d] = %p --> \"%s\"\n",
            p, name, i, *p, *p);


char *foo[] = { ... , NULL };
// ... here means comma-separated list of string literals

print_NULL_terminated_array("foo", &foo, verbose);

Exercise.  Explain why we used the parameter declaration char **array[] and not char *array[].  That is, why would the usage be

print_NULL_terminated_array("foo", &foo, verbose);

and not

print_NULL_terminated_array("foo", foo, verbose);

Exercise.  Write a macro interface to print_NULL_terminated_array() with two arguments.

Last revised, 1 Feb. 2013