CMPSC
311,
Introduction to Systems Programming
NULL-terminated arrays of
null-terminated character strings
A NULL-terminated array of null-terminated character strings is a
common data structure, and you need to understand it completely
before you try to understand how main() works.
We need to discuss characters, arrays of characters, character
strings, pointers, and arrays of pointers.
Indented text goes into more detail,
or explains something related to the previous topic, and perhaps
could be skipped on your first reading.
Reading
- CP:AMA
- Read Sec. 13.7 first, to see where we're headed.
- Ch. 11, Pointers (all, including the Q&A section)
- Ch. 12, Pointers and Arrays (all, including the Q&A
section)
- Ch. 13, Strings (all except Sec. 13.3, including the Q&A
section)
- Sec. 17.2, Dynamically Allocated Strings
- Sec. 7.3, Character Types
- Sec. 7.6, The
sizeof Operator
- CS:APP
- Read Sec. 8.4.5 first, to see where we're headed.
- Sec. 3.8, Array Allocation and Access, esp. Sec. 3.8.1 and
3.8.2
- Sec. 3.10, Putting it Together: Understanding Pointers
- Sec. 9.11, Common Memory-Related Bugs in C Programs
- C:ARM
- Sec. 2.1, 2.7, 2.8.5
- Sec. 7.5.2, 7.5.6, 7.5.7, 7.6.2
- APUE
Pointers in C are the same as pointers in C++, but not the same as
references in C++. A pointer value is essentially the
numerical address of a location in the process address space (the
total memory and currently valid memory locations associated with
the running program). Assuming the pointer value is valid, you
can go to that address in memory and retrieve some bytes.
Pointers have a type, so you can tell how to interpret the bytes at
that address (how many do you need, and what can you do with
them?). The pointer type void * is used when you
need to hold a pointer value, but want to prevent access to the
data; in particular, there is no byte-count information
available. The pointer type char * is used in C
to indicate the address of a single character, or the start of an
array of characters, or the start of a character string; which one
you intend should be clear from the context of its use.
The distinction between address and
location might seem unnecessary. The simplest analogy is to
a box with a label on the outside and some contents on the
inside. The label describes the contents of the box.
The box itself is like a location. If you organize the boxes
in numerical order on shelves, and leave them in the same place
forever, then the address tells you precisely where to find the
box you are looking for. A read operation inspects the
contents, and a write operation changes the contents, but first
you need to find the right location. In most memory designs,
it is therefore enough to speak of the address, since (assuming
the address is valid) there is one and only one location
associated with it, and we know how to find that location quickly.
The analogy changes when you can move boxes and their contents
without changing their labels. Now, to find a box, you can't
go directly to one particular location; you need to inspect all
the locations to find a matching label. This is the idea
behind an associative memory, which is used frequently as a data
structure and as a hardware component. An alternative is to
allow the boxes to be moved, but maintain another set of boxes in
fixed locations that tell you where the box you are looking for is
currently located. You can move a box anywhere as long as
you leave a "forwarding address" to it. This is how Java
organizes objects in memory. In this setting, a pointer is
like a forwarding address. Java really does have pointers,
but they are hidden from the programmer, and can only be
manipulated indirectly if at all.
Most computers today are byte-addressable, meaning that an address
refers to one byte. When you have a data item that occupies
more than one byte, its address is the lowest address among the
addresses of its bytes, and those bytes are expected to be at
consecutive addresses in memory. Note that an address is now
typically 4 or 8 bytes (32 or 64 bits), but the C language leaves
it up to the implementation to determine the size of a pointer;
see CP:AMA pp. 252-3 for some examples of older addressing
schemes.
The type conversion rules in C allow
a void * pointer to be cast to any "pointer to data"
type, and vice versa; this does not change the numerical value of
the address, only the way bytes at that address are to be
interpreted. Some machines may enforce additional rules
about addresses relative to the type; this is usually done for
memory alignment, which improves speed of access. Pointers
to functions are also possible, with restrictions on type
conversions and other operations. Dynamic memory allocation
in C uses the function malloc() which returns a
value of type void *; the call to malloc()
specifies how many bytes to allocate, but does not specify a type
or initial value for those bytes. Dynamic memory allocation
in C++ uses the operator new, which specifies type
and initial value information. We'll discuss more of this
later in the course.
There are two fundamental operators on pointers, unary &
and unary * .
Don't confuse the period at the end
of the sentence with the operator token. Also, don't confuse
unary & with binary & (bitwise
AND) or binary && (logical AND) or the C++
reference declarator, and don't confuse unary * with
binary * (multiply) or the C/C++ pointer
declarator. C++ has two additional operators .*
and ->* (pointer to member) and a pointer to
member declarator ::* that won't be discussed here.
The unary "address" operator & (also called the
"address of" operator) yields a pointer value from something that
represents declared storage. The most common case is to
declare some variable a, and then use &a
to obtain its address. If the type of a is T then the type of &a
is T *
(pointer to T).
If you declare an array A, then the value of A
is the address of A[0]. If the type of A
is T[n] (array of T) then the type of A
is automatically converted to T * in many circumstances. Note
that information about the number of elements in A is
not carried with the use of A in most cases. For
example, passing the name of an array as a function argument passes
only the address of the first element of the array, and the size of
the array must be passed separately, or the array contents must be
constructed so that the effective end of the array can be detected
by looking at the values of the array elements. This is what
happens with NULL-terminated arrays of null-terminated character
strings.
You can't take the address of a
numerical constant like 1 or '1'. But, you can take the
address of a string literal like "1", since a character string in
C is an array of characters. A string literal represents
unnamed storage which should be treated as a constant. If
you take the address of a function, by writing &function_name,
you normally get the address of the first byte of the compiled
code for the function, but you should not depend on the details.
Exercise. Why is it not legal to take the address of a
numerical constant, such as 1 or '1'?
The unary "indirection" operator * (also called the
"dereference" operator) yields a value from a pointer. If the
type of p is T
* (pointer to T)
then the type of *p is T. This action of "following the pointer"
requires that the pointer is valid and T is not void.
The
null
pointer
constant,
written
as the symbol NULL that is defined in <stddef.h>
and other headers, has the numerical value 0 and cannot be
dereferenced. The expression A[n] is defined as
the expression *(A + n) once the operation "pointer
plus integer" is defined properly (more on this next). The
expression *p can also be used on the left side of an
assignment operator if the type T
allows it (mostly, not const-qualified) and the
pointer is valid for that type (not null and memory alignment is
proper).
An attempt to dereference a null
pointer should terminate the process immediately. Attempts
to dereference a non-null but otherwise invalid pointer may
terminate the process immediately, or later, or not at all.
A delayed failure is worse than an immediate failure, because it
is harder to find the origin of the problem. As a wise but
impertinent computer once said, "Double-check your damn pointers,
okay?"
The "pointer plus integer" operation is defined in a way that makes
pointer++ equivalent to "move the pointer to the next
array element". Since a valid pointer can be treated as the
address of an array element, adding an integer to a pointer yields a
pointer to a different element in the same array. So, if p
points to A[n] then p+m points to A[n+m].
Similarly, &A[n] has the same type and value as A+n.
The
compiler
takes
care
of
the details of how the address arithmetic is actually performed;
this depends on some machine-dependent details related to sizeof(A[n])
(more on this next). You need to take care of further details
to show that if p is a valid pointer then p+m
is also a valid pointer, and neither one is aimed outside the
allocated storage for the underlying array.
A pointer aimed at "one position
past the end of the array" is legal, as long as you don't try to
dereference it. This is usually used for terminating loops
which increment a pointer using ++, and compare the
result to the end-of-array pointer.
The sizeof operator is often useful; it tells you the
number of bytes occupied by a variable, or of an array if the size
is known, or of some data type which has a known size (the so-called
incomplete types don't have a known size). The result of sizeof
is an integer of type size_t, where size_t
is defined by macro or typedef in <stddef.h>
and some other headers. For example, sizeof(int)
is typically 4 or 8 on modern systems.
Actually, sizeof tells
you the number of char equivalents that would occupy
the same space, not the number of bytes. So, sizeof(char)
is always 1. The Posix Standard, however, does specify that
a byte has 8 bits.
You can't find the size of a function using sizeof.
You can expect that sizeof(char) ≤
sizeof(short int) ≤ sizeof(int)
≤ sizeof(long int) ≤
sizeof(long long int) (where ≤
is mathematical less-than-or-equal).
See also CP:AMA pp. 167-8 (a subsection of Sec. 8.1) and the alert
on p. 196 (in Sec. 9.3).
To improve your understanding of pointers, draw boxes with labels to
represent values stored in memory and the names and types of the
variables that are associated with those locations.
For example, the declaration
int a;
gives you one box labeled a (the name or identifier)
and int (the type). The size of the box is sizeof(a),
which is sizeof(int). The inside of the box has a
question mark since we didn't specify an initial value for a.
If you want to subdivide the "int
box" into smaller boxes representing each of the bytes of a,
that might help with your understanding of memory use, but it
doesn't help your understanding of pointer use, since the bytes of
a are not individually named and are not directly
accessible. Moreover, there are two commonly used methods of
assigning the bytes of a to memory locations -
least-significant byte at the lowest address, or most-significant
byte at the lowest address (this is the big-endian, little-endian
problem). All you can be completely sure of is that the
bytes of a will be stored in consecutive locations,
and that the machine is consistent in its choice of how to assign
the bytes to memory. Problems only arise when you try to
exchange data between machines that use different sizes of ints
and different organizations of the bytes.
Exercise. Write an expression
that extracts the second byte from a. Use only
the shift and bitwise-logical operators (binary <<,
>>, &, |, ^),
and appropriate constants.
Exercise. Write an expression that extracts the second byte
from a. Do not use the shift or
bitwise-logical operators.
Exercise. Comparing your two expressions, did "second byte"
mean the same in both? If yes, is there another possible
meaning?
Global variables have the default initial value 0, but it's a
often good idea to help the human reader and specify the initial
value explicitly.
Exercise. In C, what is the difference between a declaration
and a definition? When we said "declaration"
previously, should we have said "definition"?
You can find examples of the boxes-and-arrows diagrams in CP:AMA
Ch. 11, Pointers; Ch. 12, Pointers and Arrays; Ch. 13, Strings,
and especially in Ch. 17, Advanced Uses of Pointers.
However, these diagrams are not labeled as completely as they
should be.
The declaration
int b[10];
gives you 10 int boxes, adjacent in memory, labeled b[0]
through b[9]. The array element b[0]
is at a lower address than b[1], with nothing between
b[0] and b[1], and so on. There is
implicitly a box labeled b which contains a pointer of
type int * and a value that is the address of b[0].
However, this is just notional - an assignment like b =
&a is illegal, at least because b is a
constant.
Expressions in C like b[n]
are expected to provide a value of n between 0 and
9, inclusive, but nothing is automatically checked at compile time
or at run time to verify this. The expressions b[-1]
and b[10] are legal and may provide a value, but the
value is meaningless and potentially dangerous. This "no
bounds-checking" feature of C is regarded as a failure of language
design (for example, by designers of languages like Java that
force bounds-checking) or as an efficient use of machine time (by
people who always write correct code and always use good data, and
therefore don't need run-time bounds-checking). In C++, you
can compromise and overload the [] operator.
If you don't trust your data, then check it.
If you were to use a similar bit of
code as a function parameter in a function prototype or in a
function definition, then you have a choice of writing
int
arr[] or
int *arr, because only a pointer
to the array would be passed to the function, not the array
itself. For example,
int func(int arr[]);
could equally well be written as
int func(int *arr);
The rules for function prototypes also allow
int func(int *);
but that's rather uninformative when you have two such parameters.
The size of an array is seldom given as part of a function
parameter type, since the C language does not perform
bounds-checking on array indexes, and wouldn't pass the size
information implicitly. If you need bounds checking, pass
the size of the array explicitly as an additional parameter, and
write the code for the bounds checking yourself. The same
applies to loop limits.
Exercise. Does "size of the array" mean number of elements,
or number of bytes?
The declaration and initialization (using the address operator &)
int *c = &a;
gives you one box labeled c, which holds a pointer
value of type int *, and that value tells you where in
memory the variable a is stored. You can
illustrate this by drawing an arrow from the inside of c's
box to the outside of a's box. Note that the
type labels should be consistent: a pointer to int
must point to an int.
The declaration
int *d;
gives you one box labeled d, which holds a pointer
value of type int *, but the value of d
is indeterminate since it was not initialized. You might be
lucky, and the initial value would be NULL, the null
pointer constant, or you might be unlucky, and the value would be
non-NULL, a wild pointer. You can illustrate this
by drawing an arrow from the inside of d's box to
nowhere, using a question mark at the end of the arrow.
Remember, allocating a pointer does not allocate anything for it to
point to; that is done by initialization or assignment from
something that does refer to or point to allocated storage.
Explicit initialization of a pointer to NULL is a good
idea, unless there is some other appropriate initializer.
Keep drawing boxes with labels and arrows until you can get the
sketch right in your head, and go back to actual paper drawings when
things get complicated or you just lost track of what points to
what. Most bugs involving pointers can be resolved with two of
these drawings, one of what the code actually does, and another of
what you thought it should do.
When you read someone else's pointer code, the boxes-and-arrows
technique will almost surely help in understanding the code.
Characters in any programming language are ultimately encoded as
bit-sequences, but the particular encoding method is
implementation-dependent. The rule in C is that character
values are a subrange of the integer values, and can fit in the char
type, which has fewer bits than the int type.
When a char is converted to an int, more
bits are prepended (attached to the most-significant end of the
binary number) so that the numerical value does not change.
[It gets to be a little tricky when you consider there are unsigned
char, signed char, and char
types. The char type in C agrees with one of the
other two, but the C Standard allows either choice. C also has
wide characters and multibyte characters, which you need to deal
with Unicode, and the new edition of C has explicit support for
Unicode characters.]
Character constants in C start and end with a single-quote '
. The value 'a' has int type in C
(but char type in C++), and its numerical value is
implementation-dependent. The value '\0' is the
null character; it has numerical value 0. Escape characters
and escape codes allow you to write character constants that are not
directly on the keyboard. Do not confuse '\0'
with '0'; they have the same type but different
values. When you mean to use the null character, write '\0'
and not 0; this will help people reading the program
understand what your intent is. Do not confuse '\0'
with NULL; they have the same numerical value but
different types. When a character constant of type int
(such as 'a') is assigned to a char
variable, or converted to char in an expression, the
upper bits of the int are thrown away.
Exercise (mildly tricky). Recall that getc() in
<stdio.h> can return EOF, which is
a macro defined in <stdio.h> as -1.
When EOF is converted to a char, for
example by (char) EOF, what is the result?
Exercise. Write a function that returns 0 if char
acts like unsigned char, or 1 if char
acts like signed char. Hint: what is the minimum
value of these types? Bigger hint: CHAR_MIN in <limits.h>.
A character string in C is a null-terminated array of char.
To
detect
the
end
of a string, iterate through it and watch for the null
character. For example, the C library function strlen()
in <string.h> could be written as
int strlen(char s[]) /*
string length */
{
int n;
for (n = 0; s[n] != '\0'; n++) ;
return n;
}
or as
int strlen(char *s)
{
int n = 0;
while (*s != '\0') { s++; n++; }
return n;
}
or as
int strlen(char *s)
{
int n = 0;
while (*s++ != '\0') n++;
return n;
}
Exercise for now. Which version is easier to understand?
If you can't understand the second version at all, then you need to
review the basics of pointers, and perhaps also function
parameters. The third version might confuse you if you're not
sure about operator precedence. CP:AMA p. 262 has a nice
summary explanation.
Exercise for later. Which version is more efficient at
runtime? This would be important to know if you need to write
a string library. For now, don't forget that C already has a
string library, and there are a lot of functions you don't need to
write yourself.
Exercise. What is returned by strlen(NULL)
? What is returned by strlen("") ?
Exercise. What is wrong with this version?
int strlen(char s[])
{
for (int n = 0; s[n] != '\0'; n++) ;
return n;
}
C++ refers to the C string library in the <cstring>
header. Don't forget that C doesn't have classes, so you can't
write something like s.length(). Use strlen(s)
instead.
String constants in C (correctly called string literals) start and
end with a double-quote " . The compiler
allocates space for string literals in the data section of the
process address space, and assigns initial values, including the
extra null character which marks the end of the string. So,
the literal "" (the empty string) occupies one byte of
memory. You always need to be aware of the one-extra-char
that is stored with a string.
Exercise (mildly tricky). What are the values of strlen("a")
and strlen('a')? What are the values of sizeof("a")
and sizeof('a')? If we declare char
s[] = "a"; what are strlen(s) and sizeof(s)?
If we declare char *t = "a"; what are strlen(t)
and sizeof(t)?
Here are the results on a Mac OS X
system:
strlen("a") = 1
sizeof("a") = 2
sizeof('a') = 4
strlen(s) = 1
sizeof(s) = 2
strlen(t) = 1
sizeof(t) = 4
Can you explain these results? Why does strlen('a')
not appear here? What would be different in C++?
Exercise (this is important). Is there anything wrong or
suspicious about this code? If nothing is wrong, when it runs,
how many values are copied? (count each character separately).
char *a = "abc";
char *b = "bcd";
a = b;
Exercise (this is important). The same as previous but with
char a[] = "abc";
char b[] = "bcd";
Exercise. The same as previous but with
char a[3] = "abc";
char b[3] = "bcd";
Exercise. The same as previous but with
char a[6] = "abc";
char b[6] = "bcd";
When you need to copy a string, use strcpy() or strdup().
The
former
overwrites
existing
storage.
The latter allocates new storage, which can later be
deallocated. strcpy() is in the Standard C
library, while strdup() is in the Posix library, and
in the Sun and GNU versions of the C library.
Exercise. Why is strncpy() safer than strcpy()?
Programming note. If you use strdup()
and the compiler gives you a message like
warning: implicit function declaration: strdup
then try adding the option -D_XOPEN_SOURCE=600 to
the compile command.
Now let's consider the declaration
char *names[3];
This defines an array of 3 elements, each a pointer to char.
We could initialize the array by
char *names[3] = { "abc",
"defghi", NULL };
which sets the first two elements of names[] to non-NULL
pointers to the string literals "abc" and "defghi",
and sets the remaining element of names[] to be NULL.
So, names[] is a NULL-terminated array
of null-terminated character strings. If you change 3 to 10,
all that happens differently is that some additional storage is
allocated for the array. Since NULL is used to
mark the effective end of the array, those extra elements would not
be inspected by a carefully-written program (their values are 0
or NULL by C's rules for initializers and null
pointers).
Exercise (this is important). Draw a boxes-and-arrows diagram
of the names array and everything it points to.
Compare your diagram to CP:AMA p. 303 (or CS:APP Fig. 8.19 and 8.20,
or APUE Fig. 7.5). Add some more indicators to the diagrams
(yours and the book's) to make it clear which direction represents
increasing addresses.
The most common code snippets with an array like names[]
are to iterate through all its elements, or search for a particular
element.
char **p;
for (p = names; *p != NULL; p++)
printf("%s\n", *p);
for (p = names; *p != NULL; p++)
if (strcmp(search_key, *p) == 0) // string
comparison
break;
if (*p == NULL) ... // not found
You can also iterate through all the characters of the strings that
names points to.
char *q;
for (p = names; *p != NULL; p++)
for (q = *p; *q != '\0'; q++)
putc(*q, stdout);
Exercise (this is important). Explain the type of p
and its assignment from names. In particular,
why did we not use char *p[]; ? Using the
boxes-and-arrows illustration, show how the first for
loop works.
Exercise (this is important). Explain the type of q
and its assignment as derived from names. Using
the boxes-and-arrows illustration, show how the nested for
loops work.
Exercise. What happens if a supposedly NULL-terminated
array of null-terminated character strings isn't?
Exercise. Rewrite the first for loop to use putc()
instead of printf(). (This is to review your
understanding of the data structures, not a recommended
practice. Use the string library as often as possible.)
Exercise (mildly tricky, but important). How many bytes are
allocated by this declaration?
char *foo[] = { "one", "two",
"three", NULL };
Example. The parameter verbose controls how much
information is printed.
/* print elements of (*array)[] up to (but not including) a
NULL pointer */
void print_NULL_terminated_array(char *name, char **array[], int
verbose)
{
char **p;
if (verbose == 0)
{
for (p = *array; *p != NULL; p++)
{ printf("%s\n", *p); }
}
else
{
int i = 0;
printf("address %p: %s = %p\n",
array, name, *array);
for (p = *array; *p != NULL; p++,
i++)
{
printf(" address %p: %s[%d] = %p --> \"%s\"\n",
p, name, i, *p, *p);
}
}
}
Usage:
char *foo[] = { ... , NULL };
// ... here means comma-separated
list
of string literals
print_NULL_terminated_array("foo",
&foo, verbose);
Exercise. Explain why we used the parameter declaration char
**array[] and not char *array[]. That
is, why would the usage be
print_NULL_terminated_array("foo",
&foo, verbose);
and not
print_NULL_terminated_array("foo",
foo, verbose);
Exercise. Write a macro interface to
print_NULL_terminated_array() with two arguments.
Last revised, 1 Feb. 2013