CMPSC
311,
Introduction to Systems Programming
Regular Expressions
Reading
- Harley Hahn's Guide to Unix
and
Linux
- Ch. 19, Filters:
Selecting, Sorting, Combining, and Changing
- Ch. 20, Regular
Expressions
- C Standard and man pages, see References list at the end
Regular Expressions, other than
C/Unix
Regular Expressions
- Specify a pattern that describes a set of character strings.
- Posix Definition: "Regular Expression: A pattern that
selects
specific strings from a set of character strings."
- Given a string, is it in that set?
- Given a string and a pattern, can you find a substring that
matches the pattern?
- editor commands, search, search-and-replace
Regular expressions are built from characters, "metacharacters", and
concatenation.
- a "plain character" matches itself
[characters]
- matches any single character in the list
- ranges are possible, such as
[a-z] to match
any
single lowercase letter
[^characters]
- matches any single character not in the list
.
- matches any single character
re*
- matches zero or more instances of the regular expression
re
^
- matches the beginning of a line
$
- matches the end of a line
Unix commands using regular expressions -- grep
grep [options] pattern
[file...]
- Select lines of a file that match a given pattern.
- The name comes from the
ed/ex/vi command g/re/p which means
"global
regular expression print".
- Options
-c, count the number of matching lines
-i, ignore uppercase/lowercase
-l, list names of matching files only
-L, list names of non-matching files only
-n, also print the line number
-r, recursive, if given a directory name
-s, suppress certain warning messages
-v, reverse, print lines that do not match
-w, search only for complete words (composed of
letters, digits, underscore, and separated by other
characters)
-x, match the entire line
- Other versions
fgrep, for fixed-string patterns
egrep, extended grep, the modern
version
egrep and "grep -E" use extended
regular expressions (ERE)
grep uses basic regular expressions (BRE)
- On Linux and Mac OS X, these are the same program as
grep,
but
the
behavior
changes
according
to
the
program
name.
- Use the command
cmp to compare binary files,
but
first try "ls -l".
Examples
How many subdirectories are there in a given directory?
(example
using /usr on Mac OS X)
% ls /usr
X11 bin
lib
sbin standalone
X11R6 include
libexec share
% ls -l /usr
total 8
drwxr-xr-x 8 root
wheel 272
Jul 23 2009 X11
lrwxr-xr-x 1 root
wheel 3 Jul 23 2009 X11R6
-> X11
drwxr-xr-x 924 root wheel 31416 Jan 18 09:18
bin
drwxr-xr-x 267 root wheel 9078 Jul
23
2009 include
drwxr-xr-x 395 root wheel 13430 Jun 28
2011 lib
drwxr-xr-x 94 root wheel 3196 Jul
11 2011 libexec
drwxr-xr-x 251 root wheel 8534 Jan 18
09:18 sbin
drwxr-xr-x 72 root wheel 2448 Jul
11 2011 share
drwxr-xr-x 5 root
wheel 170
Jul 23 2009 standalone
% ls -l /usr | grep
'^d'
| wc -l
8
% ls -F /usr
X11/ bin/
lib/
sbin/ standalone/
X11R6@ include/
libexec/ share/
% ls -F /usr | cat
X11/
X11R6@
bin/
include/
lib/
libexec/
sbin/
share/
standalone/
% ls -F /usr | grep /
X11/
bin/
include/
lib/
libexec/
sbin/
share/
standalone/
% ls -F /usr | grep -c /
8
Exercise. How can the ls command discover that
it
is writing to a pipe, and not to a terminal?
Edit all source code files with the struct name type.
vi `grep -l "struct name" *.[ch]`
Find all source code lines with the struct name
type. Which of these is better?
grep "struct name" *.[ch]
grep -w "struct name" *.[ch]
grep "struct *name" *.[ch]
Exercise. Should you use single-quotes or double-quotes around
the pattern in a grep or egrep
command? Why?
Really find all ...
find . -type f -exec grep -l 'SearchPattern' {} \;
- grep -r -l
'SearchPattern' .
- Can you predict which command will run faster?
- How could you tell which ran faster?
Unix commands using regular expressions -- sed
sed [-i] command
[file...]
sed [-i] -e command
[file...]
- Non-interactive text editor
- The name comes from "stream editor", since it is often used in
a
pipeline.
- Until end-of-file,
- read a line from the input stream,
- execute the specified commands, making changes as necessary
to
the line,
- write the line to the output stream
- Options (not the entire list)
-i, change the file in-place (be careful!), GNU
version only
-e, to specify a sequence of commands to be
applied to each line
-f, take commands from a file
- Commands
[ address [ , address ] ] command [ arguments ]
- Addresses
number
- line numbers start with 1; use $ for the last line number
in
the file.
number,number
/regex/
- Substitute command
[address|/pattern/]s/search/replacement/[g]
address is the
address
of one or more lines in the input stream
pattern is a
character
string
search is a regular
expression
replacement is the
replacement text
- terms in brackets are optional, vertical bar indicates a
choice
- final
g indicates "global" - make all possible
replacements on this line (non-overlapping), default is only
the first
possible
- etc.
Examples
- Convert DOS end-of-line to Unix end-of-line
sed 's/.$//'
- Is this right?
- Change all instances of Harvey to Pooka
sed 's/Harvey/Pooka/g' inputfile > outputfile
- Change all instances of mon or Mon to Monday
sed 's/[mM]on/Monday/g' inputfile
- If this is going to work correctly, what constraints should
we
place on the input file contents?
- Check the spelling in an HTML file, using the Solaris
spell
program
s/<[^>]*>/ /g
s/\ / /g
s/\</\</g
s/\>/\>/g
s/\®//g
s/\&/\&/g
s/$/ /
sed -f strip.sed $1 | spell
- Does this work on all HTML files?
Extended Regular Expressions, basic matching
metacharacter
|
meaning
|
.
|
match any single character
except newline
|
^
|
anchor: match the beginning
of a
line
|
$
|
anchor: match the end of a
line |
\<
|
anchor: match the beginning
of a
word
|
\>
|
anchor: match the end of a
word
|
[list]
|
character class: match any
character in list
|
[^list]
|
character class: match any
character not in list
|
( )
|
group: treat as a single unit
|
|
|
alternation: match one of the
choices
|
\
|
quote: interpret the
following
metacharacter literally
|
Extended Regular Expressions, repetition operators
| operator |
meaning
|
*
|
match 0 or more times
|
+
|
match one or more times
|
?
|
match zero or one times
|
{n}
|
bound: match n times
|
{n,}
|
bound: match n or more times |
{0,m}
|
bound: match m or fewer times
|
{,m}
|
bound: match m or fewer times
(non-standard) |
{n,m}
|
bound: match n to m times |
Basic and Extended Regular Expressions, predefined character classes
class
|
meaning
|
similar to ...
|
[:lower:]
|
lowercase letters
|
a-z
|
[:upper:] |
uppercase letters
|
A-Z
|
[:alpha:] |
upper- and lowercase letters
|
A-Za-z
|
[:alnum:] |
upper- and lowercase letters,
numerals
|
A-Za-z0-9 |
[:digit:] |
numerals
|
0-9
|
[:punct:] |
punctuation characters
|
|
[:blank:] |
space or tab (whitespace)
|
|
- Use these inside brackets, as in
[[:upper:][:digit:]]
to match an uppercase letter or digit.
- The set of letters is defined by the current locale.
- also,
[:cntrl:], [:graph:],
[:print:], [:space:],
[:xdigit:]
Basic Regular Expressions
- metacharacters
- must escape
{ } ( ) with \
- no
? + | (but you could use \{0,1\}
for ? and \{1,\} for +)
- predefined character classes are not accepted by some older
programs and functions, but are now part of the Posix
requirements for
BRE's
Examples
Search for Canadian postal codes
- letter number letter space number letter number
- uppercase letters (English only?)
grep '[A-Z][0-9][A-Z] [0-9][A-Z][0-9]' data
grep '[[:upper:]][[:digit:]][[:upper:]]
[[:digit:]][[:upper:]][[:digit:]]'
data
Search for a word, and accept American or British spelling
Search for a year date
Search for dollars or Euros
egrep '(\$|EUR|Eur)[:digit:]' data
- Why not use € ?
- Is this foolproof?
Exercise. Explain \\\*.*[A-Za-z]+\$
\\ --> a quoted backslash
character
- etc.
Exercise. Why are the fields in struct stat
named st_something?
Need a list of words from a dictionary?
- Solaris, look in
/usr/dict/
- Linux, look in
/usr/share/dict/
- GNU
aspell, interactive spell checker
- Mac OS X, look in
/usr/share/dict/
Build or solve crossword puzzles
grep -i '^..qu...$' /usr/dict/words
- Solaris, 17 words, including Nyquist but not dequeue
grep -i '^..qu...$' /usr/share/dict/words
- Linux, 105 words, including Nyquist and dequeue
grep -i '^..qu...$' /usr/share/dict/words
- Mac OS X, 53 words, but not Nyquist or dequeue
The Regular-Expression Library (Posix version, basic and extended
regular
expressions)
#include <regex.h>
regex_t //
compiled
internal form
regmatch_t // report match position
regoff_t // signed
integer,
offset
int regcomp(regex_t * restrict preg,
const
char * restrict pattern,
int
cflags);
int regexec(const regex_t * restrict preg,
const
char * restrict string,
size_t
nmatch,
regmatch_t
pmatch[restrict],
int
eflags);
size_t regerror(int errcode,
const
regex_t * restrict preg,
char
* restrict errbuf,
size_t
errbuf_size);
void regfree(regex_t *preg);
regcomp() compiles an RE written as a string (pattern)
into
an
internal form (preg)
regexec() matches an internal form (preg)
against
a
string (string) and reports results (first nmatch
elements of pmatch)
regerror() transforms error codes from regcomp()
or regexec() into
human-readable messages
regcomp() and regexec() return 0
if
successful
regfree() frees any dynamically-allocated storage
used by the internal form of an RE
typedef struct {
size_t re_nsub;
// Number of parenthesised subexpressions.
...
} regex_t;
typedef struct {
regoff_t rm_so;
// Byte offset from start of string to start of
substring.
regoff_t rm_eo;
// Byte offset from start of string of the first
character after
the end of substring.
...
} regmatch_t;
regcomp() cflags is bitwise OR of
REG_EXTENDED
- use Extended Regular Expressions
- default is Basic Regular Expressions (
REG_BASIC)
REG_ICASE
- ignore case in match
- default is case-sensitive
REG_NOSUB
- no subexpressions of the regular expression will be
matched
- report only success/fail in
regexec()
- default is to set
preg->re_nsub in regexec()
REG_NEWLINE
- change the handling of newline characters, as described in
the rest of the specification (not copied here)
- etc.
regexec() eflags is bitwise OR of
REG_NOTBOL
- (not beginning of line)
- The first character of the
string pointed to by
string is not the
beginning of the
line. Therefore, the circumflex character (^), when
taken
as a special character, will not match the beginning of
string.
REG_NOTEOL
- (not end of line)
- The last character of the string
pointed to by
string is not the end of the
line.
Therefore, the dollar sign ($), when taken as a special
character, will
not match the end of string.
- etc.
regexec() pmatch[] must have at least nmatch
elements if nmatch is greater than 0 or if REG_NOSUB
is not used.
Example (from the Posix Standard and the Solaris man page)
#include <regex.h>
/*
* Match string against the extended regular
expression
* in pattern, treating errors as no match.
*
* return 1 for match, 0 for no match
*/
int match(const char *string, char *pattern)
{
int status;
regex_t re;
if (regcomp(&re, pattern,
REG_EXTENDED | REG_NOSUB) != 0)
{ return 0; }
status = regexec(&re, string,
(size_t) 0, NULL, 0);
regfree(&re);
if (status != 0)
{ return 0; }
return 1;
}
Example (from the Posix Standard and the Solaris man page)
- Find all substrings in a line that match a pattern supplied by
a
user.
- Note - by default, patterns are assumed to be basic regular
expressions.
regex_t re;
regmatch_t pm;
int error;
(void) regcomp(&re, pattern, 0);
/* this call to regexec() finds the first match on
the
line */
error = regexec(&re, &buffer[0], 1, &pm,
0);
while (error == 0) { /* while
matches found */
/* substring found is between pm.rm_so and
pm.rm_eo
*/
/* find the next match */
error = regexec(&re, buffer + pm.rm_eo, 1,
&pm, REG_NOTBOL);
}
regfree(&re);
- Do you see the bug? (This is not easy!)
- Here is the original version, with additional output.
- Here is a correct version, with error-checking and additional
output.
Some details
- character matching depends on the bit pattern of the character
encoding, not on the graphic representation of the character
- a search starts at the beginning of a string, and ends when
the
first substring matching the regular expression is found
- the longest matching substring is matched
- each subpattern matches the longest possible substring
- a null string is longer than "no match at all"
- as usual, a null character denotes the end of the string
- The limit to the length of a regular expression is at least
256
bytes.
- We left out equivalence classes, collating elements,
multicharacter collation, and a few other things.
References
- Harley Hahn's Guide to Unix and Linux, see above
- assorted man pages
- Solaris, archaic versions, libgen.h, regcmp(3C); the
regcmp(1)
command can produce C code.
- Solaris, less-archaic versions, Simple Regular Expressions,
regexp(5)
- Solaris, modern versions, Basic and Extended Regular
Expressions, regex(5), regex.h, regcomp(3C)
- Linux, regcomp(3)
- Mac OS X, regex(3)
- Posix, Base Definitions, Issue 6, 2004, Chapter 9, "Regular
Expressions"
- Posix, Base Definitions, Issue 7, 2008, Chapter 9, "Regular
Expressions"
Last revised, 8 Apr. 2013