CMPSC 311, Introduction to Systems Programming

Regular Expressions



Reading
Regular Expressions, other than C/Unix



Regular Expressions
Regular expressions are built from characters, "metacharacters", and concatenation.



Unix commands using regular expressions -- grep
Examples

How many subdirectories are there in a given directory?  (example using /usr on Mac OS X)

% ls /usr
X11        bin        lib        sbin        standalone
X11R6      include    libexec    share

% ls -l /usr
total 8
drwxr-xr-x    8 root  wheel    272 Jul 23  2009 X11
lrwxr-xr-x    1 root  wheel      3 Jul 23  2009 X11R6 -> X11
drwxr-xr-x  924 root  wheel  31416 Jan 18 09:18 bin
drwxr-xr-x  267 root  wheel   9078 Jul 23  2009 include
drwxr-xr-x  395 root  wheel  13430 Jun 28  2011 lib
drwxr-xr-x   94 root  wheel   3196 Jul 11  2011 libexec
drwxr-xr-x  251 root  wheel   8534 Jan 18 09:18 sbin
drwxr-xr-x   72 root  wheel   2448 Jul 11  2011 share
drwxr-xr-x    5 root  wheel    170 Jul 23  2009 standalone

% ls -l /usr | grep '^d' | wc -l
       8


% ls -F /usr
X11/        bin/        lib/        sbin/        standalone/
X11R6@      include/    libexec/    share/

% ls -F /usr | cat
X11/
X11R6@
bin/
include/
lib/
libexec/
sbin/
share/
standalone/

% ls -F /usr | grep /
X11/
bin/
include/
lib/
libexec/
sbin/
share/
standalone/

% ls -F /usr | grep -c /
8

Exercise.  How can the ls command discover that it is writing to a pipe, and not to a terminal?

Edit all source code files with the struct name type.
Find all source code lines with the struct name type.  Which of these is better?
Exercise.  Should you use single-quotes or double-quotes around the pattern in a grep or egrep command?  Why?

Really find all ...



Unix commands using regular expressions -- sed

Examples
s/<[^>]*>/ /g
s/\&nbsp;/ /g
s/\&lt;/\</g
s/\&gt;/\>/g
s/\&reg;//g
s/\&amp;/\&/g
s/$/ /



Extended Regular Expressions, basic matching

metacharacter
meaning
.
match any single character except newline
^
anchor: match the beginning of a line
$
anchor: match the end of a line
\<
anchor: match the beginning of a word
\>
anchor: match the end of a word
[list]
character class: match any character in list
[^list]
character class: match any character not in list
(  )
group: treat as a single unit
|
alternation: match one of the choices
\
quote: interpret the following metacharacter literally


Extended Regular Expressions, repetition operators

operator meaning
*
match 0 or more times
+
match one or more times
?
match zero or one times
{n}
bound: match n times
{n,}
bound: match n or more times
{0,m}
bound: match m or fewer times
{,m}
bound: match m or fewer times (non-standard)
{n,m}
bound: match n to m times


Basic and Extended Regular Expressions, predefined character classes

class
meaning
similar to ...
[:lower:]
lowercase letters
a-z
[:upper:] uppercase letters
A-Z
[:alpha:] upper- and lowercase letters
A-Za-z
[:alnum:] upper- and lowercase letters, numerals
A-Za-z0-9
[:digit:] numerals
0-9
[:punct:] punctuation characters

[:blank:] space or tab (whitespace)


Basic Regular Expressions



Examples

Search for Canadian postal codes
Search for a word, and accept American or British spelling
Search for a year date
Search for dollars or Euros
Exercise.  Explain \\\*.*[A-Za-z]+\$
Exercise.  Why are the fields in struct stat named st_something?

Need a list of words from a dictionary?
Build or solve crossword puzzles


The Regular-Expression Library (Posix version, basic and extended regular expressions)
Example (from the Posix Standard and the Solaris man page)

#include <regex.h>
 
/*
 * Match string against the extended regular expression
 * in pattern, treating errors as no match.
 *
 * return 1 for match, 0 for no match
 */
 
int match(const char *string, char *pattern)
{
  int status;
  regex_t re;
 
  if (regcomp(&re, pattern, REG_EXTENDED | REG_NOSUB) != 0)
    { return 0; }
 
  status = regexec(&re, string, (size_t) 0, NULL, 0);
 
  regfree(&re);
 
  if (status != 0)
    { return 0; }
 
  return 1;
}

Example (from the Posix Standard and the Solaris man page)
regex_t re;
regmatch_t pm;
int error;
 
(void) regcomp(&re, pattern, 0);
 
/* this call to regexec() finds the first match on the line */
error = regexec(&re, &buffer[0], 1, &pm, 0);
 
while (error == 0) {     /* while matches found */
  /* substring found is between pm.rm_so and pm.rm_eo */

  /* find the next match */
  error = regexec(&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
}

regfree(&re);



Some details



References



Last revised, 8 Apr. 2013