Chapter 9

Regular Expressions

Introduction

Regular expressions (REs) describe patterns against which text in files can be matched. They are used in the editor ed and the family of editors based on it; they are also used in many other Unix tools. In this chapter, we will learn about REs by using the Unix tool called grep.

Text to use

Here is a short file of text:

$ more cars
The typical American male devotes more than 1,600 hours a
year to his car.  He sits in it while it goes and while it
stands idling.  He parks it and searches for it.  He earns the
money to put down on it and to meet the monthly instalments.
He works to pay for petrol, tolls, insurance, taxes and
tickets.  He spends four of his sixteen waking hours on the
road or gathering resources for it.  The model American puts
in 1,600 hours to get 7,500 miles: less than five miles per
hour.  In countries deprived of a transportation industry,
people manage to do the same, walking wherever they want to
go, and they allocate only three to eight percent of their
society's time budget to traffic instead of 28 per cent.

Ivan Illich
$

The name of the file is cars; it will be used throughout this chapter and the next.

grep

If we wish to look for lines in cars containing four we can use this command:

$ grep four cars
tickets.  He spends four of his sixteen waking hours on the
$

The first of grep's arguments is a regular expression (RE) and the second is a file name. grep's usual action is to display all the lines in the file that match the RE. So, what we see is the only line of cars containing four.

If we try the following:

$ grep it cars
year to his car.  He sits in it while it goes and while it
stands idling.  He parks it and searches for it.  He earns the
money to put down on it and to meet the monthly instalments.
road or gathering resources for it.  The model American puts
$

grep displays more lines than before because it occurs more often than four in cars.

We are not restricted to whole words with grep; it will find part-words too. For instance:

$ grep ling cars
stands idling.  He parks it and searches for it.  He earns the
$

Looking in many files

If we supply more than one file name, grep looks in all of the files and displays the file name before any lines containing the RE. The same thing happens if we use Unix's file name expansion facilities:

$ grep needle *
haystack:This is the line with the needle.
porcupine:Here is a needle
porcupine:Another needle
porcupine:Yet another needle

This is very handy when we know we have some text but don't know which file it is in.

Metacharacters

Regular expressions usually consist of ordinary characters and special characters; the special characters are known as metacharacters. The following are all metacharacters: $ ^ [ . *. We will examine them one by one.

Metacharacter $

If the dollar sign ($) comes at the end of a regular expression, it means the text that the regular expression matches has to occur at the end of a line. (If it helps, you can think of the dollar sign matching an imaginary, invisible character at the end of every line.) This example is similar to the previous one:

$ grep 'it$' cars
year to his car.  He sits in it while it goes and while it
$

This time, grep only finds the the portion of the RE before the dollar sign (it) when it occurs at the end of a line. Notice that the RE has to be in single quotation marks; this is because the dollar sign has a special meaning to the shell besides its meaning in REs. The quotation marks hide the RE from the shell so it does not interpret any of its special characters.

Metacharacter ^

If the caret symbol (^) comes at the start of a regular expression, it means that the RE has to occur at the beginning of a line. (You can think of the caret symbol matching an invisible character at the start of every line.) In this example:

$ grep '^The' cars
The typical American male devotes more than 1,600 hours a
$

grep only finds The when it occurs at the start of a line.

Not always a metacharacter

The caret symbol and dollar sign are only treated as metacharacters when they occur at the beginning and the end of a RE respectively. In the following exchange:

$ grep 't^s$i' cars
$

grep is looking for an exact match for the five characters, including ^ and $, inside the quotation marks. Notice that grep simply displays nothing if no lines match the RE.

Metacharacter . (dot)

The dot metacharacter matches any single character as we see here:

$ grep 'i.n' cars
hour.  In countries deprived of a transportation industry,
$

In our text, the dot matched the character o letting the RE match ion . We can use more than one dot to match more characters. For example, to find a and e whenever they are separated by two other characters:

$ grep 'a..e' cars
money to put down on it and to meet the monthly instalments.
He works to pay for petrol, tolls, insurance, taxes and
road or gathering resources for it.  The model American puts
$

Even when grep has found the lines, it's still quite hard for us humans to spot the text the REs match!

Escaped metacharacters

If we want to remove the special meaning of a metacharacter we can do so by preceding it with a backslash (\). Here we look for lines ending with t.:

$ grep 't\.$' cars
society's time budget to traffic instead of 28 per cent.
$

In the example, the backslash turns the dot into an ordinary character so that grep finds the only line of cars that ends with a t and a full stop.

Metacharacter [

If we put brackets ([ and ]) around some group of characters we get a RE which matches any single character from the group. For instance:

$ grep ' [Ii]n ' cars
year to his car.  He sits in it while it goes and while it
hour.  In countries deprived of a transportation industry,
$

The part of the RE in brackets matches either I or i; combined with the spaces and the n, the RE matches In or in in the middle of sentence.

Of course, we can have completely unrelated characters inside the brackets:

$ grep '[qxj]' cars
He works to pay for petrol, tolls, insurance, taxes and
tickets.  He spends four of his sixteen waking hours on the
$

There are two lines with q, x or j in them.

We can use a variation on the same theme to match a character in a range.

$ grep '[J-S]' cars
$

The example shows that no upper case letters between J and S occur in the text.

Another variation allows us to match characters that are not listed. The caret immediately after the opening bracket in the following example changes the sense of the search:

$ grep '[^A-Za-z0-9 .,:]' cars
society's time budget to traffic instead of 28 per cent.
$

Without the caret, the RE would have matched any single character taken from: letters, digits, space, dot, comma and colon. With the caret, the RE only matches any single character outside that set. (In the example, it is the apostrophe (') after society that matches the RE.)

Metacharacter *

The asterisk (*) character is only used after some other character; it means zero or more occurrences of the character it follows. For example, a* means zero or more consecutive as and z* means zero or more consecutive zs. Clearly aa* matches one or more as and aaa* matches two or more as. This example shows how we look for one or more consecutive zeroes:

$ grep '00*' cars
The typical American male devotes more than 1,600 hours a
in 1,600 hours to get 7,500 miles: less than five miles per
$

Although there are three occurrences, we only get two lines displayed.

A little warning: because the character-asterisk combination matches zero or more characters, it is always used after something else - it is never used on its own. For instance, as we show here, there are no z characters in cars:

$ grep 'z' cars
$

Even so, this command:

grep 'z*' cars

would display every line of the file because grep would find no zs at the start of every line.

Asterisk plus dot

Asterisk followed by dot matches any string of characters of any length; that is, any character followed by any character for as long a possible. You might expect it to match any character and then allow only that particular character to be repeated; it does not work like that.

Asterisk plus dot is most useful in this sort of situation:

$ grep 'stands.*the' cars
stands idling.  He parks it and searches for it.  He earns the
$

As you see, .* matches all the text between stands and the.

Useful REs

Some REs are so useful that they are worth listing:

^$                empty (blank) lines
^  *              lines starting with spaces
  *$              lines ending with spaces
^  *$             (blank) lines containing only spaces
^.*$              the whole line
[A-Z][a-z][a-z]*  a word with a capital letter

It might be difficult to tell from the presentation, but the ones with spaces contain exactly two space characters.

Here are the first and last of those REs used on cars:

$ grep '^$' cars

$ grep '[A-Z][a-z][a-z]* [A-Z][a-z][a-z]*' cars
Ivan Illich
$

Notice the blank line between the two grep commands. It is the output from the first.

Not the whole story

We have only seen the most important features of REs. For the whole story you will have to read the man pages for regexp in section five of the manual.

Two grep options

One of grep's most useful options is -v which makes grep display lines which do not match the RE. This example:

$ grep -v 'e' cars

Ivan Illich.
$

gives us the only two lines of the text that do not contain an e.

The -c option makes grep merely count the matching lines. For instance, this:

$ grep -c '^$' cars
1
$

confirms there is only one blank line in the file.

As you would expect, you can use both options at once:

$ grep -cv 'e' cars
2
$

The example confirms that two lines did not contain an e.

Common uses for grep

We often use grep to select some particular lines from many lines of output. Suppose we want to see if a friend is logged on. We could just use the who command but it might give a hundred or so people. This will save us time:

who | grep fred

Clearly, it is easier to let grep scan for fred in the list of logged in users.

Similarly, to count the number of directories in the current directory we could do ls -l and count the lines starting with d. However, since the purpose of computers should be to free people from boring, repetitive tasks we should look for a better solution. Here is one:

ls -l | grep '^d' | wc -l

You will find many similar uses for grep.

QUESTIONS

In the following you are asked to work with a file called Unix. Here is the Unix file for you to download and use.

  1. Look for lines in the Unix file containing "Unix".

    Answer

    If you are just looking, grep is the right tool for the job. So:

    grep 'Unix' Unix
    

    Or, if you put screws in with a hammer:

    sed -n '/Unix/p' Unix
    

    Hide

  2. How many lines in Unix contain "Unix"? Do not use wc. (Hint: man grep)

    Answer

    grep -c 'Unix' Unix
    

    The hint meant: find the option by reading the man page for grep. If a logical person was picking an option letter to use for something to do with counting they would use "c". Therefore, -c is the one to look at first in the list of options.

    Hide

  3. Look for lines in Unix:

    1. beginning with "U";

      Answer

      grep '^U' Unix
      

      Hide

    2. ending with "x";

      Answer

      grep 'x$' Unix
      

      Hide

    3. containing two "o"s separated by a single character;

      Answer

      grep 'o.o' Unix
      

      Hide

    4. containing two or more consecutive equals signs;

      Answer

      grep '===*' Unix
      

      Three = are needed for two or more equals signs; == ( without an *) would work but is less precise.

      Hide

    5. containing "Unix" or "unix";

      Answer

      grep '[Uu]nix' Unix
      

      Hide

    6. ending with a full stop;

      Answer

      grep '\.$' Unix
      

      Hide

    7. ending with a character that is not a full stop; (This does not include blank lines.)

      Answer

      grep '[^.]$' Unix
      

      The dot doesn't need to be escaped as characters in the brackets are assumed not to be metacharacters.

      Hide

    8. not ending with a full stop; (This includes blank lines. Consult the man page to find an option that makes grep behave like ed's "v" command.)

      Answer

      grep -v '\.$' Unix
      

      Did you take the hint to check out the -v option or did you read through the whole man page?

      Hide

    9. containing upper case letters between "P" and "S".

      Answer

      grep '[P-S]' Unix
      

      Hide

  4. Use sed to display the Unix file with:

    1. all occurrences of "Unix" changed to "UNIX".

      Answer

      sed '/Unix/s//UNIX/g' Unix
      

      OR

      sed 's/Unix/UNIX/g' Unix
      

      (Note the g suffixes -- needed because a line might have more than one "Unix"

      Hide

    2. its present line 17 deleted.

      Answer

      sed '17d' Unix
      

      Hide

    3. its present lines 17 to 19 deleted.

      Answer

      sed '17,19d' Unix
      

      Hide

    4. all lines containing "and" deleted.

      Answer

      sed '/and/d' Unix
      

      Hide

    5. all occurrences of "Unix" changed to "UNIX" and its present lines 17 to 19 deleted.

      Answer

      sed '/Unix/s//UNIX/g
           17,19d'  Unix
      

      OR (more typing):

      sed -e 's/Unix/UNIX/g' -e '17,19d' Unix
      

      Hide

    6. only line 17 shown

      Answer

      sed -n '17p' Unix
      

      Hide

    7. its present line 17 displayed twice.

      Answer

      sed '17p' Unix
      

      Hide

  5. Unix has a file called /etc/passwd which has one line per user. The login code (user name) is the first thing on the line and is immediately followed by a ":". Find your own entry in the file.

    Answer

    This:

    grep '^cmsps:' /etc/passwd
    

    finds my entry. Replace cmsps with your login code or use:

    grep "^$USER:" /etc/passwd
    

    Hide

ANSWERS

  1. If you are just looking, grep is the right tool for the job. So:

    grep 'Unix' Unix
    

    Or, if you put screws in with a hammer:

    sed -n '/Unix/p' Unix
    
  2. grep -c 'Unix' Unix
    

    The hint meant: find the option by reading the man page for grep. If a logical person was picking an option letter to use for something to do with counting they would use "c". Therefore, -c is the one to look at first in the list of options.

    1. grep '^U' Unix
      
    2. grep 'x$' Unix
      
    3. grep 'o.o' Unix
      
    4. grep '===*' Unix
      

      Three = are needed for two or more equals signs; == ( without an *) would work but is less precise.

    5. grep '[Uu]nix' Unix
      
    6. grep '\.$' Unix
      
    7. grep '[^.]$' Unix
      

      The dot doesn't need to be escaped as characters in the brackets are assumed not to be metacharacters.

    8. grep -v '\.$' Unix
      

      Did you take the hint to check out the -v option or did you read through the whole man page?

    9. grep '[P-S]' Unix
      
    1. sed '/Unix/s//UNIX/g' Unix
      

      OR

      sed 's/Unix/UNIX/g' Unix
      

      (Note the g suffixes -- needed because a line might have more than one "Unix"

    2. sed '17d' Unix
      
    3. sed '17,19d' Unix
      
    4. sed '/and/d' Unix
      
    5. sed '/Unix/s//UNIX/g
           17,19d'  Unix
      

      OR (more typing):

      sed -e 's/Unix/UNIX/g' -e '17,19d' Unix
      
    6. sed -n '17p' Unix
      
    7. sed '17p' Unix
      
  3. This:

    grep '^cmsps:' /etc/passwd
    

    finds my entry. Replace cmsps with your login code or use:

    grep "^$USER:" /etc/passwd