Chapter 13

Unix Tools

Introduction

In this chapter, we meet Unix's most common Unix data manipulation commands.

Filter

A filter is something we are familiar with in everyday life. For instance, in cooking a sieve filters lumps out of gravy. In electronics, a filter is used to remove unwanted signal frequencies. In Unix, a filter is the term for a command that accepts standard input and alters it in some way before sending it to the standard output. Some filters remove lines from their input, some alter the lines and some change the order of the lines.

In fact, most Unix commands are filters. The sed command that we saw in Chapter ten?? is an example of a very powerful filter; it can be programmed to remove, alter or add lines to its input. The more command is a filter that slows down its input so that we have time to read it. Even the wc command is a filter - albeit a rather drastic one that outputs only the number of lines, words and characters contained in the input. The tee command is a filter that makes no changes to its input because its purpose is to duplicate it.

No changes - cat

The cat command is another filter that makes no changes to its input; it was designed for joining files into one logical stream. Here is an example that uses two one-line files:

$ cat bread jam bread
wholemeal bread
strawberry jam
wholemeal bread
$

As usual with filters, cat's output is sent to the standard output and the input files are unchanged. We have to use output redirection with cat if we wish to join files physically. For example:

$ cat bread jam bread > sandwich
$

puts the bread file's contents before and after the jam file's contents in another file called sandwich.

Of course we can use cat with a single file:

$ cat jam
strawberry jam
$

Some people use cat instead of more to display files. If the file does not fit into the xterm window, they can scroll back to see the start of the file. However, if the file has more lines than xterm can scroll back, they have to use more anyway!

A common error

Beginners often start cat, or some other Unix command, without giving it files to work on. They then find that their "commands" no longer work and they can't get a shell prompt. For example:

$ cat         # do NOT do this
date
date
ls
ls
nothing works
nothing works
-
-

What has happened is that cat, having no files to work on, is taking the input and echoing it back. The way out of this is to type control D to indicate there is no more input for the command to process, or to type control C to stop the command.

Filters and pipes

Filters are very useful when used with pipes: several filters can be used consecutively with the output of one filter piped to the input of the next as shown:

$ cat bread jam bread | grep 'rr' | tee copy | wc -l
       1
$

A command like the above is known as a pipeline. The intermediate stages (grep and tee in the example) must be filters. The first stage and last stages do not have to be filters, but the first must produce output and the last must accept input. A command such as rm that ignores standard input and output would not normally be used in a pipeline.

Counting - wc

By default, wc counts the number of lines, words and characters in its input. We can use it for counting the lines in a file:

$ wc -l bread
       1 bread
$

Without the -l option wc would have give all three counts, not just the line count.

If we require two of the counts, here is how to get them:

$ wc -wl < bread
       1       2
$

Notice that this time wc did not display the name of the file. This is because it was the shell that arranged for wc to receive its input redirected from bread. The name of the file was not given an argument to wc so it couldn't display it.

If we pass more than one file to wc:

$ wc -l bread jam
       1 bread
       1 jam
       2 total
$

we get a total as well as the individual counts.

Pure filter

Most Unix commands behave like wc: they operate on standard input if they are not given a filename argument. The next command we study is an exception: it is purely a filter; it does not accept filename arguments.

Character transliterations - tr

We used tr in Chapter Two ?? to copy its input to its output without making any changes but we did not see what tr was intended for. In fact, tr does character transliterations. Here it changes all the vowels to punctuation:

$ tr aeiou '.,:!?'
the quick brown fox jumped
th, q?:ck br!wn f!x j?mp,d
over the lazy dog
!v,r th, l.zy d!g
^D
$

Notice the lines are in pairs: the first line of the pair typed by me and the second displayed by tr after it has done its substitutions. Also, note that there are two arguments, both strings. The first string specifies which characters will be translated; the second specifies what they will be translated to. The first character in the first string is translated to the first character in the second string. Similarly, the second and following characters correspond in the same way.

We can use a shortcut to specify ranges of characters. For example:

$ tr '[a-z]' '[A-Z]'
the quick brown fox jumped
THE QUICK BROWN FOX JUMPED
over the lazy dog
OVER THE LAZY DOG
^D
$

Notice that you have to use input/output redirection for tr to work with files as it is a pure filter. Here tr transliterates the contents of two files:

$ cat jam bread | tr '[a-z]' '[A-Z]' > opensandwich
$

First lines - head

The head command is a filter that gives the first few lines of a file. For example:

$ head -2 cars
The typical American male devotes more than 1,600 hours a
year to his car.  He sits in it while it goes and while it
$

The argument determines the number of lines for head to display.

When more than one file argument is given, head uses the file-name as a title and outputs a blank line between files:

$ head -1 bread jam
==> bread <==
wholemeal bread

==> jam <==
strawberry jam
$

Try head with the asterisk file-name generation facility to see the first lines of all files in a directory.

Last lines - tail

The tail filter shows the last lines of a file. For example:

$ tail -3 cars
society's time budget to traffic instead of 28 per cent.

Ivan Illich
$

A negative argument specifies the number of lines from the end of the file.

With a positive argument, tail starts displaying the specified number of lines from the start of the file. Although the first argument is different, this command:

$ tail +12 cars
society's time budget to traffic instead of 28 per cent.

Ivan Illich
$

has the same effect as the previous one.

Linux - tail

The Linux version of tail does not have the +lineNumber option. We have to use: -n +lineNumber. This:

$ tail -n +12 cars
society's time budget to traffic instead of 28 per cent.

Ivan Illich
$

is what we need.

Put in order - sort

Here we use a file containing health spending figures as the input to sort:

$ sort healthspend
Austria      984
Belgium     1032
Canada      1043
Denmark     1086
France      1054
Italy        847
Japan       1173
Switzerland 1463
UK           692
USA         1372
$

Notice that the default ordering is alphabetical using the whole line as the key. Also, remember that sort is a filter - it does not normally change its input files.

By default, sort uses the space character to split lines into fields allowing us to sort with alternative keys. Here the key is the second field on the line:

$ sort -k 2 healthspend
UK           692
USA         1372
Italy        847
Japan       1173
Canada      1043
France      1054
Austria      984
Belgium     1032
Denmark     1086
Switzerland 1463
$

But this is "wrong"! sort interprets two consecutive space characters as the end of a field followed by the end of an empty field. So, as far as sort is concerned, the second field on every line except the last is empty. This makes the countries with the shortest names come out first in our example.

We can use a b modifier after the field number to allow for the leading spaces in the key:

$ sort -k 2b healthspend
Belgium     1032
Canada      1043
France      1054
Denmark     1086
Japan       1173
USA         1372
Switzerland 1463
UK           692
Italy        847
Austria      984
$

Better but still not correct! We have sorted on the second field but the numbers have been put into alphabetical, not numerical, order; so 692 came after 1463 .

We can use an n modifier after the field number to treat the key as a number. If we use an r modifier as well the output will be in reverse order, as shown here:

$ sort -k 2nr healthspend
Switzerland 1463
USA         1372
Japan       1173
Denmark     1086
France      1054
Canada      1043
Belgium     1032
Austria      984
Italy        847
UK           692
$

The b modifier is not needed this time as the n modifier also allows for the leading spaces.

Britain must either be a healthy place to live, or a country without due concern for all its citizens!

We can use sort on numbers with decimal points too, as this file of education spending figures shows:

$ sort -k 2nr eduspend
Canada      7.2
Denmark     6.9
Holland     6.6
Ireland     6.2
WestGermany 6.2
France      5.7
Sweden      5.7
USA         5.7
Spain       5.0
Japan       4.9
Portugal    4.9
Italy       4.8
UK          4.7
$

Britain's place at the bottom is the reason the country will be much, much less prosperous in twenty year's time.

Multiple keys

We can specify more than one key, as this example shows:

$ sort -k 2b -k 1 eduspend
UK          4.7
Italy       4.8
Japan       4.9
Portugal    4.9
Spain       5.0
France      5.7
Sweden      5.7
USA         5.7
Ireland     6.2
WestGermany 6.2
Holland     6.6
Denmark     6.9
Canada      7.2
$

As before, the countries are being ranked by their education spending but, this time the -k 1 specifies that if two countries have the same figure, they should then be sorted on the country name too. This causes Ireland to appear before WestGermany.

Any character position

Keys can begin part way through a word as in the following example:

$ sort -k 1.2f eduspend
Canada      7.2
Japan       4.9
Denmark     6.9
WestGermany 6.2
UK          4.7
Holland     6.6
Portugal    4.9
Spain       5.0
France      5.7
Ireland     6.2
USA         5.7
Italy       4.8
Sweden      5.7
$

This time, the extra .2 says the sort key starts with the second letter of the field. The f modifier after the key prevents sort from distinguishing between upper and lower case letters in the key, so the UK and the USA appear in the right place.

A warning

You probably expect this to sort by the second and fourth letters of the country name:

$ sort -k 1.2f -k 1.4f eduspend
Canada      7.2
Japan       4.9
Denmark     6.9
WestGermany 6.2
UK          4.7
Holland     6.6
Portugal    4.9
Spain       5.0
France      5.7
Ireland     6.2
USA         5.7
Italy       4.8
Sweden      5.7
$

But, as you can see, it doesn't! The reason is: by default, sort expects keys to continue to the end of the line; which means that the key for Canada's line is:

anada      7.2ada      7.2

and not:

aa

as we might have expected.

When using multiple keys, we have to specify where they end. We do it like this:

$ sort -k 1.2,1.2f -k 1.4,1.4f eduspend
Canada      7.2
Japan       4.9
Denmark     6.9
WestGermany 6.2
UK          4.7
Holland     6.6
Portugal    4.9
Spain       5.0
Ireland     6.2
France      5.7
USA         5.7
Italy       4.8
Sweden      5.7
$

And, this time, we get the expected results. The end positions are specified by adding a comma and a further key position (,1.2 and ,1.4) after the keys.

Any field separator

The following example illustrates multiple keys and a different field separator being used:

$ sort -t '.' -k 2 -k 1,1 eduspend
Spain       5.0
Canada      7.2
Ireland     6.2
WestGermany 6.2
Holland     6.6
France      5.7
Sweden      5.7
UK          4.7
USA         5.7
Italy       4.8
Denmark     6.9
Japan       4.9
Portugal    4.9
$

This time, the field separator is the full stop character; it is specified using sort's -t option. Therefore, the sort key is the decimal part of the number followed by the country's name.

A subtle point: we don't need to say where the first key field ends because sort's assumption of the end of the line is correct in this case.

Just a filter

Don't forget that none of the above altered the data in healthspend or eduspend; if we wanted the sorted data left in the file we could use the -o option:

$ sort -k 2nr -o eduspend eduspend
$

Alternatively we could use output redirection:

$ sort -k 2nr eduspend > eduspend.sorted
$

But that needs an extra file name.

Extract data - cut

Computer users often need to extract a column of data from a file. Unix uses the cut utility for this. We can extract certain characters or fields from each line using the -c or -f option. Either option is followed by a list of character or field numbers. Here we get characters one, two, three and six from a file:

$ cut -c 1-3,6 healthspend
Ita
Ausi
Belu
Cana
Frae
Denr
Jap
USA3
Swie
UK 2
$

If we omit the last number in a range, cut goes to the end of the line, as shown here:

$ cut -c 13- healthspend
 847
 984
1032
1043
1054
1086
1173
1372
1463
 692
$

Counting character positions is necessary when the columns are aligned with multiple spaces, like ours are.

One of the problems with these tools is that they were written at different times by various Unix users who chose their own option letters and default field separators. Unlike sort, cut uses the tab character as its default separator and d as the option. In our file, the fields are separated by space characters so we need to specify that after the -d option:

$ cut -f 1 -d ' ' healthspend
Italy
Austria
Belgium
Canada
France
Denmark
Japan
USA
Switzerland
UK
$

to get the first field from the file.

As cut is a filter, we need to use redirection if we wish to do more than just see the data:

$ cut -f 1 -d ' ' healthspend > countries
$ cut -c 13- healthspend | sed 's/ //' > spending
$ cut -c 1 healthspend > initials
$

These three new files will be used in the next section.

Joining columns - paste

Unix supplies the paste command to join columns of data together:

$ paste spending countries
847     Italy
984     Austria
1032    Belgium
1043    Canada
1054    France
1086    Denmark
1173    Japan
1372    USA
1463    Switzerland
692     UK
$

Notice that it too uses the tab character between columns as its default, so our second column has its left margin neatly aligned.

Again, we can choose to use another character:

$ paste -d ' ' spending countries
847 Italy
984 Austria
1032 Belgium
1043 Canada
1054 France
1086 Denmark
1173 Japan
1372 USA
1463 Switzerland
692 UK
$

It is just as easy to join three files:

$ paste -d ' :' initials countries spending
I Italy:847
A Austria:984
B Belgium:1032
C Canada:1043
F France:1054
D Denmark:1086
J Japan:1173
U USA:1372
S Switzerland:1463
U UK:692
$

If we supply more than one delimiter, paste will use them in rotation; if we don't supply enough, it will recycle them.

Compare - cmp

Unix commands rarely generate unnecessary output so we shouldn't be surprised that its file comparison tool outputs nothing when it has no differences to report:

$ cmp healthspend healthspend
$

Silence!

To see the effect when there is something to report, we need to make an altered version of the file:

$ sed '/U/d' healthspend > healthspend.1
$ echo 'UK           693' >> healthspend.1
$

We do so by filtering out the UK's entry and adding a new UK entry to the end of the altered file.

(By the way, don't worry about the echo command. We will find out about it in the chapter ??. For now it is simply an easily written way of editing a file in this book.)

When we compare the new version with the original:

$ cmp healthspend healthspend.1
healthspend healthspend.1 differ: char 120, line 8
$

cmp tells us where the first difference occurs. We need to use a another tool to actually see the differences.

Differences - sdiff

The sdiff command shows two files side by side with the differences marked. The -w option specifies the screen width to use.

$ sdiff -w 60 healthspend healthspend.1
Italy        847                Italy        847
Austria      984                Austria      984
Belgium     1032                Belgium     1032
Canada      1043                Canada      1043
France      1054                France      1054
Denmark     1086                Denmark     1086
Japan       1173                Japan       1173
USA         1372             <
Switzerland 1463                Switzerland 1463
UK           692             |  UK           693
$

The less-than (<) and greater-than (>) symbols act like arrows, pointing at lines that are in one file and not in the other. The vertical bar (|) indicates that a line differs from one file to the other.

Database joins - join

The paste command joins files line by line without regard to their contents. Sometimes, we wish to merge lines that have a common field; this is known as a database join. As Unix's join facility expects files to have been sorted on the common field, we need to create two files sorted by country:

$ sort healthspend > shs
$ sort eduspend > ses
$

Then, we can join them:

$ join shs ses
Canada 1043 7.2
Denmark 1086 6.9
France 1054 5.7
Italy 847 4.8
Japan 1173 4.9
UK 692 4.7
USA 1372 5.7
$

As you see, only seven countries are in both files. Each line of output contains the key field followed by the data from the first file and the data from the second file.

What join is doing can be expressed as a diagram like this:


+-------------------+
|                   |
|       +-----------+-------+
|       |           |       |
|  1    | Common    |  2    |
|       | Keys      |       |
|       |           |       |
|       +-----------+-------+
|                   |
+-------------------+

Key: 1 - unmatched keys in file 1
     2 - unmatched keys in file 2

DIAGRAM OF DATABASE JOIN

The seven countries that were output by the previous join, were the keys that were common to both files and correspond to the intersection area in the diagram.

Unmatched lines

Options for join allow us to access the lines with unmatched keys in either file; they correspond to areas 1 and 2 in the diagram. The -a option gives the lines with unmatched keys in addition to the lines with matched keys. The -v option gives the lines with unmatched keys instead of the lines with matched keys.

For example, the following shows the lines with unmatched keys from the first (shs) file:

$ join -v 1 shs ses
Austria 984
Belgium 1032
Switzerland 1463
$

Similarly, this shows the unmatched key lines from the second (ses) file:

$ join -v 2 shs ses
Holland 6.6
Ireland 6.2
Portugal 4.9
Spain 5.0
Sweden 5.7
WestGermany 6.2
$

Obviously, using -v 1 and -v 2 together would show the unmatched key lines from both files.

If we use the -a option instead, this is what we get:

$ join -a 1 -a 2 shs ses
Austria 984
Belgium 1032
Canada 1043 7.2
Denmark 1086 6.9
France 1054 5.7
Holland 6.6
Ireland 6.2
Italy 847 4.8
Japan 1173 4.9
Portugal 4.9
Spain 5.0
Sweden 5.7
Switzerland 1463
UK 692 4.7
USA 1372 5.7
WestGermany 6.2
$

The -a 1 and -a 2 force join to output all the data from both input files.

Print - lp

To print a file on a printer, we use the lp command; this does not actually do the printing itself but sends the request to a spooler that queues it until the printer is free. In this exchange:

$ lp healthspend eduspend
request id is hbp246-29465 (2 file(s))
$

hbp246 is the name of the printer queue and 29465 is request number. We could use those later to find out about, or cancel the request.

Of course, lp is able to accept standard input:

$ cat healthspend eduspend | lp
request id is hbp246-29466 (standard input)
$

This is a handy way of printing several files without starting a new page for each one.

Do just one job well

Part of the Unix philosophy is that each tool should do just one job as well as reasonably possible. That is why none of them output titles or paginate their output - that would be a second job. In line with this, lp doesn't format its input in any way; it simply prints whatever is there. Unix has a separate command, called pr, which does all the fancy formatting anyone could ask for. If the user wants the output from a command to be formatted, all they have to do is send the output through pr, usually using a pipeline.

Formatting - pr

Bear in mind that pr is usually used as a front-end for lp. We will use it on its own in this section because sending its results to the standard output is the easiest way to see them. Here is pr's default operation on a file:

$ pr cars


Apr  3 13:34 1996  cars Page 1


The typical American male devotes more than 1,600 hours a
<ten lines deleted>
society's time budget to traffic instead of 28 per cent.

Ivan Illich

<46 lines deleted>
$

The pr command is set up to use 11 inch paper. Therefore, as the cars file only needs one page, pr output 66 lines of output. (I removed the unimportant ones.) The first five lines are a page header with two blank lines before and after. The time in the header is when the file was last modified. By default, pr ends each page with a footer consisting of five blank lines.

here is the usual way of paginating files before printing them:

$ pr -h 'Spending files' -l 70 -w 100 \
> aidspend eduspend healthspend | lp
request id is hbp246-29577 (standard input)
$

In this instance, pr has been given a customised header along with a special page length and width.

When used without lp, pr's -t option suppresses the header and footer and prevents pr from spewing blank lines to fill the page. Several thin files can be displayed side by side simultaneously with pr as shown here:

$ pr -t -m -w 57 aidspend eduspend healthspend
US          0.15   Canada      7.2    Italy        847
Italy       0.20   Japan       4.9    Austria      984
Ireland     0.24   Denmark     6.9    Belgium     1032
NewZealand  0.24   WestGermany 6.2    Canada      1043
Spain       0.26   UK          4.7    France      1054
Portugal    0.28   Holland     6.6    Denmark     1086
Austria     0.29   Portugal    4.9    Japan       1173
Japan       0.29   Spain       5.0    USA         1372
Belgium     0.30   France      5.7    Switzerland 1463
UK          0.30   Ireland     6.2    UK           692
Finland     0.31   USA         5.7
Germany     0.33   Italy       4.8
Switzerland 0.36   Sweden      5.7
Australia   0.38
Luxembourg  0.40
Canada      0.42
France      0.64
Netherlands 0.76
Sweden      0.90
Denmark     1.03
Norway      1.05
$

When doing multi-column output, pr divides the available page width (in characters) evenly between the columns.

If a file is long and thin, we can format it in several columns:

$ sort -k 2nr aidspend | pr -3 -t -w 57
Norway      1.05   Australia   0.38   Japan       0.29
Denmark     1.03   Switzerland 0.36   Portugal    0.28
Sweden      0.90   Germany     0.33   Spain       0.26
Netherlands 0.76   Finland     0.31   Ireland     0.24
France      0.64   Belgium     0.30   NewZealand  0.24
Canada      0.42   UK          0.30   Italy       0.20
Luxembourg  0.40   Austria     0.29   US          0.15
$

Note that pr fills up the first column before the second. Here, Japan is six from the bottom and the US is bottom of the aid table.

Repeated lines - uniq

To see Unix's filter for repeated lines, we first need to duplicate one of the lines in the healthspend file:

$ cp healthspend healthspend.2
$ echo 'France      1054' >> healthspend.2
$ sort -o healthspend.2 healthspend.2
$

Then we can see that uniq prevents France's entry from coming out twice:

$ uniq healthspend.2
Austria      984
Belgium     1032
Canada      1043
Denmark     1086
France      1054
Italy        847
Japan       1173
Switzerland 1463
UK           692
USA         1372
$

If we add another line for France to the end of the file:

$ echo 'France      1054' >> healthspend.2
$

and run uniq again:

$ uniq healthspend.2
Austria      984
Belgium     1032
Canada      1043
Denmark     1086
France      1054
Italy        847
Japan       1173
Switzerland 1463
UK           692
USA         1372
France      1054
$

we should notice that, France does now appear twice. The command's name is misleading: only consecutive repeated lines are removed.

Text manipulation - awk

The awk command can do more complex manipulations involving text and numbers than can be done with the other tools. Its instructions can be so complicated that they are called a program and have to be enclosed in quotation marks. Here is the simplest possible example:

$ awk '/^U/' aidspend
US          0.15
UK          0.30
$

Programs for awk consist of one or more patterns (or conditions) each of which can be followed by an action. The action is carried out for all input lines that match the pattern (or meet the condition). In our example, the pattern is a simple regular expression and we have omitted the action. Without an action, awk does its default action, which is to display the input line. So the program, as you can see, displays the lines beginning with 'U' from the aidspend file.

Here is a program with an action:

$ awk '/^U/ { print $2, $1 }' aidspend
0.15 US
0.30 UK
$

Notice that the action has to be in braces ({}). The $1 and $2 refer to the first and second fields on the line. The example displays them in reverse order. Because individual fields are being displayed we only get one space character between the fields, so the lines look shorter than when we displayed the whole line.

Here is a program with a condition instead of a pattern:

$ awk '$2 > 0.7' aidspend
Netherlands 0.76
Sweden      0.90
Denmark     1.03
Norway      1.05
$

Again, the action has been omitted, so awk displays the countries that spend more than the United Nation's recommended minimum percentage of gross national product (GNP).

Here is a program with three pattern-action pairs:

$ awk '$2 > 1.0    { print $0, "(Generous)" }
>      /^UK /      { print $0, "(UK)" }
>      $2 < 0.2    { print $0, "(Mean)" }' aidspend
US          0.15 (Mean)
UK          0.30 (UK)
Denmark     1.03 (Generous)
Norway      1.05 (Generous)
$

Don't forget that Unix uses its other prompt (>) when the command spills over onto the next line as shown above. Which means that the prompts aren't part of the awk program! Also, notice how, the actions have been aligned to make the command as readable as possible.

Normally, Unix commands as long and as complex as the one above would not be entered interactively - they would be stored in a file using an editor. We will see how to do this in the chapter on shell scripts. For now they will be entered the hard way. If you copy them, just be sure to miss out the prompts.

Actions without patterns or conditions are performed on all input lines. We use that here to calculate an average. The first action has no pattern, so the second field of each input line is added into the variable called sum:

$ awk '      { sum = sum + $2 }
>      END   { average = sum / NR
>              printf "Average is %f\n", average
>            }' aidspend
Average is 0.434762
$

END is a special pattern whose action is performed at the end of the input file(s). Notice that actions can be spread over more than one line, as here. NR is a built-in variable that holds the number of lines (records) read from the input file. The printf statement will be recognised by C programmers; its first argument is a format string that shows the layout of the line to be displayed. The second and subsequent arguments are values to be slotted into the format string as indicated by place markers, which begin with a percent sign (%) The %f in this example is a place marker for a real number; the \n shows where a newline is needed in the output. Notice that variables are automatically initialised to zero for numbers and null for strings.

The next example shows a conditional (if) statement being used to find the longest country name occurring in the aid files:

$ awk '        { if (length($1) > length(longest))
>                     longest = $1 }
>      END     { print longest }' healthspend aidspend eduspend
Switzerland
$

length is a function that returns the number of characters in its argument. See how awk is given several files to work with in the example.

As a matter of fact, the same thing could be done without the if:

$ awk 'length($1) > length(longest)
>           { longest = $1 }
>      END  { print longest }' healthspend
Switzerland
$

In all the awk scripts so far, the condition and the action have been on the same line; that isn't essential. In the example, the condition is in the first line and the action is in the second. However, the two conditions have been aligned with each other, as have the actions, to aid readability.

The following example shows an if-else statement being used to compare the average spending on aid in two sets of countries:

$ awk 'BEGIN    { printf "Countries beginning with letters " }
>      $1 < "K" { sumA = sumA + $2
>                 countA = countA + 1 }
>      $1 > "K" { sumK = sumK + $2
>                 countK = countK + 1 }
>      END      { avA = sumA / countA
>                 avK = sumK / countK
>                 if ( avA > avK )
>                      printf "A-J"
>                 else printf "K-Z"
>                 print " are less mean."
>               }' aidspend
Countries beginning with letters K-Z are less mean.
$

The BEGIN action is performed before any lines of data are read from the input file. In this example, it is used to display a heading before processing the input.

These examples are only a sample of what awk can do; it is so powerful, whole books have been written about it.

QUESTIONS

You will have to read the manual pages very carefully to do some of these questions.

Here are some files for you to download and use in the questions: bob, bob1, cars, people, peter, peter1.

  1. Count the words, lines and characters in the cars file.

    Answer

    wc cars
    

    Hide

  2. Repeat question (1) but count only the lines.

    Answer

    wc -l cars
    

    Hide

  3. Sort the contents of the people file into:

    1. first name order;

      Answer

      sort people
      

      Hide

    2. surname order;

      Answer

      sort -k 2b people
      

      Note the b. Without it the Joans and Johns will occur first. This is because those lines have more than one space between the first and second names. These are treated as spaces leading the second field.

      Hide

    3. oldest first;

      Answer

      sort -t ',' -k 2r people
      

      Hide

    4. oldest last.

      Answer

      sort -t ',' -k 2 people
      

      Hide

    (You are not expected to change the file -- only the order in which its lines appear on the screen.)

  4. Put the first character of each line of the people file into another file.

    Answer

    cut -c1 people > initials
    

    Hide

  5. Put the surnames from the people file into another file.

    Answer

    cut -f1 -d ',' people | cut -c7- > surnames
    

    Notice we need two cut commands because -f and -c are mutually exclusive.

    OR

    awk '{print $2}' people | sed 's/,//'
    

    This is a bit simpler as awk uses any white space characters as a field separator instead of just one character. However, we still need something to get rid of the commas.

    Hide

  6. Create a file containing initials and surnames from the files created for questions (4) and (5).

    Answer

    paste initials surnames > names
    

    Hide

  7. The split command does the opposite of cat. Read its man page and use it to split the people file into files of four lines each.

    Answer

    split -4 people part
    

    Hide

  8. How would you check if the files bob and bob1 were identical?

    Answer

    cmp bob bob1
    

    Hide

  9. Find the differences between files peter and peter1.

    Answer

    sdiff peter peter1
    

    OR (better, lazier)

    sdiff peter peter1 | grep '[<|>]'
    

    Hide

  10. How would you check if there are any common lines in peter and peter1?

    Answer

    sdiff peter peter1
    

    OR (better, lazier)

    sdiff peter peter1 | grep -v '[<|>]'
    

    OR

    (using

    $ apropos common
    

    to find a better tool -- the comm command )

    sort peter > peter.sorted
    sort peter1 > peter1.sorted
    comm -12 peter.sorted peter1.sorted
    rm peter.sorted peter1.sorted
    

    Hide

  11. Translate all the lower case letters in a file to upper-case.

    Answer

    tr "[a-z]" "[A-Z]" < bob
    

    (must use < as tr is a pure filter)

    Hide

  12. sort's -u option seems to provide the same facility as the uniq command. Are there any circumstances when they are not equivalent?

    Answer

    Yes -- when the sort key is not the whole line. uniq filters out duplicated adjacent lines. sort's -u option filters out lines that have the same sort key; lines may be different but still have the same key

    Hide

  13. How would you get a printout of one of your files?

    Answer

    lp filename
    

    This may not work for you if your workplace doesn't provide you with a Unix printer.

    Hide

  14. Format the people file into 2 columns, 40 characters wide with a suitable heading on 12 line "pages".

    Answer

    pr -h "suitable heading" -2 -w 80 -l 12 people
    

    Hide

  15. One of the names in the people file is repeated. Produce another version with unique names WITHOUT using an editor.

    Answer

    sort people | uniq > newpeople
    

    OR

    sort -u people > newpeople
    

    Hide

  16. Calculate the average age of the people in the people file (When you have decided which command to use, see the examples section of its man page.)

    Answer

    awk -F',' '    {sum += $2}
               END {print sum/NR}' people
    

    Hide

  17. Format the people file into two columns, but this time, put the names across the screen and down the columns rather than down the columns and across the page. (This question requires careful reading of the man page and/or ingenuity.)

    Answer

    pr -a -2 -w 80 people
    

    (using the -a [across] option)

    OR

    pr -t -l 1 -2 -w 80 people
    

    (an ingenious solution -- suppresses page breaks and sets the page length to one)

    OR

    paste - - < people
    

    (more creative still -- standard input lines are pasted alongside the next line of standard input, but the columns are not so tidy)

    Hide

  18. Run this command:

    grep '\(\<[A-Za-z][a-z]*\>\).*\<\1\>' cars
    

    What does it do? Explain why.

    Answer

    It finds lines that have any word repeated in the line.

    [A-Za-z][a-z]* means a string of characters, possibly with an initial capital letter. Enclosing it in \< \> makes it a separate, word rather than just a string in a longer word. \1 means whatever the previous regular expression matched -- that is an earlier string in the line. \<\1\> means the re-occurring string must also be a separate word.

    Hide

  19. In an earlier exercise you used:

    who | wc -l
    

    to get a rough count of people logged in. Improve it so that duplicate logins are not counted.

    Answer

    who | sort -u -k 1,1 | wc -l
    

    The extra stage drops lines whose first field (the login code) is repeated.

    Hide

ANSWERS

  1. wc cars
    
  2. wc -l cars
    
    1. sort people
      
    2. sort -k 2b people
      

      Note the b. Without it the Joans and Johns will occur first. This is because those lines have more than one space between the first and second names. These are treated as spaces leading the second field.

    3. sort -t ',' -k 2r people
      
    4. sort -t ',' -k 2 people
      
  3. cut -c1 people > initials
    
  4. cut -f1 -d ',' people | cut -c7- > surnames
    

    Notice we need two cut commands because -f and -c are mutually exclusive.

    OR

    awk '{print $2}' people | sed 's/,//'
    

    This is a bit simpler as awk uses any white space characters as a field separator instead of just one character. However, we still need something to get rid of the commas.

  5. paste initials surnames > names
    
  6. split -4 people part
    
  7. cmp bob bob1
    
  8. sdiff peter peter1
    

    OR (better, lazier)

    sdiff peter peter1 | grep '[<|>]'
    
  9. sdiff peter peter1
    

    OR (better, lazier)

    sdiff peter peter1 | grep -v '[<|>]'
    

    OR

    (using

    $ apropos common
    

    to find a better tool -- the comm command )

    sort peter > peter.sorted
    sort peter1 > peter1.sorted
    comm -12 peter.sorted peter1.sorted
    rm peter.sorted peter1.sorted
    
  10. tr "[a-z]" "[A-Z]" < bob
    

    (must use < as tr is a pure filter)

  11. Yes -- when the sort key is not the whole line. uniq filters out duplicated adjacent lines. sort's -u option filters out lines that have the same sort key; lines may be different but still have the same key

  12. lp filename
    

    This may not work for you if your workplace doesn't provide you with a Unix printer.

  13. pr -h "suitable heading" -2 -w 80 -l 12 people
    
  14. sort people | uniq > newpeople
    

    OR

    sort -u people > newpeople
    
  15. awk -F',' '    {sum += $2}
               END {print sum/NR}' people
    
  16. pr -a -2 -w 80 people
    

    (using the -a [across] option)

    OR

    pr -t -l 1 -2 -w 80 people
    

    (an ingenious solution -- suppresses page breaks and sets the page length to one)

    OR

    paste - - < people
    

    (more creative still -- standard input lines are pasted alongside the next line of standard input, but the columns are not so tidy)

  17. It finds lines that have any word repeated in the line.

    [A-Za-z][a-z]* means a string of characters, possibly with an initial capital letter. Enclosing it in \< \> makes it a separate, word rather than just a string in a longer word. \1 means whatever the previous regular expression matched -- that is an earlier string in the line. \<\1\> means the re-occurring string must also be a separate word.

  18. who | sort -u -k 1,1 | wc -l
    

    The extra stage drops lines whose first field (the login code) is repeated.