ITS ALL ABOUT AWK: LETS SEE AWK

Saturday, January 19, 2008

LETS SEE AWK

ALL ABOUT AWK LANGUAGE:

Getting Started with awk

The basic function of awk is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, awk performs specified actions on that line. awk keeps processing input lines in this way until the end of the input files are reached.

Programs in awk are different from programs in most other languages, because awk programs are data-driven; that is, you describe the data you wish to work with, and then what to do when you find it. Most other languages are procedural; you have to describe, in great detail, every step the program is to take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, awk programs are often refreshingly easy to both write and read.

When you run awk, you specify an awk program that tells awk what to do. The program consists of a series of rules. (It may also contain function definitions, an advanced feature which we will ignore for now. See section User-defined Functions.) Each rule specifies one pattern to search for, and one action to perform when that pattern is found.

Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in curly braces to separate it from the pattern. Rules are usually separated by newlines. Therefore, an awk program looks like this:
pattern { action }
pattern { action }
...
Names: What name to use to find awk.
Running gawk: How to run gawk programs; includes command line syntax.
Very Simple: A very simple example.
Two Rules: A less simple one-line example with two rules.
More Complex: A more complex example.
Statements/Lines: Subdividing or combining statements into lines.
Other Features: Other Features of awk.
When: When to use gawk and when to use other things.
A Rose By Any Other Name

The awk language has evolved over the years. Full details are provided in section The Evolution of the awk Language. The language described in this book is often referred to as "new awk."

Because of this, many systems have multiple versions of awk. Some systems have an awk utility that implements the original version of the awk language, and a nawk utility for the new version. Others have an oawk for the "old awk" language, and plain awk for the new one. Still others only have one version, usually the new one.(2)

All in all, this makes it difficult for you to know which version of awk you should run when writing your programs. The best advice we can give here is to check your local documentation. Look for awk, oawk, and nawk, as well as for gawk. Chances are, you will have some version of new awk on your system, and that is what you should use when running your programs. (Of course, if you're reading this book, chances are good that you have gawk!)

Throughout this book, whenever we refer to a language feature that should be available in any complete implementation of POSIX awk, we simply use the term awk. When referring to a feature that is specific to the GNU implementation, we use the term gawk.
How to Run awk Programs

There are several ways to run an awk program. If the program is short, it is easiest to include it in the command that runs awk, like this:
awk 'program' input-file1 input-file2 ...

where program consists of a series of patterns and actions, as described earlier. (The reason for the single quotes is described below, in section One-shot Throw-away awk Programs.)

When the program is long, it is usually more convenient to put it in a file and run it with a command like this:
awk -f program-file input-file1 input-file2 ...
One-shot: Running a short throw-away awk program.
Read Terminal: Using no input files (input from terminal instead).
Long: Putting permanent awk programs in files.
Executable Scripts: Making self-contained awk programs.
Comments: Adding documentation to gawk programs.
One-shot Throw-away awk Programs

Once you are familiar with awk, you will often type in simple programs the moment you want to use them. Then you can write the program as the first argument of the awk command, like this:
awk 'program' input-file1 input-file2 ...

where program consists of a series of patterns and actions, as described earlier.

This command format instructs the shell, or command interpreter, to start awk and use the program to process records in the input file(s). There are single quotes around program so that the shell doesn't interpret any awk characters as special shell characters. They also cause the shell to treat all of program as a single argument for awk and allow program to be more than one line long.

This format is also useful for running short or medium-sized awk programs from shell scripts, because it avoids the need for a separate file for the awk program. A self-contained shell script is more reliable since there are no other files to misplace.

section Useful One Line Programs, presents several short, self-contained programs.

As an interesting side point, the command
awk '/foo/' files ...

is essentially the same as
egrep foo files ...
Running awk without Input Files

You can also run awk without any input files. If you type the command line:
awk 'program'

then awk applies the program to the standard input, which usually means whatever you type on the terminal. This continues until you indicate end-of-file by typing Control-d. (On other operating systems, the end-of-file character may be different. For example, on OS/2 and MS-DOS, it is Control-z.)

For example, the following program prints a friendly piece of advice (from Douglas Adams' The Hitchhiker's Guide to the Galaxy), to keep you from worrying about the complexities of computer programming (`BEGIN' is a feature we haven't discussed yet).
$ awk "BEGIN { print \"Don't Panic!\" }"
- Don't Panic!

This program does not read any input. The `\' before each of the inner double quotes is necessary because of the shell's quoting rules, in particular because it mixes both single quotes and double quotes.

This next simple awk program emulates the cat utility; it copies whatever you type at the keyboard to its standard output. (Why this works is explained shortly.)
$ awk '{ print }'
Now is the time for all good men
- Now is the time for all good men
to come to the aid of their country.
- to come to the aid of their country.
Four score and seven years ago, ...
- Four score and seven years ago, ...
What, me worry?
- What, me worry?
Control-d
Running Long Programs

Sometimes your awk programs can be very long. In this case it is more convenient to put the program into a separate file. To tell awk to use that file for its program, you type:
awk -f source-file input-file1 input-file2 ...

The `-f' instructs the awk utility to get the awk program from the file source-file. Any file name can be used for source-file. For example, you could put the program:
BEGIN { print "Don't Panic!" }

into the file `advice'. Then this command:
awk -f advice

does the same thing as this one:
awk "BEGIN { print \"Don't Panic!\" }"

which was explained earlier (see section Running awk without Input Files). Note that you don't usually need single quotes around the file name that you specify with `-f', because most file names don't contain any of the shell's special characters. Notice that in `advice', the awk program did not have single quotes around it. The quotes are only needed for programs that are provided on the awk command line.

If you want to identify your awk program files clearly as such, you can add the extension `.awk' to the file name. This doesn't affect the execution of the awk program, but it does make "housekeeping" easier.
Executable awk Programs

Once you have learned awk, you may want to write self-contained awk scripts, using the `#!' script mechanism. You can do this on many Unix systems(3) (and someday on the GNU system).

For example, you could update the file `advice' to look like this:
#! /bin/awk -f

BEGIN { print "Don't Panic!" }

After making this file executable (with the chmod utility), you can simply type `advice' at the shell, and the system will arrange to run awk (4) as if you had typed `awk -f advice'.
$ advice
- Don't Panic!

Self-contained awk scripts are useful when you want to write a program which users can invoke without their having to know that the program is written in awk.

Some older systems do not support the `#!' mechanism. You can get a similar effect using a regular shell script. It would look something like this:
: The colon ensures execution by the standard shell.
awk 'program' "$@"

Using this technique, it is vital to enclose the program in single quotes to protect it from interpretation by the shell. If you omit the quotes, only a shell wizard can predict the results.

The "$@" causes the shell to forward all the command line arguments to the awk program, without interpretation. The first line, which starts with a colon, is used so that this shell script will work even if invoked by a user who uses the C shell. (Not all older systems obey this convention, but many do.)
Comments in awk Programs

A comment is some text that is included in a program for the sake of human readers; it is not really part of the program. Comments can explain what the program does, and how it works. Nearly all programming languages have provisions for comments, because programs are typically hard to understand without their extra help.

In the awk language, a comment starts with the sharp sign character, `#', and continues to the end of the line. The `#' does not have to be the first character on the line. The awk language ignores the rest of a line following a sharp sign. For example, we could have put the following into `advice':
# This program prints a nice friendly message. It helps
# keep novice users from being afraid of the computer.
BEGIN { print "Don't Panic!" }

You can put comment lines into keyboard-composed throw-away awk programs also, but this usually isn't very useful; the purpose of a comment is to help you or another person understand the program at a later time.
A Very Simple Example

The following command runs a simple awk program that searches the input file `BBS-list' for the string of characters: `foo'. (A string of characters is usually called a string. The term string is perhaps based on similar usage in English, such as "a string of pearls," or, "a string of cars in a train.")
awk '/foo/ { print $0 }' BBS-list

When lines containing `foo' are found, they are printed, because `print $0' means print the current line. (Just `print' by itself means the same thing, so we could have written that instead.)

You will notice that slashes, `/', surround the string `foo' in the awk program. The slashes indicate that `foo' is a pattern to search for. This type of pattern is called a regular expression, and is covered in more detail later (see section Regular Expressions). The pattern is allowed to match parts of words. There are single-quotes around the awk program so that the shell won't interpret any of it as special shell characters.

Here is what this program prints:
$ awk '/foo/ { print $0 }' BBS-list
- fooey 555-1234 2400/1200/300 B
- foot 555-6699 1200/300 B
- macfoo 555-6480 1200/300 A
- sabafoo 555-2127 1200/300 C

In an awk rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for every input line. If the action is omitted, the default action is to print all lines that match the pattern.

Thus, we could leave out the action (the print statement and the curly braces) in the above example, and the result would be the same: all lines matching the pattern `foo' would be printed. By comparison, omitting the print statement but retaining the curly braces makes an empty action that does nothing; then no lines would be printed.
An Example with Two Rules

The awk utility reads the input files one line at a time. For each line, awk tries the patterns of each of the rules. If several patterns match then several actions are run, in the order in which they appear in the awk program. If no patterns match, then no actions are run.

After processing all the rules (perhaps none) that match the line, awk reads the next line (however, see section The next Statement, and also see section The nextfile Statement). This continues until the end of the file is reached.

For example, the awk program:
/12/ { print $0 }
/21/ { print $0 }

contains two rules. The first rule has the string `12' as the pattern and `print $0' as the action. The second rule has the string `21' as the pattern and also has `print $0' as the action. Each rule's action is enclosed in its own pair of braces.

This awk program prints every line that contains the string `12' or the string `21'. If a line contains both strings, it is printed twice, once by each rule.

This is what happens if we run this program on our two sample data files, `BBS-list' and `inventory-shipped', as shown here:
$ awk '/12/ { print $0 }
> /21/ { print $0 }' BBS-list inventory-shipped
- aardvark 555-5553 1200/300 B
- alpo-net 555-3412 2400/1200/300 A
- barfly 555-7685 1200/300 A
- bites 555-1675 2400/1200/300 A
- core 555-2912 1200/300 C
- fooey 555-1234 2400/1200/300 B
- foot 555-6699 1200/300 B
- macfoo 555-6480 1200/300 A
- sdace 555-3430 2400/1200/300 A
- sabafoo 555-2127 1200/300 C
- sabafoo 555-2127 1200/300 C
- Jan 21 36 64 620
- Apr 21 70 74 514

Note how the line in `BBS-list' beginning with `sabafoo' was printed twice, once for each rule.

Regular Expressions:

A regular expression, or regexp, is a way of describing a set of strings. Because regular expressions are such a fundamental part of awk programming, their format and use deserve a separate chapter.

A regular expression enclosed in slashes (`/') is an awk pattern that matches every input record whose text belongs to that set.

The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence. Thus, the regexp `foo' matches any string containing `foo'. Therefore, the pattern /foo/ matches any input record containing the three characters `foo', anywhere in the record. Other kinds of regexps let you specify more complicated classes of strings.

Initially, the examples will be simple. As we explain more about how regular expressions work, we will present more complicated examples.
Regexp Usage: How to Use Regular Expressions.
Escape Sequences: How to write non-printing characters.
Regexp Operators: Regular Expression Operators.
GNU Regexp Operators: Operators specific to GNU software.
Case-sensitivity: How to do case-insensitive matching.
Leftmost Longest: How much text matches.
Computed Regexps: Using Dynamic Regexps.
How to Use Regular Expressions

A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, this prints the second field of each record that contains the three characters `foo' anywhere in it:
$ awk '/foo/ { print $2 }' BBS-list
- 555-1234
- 555-6699
- 555-6480
- 555-2127

Regular expressions can also be used in matching expressions. These expressions allow you to specify the string to match against; it need not be the entire current input record. The two operators, `~' and `!~', perform regular expression comparisons. Expressions using these operators can be used as patterns or in if, while, for, and do statements.
exp ~ /regexp/
This is true if the expression exp (taken as a string) is matched by regexp. The following example matches, or selects, all input records with the upper-case letter `J' somewhere in the first field:
$ awk '$1 ~ /J/' inventory-shipped
- Jan 13 25 15 115
- Jun 31 42 75 492
- Jul 24 34 67 436
- Jan 21 36 64 620
So does this:
awk '{ if ($1 ~ /J/) print }' inventory-shipped
exp !~ /regexp/
This is true if the expression exp (taken as a character string) is not matched by regexp. The following example matches, or selects, all input records whose first field does not contain the upper-case letter `J':
$ awk '$1 !~ /J/' inventory-shipped
- Feb 15 32 24 226
- Mar 15 24 34 228
- Apr 31 52 63 420
- May 16 34 29 208
...

When a regexp is written enclosed in slashes, like /foo/, we call it a regexp constant, much like 5.27 is a numeric constant, and "foo" is a string constant.
Escape Sequences

Some characters cannot be included literally in string constants ("foo") or regexp constants (/foo/). You represent them instead with escape sequences, which are character sequences beginning with a backslash (`\').

One use of an escape sequence is to include a double-quote character in a string constant. Since a plain double-quote would end the string, you must use `\"' to represent an actual double-quote character as a part of the string. For example:
$ awk 'BEGIN { print "He said \"hi!\" to her." }'
- He said "hi!" to her.

The backslash character itself is another character that cannot be included normally; you write `\\' to put one backslash in the string or regexp. Thus, the string whose contents are the two characters `"' and `\' must be written "\"\\".

Another use of backslash is to represent unprintable characters such as tab or newline. While there is nothing to stop you from entering most unprintable characters directly in a string constant or regexp constant, they may look ugly.

Here is a table of all the escape sequences used in awk, and what they represent. Unless noted otherwise, all of these escape sequences apply to both string constants and regexp constants.
\\
A literal backslash, `\'.
\a
The "alert" character, Control-g, ASCII code 7 (BEL).
\b
Backspace, Control-h, ASCII code 8 (BS).
\f
Formfeed, Control-l, ASCII code 12 (FF).
\n
Newline, Control-j, ASCII code 10 (LF).
\r
Carriage return, Control-m, ASCII code 13 (CR).
\t
Horizontal tab, Control-i, ASCII code 9 (HT).
\v
Vertical tab, Control-k, ASCII code 11 (VT).
\nnn
The octal value nnn, where nnn are one to three digits between `0' and `7'. For example, the code for the ASCII ESC (escape) character is `\033'.
\xhh...
The hexadecimal value hh, where hh are hexadecimal digits (`0' through `9' and either `A' through `F' or `a' through `f'). Like the same construct in ANSI C, the escape sequence continues until the first non-hexadecimal digit is seen. However, using more than two hexadecimal digits produces undefined results. (The `\x' escape sequence is not allowed in POSIX awk.)
\/
A literal slash (necessary for regexp constants only). You use this when you wish to write a regexp constant that contains a slash. Since the regexp is delimited by slashes, you need to escape the slash that is part of the pattern, in order to tell awk to keep processing the rest of the regexp.
\"
A literal double-quote (necessary for string constants only). You use this when you wish to write a string constant that contains a double-quote. Since the string is delimited by double-quotes, you need to escape the quote that is part of the string, in order to tell awk to keep processing the rest of the string.

In gawk, there are additional two character sequences that begin with backslash that have special meaning in regexps. See section Additional Regexp Operators Only in gawk.

In a string constant, what happens if you place a backslash before something that is not one of the characters listed above? POSIX awk purposely leaves this case undefined. There are two choices.
Strip the backslash out. This is what Unix awk and gawk both do. For example, "a\qc" is the same as "aqc".
Leave the backslash alone. Some other awk implementations do this. In such implementations, "a\qc" is the same as if you had typed "a\\qc".

In a regexp, a backslash before any character that is not in the above table, and not listed in section Additional Regexp Operators Only in gawk, means that the next character should be taken literally, even if it would normally be a regexp operator. E.g., /a\+b/ matches the three characters `a+b'.

For complete portability, do not use a backslash before any character not listed in the table above.

Another interesting question arises. Suppose you use an octal or hexadecimal escape to represent a regexp metacharacter (see section Regular Expression Operators). Does awk treat the character as literal character, or as a regexp operator?

It turns out that historically, such characters were taken literally (d.c.). However, the POSIX standard indicates that they should be treated as real metacharacters, and this is what gawk does. However, in compatibility mode (see section Command Line Options), gawk treats the characters represented by octal and hexadecimal escape sequences literally when used in regexp constants. Thus, /a\52b/ is equivalent to /a\*b/.

To summarize:
The escape sequences in the table above are always processed first, for both string constants and regexp constants. This happens very early, as soon as awk reads your program.
gawk processes both regexp constants and dynamic regexps (see section Using Dynamic Regexps), for the special operators listed in section Additional Regexp Operators Only in gawk.
A backslash before any other character means to treat that character literally.
Regular Expression Operators

You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.

The escape sequences described above in section Escape Sequences, are valid inside a regexp. They are introduced by a `\'. They are recognized and converted into the corresponding real characters as the very first step in processing regexps.

Here is a table of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves.
\
This is used to suppress the special meaning of a character when matching. For example:
\$
matches the character `$'.
^
This matches the beginning of a string. For example:
^@chapter
matches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files. The `^' is known as an anchor, since it anchors the pattern to matching only at the beginning of the string. It is important to realize that `^' does not match the beginning of a line embedded in a string. In this example the condition is not true:
if ("line1\nLINE 2" ~ /^L/) ...
$
This is similar to `^', but it matches only at the end of a string. For example:
p$
matches a record that ends with a `p'. The `$' is also an anchor, and also does not match the end of a line embedded in a string. In this example the condition is not true:
if ("line1\nLINE 2" ~ /1$/) ...
.
The period, or dot, matches any single character, including the newline character. For example:
.P
matches any single character followed by a `P' in a string. Using concatenation we can make a regular expression like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'. In strict POSIX mode (see section Command Line Options), `.' does not match the NUL character, which is a character with all bits equal to zero. Otherwise, NUL is just another character. Other versions of awk may not be able to match the NUL character.
[...]
This is called a character list. It matches any one of the characters that are enclosed in the square brackets. For example:
[MVX]
matches any one of the characters `M', `V', or `X' in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:
[0-9]
matches any digit. Multiple ranges are allowed. E.g., the list [A-Za-z0-9] is a common way to express the idea of "all alphanumeric characters." To include one of the characters `\', `]', `-' or `^' in a character list, put a `\' in front of it. For example:
[d\]]
matches either `d', or `]'. This treatment of `\' in character lists is compatible with other awk implementations, and is also mandated by POSIX. The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep utility. Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but where the actual characters themselves can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs in the USA and in France. A character class is only valid in a regexp inside the brackets of a character list. Character classes consist of `[:', a keyword denoting the class, and `:]'. Here are the character classes defined by the POSIX standard.
[:alnum:]
Alphanumeric characters.
[:alpha:]
Alphabetic characters.
[:blank:]
Space and tab characters.
[:cntrl:]
Control characters.
[:digit:]
Numeric characters.
[:graph:]
Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.)
[:lower:]
Lower-case alphabetic characters.
[:print:]
Printable characters (characters that are not control characters.)
[:punct:]
Punctuation characters (characters that are not letter, digits, control characters, or space characters).
[:space:]
Space characters (such as space, tab, and formfeed, to name a few).
[:upper:]
Upper-case alphabetic characters.
[:xdigit:]
Characters that are hexadecimal digits.
For example, before the POSIX standard, to match alphanumeric characters, you had to write /[A-Za-z0-9]/. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write /[[:alnum:]]/, and this will match all the alphabetic and numeric characters in your character set. Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character, as well as several characters that are equivalent for collating, or sorting, purposes. (E.g., in French, a plain "e" and a grave-accented "`e" are equivalent.)
Collating Symbols
A collating symbol is a multi-character collating element enclosed in `[.' and `.]'. For example, if `ch' is a collating element, then [[.ch.]] is a regexp that matches this collating element, while [ch] is a regexp that matches either `c' or `h'.
Equivalence Classes
An equivalence class is a list of equivalent characters enclosed in `[=' and `=]'. Thus, [[=e`e=]] is regexp that matches either `e' or ``e'.
These features are very valuable in non-English speaking locales. Caution: The library functions that gawk uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or equivalence classes.
[^ ...]
This is a complemented character list. The first character after the `[' must be a `^'. It matches any characters except those in the square brackets, or newline. For example:
[^0-9]
matches any character that is not a digit.

This is the alternation operator, and it is used to specify alternatives. For example:
^P[0-9]
matches any string that matches either `^P' or `[0-9]'. This means it matches any string that starts with `P' or contains a digit. The alternation applies to the largest possible regexps on either side. In other words, `' has the lowest precedence of all the regular expression operators.
(...)
Parentheses are used for grouping in regular expressions as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, `'. For example, `@(sampcode)\{[^}]+\}' matches both `@code{foo}' and `@samp{bar}'. (These are Texinfo formatting control sequences.)
*
This symbol means that the preceding regular expression is to be repeated as many times as necessary to find a match. For example:
ph*
applies the `*' symbol to the preceding `h' and looks for matches of one `p' followed by any number of `h's. This will also match just `p' if no `h's are present. The `*' repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:
awk '/\(c[ad][ad]*r x\)/ { print }' sample
prints every record in `sample' containing a string of the form `(car x)', `(cdr x)', `(cadr x)', and so on. Notice the escaping of the parentheses by preceding them with backslashes.
+
This symbol is similar to `*', but the preceding expression must be matched at least once. This means that:
wh+y
would match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example:
awk '/\(c[ad]+r x\)/ { print }' sample
?
This symbol is similar to `*', but the preceding expression can be matched either once or not at all. For example:
fe?d
will match `fed' and `fd', but nothing else.
{n}
{n,}
{n,m}
One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times.
wh{3}y
matches `whhhy' but not `why' or `whhhhy'.
wh{3,5}y
matches `whhhy' or `whhhhy' or `whhhhhy', only.
wh{2,}y
matches `whhy' or `whhhy', and so on.
Interval expressions were not traditionally available in awk. As part of the POSIX standard they were added, to make awk and egrep consistent with each other. However, since old programs may use `{' and `}' in regexp constants, by default gawk does not match interval expressions in regexps. If either `--posix' or `--re-interval' are specified (see section Command Line Options), then interval expressions are allowed in regexps.

In regular expressions, the `*', `+', and `?' operators, as well as the braces `{' and `}', have the highest precedence, followed by concatenation, and finally by `'. As in arithmetic, parentheses can change how operators are grouped.

If gawk is in compatibility mode (see section Command Line Options), character classes and interval expressions are not available in regular expressions.

The next section discusses the GNU-specific regexp operators, and provides more detail concerning how command line options affect the way gawk interprets the characters in regular expressions.
Additional Regexp Operators Only in gawk

GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section, and are specific to gawk; they are not available in other awk implementations.

Most of the additional operators are for dealing with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (`_').
\w
This operator matches any word-constituent character, i.e. any letter, digit, or underscore. Think of it as a short-hand for [[:alnum:]_].
\W
This operator matches any character that is not word-constituent. Think of it as a short-hand for [^[:alnum:]_].
\<
This operator matches the empty string at the beginning of a word. For example, /\\>
This operator matches the empty string at the end of a word. For example, /stow\>/ matches `stow', but not `stowaway'.
\y
This operator matches the empty string at either the beginning or the end of a word (the word boundary). For example, `\yballs?\y' matches either `ball' or `balls' as a separate word.
\B
This operator matches the empty string within a word. In other words, `\B' matches the empty string that occurs between two word-constituent characters. For example, /\Brat\B/ matches `crate', but it does not match `dirty rat'. `\B' is essentially the opposite of `\y'.

There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. For other programs, the regexp library routines that gawk uses consider the entire string to be matched as the buffer.

For awk, since `^' and `$' always work in terms of the beginning and end of strings, these operators don't add any new capabilities. They are provided for compatibility with other GNU software.

\`
This operator matches the empty string at the beginning of the buffer.
\'
This operator matches the empty string at the end of the buffer.

In other GNU software, the word boundary operator is `\b'. However, that conflicts with the awk language's definition of `\b' as backspace, so gawk uses a different letter.

An alternative method would have been to require two backslashes in the GNU operators, but this was deemed to be too confusing, and the current method of using `\y' for the GNU `\b' appears to be the lesser of two evils.

The various command line options (see section Command Line Options) control how gawk interprets characters in regexps.
No options
In the default case, gawk provide all the facilities of POSIX regexps and the GNU regexp operators described above. However, interval expressions are not supported.
--posix
Only POSIX regexps are supported, the GNU operators are not special (e.g., `\w' matches a literal `w'). Interval expressions are allowed.
--traditional
Traditional Unix awk regexps are matched. The GNU operators are not special, interval expressions are not available, and neither are the POSIX character classes ([[:alnum:]] and so on). Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters.
--re-interval
Allow interval expressions in regexps, even if `--traditional' has been provided.
Case-sensitivity in Matching

Case is normally significant in regular expressions, both when matching ordinary characters (i.e. not metacharacters), and inside character sets. Thus a `w' in a regular expression matches only a lower-case `w' and not an upper-case `W'.

The simplest way to do a case-independent match is to use a character list: `[Ww]'. However, this can be cumbersome if you need to use it often; and it can make the regular expressions harder to read. There are two alternatives that you might prefer.

One way to do a case-insensitive match at a particular point in the program is to convert the data to a single case, using the tolower or toupper built-in string functions (which we haven't discussed yet; see section Built-in Functions for String Manipulation). For example:
tolower($1) ~ /foo/ { ... }

converts the first field to lower-case before matching against it. This will work in any POSIX-compliant implementation of awk.

Another method, specific to gawk, is to set the variable IGNORECASE to a non-zero value (see section Built-in Variables). When IGNORECASE is not zero, all regexp and string operations ignore case. Changing the value of IGNORECASE dynamically controls the case sensitivity of your program as it runs. Case is significant by default because IGNORECASE (like most variables) is initialized to zero.
x = "aB"
if (x ~ /ab/) ... # this test will fail

IGNORECASE = 1
if (x ~ /ab/) ... # now it will succeed

In general, you cannot use IGNORECASE to make certain rules case-insensitive and other rules case-sensitive, because there is no way to set IGNORECASE just for the pattern of a particular rule. To do this, you must use character lists or tolower. However, one thing you can do only with IGNORECASE is turn case-sensitivity on or off dynamically for all the rules at once.

IGNORECASE can be set on the command line, or in a BEGIN rule (see section Other Command Line Arguments; also see section Startup and Cleanup Actions). Setting IGNORECASE from the command line is a way to make a program case-insensitive without having to edit it.

Prior to version 3.0 of gawk, the value of IGNORECASE only affected regexp operations. It did not affect string comparison with `==', `!=', and so on. Beginning with version 3.0, both regexp and string comparison operations are affected by IGNORECASE.

Beginning with version 3.0 of gawk, the equivalences between upper-case and lower-case characters are based on the ISO-8859-1 (ISO Latin-1) character set. This character set is a superset of the traditional 128 ASCII characters, that also provides a number of characters suitable for use with European languages.

The value of IGNORECASE has no effect if gawk is in compatibility mode (see section Command Line Options). Case is always significant in compatibility mode.
How Much Text Matches?

Consider the following example:
echo aaaabcd awk '{ sub(/a+/, ""); print }'

This example uses the sub function (which we haven't discussed yet, see section Built-in Functions for String Manipulation) to make a change to the input record. Here, the regexp /a+/ indicates "one or more `a' characters," and the replacement text is `'.

The input contains four `a' characters. What will the output be? In other words, how many is "one or more"---will awk match two, three, or all four `a' characters?

The answer is, awk (and POSIX) regular expressions always match the leftmost, longest sequence of input characters that can match. Thus, in this example, all four `a' characters are replaced with `'.
$ echo aaaabcd awk '{ sub(/a+/, ""); print }'
- bcd

For simple match/no-match tests, this is not so important. But when doing regexp-based field and record splitting, and text matching and substitutions with the match, sub, gsub, and gensub functions, it is very important. Understanding this principle is also important for regexp-based record and field splitting (see section How Input is Split into Records, and also see section Specifying How Fields are Separated).
Using Dynamic Regexps

The right hand side of a `~' or `!~' operator need not be a regexp constant (i.e. a string of characters between slashes). It may be any expression. The expression is evaluated, and converted if necessary to a string; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp. For example:
BEGIN { identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" }
$0 ~ identifier_regexp { print }

sets identifier_regexp to a regexp that describes awk variable names, and tests if the input record matches this regexp.

Caution: When using the `~' and `!~' operators, there is a difference between a regexp constant enclosed in slashes, and a string constant enclosed in double quotes. If you are going to use a string constant, you have to understand that the string is in essence scanned twice; the first time when awk reads your program, and the second time when it goes to match the string on the left-hand side of the operator with the pattern on the right. This is true of any string valued expression (such as identifier_regexp above), not just string constants.

What difference does it make if the string is scanned twice? The answer has to do with escape sequences, and particularly with backslashes. To get a backslash into a regular expression inside a string, you have to type two backslashes.

For example, /\*/ is a regexp constant for a literal `*'. Only one backslash is needed. To do the same thing with a string, you would have to type "\\*". The first backslash escapes the second one, so that the string actually contains the two characters `\' and `*'.

Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons.
String constants are more complicated to write, and more difficult to read. Using regexp constants makes your programs less error-prone. Not understanding the difference between the two kinds of constants is a common source of errors.
It is also more efficient to use regexp constants: awk can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, awk must first convert the string into this internal form, and then perform the pattern matching.
Using regexp constants is better style; it shows clearly that you intend a regexp match.

Reading Input Files

In the typical awk program, all input is read either from the standard input (by default the keyboard, but often a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, reading all the data from one before going on to the next. The name of the current input file can be found in the built-in variable FILENAME (see section Built-in Variables).

The input is read in units called records, and processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.

On rare occasions you will need to use the getline command. The getline command is valuable, both because it can do explicit input from any number of files, and because the files used with it do not have to be named on the awk command line (see section Explicit Input with getline).
Records: Controlling how data is split into records.
Fields: An introduction to fields.
Non-Constant Fields: Non-constant Field Numbers.
Changing Fields: Changing the Contents of a Field.
Field Separators: The field separator and how to change it.
Constant Size: Reading constant width data.
Multiple Line: Reading multi-line records.
Getline: Reading files under explicit program control using the getline function.

Changing the Contents of a Field

You can change the contents of a field as seen by awk within an awk program; this changes what awk perceives as the current input record. (The actual input is untouched; awk never modifies the input file.)

Consider this example and its output:
$ awk '{ $3 = $2 - 10; print $2, $3 }' inventory-shipped
- 13 3
- 15 5
- 15 5
...

The `-' sign represents subtraction, so this program reassigns field three, $3, to be the value of field two minus ten, `$2 - 10'. (See section Arithmetic Operators.) Then field two, and the new value for field three, are printed.

In order for this to work, the text in field $2 must make sense as a number; the string of characters must be converted to a number in order for the computer to do arithmetic on it. The number resulting from the subtraction is converted back to a string of characters which then becomes field three. See section Conversion of Strings and Numbers.

When you change the value of a field (as perceived by awk), the text of the input record is recalculated to contain the new field where the old one was. Therefore, $0 changes to reflect the altered field. Thus, this program prints a copy of the input file, with 10 subtracted from the second field of each line.
$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
- Jan 3 25 15 115
- Feb 5 32 24 226
- Mar 5 24 34 228
...

You can also assign contents to fields that are out of range. For example:
$ awk '{ $6 = ($5 + $4 + $3 + $2)
> print $6 }' inventory-shipped
- 168
- 297
- 301
...

We've just created $6, whose value is the sum of fields $2, $3, $4, and $5. The `+' sign represents addition. For the file `inventory-shipped', $6 represents the total number of parcels shipped for a particular month.

Creating a new field changes awk's internal copy of the current input record--the value of $0. Thus, if you do `print $0' after adding a field, the record printed includes the new field, with the appropriate number of field separators between it and the previously existing fields.

This recomputation affects and is affected by NF (the number of fields; see section Examining Fields), and by a feature that has not been discussed yet, the output field separator, OFS, which is used to separate the fields (see section Output Separators). For example, the value of NF is set to the number of the highest field you create.

Note, however, that merely referencing an out-of-range field does not change the value of either $0 or NF. Referencing an out-of-range field only produces an empty string. For example:
if ($(NF+1) != "")
print "can't happen"
else
print "everything is normal"

should print `everything is normal', because NF+1 is certain to be out of range. (See section The if-else Statement, for more information about awk's if-else statements. See section Variable Typing and Comparison Expressions, for more information about the `!=' operator.)

It is important to note that making an assignment to an existing field will change the value of $0, but will not change the value of NF, even when you assign the empty string to a field. For example:
$ echo a b c d awk '{ OFS = ":"; $2 = ""
> print $0; print NF }'
- a::c:d
- 4

The field is still there; it just has an empty value. You can tell because there are two colons in a row.

This example shows what happens if you create a new field.
$ echo a b c d awk '{ OFS = ":"; $2 = ""; $6 = "new"
> print $0; print NF }'
- a::c:d::new
- 6

The intervening field, $5 is created with an empty value (indicated by the second pair of adjacent colons), and NF is updated with the value six.

Using Regular Expressions to Separate Fields

The previous subsection discussed the use of single characters or simple strings as the value of FS. More generally, the value of FS may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment:
FS = ", \t"

makes every area of an input line that consists of a comma followed by a space and a tab, into a field separator. (`\t' is an escape sequence that stands for a tab; see section Escape Sequences, for the complete list of similar escape sequences.)

For a less trivial example of a regular expression, suppose you want single spaces to separate fields the way single commas were used above. You can set FS to "[ ]" (left bracket, space, right bracket). This regular expression matches a single space and nothing else (see section Regular Expressions).

There is an important difference between the two cases of `FS = " "' (a single space) and `FS = "[ \t]+"' (left bracket, space, backslash, "t", right bracket, which is a regular expression matching one or more spaces or tabs). For both values of FS, fields are separated by runs of spaces and/or tabs. However, when the value of FS is " ", awk will first strip leading and trailing whitespace from the record, and then decide where the fields are.

For example, the following pipeline prints `b':
$ echo ' a b c d ' awk '{ print $2 }'
- b

However, this pipeline prints `a' (note the extra spaces around each letter):
$ echo ' a b c d ' awk 'BEGIN { FS = "[ \t]+" }
> { print $2 }'
- a

In this case, the first field is null, or empty.

The stripping of leading and trailing whitespace also comes into play whenever $0 is recomputed. For instance, study this pipeline:
$ echo ' a b c d' awk '{ print; $2 = $2; print }'
- a b c d
- a b c d

The first print statement prints the record as it was read, with leading whitespace intact. The assignment to $2 rebuilds $0 by concatenating $1 through $NF together, separated by the value of OFS. Since the leading whitespace was ignored when finding $1, it is not part of the new $0. Finally, the last print statement prints the new $0.

Explicit Input with getline

So far we have been getting our input data from awk's main input stream--either the standard input (usually your terminal, sometimes the output from another program) or from the files specified on the command line. The awk language has a special built-in command called getline that can be used to read input under your explicit control.
Getline Intro: Introduction to the getline function.
Plain Getline: Using getline with no arguments.
Getline/Variable: Using getline into a variable.
Getline/File: Using getline from a file.
Getline/Variable/File: Using getline into a variable from a file.
Getline/Pipe: Using getline from a pipe.
Getline/Variable/Pipe: Using getline into a variable from a pipe.
Getline Summary: Summary Of getline Variants.
Introduction to getline

This command is used in several different ways, and should not be used by beginners. It is covered here because this is the chapter on input. The examples that follow the explanation of the getline command include material that has not been covered yet. Therefore, come back and study the getline command after you have reviewed the rest of this book and have a good knowledge of how awk works.

getline returns one if it finds a record, and zero if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1. In this case, gawk sets the variable ERRNO to a string describing the error that occurred.

In the following examples, command stands for a string value that represents a shell command.
Using getline with No Arguments

The getline command can be used without arguments to read input from the current input file. All it does in this case is read the next input record and split it up into fields. This is useful if you've finished processing the current record, but you want to do some special processing right now on the next record. Here's an example:
awk '{
if ((t = index($0, "/*")) != 0) {
# value will be "" if t is 1
tmp = substr($0, 1, t - 1)
u = index(substr($0, t + 2), "*/")
while (u == 0) {
if (getline <= 0) {
m = "unexpected EOF or error"
m = (m ": " ERRNO)
print m > "/dev/stderr"
exit
}
t = -1
u = index($0, "*/")
}
# substr expression will be "" if */
# occurred at end of line
$0 = tmp substr($0, t + u + 3)
}
print $0
}'

This awk program deletes all C-style comments, `/* ... */', from the input. By replacing the `print $0' with other statements, you could perform more complicated processing on the decommented input, like searching for matches of a regular expression. This program has a subtle problem--it does not work if one comment ends and another begins on the same line.

This form of the getline command sets NF (the number of fields; see section Examining Fields), NR (the number of records read so far; see section How Input is Split into Records), FNR (the number of records read from this input file), and the value of $0.

Note: the new value of $0 is used in testing the patterns of any subsequent rules. The original value of $0 that triggered the rule which executed getline is lost (d.c.). By contrast, the next statement reads a new record but immediately begins processing it normally, starting with the first rule in the program. See section The next Statement.
Using getline Into a Variable

You can use `getline var' to read the next record from awk's input into the variable var. No other processing is done.

For example, suppose the next line is a comment, or a special string, and you want to read it, without triggering any rules. This form of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it.

The following example swaps every two lines of input. For example, given:
wan
tew
free
phore

it outputs:
tew
wan
phore
free

Here's the program:
awk '{
if ((getline tmp) > 0) {
print tmp
print $0
} else
print $0
}'

The getline command used in this way sets only the variables NR and FNR (and of course, var). The record is not split into fields, so the values of the fields (including $0) and the value of NF do not change.
Using getline from a File

Use `getline < file' to read the next record from the file file. Here file is a string-valued expression that specifies the file name. `< file' is called a redirection since it directs input to come from a different place.

For example, the following program reads its input record from the file `secondary.input' when it encounters a first field with a value equal to 10 in the current input file.
awk '{
if ($1 == 10) {
getline < "secondary.input"
print
} else
print
}'

Since the main input stream is not used, the values of NR and FNR are not changed. But the record read is split into fields in the normal manner, so the values of $0 and other fields are changed. So is the value of NF.
Using getline Into a Variable from a File

Use `getline var < file' to read input the file file and put it in the variable var. As above, file is a string-valued expression that specifies the file from which to read.

In this version of getline, none of the built-in variables are changed, and the record is not split into fields. The only variable changed is var.
For example, the following program copies all the input files to the output, except for records that say `@include filename'. Such a record is replaced by the contents of the file filename.
awk '{
if (NF == 2 && $1 == "@include") {
while ((getline line < $2) > 0)
print line
close($2)
} else
print
}'

Note here how the name of the extra input file is not built into the program; it is taken directly from the data, from the second field on the `@include' line.

The close function is called to ensure that if two identical `@include' lines appear in the input, the entire specified file is included twice. See section Closing Input and Output Files and Pipes.

One deficiency of this program is that it does not process nested `@include' statements (`@include' statements in included files) the way a true macro preprocessor would. See section An Easy Way to Use Library Functions, for a program that does handle nested `@include' statements.
Summary of getline Variants

With all the forms of getline, even though $0 and NF, may be updated, the record will not be tested against all the patterns in the awk program, in the way that would happen if the record were read normally by the main processing loop of awk. However the new record is tested against any subsequent rules.

Many awk implementations limit the number of pipelines an awk program may have open to just one! In gawk, there is no such limit. You can open as many pipelines as the underlying operating system will permit.

The following table summarizes the six variants of getline, listing which built-in variables are set by each one.
getline
sets $0, NF, FNR, and NR.
getline var
sets var, FNR, and NR.
getline < file
sets $0, and NF.
getline var < file
sets var.
command getline
sets $0, and NF.
command getline var
sets var.

Printing Output

One of the most common actions is to print, or output, some or all of the input. You use the print statement for simple output. You use the printf statement for fancier formatting. Both are described in this chapter.

The print Statement

The print statement does output with simple, standardized formatting. You specify only the strings or numbers to be printed, in a list separated by commas. They are output, separated by single spaces, followed by a newline. The statement looks like this:
print item1, item2, ...

The entire list of items may optionally be enclosed in parentheses. The parentheses are necessary if any of the item expressions uses the `>' relational operator; otherwise it could be confused with a redirection (see section Redirecting Output of print and printf).

The items to be printed can be constant strings or numbers, fields of the current record (such as $1), variables, or any awk expressions. Numeric values are converted to strings, and then printed.

The print statement is completely general for computing what values to print. However, with two exceptions, you cannot specify how to print them--how many columns, whether to use exponential notation or not, and so on. (For the exceptions, see section Output Separators, and section Controlling Numeric Output with print.) For that, you need the printf statement (see section Using printf Statements for Fancier Printing).

The simple statement `print' with no items is equivalent to `print $0': it prints the entire current record. To print a blank line, use `print ""', where "" is the empty string.

To print a fixed piece of text, use a string constant such as "Don't Panic" as one item. If you forget to use the double-quote characters, your text will be taken as an awk expression, and you will probably get an error. Keep in mind that a space is printed between any two items.

Each print statement makes at least one line of output. But it isn't limited to one line. If an item value is a string that contains a newline, the newline is output along with the rest of the string. A single print can make any number of lines this way.
Examples of print Statements

Here is an example of printing a string that contains embedded newlines (the `\n' is an escape sequence, used to represent the newline character; see section Escape Sequences):
$ awk 'BEGIN { print "line one\nline two\nline three" }'
- line one
- line two
- line three

Here is an example that prints the first two fields of each input record, with a space between them:
$ awk '{ print $1, $2 }' inventory-shipped
- Jan 13
- Feb 15
- Mar 15
...

Controlling Numeric Output with print

When you use the print statement to print numeric values, awk internally converts the number to a string of characters, and prints that string. awk uses the sprintf function to do this conversion (see section Built-in Functions for String Manipulation). For now, it suffices to say that the sprintf function accepts a format specification that tells it how to format numbers (or strings), and that there are a number of different ways in which numbers can be formatted. The different format specifications are discussed more fully in section Format-Control Letters.

The built-in variable OFMT contains the default format specification that print uses with sprintf when it wants to convert a number to a string for printing. The default value of OFMT is "%.6g". By supplying different format specifications as the value of OFMT, you can change how print will print your numbers. As a brief example:
$ awk 'BEGIN {
> OFMT = "%.0f" # print numbers as integers (rounds)
> print 17.23 }'
- 17

According to the POSIX standard, awk's behavior will be undefined if OFMT contains anything but a floating point conversion specification (d.c.).

Examples Using printf

Here is how to use printf to make an aligned table:
awk '{ printf "%-10s %s\n", $1, $2 }' BBS-list

prints the names of bulletin boards ($1) of the file `BBS-list' as a string of 10 characters, left justified. It also prints the phone numbers ($2) afterward on the line. This produces an aligned two-column table of names and phone numbers:
$ awk '{ printf "%-10s %s\n", $1, $2 }' BBS-list
- aardvark 555-5553
- alpo-net 555-3412
- barfly 555-7685
- bites 555-1675
- camelot 555-0542
- core 555-2912
- fooey 555-1234
- foot 555-6699
- macfoo 555-6480
- sdace 555-3430
- sabafoo 555-2127

Did you notice that we did not specify that the phone numbers be printed as numbers? They had to be printed as strings because the numbers are separated by a dash. If we had tried to print the phone numbers as numbers, all we would have gotten would have been the first three digits, `555'. This would have been pretty confusing.

We did not specify a width for the phone numbers because they are the last things on their lines. We don't need to put spaces after them.

We could make our table look even nicer by adding headings to the tops of the columns. To do this, we use the BEGIN pattern (see section The BEGIN and END Special Patterns) to force the header to be printed only once, at the beginning of the awk program:
awk 'BEGIN { print "Name Number"
print "---- ------" }
{ printf "%-10s %s\n", $1, $2 }' BBS-list

Did you notice that we mixed print and printf statements in the above example? We could have used just printf statements to get the same results:
awk 'BEGIN { printf "%-10s %s\n", "Name", "Number"
printf "%-10s %s\n", "----", "------" }
{ printf "%-10s %s\n", $1, $2 }' BBS-list

By printing each column heading with the same format specification used for the elements of the column, we have made sure that the headings are aligned just like the columns.

The fact that the same format specification is used three times can be emphasized by storing it in a variable, like this:
awk 'BEGIN { format = "%-10s %s\n"
printf format, "Name", "Number"
printf format, "----", "------" }
{ printf format, $1, $2 }' BBS-list

See if you can use the printf statement to line up the headings and table data for our `inventory-shipped' example covered earlier in the section on the print statement (see section The print Statement).

Special File Names in gawk

Running programs conventionally have three input and output streams already available to them for reading and writing. These are known as the standard input, standard output, and standard error output. These streams are, by default, connected to your terminal, but they are often redirected with the shell, via the `<', `<<', `>', `>>', `>&' and `' operators. Standard error is typically used for writing error messages; the reason we have two separate streams, standard output and standard error, is so that they can be redirected separately.

In other implementations of awk, the only way to write an error message to standard error in an awk program is as follows:
print "Serious error detected!" "cat 1>&2"

This works by opening a pipeline to a shell command which can access the standard error stream which it inherits from the awk process. This is far from elegant, and is also inefficient, since it requires a separate process. So people writing awk programs often neglect to do this. Instead, they send the error messages to the terminal, like this:
print "Serious error detected!" > "/dev/tty"

This usually has the same effect, but not always: although the standard error stream is usually the terminal, it can be redirected, and when that happens, writing to the terminal is not correct. In fact, if awk is run from a background job, it may not have a terminal at all. Then opening `/dev/tty' will fail.

gawk provides special file names for accessing the three standard streams. When you redirect input or output in gawk, if the file name matches one of these special names, then gawk directly uses the stream it stands for.

`/dev/stdin'
The standard input (file descriptor 0).
`/dev/stdout'
The standard output (file descriptor 1).
`/dev/stderr'
The standard error output (file descriptor 2).
`/dev/fd/N'
The file associated with file descriptor N. Such a file must have been opened by the program initiating the awk execution (typically the shell). Unless you take special pains in the shell from which you invoke gawk, only descriptors 0, 1 and 2 are available.

The file names `/dev/stdin', `/dev/stdout', and `/dev/stderr' are aliases for `/dev/fd/0', `/dev/fd/1', and `/dev/fd/2', respectively, but they are more self-explanatory.

The proper way to write an error message in a gawk program is to use `/dev/stderr', like this:
print "Serious error detected!" > "/dev/stderr"

gawk also provides special file names that give access to information about the running gawk process. Each of these "files" provides a single record of information. To read them more than once, you must first close them with the close function (see section Closing Input and Output Files and Pipes). The filenames are:

`/dev/pid'
Reading this file returns the process ID of the current process, in decimal, terminated with a newline.
`/dev/ppid'
Reading this file returns the parent process ID of the current process, in decimal, terminated with a newline.
`/dev/pgrpid'
Reading this file returns the process group ID of the current process, in decimal, terminated with a newline.
`/dev/user'
Reading this file returns a single record terminated with a newline. The fields are separated with spaces. The fields represent the following information:
$1
The return value of the getuid system call (the real user ID number).
$2
The return value of the geteuid system call (the effective user ID number).
$3
The return value of the getgid system call (the real group ID number).
$4
The return value of the getegid system call (the effective group ID number).
If there are any additional fields, they are the group IDs returned by getgroups system call. (Multiple groups may not be supported on all systems.)

These special file names may be used on the command line as data files, as well as for I/O redirections within an awk program. They may not be used as source files with the `-f' option.

Recognition of these special file names is disabled if gawk is in compatibility mode (see section Command Line Options).

Caution: Unless your system actually has a `/dev/fd' directory (or any of the other above listed special files), the interpretation of these file names is done by gawk itself. For example, using `/dev/fd/4' for output will actually write on file descriptor 4, and not on a new file descriptor that was dup'ed from file descriptor 4. Most of the time this does not matter; however, it is important to not close any of the files related to file descriptors 0, 1, and 2. If you do close one of these files, unpredictable behavior will result.

The special files that provide process-related information may disappear in a future version of gawk. See section Probable Future Extensions

Patterns and Actions

As you have already seen, each awk statement consists of a pattern with an associated action. This chapter describes how you build patterns and actions.
Pattern Overview: What goes into a pattern.
Action Overview: What goes into an action.
Pattern Elements

Patterns in awk control the execution of rules: a rule is executed when its pattern matches the current input record. This section explains all about how to write patterns.
Kinds of Patterns: A list of all kinds of patterns.
Regexp Patterns: Using regexps as patterns.
Expression Patterns: Any expression can be used as a pattern.
Ranges: Pairs of patterns specify record ranges.
BEGIN/END: Specifying initialization and cleanup rules.
Empty: The empty pattern, which matches every record.
Kinds of Patterns

Here is a summary of the types of patterns supported in awk.
/regular expression/
A regular expression as a pattern. It matches when the text of the input record fits the regular expression. (See section Regular Expressions.)
expression
A single expression. It matches when its value is non-zero (if a number) or non-null (if a string). (See section Expressions as Patterns.)
pat1, pat2
A pair of patterns separated by a comma, specifying a range of records. The range includes both the initial record that matches pat1, and the final record that matches pat2. (See section Specifying Record Ranges with Patterns.)
BEGIN
END
Special patterns for you to supply start-up or clean-up actions for your awk program. (See section The BEGIN and END Special Patterns.)
empty
The empty pattern matches every input record. (See section The Empty Pattern.)
Regular Expressions as Patterns

We have been using regular expressions as patterns since our early examples. This kind of pattern is simply a regexp constant in the pattern part of a rule. Its meaning is `$0 ~ /pattern/'. The pattern matches when the input record matches the regexp. For example:
/foobarbaz/ { buzzwords++ }
END { print buzzwords, "buzzwords seen" }
Expressions as Patterns

Any awk expression is valid as an awk pattern. Then the pattern matches if the expression's value is non-zero (if a number) or non-null (if a string).

The expression is reevaluated each time the rule is tested against a new input record. If the expression uses fields such as $1, the value depends directly on the new input record's text; otherwise, it depends only on what has happened so far in the execution of the awk program, but that may still be useful.

A very common kind of expression used as a pattern is the comparison expression, using the comparison operators described in section Variable Typing and Comparison Expressions.

Regexp matching and non-matching are also very common expressions. The left operand of the `~' and `!~' operators is a string. The right operand is either a constant regular expression enclosed in slashes (/regexp/), or any expression, whose string value is used as a dynamic regular expression (see section Using Dynamic Regexps).

The following example prints the second field of each input record whose first field is precisely `foo'.
$ awk '$1 == "foo" { print $2 }' BBS-list

(There is no output, since there is no BBS site named "foo".) Contrast this with the following regular expression match, which would accept any record with a first field that contains `foo':
$ awk '$1 ~ /foo/ { print $2 }' BBS-list
- 555-1234
- 555-6699
- 555-6480
- 555-2127

Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match.

For example, the following command prints all records in `BBS-list' that contain both `2400' and `foo'.
$ awk '/2400/ && /foo/' BBS-list
- fooey 555-1234 2400/1200/300 B

The following command prints all records in `BBS-list' that contain either `2400' or `foo', or both.
$ awk '/2400/ /foo/' BBS-list
- alpo-net 555-3412 2400/1200/300 A
- bites 555-1675 2400/1200/300 A
- fooey 555-1234 2400/1200/300 B
- foot 555-6699 1200/300 B
- macfoo 555-6480 1200/300 A
- sdace 555-3430 2400/1200/300 A
- sabafoo 555-2127 1200/300 C

The following command prints all records in `BBS-list' that do not contain the string `foo'.
$ awk '! /foo/' BBS-list
- aardvark 555-5553 1200/300 B
- alpo-net 555-3412 2400/1200/300 A
- barfly 555-7685 1200/300 A
- bites 555-1675 2400/1200/300 A
- camelot 555-0542 300 C
- core 555-2912 1200/300 C
- sdace 555-3430 2400/1200/300 A

The subexpressions of a boolean operator in a pattern can be constant regular expressions, comparisons, or any other awk expressions. Range patterns are not expressions, so they cannot appear inside boolean patterns. Likewise, the special patterns BEGIN and END, which never match any input record, are not expressions and cannot appear inside boolean patterns.

A regexp constant as a pattern is also a special case of an expression pattern. /foo/ as an expression has the value one if `foo' appears in the current input record; thus, as a pattern, /foo/ matches any record containing `foo'.

No comments:

Subscribe to: Post Comments (Atom)