Finds lines in files that match patterns and then performs specified actions on them.
awk [ -F Ere ] [ -v Assignment ] ... { -f ProgramFile | 'Program' } [ [ File ... | Assignment ... ] ] ...
The awk command utilizes a set of user-supplied instructions to compare a set of files, one line at a time, to extended regular expressions supplied by the user. Then actions are performed upon any line that matches the extended regular expressions.
The pattern searching of the awk command is more general than that of the grep command, and it allows the user to perform multiple actions on input text lines. The awk command programming language requires no compiling, and allows the user to use variables, numeric functions, string functions, and logical operators.
The following topics are covered in this article:
The awk command takes two types of input: input text files and program instructions.
Searching and actions are performed on input text files. The files are specified by:
If multiple files are specified with the File variable, the files are processed in the order specified.
Instructions provided by the user control the actions of the awk command. These instructions come from either the `Program' variable on the command line or from a file specified by the -f flag together with the ProgramFile variable. If multiple program files are specified, the files are concatenated in the order specified and the resultant order of instructions is used.
The awk command produces three types of output from the data within the input text file:
All of these types of output can be performed on the same file. The programming language recognized by the awk command allows the user to redirect output.
Files are processed in the following way:
The BEGIN statement in the awk programming language allows the user to specify a set of instructions to be done before the first record is read. This is particularly useful for initializing special variables.
A record is a set of data separated by a record separator. The default value for the record separator is the new-line character, which makes each line in the file a separate record. The record separator can be changed by setting the RS special variable.
The command instructions can specify that a specific field within the record be compared. By default, fields are separated by white space (blanks or tabs). Each field is referred to by a field variable. The first field in a record is assigned the $1 variable, the second field is assigned the $2 variable, and so forth. The entire record is assigned to the $0 variable. The field separator can be changed by using the -F flag on the command line or by setting the FS special variable. The FS special variable can be set to the values of: blank, single character, or extended regular expression.
The END statement in the awk programming language allows the user to specify actions to be performed after the last record is read. This is particularly useful for sending messages about what work was accomplished by the awk command.
The awk command programming language consists of statements in the form:
If a record matches the specified pattern, or contains a field which matches the pattern, the associated action is then performed. A pattern can be specified without an action, in which case the entire line containing the pattern is written to standard output. An action specified without a pattern is performed for every input record.
There are four types of patterns used in the awk command language syntax:
The extended regular expressions used by the awk command are similar to those used by the grep or egrep command. The simplest form of an extended regular expression is a string of characters enclosed in slashes. For an example, suppose a file named testfile had the following contents:
smawley, andy smiley, allen smith, alan smithern, harry smithhern, anne smitters, alexis
Entering the following command line:
awk '/smi/' testfile
would print to standard output of all records that contained an occurrence of the string smi. In this example, the program '/smi/' for the awk command is a pattern with no action. The output is:
smiley, allen smith, alan smithern, harry smithhern, anne smitters, alexis
The following special characters are used to form extended regular expressions:
The awk command recognizes most of the escape sequences used in C language conventions, as well as several that are used as special characters by the awk command itself. The escape sequences are:
Note: Except in the gsub, match, split, and sub built-in functions, the matching of extended regular expressions is based on input records. Record-separator characters (the new-line character by default) cannot be embedded in the expression, and no expression matches the record-separator character. If the record separator is not the new-line character, then the new-line character can be matched. In the four built-in functions specified, matching is based on text strings, and any character (including the record separator) can be embedded in the pattern so that the pattern matches the appropriate character. However, in all regular-expression matching with the awk command, the use of one or more NULL characters in the pattern produces undefined results.
The relational operators < (less than), > (greater than), <= (less than or equal to), >= (greater than or equal to), = = (equal to), and ! = (not equal to) can be used to form patterns. For example, the pattern:
$1 < $4
matches records where the first field is less than the fourth field. The relational operators also work with string values. For example:
$1 =! "q"
matches all records where the first field is not a q. String values can also be matched on collation values. For example:
$1 >= "d"
matches all records where the first field starts with a character that is a, b, c, or d. If no other information is given, field variables are compared as string values.
Patterns can be combined using three options:
/begin/,/end/matches the record containing the string begin, and every record between it and the record containing the string end, including the record containing the string end.
$1 == "al" && $2 == "123"matches records where the first field is al and the second field is 123.
Actions specified with the BEGIN pattern are performed before any input is read. Actions specified with the END pattern are performed after all input has been read. Multiple BEGIN and END patterns are allowed and processed in the order specified. An END pattern can precede a BEGIN pattern within the program statements. If a program consists only of BEGIN statements, the actions are performed and no input is read. If a program consists only of END statements, all the input is read prior to any actions being taken.
There are several types of action statements:
Action statements are enclosed in { } (braces). If the statements are specified without a pattern, they are performed on every record. Multiple actions can be specified within the braces, but must be separated by new-line characters or ; (semicolons), and the statements are processed in the order they appear. Action statements include:
Unary Statements |
The unary - (minus) and unary + (plus) operate as in the C programming language:
+Expression or -Expression |
Assignment Statements |
The assignment operators += (addition), -= (subtraction), /= (division), and *= (multiplication) operate as in the C programming language, with the form:
Variable += Expression Variable -= Expression Variable /= Expression Variable *= Expression $1 *= $2 multiplies the field variable $1 by the field variable $2 and then assigns the new value to $1. The assignment operators ^= (exponentiation) and %= (modulus) have the form: Variable1^=Expression1 Variable2%=Expression2 and they are equivalent to the C programming language statements: Variable1=pow(Variable1, Expression1) Variable2=fmod(Variable2, Expression2) where pow is the pow subroutine and fmod is the fmod subroutine. |
The awk command language uses arithmetic functions, string functions, and general functions. The close Subroutine statement is necessary if you intend to write a file, then read it later in the same program.
The following arithmetic functions perform the same actions as the C language subroutines by the same name:
gsub( Ere, Repl, [ In ] ) | Performs exactly as the sub function, except that all occurrences of the regular expression are replaced. |
sub( Ere, Repl, [ In ] ) | Replaces the first occurrence of the extended regular expression specified by the Ere parameter in the string specified by the In parameter with the string specified by the Repl parameter. The sub function returns the number of substitutions. An & (ampersand) appearing in the string specified by the Repl parameter is replaced by the string in the In parameter that matches the extended regular expression specified by the Ere parameter. If no In parameter is specified, the default value is the entire record ( the $0 record variable). |
index( String1, String2 ) | Returns the position, numbering from 1, within the string specified by the String1 parameter where the string specified by the String2 parameter occurs. If the String2 parameter does not occur in the String1 parameter, a 0 (zero) is returned. |
length [(String)] | Returns the length, in characters, of the string specified by the String parameter. If no String parameter is given, the length of the entire record (the $0 record variable) is returned. |
blength [(String)] | Returns the length, in bytes, of the string specified by the String parameter. If no String parameter is given, the length of the entire record (the $0 record variable) is returned. |
substr( String, M, [ N ] ) | Returns a substring with the number of characters specified by the N parameter. The substring is taken from the string specified by the String parameter, starting with the character in the position specified by the M parameter. The M parameter is specified with the first character in the String parameter as number 1. If the N parameter is not specified, the length of the substring will be from the position specified by the M parameter until the end of the String parameter. |
match( String, Ere ) | Returns the position, in characters, numbering from 1, in the string specified by the String parameter where the extended regular expression specified by the Ere parameter occurs, or else returns a 0 (zero) if the Ere parameter does not occur. The RSTART special variable is set to the return value. The RLENGTH special variable is set to the length of the matched string, or to -1 (negative one) if no match is found. |
split( String, A, [Ere] ) | Splits the string specified by the String parameter into array elements A[1], A[2], . . ., A[n], and returns the value of the n variable. The separation is done with the extended regular expression specified by the Ere parameter or with the current field separator (the FS special variable) if the Ere parameter is not given. The elements in the A array are created with string values, unless context indicates a particular element should also have a numeric value. |
tolower( String ) | Returns the string specified by the String parameter, with each uppercase character in the string changed to lowercase. The uppercase and lowercase mapping is defined by the LC_CTYPE category of the current locale. |
toupper( String ) | Returns the string specified by the String parameter, with each lowercase character in the string changed to uppercase. The uppercase and lowercase mapping is defined by the LC_CTYPE category of the current locale. |
sprintf(Format, Expr, Expr, . . . ) | Formats the expressions specified by the Expr parameters according to the printf subroutine format string specified by the Format parameter and returns the resulting string. |
close( Expression ) | Close the file or pipe opened by a print or printf statement or a call to the getline function with the same string-valued Expression parameter. If the file or pipe is successfully closed, a 0 is returned; otherwise a non-zero value is returned. The close statement is necessary if you intend to write a file, then read the file later in the same program. |
system(Command ) | Executes the command specified by the Command parameter and returns its exit status. Equivalent to the system subroutine. |
Expression | getline [ Variable ] | Reads a record of input from a stream piped from the output of a command specified by the Expression parameter and assigns the value of the record to the variable specified by the Variable parameter. The stream is created if no stream is currently open with the value of the Expression parameter as its command name. The stream created is equivalent to one created by a call to the popen subroutine with the Command parameter taking the value of the Expression parameter and the Mode parameter set to a value of r. Each subsequent call to the getline function reads another record, as long as the stream remains open and the Expression parameter evaluates to the same string. If a Variable parameter is not specified, the $0 record variable and the NF special variable are set to the record read from the stream. |
getline [ Variable ] < Expression | Reads the next record of input from the file named by the Expression parameter and sets the variable specified by the Variable parameter to the value of the record. Each subsequent call to the getline function reads another record, as long as the stream remains open and the Expression parameter evaluates to the same string. If a Variable parameter is not specified, the $0 record variable and the NF special variable are set to the record read from the stream. |
getline [ Variable ] | Sets the variable specified by the Variable parameter to the next record of input from the current input file. If no Variable parameter is specified, $0 record variable is set to the value of the record, and the NF, NR, and FNR special variables are also set. |
Note: All forms of the getline function return 1 for successful input, zero for end of file, and -1 for an error.
User-defined functions are declared in the following form:
function Name (Parameter, Parameter,...) { Statements }
A function can be referred to anywhere in an awk command program, and its use can precede its definition. The scope of the function is global.
Function parameters can be either scalars or arrays. Parameter names are local to the function; all other variable names are global. The same name should not be used for different entities; for example, a parameter name should not be duplicated as a function name, or special variable. Variables with global scope should not share the name of a function. Scalars and arrays should not have the same name in the same scope.
The number of parameters in the function definition does not have to match the number of parameters used when the function is called. Excess formal parameters can be used as local variables. Extra scalar parameters are initialized with a string value equivalent to the empty string and a numeric value of 0 (zero); extra array parameters are initialized as empty arrays.
When invoking a function, no white space is placed between the function name and the opening parenthesis. Function calls can be nested and recursive. Upon return from any nested or recursive function call, the values of all the calling function's parameters shall be unchanged, except for array parameters passed by reference. The return statement can be used to return a value.
Within a function definition, the new-line characters are optional before the opening { (brace) and after the closing } (brace).
An example of a function definition is:
function average ( g,n) { for (i in g) sum=sum+g[i] avg=sum/n return avg }
The function average is passed an array, g, and a variable, n, with the number of elements in the array. The function then obtains an average and returns it.
Most conditional statements in the awk command programming language have the same syntax and function as conditional statements in the C programming language. All of the conditional statements allow the use of { } (braces) to group together statements. An optional new-line can be used between the expression portion and the statement portion of the conditional statement, and new-lines or ; (semicolon) are used to separate multiple statements in { } (braces). Six conditional statements in C language are:
Five conditional statements in the awk command programming language that do not follow C-language rules are:
Two output statements in the awk command programming language are:
Requires the following syntax:
print [ ExpressionList ] [ Redirection ] [ Expression ] The print statement writes the value of each expression specified by the ExpressionList parameter to standard output. Each expression is separated by the current value of the OFS special variable, and each record is terminated by the current value of the ORS special variable. The output can be redirected using the Redirection parameter, which can specify the three output redirections with the > (greater than), >> (double greater than), and the | (pipe). The Redirection parameter specifies how the output is redirected, and the Expression parameter is either a path name to a file (when Redirection parameter is > or >> ) or the name of a command ( when the Redirection parameter is a | ). | |
printf | Requires the following syntax:
printf Format [ , ExpressionList ] [ Redirection ] [ Expression ] The printf statement writes to standard output the expressions specified by the ExpressionList parameter in the format specified by the Format parameter. The printf statement functions exactly like the printf command, except for the c conversion specification (%c). The Redirection and Expression parameters function the same as in the print statement. For the c conversion specification: if the argument has a numeric value, the character whose encoding is that value will be output. If the value is zero or is not the encoding of any character in the character set, the behavior is undefined. If the argument does not have a numeric value, the first character of the string value will be output; if the string does not contain any characters the bahaviour is undefined. |
Note: If the Expression parameter specifies a path name for the Redirection parameter, the Expression parameter should be enclosed in double quotes to insure that it is treated as a string.
Variables can be scalars, field variables, arrays, or special variables. Variable names cannot begin with a digit.
Variables can be used just by referencing them. With the exception of function parameters, they are not explicitly declared. Uninitialized scalar variables and array elements have both a numeric value of 0 (zero) and a string value of the null string (" ").
Variables take on numeric or string values according to context. Each variable can have a numeric value, a string value, or both. For example:
x = "4" + "8"
assigns the value of 12 to the variable x. For string constants, expressions should be enclosed in " " (double quotation) marks.
There are no explicit conversions between numbers and strings. To force an expression to be treated as a number, add 0 (zero) to it. To force an expression to be treated as a string, append a null string (" ").
Field variables are designated by a $ (dollar sign) followed by a number or numerical expression. The first field in a record is assigned the $1 variable , the second field is assigned to the $2 variable, and so forth. The $0 field variable is assigned to the entire record. New field variables can be created by assigning a value to them. Assigning a value to a non-existent field, that is, any field larger than the current value of $NF field variable, forces the creation of any intervening fields (set to the null string), increases the value of the NF special variable, and forces the value of $0 record variable to be recalculated. The new fields are separated by the current field separator ( which is the value of the FS special variable). Blanks and tabs are the default field separators. To change the field separator, use the -F flag, or assign the FS special variable a different value in the awk command program.
Arrays are initially empty and their sizes change dynamically. Arrays are represented by a variable with subscripts in [ ] (square brackets). The subscripts, or element identifiers, can be numbers of strings, which provide a type of associative array capability. For example, the program:
/red/ { x["red"]++ } /green/ { y["green"]++ }
increments counts for both the red counter and the green counter.
Arrays can be indexed with more than one subscript, similar to multidimensional arrays in some programming languages. Because programming arrays for the awk command are really one dimensional, the comma-separated subscripts are converted to a single string by concatenating the string values of the separate expressions, with each expression separated by the value of the SUBSEP environmental variable. Therefore, the following two index operations are equivalent:
x[expr1, expr2,...exprn]
AND
x[expr1SUBSEPexpr2SUBSEP...SUBSEPexprn]
When using the in operator, a multidimensional Index value should be contained within parentheses. Except for the in operator, any reference to a nonexistent array element automatically creates that element.
The following variables have special meaning for the awk command:
ARGC | The number of elements in the ARGV array. This value can be altered. |
ARGV | The array with each member containing one of the File variables or Assignment variables, taken in order from the command line, and numbered from 0 (zero) to ARGC -1. As each input file is finished, the next member of the ARGV array provides the name of the next input file, unless:
|
CONVFMT | The printf format for converting numbers to strings (except for output statements, where the OFMT special variable is used). The default is "%.6g". |
ENVIRON | An array representing the environment under which the awk command operates. Each element of the array is of the form:
ENVIRON [ "Environment VariableName" ] = EnvironmentVariableValue The values are set when the awk command begins execution, and that environment is used until the end of execution, regardless of any modification of the ENVIRON special variable. |
FILENAME | The path name of the current input file. During the execution of a BEGIN action, the value of FILENAME is undefined. During the execution of an END action, the value is the name of the last input file processed. |
FNR | The number of the current input record in the current file. |
FS | The input field separator. The default value is a blank. If the input field separator is a blank, any number of locale-defined spaces can separate fields. The FS special variable can take two additional values:
|
NF | The number of fields in the current record, with a limit of 99. Inside a BEGIN action, the NF special variable is undefined unless a getline function without a Variable parameter has been issued previously. Inside an END action, the NF special variable retains the value it had for the last record read, unless a subsequent, redirected, getline function without a Variable parameter is issued prior to entering the END action. |
NR | The number of the current input record. Inside a BEGIN action the value of the NR special variable is 0 (zero). Inside an END action, the value is the number of the last record processed. |
OFMT | The printf format for converting numbers to strings in output statements. The default is "% .6g". |
OFS | The output field separator (default is a space). |
ORS | The output record separator (default is a new-line character). |
RLENGTH | The length of the string matched by the match function. |
RS | Input record separator (default is a new-line character). If the RS special variable is null, records are separated by sequences of one or more blank lines; leading or trailing blank lines do not result in empty records at the beginning or end of input; and the new-line character is always a field separator, regardless of the value of the FS special variable. |
RSTART | The starting position of the string matched by the match function, numbering from 1. Equivalent to the return value of the match function. |
SUBSEP | Separates multiple subscripts. The default is \031. |
This command returns the following exit values:
0 | Successful completion. |
>0 | An error occurred. |
You can alter the exit status within the program by using the exit [ Expression ] conditional statement.
awk 'length >72' chapter1This selects each line of the chapter1 file that is longer than 72 characters and writes these lines to standard output, because no Action is specified. A tab character is counted as 1 byte.
awk '/start/,/stop/' chapter1
awk -f sum2.awk chapter1The following program, sum2.awk, computes the sum and average of the numbers in the second column of the input file, chapter1:
{ sum += $2 } END { print "Sum: ", sum; print "Average:", sum/NR; }The first action adds the value of the second field of each line to the variable sum. All variables are initialized to the numeric value of 0 (zero) when first referenced. The pattern END before the second action causes those actions to be performed after all of the input file has been read. The NR special variable, which is used to calculate the average, is a special variable specifying the number of records that have been read.
awk '{ print $2, $1 }' chapter1
BEGIN {FS = ",|[ \t]+"} {print $1, $2} {s += $1} END {print "sum is",s,"average is", s/NR }
Commands: egrep, fgrep, grep, lex, printf, sed.
Subroutines: popen, printf, system.
Books: Aho, A.V., Kernighan, B.W., and Weinberger, P.J. The Awk Programming Language. Bell Telephone Laboratories, Incorporated, 1988.