nawk(1) General Commands Manual nawk(1)
Name
nawk - data transformation, report generation language
Syntax
nawk [ -f programfile ] [ -Fs ] [ program ] [ var=value... ] [ file ... ]
Description
The language is a file-processing language which is well-suited to data manipulation and retrieval of information from text files. This
reference page provides a full technical description of if you are unfamiliar with the language, you will probably find it helpful to read
the Guide to the nawk Utility before reading the following material.
A program consists of any number of user-defined functions and `rules' of the form:
pattern {action}
There are two ways to specify the program:
(a) Directly on the command line. In this case, the program is a single command line argument, usually enclosed in apostrophes
(b) By using the -f programfile option (where programfile contains the program). More than one -f option can appear on the command line.
The program will consist of the concatenation of the contents of all the specified programfiles. You can use - in place of a file
name, to obtain input from the standard input.
The input data manipulated by the program is provided in files specified on the command line. If no such files are specified, data is read
from the standard input. You can also specify a file name of - to mean the standard input.
Input to is divided into records. By default, records are separated by new-line characters; however, you can specify a different record
separator if you wish.
One at a time, and in order, each input record is compared with the pattern of every `rule' in the program. When a pattern matches, the
action part of the rule is performed on the current input record. Patterns and actions often refer to separate fields within a record. By
default, fields are separated by white space (blanks, new-lines, or horizontal tab characters); however, you can specify a different field
separator string using the -Fs option (see Input).
You can omit the pattern or action part of a rule (but not both). If pattern is omitted, the action is performed on every input record (as
if every record matches). If action is omitted, every record matching the pattern will be written to the standard output.
If a line in a program contains a `#' character, the `#' and everything after it is considered to be a comment.
Program lines can be continued by adding a backslash `' to the end of the line. Statement lines ending with a comma `,', double or-bars
`||', or double ampersands `&&', are automatically continued.
Options
-f programfile
Tells to obtain its program from the specified file. There can be more than one of these on the command line.
-Fs Says that s is the field separator character within records.
Variables and Expressions
There are three types of variables in identifiers, fields, and array elements.
An identifier is a sequence of letters, digits, and underscores beginning with a letter or an underscore.
Fields are described in the Input subsection.
Arrays are associative collections of values called the elements of the array. Array elements are referenced with constructs of the form
identifier[subscript]
where subscript has the form expr or expr,expr,... Each such expr can have any string value. Arrays with multiple expr subscripts are
implemented by concatenating the string values of each expr with a separator character SUBSEP separating multiple expr. The initial value
of SUBSEP is set to ` 34' (ASCII field separator).
Fields and identifiers are sometimes called scalar variables to distinguish them from arrays.
Variables are not declared and need not be initialized. The value of an uninitialized variable is the empty string. Variables can be ini-
tialized on the command line using
var=value
Such initializations can be interspersed with the names of input files on the command line. Initializations and input files will be pro-
cessed in the order they appear on the command line. For example, the command
nawk -f progfile A=1 f1 f2 A=2 f3
sets A to 1 before input is read from f1 and sets A to 2 before input is read from f3.
Certain built-in variables have special meaning to as described in later sections.
Expressions consist of constants, variables, functions, regular expressions and `subscript in array' conditions (see below) combined with
operators. Each variable and expression has a string value and a corresponding numeric value; the value appropriate to the context is
used. If a string is used in a numeric context, and the contents of the string cannot be interpreted as a number, the `value' of the
string is taken to be zero.
Numeric constants are sequences of decimal digits.
String constants are quoted, as in "x". Escape sequences accepted in literal strings are:
Escape ASCII Character
-------------------------------
a audible bell
backspace
f formfeed
new-line
carriage return
horizontal tab
v vertical tab
ooo octal value ooo
xdd hexadecimal value dd
" quotation mark
c any other character c
The regular expression syntax understood by is the extended regular expressions of the utility described in Characters enclosed in slash
characters `/' are compiled as regular expressions when the program is read. In addition, literal strings and variables are interpreted as
dynamic regular expressions on the right side of a `~' or `!~' operator, or as certain arguments to built-in matching and substitution
functions. Note that when literal strings are used as regular expressions, extra backslashes are needed to escape regular expression
metacharacters because the backslash is also the literal string escape character.
The `subscript in array' condition is defined as:
index in array
where index looks like expr or (expr,...,expr). This condition evaluates to 1 if the string value of index is a subscript of array, and to
0 otherwise. This is a way to determine if an array element exists. If the element does not exist, this condition will not create it.
Symbol Table
The symbol table can be accessed through the built-in array SYMTAB.
SYMTAB[expr]
is equivalent to the variable named by the evaluation of expr. For example,
SYMTAB["var"]
is a synonym for the variable var.
Environment
A program can determine its initial environment by examining the ENVIRON array. If the environment consists of entries of the form:
name=value
then
ENVIRON[name]
has string value
"value"
For example, the following program is equivalent to the default output of
BEGIN {
for (i in ENVIRON)
printf("%s=%s
", i, ENVIRON[i])
exit
}
Operators
The usual precedence order of arithmetic operations is followed unless overridden with parentheses; a table giving the order of operations
appears at the end of the Guide to the nawk Utility. The unary operators are
- Negation
+ Nothing (place holder)
-- Decrement by one
++ Increment by one
where the `++' and `--' operators can be used as either postfix or prefix operators, as in C.
The binary arithmetic operators are
+ Addition
- Subtraction
* Multiplication
/ Division
% Modulus
^ Exponentiation
The conditional operator
expr ? expr1 : expr2
evaluates to expr1 if the value of expr is non-zero, and to expr2 otherwise.
If two expressions are not separated by an operator, their string values are concatenated.
The operator `~' yields 1 (true) if the regular expression on the right side matches the string on the left side. The operator `!~' yields
1 when the right side has no match on the left. To illustrate:
$2 ~ /[0-9]/
selects any line where the second field contains at least one digit. Any string or variable on the right side of `~' or `!~' is inter-
preted as a dynamic regular expression.
The relational operators are the usual `<', `<=', `>', `>=', `==', and `!='.
The boolean operators are `||' (or), `&&' (and), and `!' (not).
Values can be assigned to a variable with
var = expr
If op is a binary arithmetic operator,
var op= expr
is equivalent to
var = var op expr
Command Line Arguments
The built-in variable ARGC is set to the number of command line arguments. The built-in array ARGV has elements subscripted with digits
from zero to ARGC-1, giving command line arguments in the order they appeared on the command line.
The ARGC count and the ARGV vector do not include command line options (beginning with `-') or the program file (following They do include
the name of the command itself, initialization statements of the form
var=value
and the names of input data files.
The language actually creates ARGC and ARGV before doing anything else. It then walks through ARGV processing the arguments. If an ele-
ment of ARGV is the empty string, it is simply skipped. If it contains an equals sign `=', it is interpreted as a variable assignment. If
it is a minus sign `-', it stands for the standard input and input is immediately read from the standard input until end-of-file is encoun-
tered. Otherwise, the argument is taken to be a file name; input will be read from that file until end-of-file is reached. Note that the
program is executed by `walking through' ARGV in this way; thus if the program changes ARGV, different files can be read and assignments
made.
Input
Input is divided into records. Each record is separated from the next with a record separator character. The value of the built-in vari-
able RS gives the current record separator character; by default, it begins as the new-line `
'. If you assign a different character to
RS, will use that as the record separator character from that point on.
Records are divided into fields. Each field is separated from the next with a field separator string, given by the value of the built-in
variable FS. You can set a specific separator string by assigning a value to FS or by specifying the -Fs option on the command line. FS
can be be assigned a regular expression. For example,
FS = "[,:$]"
says that fields can be separated by commas, colons, or dollar signs. As a special case, assigning FS a string containing only a blank
character sets the field separator to white space. In this case, any sequence of contiguous space and/or tab characters is considered a
single field separator. This is the default for FS. However, if FS is assigned a string containing any other character, that character
designates the start of a new field. For example, if we set
FS=" "
(the tab character),
texta textb textc
contains five fields, two of which only contain blanks. With the default setting, the above would only contain three fields because the
sequence of multiple blanks and tabs would be considered a single separator.
Various pieces of information about input are provided by the built-in variables listed below.
NF Number of fields in the current record
NR Number of records read so far
FILENAME Name of file containing current record
FNR Number of records read from current file
Field specifiers have the form $i where i runs from 1 through NF. Such a field specifier refers to the ith field of the current input
record. $0 (zero) refers to the entire current input record.
The getline function can read a value for a variable or $0 from the current input, from a file, or from a pipe. The result of getline is
an integer indicating whether the read operation was successful. A value of 1 indicates success; 0 indicates end-of-file encountered; and
-1 indicates that an error occurred. Possible forms for getline are:
getline
Reads next input record into $0 and splits the record into fields. NF, NR, and FNR are set appropriately.
getline var
Reads next input record into the variable var. The record is not split into fields (which means that the current $i values do not
change). NR and FNR are set appropriately.
getline <expr
Interprets the string value of expr to be a file name. The next record from that file is read into $0 and split into fields. NF is
set appropriately.
getline var <expr
Interprets the string value of expr to be a file name, and reads the next record from that file into the variable var. The record is
not split into fields.
expr | getline
Interprets the string value of expr as a command line to be executed. Output from this command is piped into getline, and read into
$0 in a manner similar to getline <expr. See the SYSTEM FUNCTION section for additional details.
expr | getline var
Executes the string value of expr as a command and pipes the output of the command into getline. The result is similar to getline var
<expr.
close(expr)
Only a limited number of files and pipes can be open at one time. This function will close open files or pipes. The expr must be one
that came before `|' or after `>' in getline, or after `>', `>>', or `|' in print or printf as described in the Output section. By
closing files and pipes that are no longer needed, you can use any number of files and pipes in the course of executing a program.
Built-In Arithmetic Functions
int(expr)
Returns the integer part of the numeric value of expr. If (expr) is omitted, the integer part of $0 is returned.
exp(expr), log(expr), sqrt(expr)
Returns the exponential, natural logarithm, and square root of the numeric value of expr. If (expr) is omitted, $0 is used.
sin(expr), cos(expr)
Returns the sine and cosine of the numeric value of expr (interpreted as an angle in radians).
atan2(expr1, expr2)
Returns the arctangent of expr1/expr2 in the range of -n through n.
rand()
Returns a random floating-point number in the range 0 through 1.
srand(expr)
Sets the seed of the rand function to the integer value of expr. If (expr) is omitted, sets a default seed (which is the same each
time is invoked).
Built-In String Functions
len = length(expr)
Returns the number of characters in the string value of expr. If (expr) is omitted, $0 is used.
n = split(string, array, regexp)
Splits the string into fields. The expression regexp is a regular expression giving the field separator string for the purposes of
this operation. The elements of array are assigned the separated fields in order; subscripts for array begin at 1. All other ele-
ments of array are discarded. The result of split is the number of fields into which string was divided (which is also the maximum
subscript for array). Note that regexp divides the record in the same way that the FS field separator string does. If regexp is
omitted in the call to split, the current value of FS will be used.
str = substr(string, m, len)
Returns the substring of string that begins in position m and is at most len characters long. The first character of the string has
m equal to one. If len is omitted, the rest of string is returned.
pos = index(s1, s2)
Returns the position of the first occurrence of string s2 in string s1; if s2 is not found in s1, index returns zero.
pos = match(string, regexp)
Searches string for the first substring matching the regular expression regexp, and returns an integer giving the position of this
substring. If no such substring is found, match returns zero. The built-in variable RSTART is set to pos and the built-in variable
RLENGTH is set to the length of the matched string. These are both set to zero if there is no match. The regexp can be enclosed in
slashes or given as a string.
n = gsub(regexp, repl, string)
globally replaces all substrings of string that match the regular expression regexp, and replaces the substring with the string
repl. If string is omitted, the current record ($0) is used. The notation gsub returns the number of substrings that were replaced
or zero if no match occurred.
n = sub(regexp, repl, string)
Works like gsub except that at most one match and substitution is attempted.
str = sprintf(fmt, expr, expr...)
Formats the expression list expr, expr, ... using specifications from the string fmt, then returns the formatted string. The fmt
string consists of conversion specifications which convert and add the next expr to the string, and ordinary characters which are
simply added to the string. Conversion specifications have the form
%[-][x][.y]c
where
- left justifies the field
x is the minimum field width
y is the precision
c is the conversion character
In a string, the precision is the maximum number of characters to be printed from the string; in a number, the precision is the num-
ber of digits to be printed to the right of the decimal point in a floating point value. If x or y is `*' (asterisk), the minimum
field width or precision will be the value of the next expr in the call to sprintf.
The conversion character c is one of following:
d Decimal integer
o Unsigned octal integer
x Unsigned hexadecimal integer
u Unsigned decimal integer
f Floating point
e Floating point (scientific notation)
g The shorter of e and f (suppresses non-significant zeros)
c Single character of an integer value
s String
n = ord(expr)
Returns the integer value of first character in the string value of expr. This is useful in conjunction with `%c' in sprintf.
str = tolower(expr)
Converts all letters in the string value of expr into lower case, and returns the result. If expr is omitted, $0 is used.
str = toupper(expr)
Converts all letters in the string value of expr into upper case, and returns the result. If expr is omitted, $0 is used.
The System Function
status = system(expr)
Executes the string value of expr as a command. For example,
system("tail " $1)
calls the command, using the string value of $1 as the file that should examine. See the Restrictions section for a discussion of
the execution of the command.
User-Defined Functions
You can define your own functions using the form
function name(parameter-list) {
statements
}
A function definition can appear in the place of a pattern {action} rule. The parameter-list contains any number of normal (scalar) and
array variables separated by commas. When a function is called, scalar arguments are passed by value, and array arguments are passed by
reference. The names specified in the parameter-list are local to the function; all other names used in the function are are global.
Local scalar variables can be defined by adding them to the end of the parameter list. These extra parameters are not used in any call to
the function.
A function returns to its caller either when the final statement in the function is executed, or when an explicit return statement is exe-
cuted.
Patterns and Actions
A pattern is a regular expression, a special pattern, a pattern range, or any arithmetic expression.
BEGIN is a special pattern used to label actions that should be performed before any input records have been read. END is a special pat-
tern used to label actions that should be performed after all input records have been read.
A pattern range is given as
pattern1,pattern2
This matches all lines from one that matches pattern1 to one that matches pattern2, inclusive.
If a pattern is omitted, or if the numeric value of the pattern is non-zero (true), the resulting action is executed for the line.
An action is a series of statements terminated by semicolons, new-lines, or closing braces. A condition is any expression; a non-zero
value is considered true, and a zero value is considered false. A statement is one of the following:
expression
if (condition)
statement
[else
statement]
while (condition)
statement
do
statement
while (condition)
for (expression1; condition; expression2)
statement
The for statement is equivalent to:
expression1
while (condition) {
statement
expression2
}
The for statement can also have the form
for (i in array)
statement
The statement is executed once for each element in array; on each repetition, the variable i will contain the name of a subscript of array,
running through all the subscripts in an arbitrary order. If array is multi-dimensional (has multiple subscripts), i will be expressed as
a single string with the SUBSEP character separating the subscripts. The following simple statements are supported:
break Exits a for or a while loop immediately.
continue
Stops the current iteration of a for or while loop and begins the next iteration (if there is one).
next Terminates any processing for the current input record and immediately starts processing the next input record. Processing for the
next record will begin with the first appropriate rule.
exit[ (expr) ]
Immediately goes to the END action if it exists; if there is no END action, or if is already executing the END action, the program
terminates. The exit status of the program is set to the numeric value of expr. If (expr) is omitted, the exit status is 0.
return [expr]
Returns from the execution of a function. If an expr is specified, the value of the expression is returned as the result of the
function. Otherwise, the function result is undefined.
delete array[i]
Deletes element i from the given array.
print expr, expr, ...
Described below.
printf fmt, expr, expr, ...
Described below.
Output
The print and printf statements write to the standard output. Output can be redirected to a file or pipe as described below.
If >expr is added to a print or printf statement, the string value of expr is taken to be a file name, and output is written to that file.
Similarly, if >RI >> expr is added, output will be appended to the current contents of the file. The distinction between `>' and `>>' is
only important for the first print to the file expr. Subsequent outputs to an already open file will append to what is there already.
In order to eliminate ambiguities, statements such as
print a > b c
are syntactically illegal. Parentheses must be used to resolve the ambiguity.
If |expr is added to a print or printf statement, the string value of expr is taken to be an executable command. The command is executed
with the output from the statement piped as input into the command.
As noted earlier, only a limited number of files and pipes can be open at any time. To avoid going over the limit, you should use the
close function to close files and pipes when they are no longer needed.
The print statement prints its arguments with only simple formatting. If it has no arguments, the current input record is printed in its
entirety. The output record separator ORS is added to the end of the output produced by each print statement; when arguments in the print
statement are separated by commas, the corresponding output values will be separated by the output field separator OFS. ORS and OFS are
built-in variables whose values can be changed by assigning them strings. The default output record separator is a new-line and the
default output field separator is a space. The format of numbers output by print is given by the string OFMT. By default, the value is
`%.6g'; this can be changed by assigning OFMT a different string value.
The printf statement formats its arguments using the fmt argument. Formatting is the same as for the built-in function sprintf. Unlike
print, printf does not add output separators automatically. This gives the program more precise control of the output.
Restrictions
The longest input record is restricted to 20,000 bytes and the maximum number of fields supported is 4000. The length of the string pro-
duced by sprintf is limited to 1024 bytes.
The ord function may not be recognized by other versions of The toupper and tolower functions and the ENVIRON array variable are found in
the Bell Labs version of this version is a superset of `New as described in The AWK Programming Language by Aho, Weinberger, and Kernighan.
The shell that is used by the functions
getline print printf system
and the return value of the system function is described in
Examples
The following example outputs the contents of the file with line numbers prepended to each line:
nawk '{print NR ":" $0}' input1
The following is an example using var=value on the command line:
nawk '{print NR SEP $0}' SEP=":" input1
The program script can also be read from a file as in the command line:
nawk -f addline.nawk input1
This example produces the same output as the previous example when the file contains
{print NR ":" $0}
The following program appends all input lines starting with `January' to the file (which can already exist or not), and all lines starting
with `February' or `March' to the file
/^January/ {print >> "jan"}
/^February|^March/ {print >> "febmar"}
This program prints the total and average for the last column of each input line:
{s += $NF}
END {print "sum is", s, "average is", s/NR}
The following program interchanges the first and second fields of input lines:
{
tmp = $1
$1 = $2
$2 = tmp
print
}
The following example inserts line numbers so that output lines are left-aligned:
{printf "%-6d: %s
", NR, $0}
This example prints input records in reverse order (assuming sufficient memory):
{
a[NR] = $0 # index using record number
}
END {
for (i = NR; i>0; --i)
print a[i]
}
The next program determines the number of lines starting with the same first field:
{
++a[$1] # array indexed using the first field
}
END { # note output will be in undefined order
for (i in a)
print a[i], "lines start with", i
}
The following program can be used to determine the number of lines in each input file:
{
++a[FILENAME]
}
END {
for (file in a)
if (a[file] == 1)
print file, "has 1 line"
else
print file, "has", a[file], "lines"
}
This program illustrates how a two dimensional array can be used in Assume the first field contains a product number, the second field con-
tains a month number, and the third field contains a quantity (bought, sold, or whatever). The program generates a table of products ver-
sus month.
BEGIN {NUMPROD = 5}
{
array[$1,$2] += $3
}
END {
print " Jan Feb March April May "
"June July Aug Sept Oct Nov Dec"
for (prod = 1; prod <= NUMPROD; prod++) {
printf "%-7s", "prod#" prod
for (month = 1; month <= 12; month++){
printf " %5d", array[prod,month]
}
printf "
"
}
}
As this program reads in each line of input, it reports whether the line matches a pre-determined value:
function randint() {
return (int((rand()+1)*10))
}
BEGIN {
prize[randint(),randint()] = "$100";
prize[randint(),randint()] = "$10";
prize[1,1] = "the booby prize"
}
{
if (($1,$2) in prize)
printf "You have won %s!
", prize[$1,$2]
}
END
This example prints lines whose first and last fields are the same, reversing the order of the fields:
$1==$NF {
for (i = NF; i > 0; --i)
printf "%s", $i (i>1 ? OFS : ORS)
}
The following program prints the input files from the command line. The infiles function first empties the array passed to it, and then
fills the array. Notice that the extra parameter i of infiles is a local variable.
function infiles(f, i) {
for (i in f)
delete f[i]
for (i = 1; i < ARGC; i++)
if (index(ARGV[i],"=") == 0)
f[i] = ARGV[i]
}
BEGIN {
infiles(a)
for (i in a)
print a[i]
exit
}
This example is the standard recursive factorial function:
function fact(num) {
if (num <= 1)
return 1
else
return num * fact(num - 1)
}
{ print $0 " factorial is " fact($0) }
The last program illustrates the use of getline with a pipe. Here, getline sets the current record from the output of the command. The
program prints the number of words in each input file.
function words(file, string) {
string = "wc " fn
string | getline
close(string)
return ($2)
}
BEGIN {
for (i=1; i<ARGC; i++) {
fn = ARGV[i]
printf "There are %d words in %s.",
words(fn), fn
}
}
See Also
ed(1), grep(1), sed(1), ex(1), system(3), ascii(7),
"Awk - A Pattern Scanning and Processing Language" ULTRIX Supplementary Documents, Vol. II: Programmer
nawk(1)