lex(1) User Commands lex(1)
NAME
lex - generate programs for lexical tasks
SYNOPSIS
lex [-cntv] [-e | -w] [ -V -Q
[y | n] ] [file...]
DESCRIPTION
The lex utility generates C programs to be used in lexical processing of character input, and that can be used as an interface to yacc. The
C programs are generated from lex source code and conform to the ISO C standard. Usually, the lex utility writes the program it generates
to the file lex.yy.c. The state of this file is unspecified if lex exits with a non-zero exit status. See EXTENDED DESCRIPTION for a com-
plete description of the lex input language.
OPTIONS
The following options are supported:
-c Indicates C-language action (default option).
-e Generates a program that can handle EUC characters (cannot be used with the -w option). yytext[] is of type unsigned
char[].
-n Suppresses the summary of statistics usually written with the -v option. If no table sizes are specified in the lex source
code and the -v option is not specified, then -n is implied.
-t Writes the resulting program to standard output instead of lex.yy.c.
-v Writes a summary of lex statistics to the standard error. (See the discussion of lex table sizes under the heading Defini-
tions in lex.) If table sizes are specified in the lex source code, and if the -n option is not specified, the -v option
may be enabled.
-w Generates a program that can handle EUC characters (cannot be used with the -e option). Unlike the -e option, yytext[] is
of type wchar_t[].
-V Prints out version information on standard error.
-Q[y|n] Prints out version information to output file lex.yy.c by using -Qy. The -Qn option does not print out version information
and is the default.
OPERANDS
The following operand is supported:
file A pathname of an input file. If more than one such file is specified, all files will be concatenated to produce a single lex pro-
gram. If no file operands are specified, or if a file operand is -, the standard input will be used.
OUTPUT
The lex output files are described below.
Stdout
If the -t option is specified, the text file of C source code output of lex will be written to standard output.
Stderr
If the -t option is specified informational, error and warning messages concerning the contents of lex source code input will be written to
the standard error.
If the -t option is not specified:
1. Informational error and warning messages concerning the contents of lex source code input will be written to either the standard output
or standard error.
2. If the -v option is specified and the -n option is not specified, lex statistics will also be written to standard error. These statis-
tics may also be generated if table sizes are specified with a % operator in the Definitions in lex section (see EXTENDED DESCRIPTION),
as long as the -n option is not specified.
Output Files
A text file containing C source code will be written to lex.yy.c, or to the standard output if the -t option is present.
EXTENDED DESCRIPTION
Each input file contains lex source code, which is a table of regular expressions with corresponding actions in the form of C program frag-
ments.
When lex.yy.c is compiled and linked with the lex library (using the -l l operand with c89 or cc), the resulting program reads character
input from the standard input and partitions it into strings that match the given expressions.
When an expression is matched, these actions will occur:
o The input string that was matched is left in yytext as a null-terminated string; yytext is either an external character array or a
pointer to a character string. As explained in Definitions in lex, the type can be explicitly selected using the %array or %pointer
declarations, but the default is %array.
o The external int yyleng is set to the length of the matching string.
o The expression's corresponding program fragment, or action, is executed.
During pattern matching, lex searches the set of patterns for the single longest possible match. Among rules that match the same number of
characters, the rule given first will be chosen.
The general format of lex source is:
Definitions
%%
Rules
%%
User Subroutines
The first %% is required to mark the beginning of the rules (regular expressions and actions); the second %% is required only if user sub-
routines follow.
Any line in the Definitions in lex section beginning with a blank character will be assumed to be a C program fragment and will be copied
to the external definition area of the lex.yy.c file. Similarly, anything in the Definitions in lex section included between delimiter
lines containing only %{ and %} will also be copied unchanged to the external definition area of the lex.yy.c file.
Any such input (beginning with a blank character or within %{ and %} delimiter lines) appearing at the beginning of the Rules section
before any rules are specified will be written to lex.yy.c after the declarations of variables for the yylex function and before the first
line of code in yylex. Thus, user variables local to yylex can be declared here, as well as application code to execute upon entry to
yylex.
The action taken by lex when encountering any input beginning with a blank character or within %{ and %} delimiter lines appearing in the
Rules section but coming after one or more rules is undefined. The presence of such input may result in an erroneous definition of the
yylex function.
Definitions in lex
Definitions in lex appear before the first %% delimiter. Any line in this section not contained between %{ and %} lines and not beginning
with a blank character is assumed to define a lex substitution string. The format of these lines is:
name substitute
If a name does not meet the requirements for identifiers in the ISO C standard, the result is undefined. The string substitute will replace
the string { name } when it is used in a rule. The name string is recognized in this context only when the braces are provided and when it
does not appear within a bracket expression or within double-quotes.
In the Definitions in lex section, any line beginning with a % (percent sign) character and followed by an alphanumeric word beginning with
either s or S defines a set of start conditions. Any line beginning with a % followed by a word beginning with either x or X defines a set
of exclusive start conditions. When the generated scanner is in a %s state, patterns with no state specified will be also active; in a %x
state, such patterns will not be active. The rest of the line, after the first word, is considered to be one or more blank-character-sepa-
rated names of start conditions. Start condition names are constructed in the same way as definition names. Start conditions can be used to
restrict the matching of regular expressions to one or more states as described in Regular expressions in lex.
Implementations accept either of the following two mutually exclusive declarations in the Definitions in lex section:
%array Declare the type of yytext to be a null-terminated character array.
%pointer Declare the type of yytext to be a pointer to a null-terminated character string.
Note: When using the %pointer option, you may not also use the yyless function to alter yytext.
%array is the default. If %array is specified (or neither %array nor %pointer is specified), then the correct way to make an external ref-
erence to yyext is with a declaration of the form:
extern char yytext[]
If %pointer is specified, then the correct external reference is of the form:
extern char *yytext;
lex will accept declarations in the Definitions in lex section for setting certain internal table sizes. The declarations are shown in the
following table.
Table Size Declaration in lex
+------------------------------------------------------------------+
| Declaration Description Default |
| %pn Number of positions 2500 |
| %nn Number of states 500 |
| %a n Number of transitions 2000 |
| %en Number of parse tree nodes 1000 |
| %kn Number of packed character classes 10000 |
| %on Size of the output array 3000 |
+------------------------------------------------------------------+
Programs generated by lex need either the -e or -w option to handle input that contains EUC characters from supplementary codesets. If nei-
ther of these options is specified, yytext is of the type char[], and the generated program can handle only ASCII characters.
When the -e option is used, yytext is of the type unsigned char[] and yyleng gives the total number of bytes in the matched string. With
this option, the macros input(), unput(c), and output(c) should do a byte-based I/O in the same way as with the regular ASCII lex. Two more
variables are available with the -e option, yywtext and yywleng, which behave the same as yytext and yyleng would under the -w option.
When the -w option is used, yytext is of the type wchar_t[] and yyleng gives the total number of characters in the matched string. If you
supply your own input(), unput(c), or output(c) macros with this option, they must return or accept EUC characters in the form of wide
character (wchar_t). This allows a different interface between your program and the lex internals, to expedite some programs.
Rules in lex
The Rules in lex source files are a table in which the left column contains regular expressions and the right column contains actions (C
program fragments) to be executed when the expressions are recognized.
ERE action
ERE action
...
The extended regular expression (ERE) portion of a row will be separated from action by one or more blank characters. A regular expression
containing blank characters is recognized under one of the following conditions:
o The entire expression appears within double-quotes.
o The blank characters appear within double-quotes or square brackets.
o Each blank character is preceded by a backslash character.
User Subroutines in lex
Anything in the user subroutines section will be copied to lex.yy.c following yylex.
Regular Expressions in lex
The lex utility supports the set of Extended Regular Expressions (EREs) described on regex(5) with the following additions and exceptions
to the syntax:
...
Any string enclosed in double-quotes will represent the characters within the double-quotes as themselves, except that backslash
escapes (which appear in the following table) are recognized. Any backslash-escape sequence is terminated by the closing quote. For
example, "