lex(1) [osf1 man page]

lex(1)							      General Commands Manual							    lex(1)

NAME

       lex - Generates programs for lexical tasks

SYNOPSIS

       lex [-ct] [-n  | -v] [file...]

       [Tru64  UNIX]  The  following  syntax applies when the CMD_ENV environment variable is set to svr4: lex [-crt] [-n  | -v] [-V] [-Qy  | -Qn]
       [file...]

STANDARDS

       Interfaces documented on this reference page conform to industry standards as follows:

       lex:  XPG4, XPG4-UNIX

       Refer to the standards(5) reference page for more information about industry standards and associated tags.

OPTIONS

       Writes C code to the file lex.yy.c. This is the default.  Suppresses the statistics summary. When you set your  own  table  sizes  for  the
       finite state machine, lex automatically produces this summary if you do not select this flag.  [Tru64 UNIX]  Writes RATFOR code to the file
       lex.yy.r. (There is no RATFOR compiler for Tru64 UNIX.)	Writes to standard output instead of writing to a file.  Provides a summary of the
       generated  finite  state machine statistics.  [Tru64 UNIX]  Outputs lex version number to standard error. Requires the environment variable
       CMD_ENV to be set to svr4.  [Tru64 UNIX]  Determines whether the lex version number is written to the output file. The -Qn option does  not
       do so and is the default. Requires the environment variable CMD_ENV to be set to svr4.

DESCRIPTION

       The  lex  command  uses the rules and actions contained in file to generate a program, lex.yy.c, which can be compiled with the cc command.
       That program can then receive input, break the input into the logical pieces defined by the rules in file, and run program  fragments  con-
       tained in the actions in file.

       The  generated  program	is  a  C  Language  function called yylex(). The lex command stores yylex() in a file named lex.yy.c.  You can use
       yylex() alone to recognize simple, 1-word input, or you can use it with other C Language programs to perform more difficult input  analysis
       functions.   For example, you can use lex to generate a program that tokenizes an input stream before sending it to a parser program gener-
       ated by the yacc command.

       The yylex() function analyzes the input stream using a program structure called a finite state machine. This structure allows  the  program
       to  exist  in  only one state (or condition) at a time.	A finite number of states are allowed. The rules in file determine how the program
       moves from one state to another in response to the input that the program receives.

       The lex command reads its skeleton finite state machine from the file /usr/ccs/lib/ncpform  or  /usr/ccs/lib/ncform.  Use  the  environment
       variable LEXER to specify another location for lex to read from.

       If you do not specify a file, lex reads standard input. It treats multiple files as a single file.

   Input File Format
       The  input  file can contain three sections:  definitions, rules, and user subroutines. Each section must be separated from the others by a
       line containing only the delimiter, %%.	The format is as follows:

       definitions %% rules %% user_subroutines

       The purpose and format of each of these sections are described under the headings that follow.

   Definitions Section
       If you want to use variables in rules, you must define them in the definitions section. The variables make up the left  column,	and  their
       definitions make up the right column.  For example, to define D as a numerical digit, enter: D	 [0-9]

       You can use a defined variable in the rules section by enclosing the variable name in braces, {D}.

       In  the	definitions  section,  you can set either of the following two mutually exclusive declarations: Declare the type of yytext to be a
       null-terminated character array.  Declare the type of yytext to be a pointer to a null-terminated character string.  Use  of  the  %pointer
       definition selects the /usr/ccs/lib/ncpform skeleton.

       In  the	definitions  section,  you can also set table sizes for the resulting finite state machine. The default sizes are large enough for
       small programs.	You may want to set larger sizes for more complex programs: Number of positions is number (default 5000) Number of  states
       is  number  (default  2500)  Number  of	parse tree nodes is number (default 2000) Number of transitions is number (default 5000) Number of
       packed character classes is number (default 2000) Number of output slots is number (default 5000)

       If extended characters appear in regular expression strings, you may need to reset the output array size with the %o parameter (possibly to
       array  sizes  in  the range 10,000 to 20,000).  This reset reflects the much larger number of extended characters relative to the number of
       ASCII characters.

   Rules Section
       The rules section is required, and it must be preceded by the %% delimiter, even if you do not have a definitions section. The lex  command
       does not recognize rules without the delimiter.

       In  this  section, the left column contains the pattern to be recognized in an input file to yylex().  The right column contains the C pro-
       gram fragment executed when that pattern is recognized.

       Patterns can include extended characters with one exception: extended characters may not appear in range  specifications  within  character
       class expressions surrounded by brackets.

       The  columns  are separated by a tab. For example, to search files for the word LEAD and replace it with GOLD, perform the following steps:
       Create a file called transmute.l containing the lines:

	      %% (LEAD)  printf("GOLD"); Then issue the following commands to the shell: lex transmute.l cc -o transmute lex.yy.c -ll You can test
	      the resulting program with the command: transmute <transmute.l

       This command echoes the contents of transmute.l, with the occurrences of LEAD changed to GOLD.

       Each  pattern may have a corresponding action, that is, a fragment of C source code to execute when the pattern is matched.  Each statement
       must end with a ; (semicolon).  If you use more than one statement in an action, you must enclose all of them  in  {}  (braces).  A  second
       delimiter, %%, must follow the rules section if you have a user subroutine section.

       When  yylex()  matches  a string in the input stream, it copies the matched text to an external character array, yytext, before it executes
       any actions in the rules section.

       You can use the following operators to form patterns that you want to match: Matches the characters written.  Matches any one character	in
       the  enclosed  range ([.-.]) or the enclosed list ([...]). [abcx-z] matches a,b,c,x,y, or z.  Matches the enclosed character or string even
       if it is an operator.  "$" prevents lex from interpreting the $ character as an operator.  Acts the same as double quotes.  $ prevents lex
       from  interpreting the $ character as an operator.  Matches zero or more occurrences of the single-character regular expression immediately
       preceding it.  x* matches zero or more repeated literal characters x.  Matches one or more  occurrences	of  the  single-character  regular
       expression  immediately preceding it.  Matches either zero or one occurrence of the single-character regular expression immediately preced-
       ing it.	Matches the character only at the beginning of a line.	^x matches an x at the beginning of a line.  Matches any character  except
       for  the  characters  following	the  ^.  [^xyz] matches any character but x, y, or z.  Matches any character except the newline character.
       Matches the end of a line.  Matches either of two characters.  x|y matches either x or y.  Matches one extended	regular  expression  (ERE)
       only  when  followed by a second ERE. It reads only the first token into yytext.  Given the regular expression a*b/cc and the input aaabcc,
       yytext would contain the string aaab on this match.  Matches the pattern in the ( ) (parentheses). This is used for grouping. It reads  the
       whole  pattern into yytext. A group in parentheses can be used in place of any single character in any other pattern.  (xyz123) matches the
       pattern xyz123 and reads the whole string into yytext.  Matches the character as defined in the definitions section.  If D  is  defined	as
       numeric	digits,  {D} matches all numeric digits.  Matches m-to-n occurrences of the specified character.  x{2,4} matches 2, 3, or 4 occur-
       rences of x.

       If a line begins with only a space, lex copies it to the lex.yy.c output file. If the line is in  the  definitions  section  of	file,  lex
       copies  it  to  the  declarations  section  of  lex.yy.c. If the line is in the rules section, lex copies it to the program code section of
       lex.yy.c.

   User Subroutines Section
       The lex library has three subroutines defined as macros that you can use in the rules.  Reads a character from yyin.  Replaces a  character
       after it is read.  Writes a character to yyout.

       You  can override these three macros by writing your own code for these routines in the user subroutines section. But if you write your own
       routines, you must undefine these macros in the definitions section as follows:

       %{ #undef input #undef unput #undef output }%

       When you are using lex as a simple transformer/recognizer for stdin to stdout piping, you can avoid writing the framework by  using  libl.a
       (the lex library). It has a main routine that calls yylex() for you.

       External names generated by lex all begin with the prefix yy, as in yyin, yyout, yylex, and yytext.

   Putting Spaces in an Expression
       Normally, spaces or tabs end a rule and, therefore, the expression that defines a rule.	However, you can enclose the spaces or tab charac-
       ters in "" (double quotes) to include them in the expression. Use quotes around all spaces in expressions that are not already within  sets
       of [ ] (brackets).

   Other Special Characters
       The lex program recognizes many of the normal C language special characters.  These character sequences are as follows:

       Sequence   Meaning
       
	  Newline
       		  Tab
       	  Backspace
       \	  Backslash
       digits	  The  character whose encoding is represented
		  by the three-digit octal number
       xdigits   The character whose encoding is  represented
		  by the hexadecimal integer

       Do not use the actual newline character in an expression.

       When  using  these  special  characters in an expression, you do not need to enclose them in quotes.  Every character, except these special
       characters and the previously described operator symbols, is always a text character.

   Matching Rules
       When more than one expression can match the current input, lex chooses the longest match first.	Among rules that match the same number	of
       characters, the rule that occurs first is chosen.  For example:

       integer keyword action...; [a-z]+ identifier action...;

       If  the	preceding  rules  are  given  in  that order and integers is the input word, lex matches the input as an identifier because [a-z]+
       matches eight characters, while integer matches only seven.  However, if the input is integer, both rules match seven characters. The  key-
       word  rule  is selected because it occurs first. A shorter input, such as int, does not match the expression rule integer and causes lex to
       select the rule identifier.

   Matching a String with Wildcard Characters
       Because lex chooses the longest match first, do not use rules containing expressions like (for example: '.*').

       The preceding rule might seem like a good way to recognize a string in single quotes.  However, the lexical analyzer reads far ahead, look-
       ing for a distant single quote to complete the long match.  If a lexical analyzer with such a rule gets the following input, it matches the
       whole string:

       'first' quoted string here, 'second' here

       To find the smaller strings, first and second, use the following rule:

       '[^'
]*'

       This rule stops after matching 'first'.

       Errors of this type are not far-reaching because the . (dot) operator does not match a newline character.  Therefore, expressions like stop
       on  the	current  line.	Do not try to defeat this with expressions like [.
] +. The lexical analyzer tries to read the entire input file,
       and an internal buffer overflow occurs.

   Finding Strings within Strings
       The lex program partitions the input stream and does not search for all possible matches of each expression.  Each character  is  accounted
       for once and only once.	For example, to count occurrences of both she and he in an input text, try the following rules:

       she   s++; he	h++; 
    | .	   ;

       The  last two rules ignore everything besides he and she. However, because she includes he, lex does not recognize the instances of he that
       are included in she.

       To override this choice, use the REJECT action.	This directive tells lex to go to the next rule.  The lex command then adjusts	the  posi-
       tion  of  the  input pointer to where it was before the first rule was executed, and executes the second choice rule. For example, to count
       the included instances of he, use the following rules:

       she    {s++; REJECT;} he     {h++; REJECT;} 
	  | .	   ;

       After counting the occurrences of she, lex rejects the input stream and then counts the occurrences of he. In this case, you can  omit  the
       REJECT action on he because she includes he but not vice versa. In other cases, it may be difficult to determine which input characters are
       in both classes.

       In general, REJECT is useful whenever the purpose of lex is not to partition the input stream but to detect all examples of some  items	in
       the input, and the instances of these items may overlap or include each other.

NOTES

       Because lex uses fixed names for intermediate and output files, you can have only one lex-generated program in a given directory. If the -t
       option is not specified, informational, error, and warning messages are written to stdout. If the -t option  is	specified,  informational,
       error, and warning messages are written to stderr.

       [Tru64 UNIX]  The yytext array has a default dimension of 200, controlled by the constant YYLMAX. If the programmer needs to allow a larger
       array, the YYLMAX constant may be redefined as follows from within the lex command file:

       { #undef YYLMAX #define YYLMAX 8192 }

       Two other arrays use YYLMAX, yysubf, and yylstate.

       The lex program can be compiled as a C program with -std0, -std, or -std1 mode. It can also be compiled as a C++ program. If YY_NOPROTO	is
       defined on the compilation command line, function prototypes are not generated.

EXAMPLES

       The  following command draws lex instructions from the file lexcommands and places the output in lex.yy.c: lex lexcommands The file lexcom-
       mands contains an example of a lex program that would be put into a lex command file.  The following program converts uppercase	to  lower-
       case, removes spaces at the end of a line, and replaces multiple spaces with single spaces:

	      %% [A-Z] putchar(tolower(yytext[0])); [ ]+$ ; [ ]+ putchar(' ');

ENVIRONMENT VARIABLES

       The  following  environment variables affect the behavior of lex(): Provides a default value for the locale category variables that are not
       set or null.  If set, overrides the values of all other locale variables.  Determines the order in  which  output  is  sorted  for  the	-x
       option.	 Determines  the locale for the interpretation of byte sequences as characters (single-byte or multi-byte) in input parameters and
       files.  Determines the locale used to affect the format and contents of diagnostic messages displayed by the command.  Determines the loca-
       tion of message catalogs for the processing of LC_MESSAGES.

FILES

       Run-time  library.   Default  C	language skeleton finite state machine for lex.  Default C language skeleton finite state machine for lex,
       implemented with the pointer definition of yytext.  Default RATFOR language skeleton finite state machine for lex.

SEE ALSO

       Commands:  yacc(1)

       Standards:  standards(5)

       Programming Support Tools

																	    lex(1)
Linux and UNIX Man Pages

lex(1) [osf1 man page]