Unix/Linux Go Back    

OpenSolaris 2009.06 - man page for lex (opensolaris section 1)

Linux & Unix Commands - Search Man Pages
Man Page or Keyword Search:   man
Select Man Page Set:       apropos Keyword Search (sections above)

lex(1)					  User Commands 				   lex(1)

       lex - generate programs for lexical tasks

       lex [-cntv] [-e | -w] [-V -Q [y | n]] [file]...

       The  lex utility generates C programs to be used in lexical processing of character input,
       and that can be used as an interface to yacc. The C programs are generated from lex source
       code  and  conform  to  the ISO C standard. Usually, the lex utility writes the program it
       generates to the file lex.yy.c. The state of this file is unspecified if lex exits with	a
       non-zero exit status. See EXTENDED DESCRIPTION for a complete description of the lex input

       The following options are supported:

       -c	   Indicates C-language action (default option).

       -e	   Generates a program that can handle EUC characters (cannot be used with the -w
		   option). yytext[] is of type unsigned char[].

       -n	   Suppresses the summary of statistics usually written with the -v option. If no
		   table sizes are specified in the lex source code and  the  -v  option  is  not
		   specified, then -n is implied.

       -t	   Writes the resulting program to standard output instead of lex.yy.c.

       -v	   Writes  a summary of lex statistics to the standard error. (See the discussion
		   of lex table sizes under the heading Definitions in lex.) If table  sizes  are
		   specified  in  the lex source code, and if the -n option is not specified, the
		   -v option may be enabled.

       -w	   Generates a program that can handle EUC characters (cannot be used with the -e
		   option). Unlike the -e option, yytext[] is of type wchar_t[].

       -V	   Prints out version information on standard error.

       -Q[y|n]	   Prints  out	version information to output file lex.yy.c by using -Qy. The -Qn
		   option does not print out version information and is the default.

       The following operand is supported:

       file	A pathname of an input file. If more than one such file is specified,  all  files
		will  be  concatenated	to  produce a single lex program. If no file operands are
		specified, or if a file operand is -, the standard input will be used.

       The lex output files are described below.

       If the -t option is specified, the text file of C source code output of lex will be  writ-
       ten to standard output.

       If  the	-t  option  is specified informational, error and warning messages concerning the
       contents of lex source code input will be written to the standard error.

       If the -t option is not specified:

	   1.	  Informational error and warning messages concerning the contents of lex  source
		  code input will be written to either the standard output or standard error.

	   2.	  If  the  -v option is specified and the -n option is not specified, lex statis-
		  tics will also be written to standard error. These statistics may also be  gen-
		  erated if table sizes are specified with a % operator in the Definitions in lex
		  section (see EXTENDED DESCRIPTION), as long as the -n option is not specified.

   Output Files
       A text file containing C source code will be written to lex.yy.c, or to the standard  out-
       put if the -t option is present.

       Each  input  file  contains  lex source code, which is a table of regular expressions with
       corresponding actions in the form of C program fragments.

       When lex.yy.c is compiled and linked with the lex library (using the -l l operand with c89
       or cc), the resulting program reads character input from the standard input and partitions
       it into strings that match the given expressions.

       When an expression is matched, these actions will occur:

	   o	  The input string that was matched  is  left  in  yytext  as  a  null-terminated
		  string;  yytext is either an external character array or a pointer to a charac-
		  ter string. As explained in Definitions in lex,  the	type  can  be  explicitly
		  selected using the %array or %pointer declarations, but the default is %array.

	   o	  The external int yyleng is set to the length of the matching string.

	   o	  The expression's corresponding program fragment, or action, is executed.

       During  pattern matching, lex searches the set of patterns for the single longest possible
       match. Among rules that match the same number of characters, the rule given first will  be

       The general format of lex source is:

	 User Subroutines

       The  first  %%  is  required  to  mark the beginning of the rules (regular expressions and
       actions); the second %% is required only if user subroutines follow.

       Any line in the Definitions in lex section  beginning  with  a  blank  character  will  be
       assumed	to  be a C program fragment and will be copied to the external definition area of
       the lex.yy.c file. Similarly, anything in the Definitions in lex section included  between
       delimiter  lines  containing  only %{ and %} will also be copied unchanged to the external
       definition area of the lex.yy.c file.

       Any such input (beginning with a blank character or within  %{  and  %}	delimiter  lines)
       appearing  at  the  beginning  of the Rules section before any rules are specified will be
       written to lex.yy.c after the declarations of variables for the yylex function and  before
       the first line of code in yylex. Thus, user variables local to yylex can be declared here,
       as well as application code to execute upon entry to yylex.

       The action taken by lex when encountering any input beginning with a  blank  character  or
       within  %{  and	%} delimiter lines appearing in the Rules section but coming after one or
       more rules is undefined. The presence of such input may result in an erroneous  definition
       of the yylex function.

   Definitions in lex
       Definitions in lex appear before the first %% delimiter. Any line in this section not con-
       tained between %{ and %} lines and not beginning with a	blank  character  is  assumed  to
       define a lex substitution string. The format of these lines is:

	 name	substitute

       If a name does not meet the requirements for identifiers in the ISO C standard, the result
       is undefined. The string substitute will replace the string { name } when it is used in	a
       rule.  The name string is recognized in this context only when the braces are provided and
       when it does not appear within a bracket expression or within double-quotes.

       In the Definitions in lex section, any line beginning with a %  (percent  sign)	character
       and  followed  by an alphanumeric word beginning with either s or S defines a set of start
       conditions. Any line beginning with a % followed by a word beginning with either  x  or	X
       defines	a set of exclusive start conditions. When the generated scanner is in a %s state,
       patterns with no state specified will be also active; in a %x state,  such  patterns  will
       not be active. The rest of the line, after the first word, is considered to be one or more
       blank-character-separated names of start conditions. Start condition names are constructed
       in the same way as definition names. Start conditions can be used to restrict the matching
       of regular expressions to one or more states as described in Regular expressions in lex.

       Implementations accept either of the following two mutually exclusive declarations in  the
       Definitions in lex section:

       %array	    Declare the type of yytext to be a null-terminated character array.

       %pointer     Declare  the  type	of  yytext to be a pointer to a null-terminated character

       Note: When using the %pointer option, you may not also use the yyless  function	to  alter

       %array  is  the	default. If %array is specified (or neither %array nor %pointer is speci-
       fied), then the correct way to make an external reference to yyext is with  a  declaration
       of the form:

       extern char yytext[]

       If %pointer is specified, then the correct external reference is of the form:

       extern char *yytext;

       lex  will accept declarations in the Definitions in lex section for setting certain inter-
       nal table sizes. The declarations are shown in the following table.

       Table Size Declaration in lex

       | Declaration		   Description			Default    |
       |%pn		Number of positions		     2500	   |
       |%nn		Number of states		     500	   |
       |%a n		Number of transitions		     2000	   |
       |%en		Number of parse tree nodes	     1000	   |
       |%kn		Number of packed character classes   10000	   |
       |%on		Size of the output array	     3000	   |

       Programs generated by lex need either the -e or -w option to handle  input  that  contains
       EUC  characters	from  supplementary  codesets.	If neither of these options is specified,
       yytext is of the type char[], and the generated program can handle only ASCII characters.

       When the -e option is used, yytext is of the type unsigned char[]  and  yyleng  gives  the
       total  number  of  bytes  in  the  matched  string.  With this option, the macros input(),
       unput(c), and output(c) should do a byte-based I/O in the same way  as  with  the  regular
       ASCII lex. Two more variables are available with the -e option, yywtext and yywleng, which
       behave the same as yytext and yyleng would under the -w option.

       When the -w option is used, yytext is of the type wchar_t[] and	yyleng	gives  the  total
       number  of characters in the matched string.  If you supply your own input(), unput(c), or
       output(c) macros with this option, they must return or accept EUC characters in	the  form
       of  wide  character  (wchar_t). This allows a different interface between your program and
       the lex internals, to expedite some programs.

   Rules in lex
       The Rules in lex source files are a table  in  which  the  left	column	contains  regular
       expressions  and  the  right  column contains actions (C program fragments) to be executed
       when the expressions are recognized.

	 ERE action
	 ERE action

       The extended regular expression (ERE) portion of a row will be separated  from  action  by
       one  or	more blank characters. A regular expression containing blank characters is recog-
       nized under one of the following conditions:

	   o	  The entire expression appears within double-quotes.

	   o	  The blank characters appear within double-quotes or square brackets.

	   o	  Each blank character is preceded by a backslash character.

   User Subroutines in lex
       Anything in the user subroutines section will be copied to lex.yy.c following yylex.

   Regular Expressions	   in lex
       The lex utility supports the set of  Extended  Regular  Expressions  (EREs)  described  on
       regex(5) with the following additions and exceptions to the syntax:

       ...	     Any  string  enclosed  in double-quotes will represent the characters within
		     the double-quotes as themselves, except that backslash escapes (which appear
		     in  the  following  table)  are recognized. Any backslash-escape sequence is
		     terminated by the closing quote. For example, "\01""1" represents	a  single
		     string: the octal value 1 followed by the character 1.


       <state1, state2, ...>r

	   The	regular expression r will be matched only when the program is in one of the start
	   conditions indicated by state, state1, and so forth. For more information, see Actions
	   in lex. As an exception to the typographical conventions of the rest of this document,
	   in this case <state> does not represent a metavariable, but the literal  angle-bracket
	   characters surrounding a symbol. The start condition is recognized as such only at the
	   beginning of a regular expression.


	   The regular expression r will be matched only if it is followed by  an  occurrence  of
	   regular  expression x. The token returned in yytext will only match r. If the trailing
	   portion of r matches the beginning of x, the result is unspecified. The  r  expression
	   cannot  include further trailing context or the $ (match-end-of-line) operator; x can-
	   not include the ^ (match-beginning-of-line) operator, nor trailing context, nor the	$
	   operator. That is, only one occurrence of trailing context is allowed in a lex regular
	   expression, and the ^ operator only can be used at the beginning of	such  an  expres-
	   sion.  A further restriction is that the trailing-context operator / (slash) cannot be
	   grouped within parentheses.


	   When name is one of the substitution symbols from the Definitions section, the string,
	   including  the enclosing braces, will be replaced by the substitute value. The substi-
	   tute value will be treated in the extended regular expression as if it  were  enclosed
	   in  parentheses.  No substitution will occur if {name} occurs within a bracket expres-
	   sion or within double-quotes.

       Within an ERE, a backslash character (\\, \a, \b, \f, \n, \r, \t,  \v)  is  considered  to
       begin an escape sequence. In addition, the escape sequences in the following table will be

       A literal newline character cannot occur within an ERE; the escape sequence \n can be used
       to represent a newline character. A newline character cannot be matched by a period opera-

       Escape Sequences in lex

       |Escape Sequences in lex 							      |
       |    Escape Sequence		  Description			    Meaning	      |
       |	\digits 	  A  backslash	character  fol-   The  character whose encod- |
       |			  lowed by the longest sequence   ing is represented  by  the |
       |			  of one, two or  three  octal-   one-,  two-  or three-digit |
       |			  digit  characters (01234567).   octal  integer.  Multi-byte |
       |			  Ifall of the	digits	are  0,   characters  require  multi- |
       |			  (that  is,  representation of   ple,	concatenated   escape |
       |			  the	NUL   character),   the   sequences   of  this	type, |
       |			  behavior is undefined.	  including the leading \ for |
       |							  each byte.		      |
       |       \xdigits 	  A  backslash	character  fol-   The character whose  encod- |
       |			  lowed by the longest sequence   ing  is  represented by the |
       |			  of  hexadecimal-digit charac-   hexadecimal integer.	      |
       |			  ters	(01234567abcdefABCDEF). 			      |
       |			  If  all  of the digits are 0, 			      |
       |			  (that is,  representation  of 			      |
       |			  the	NUL   character),   the 			      |
       |			  behavior is undefined.				      |
       |	  \c		  A  backslash	character  fol-   The character c, unchanged. |
       |			  lowed  by  any  character not 			      |
       |			  described  in   this	 table. 			      |
       |			  (\\, \a, \b, \f, \en, \r, \t, 			      |
       |			  \v).							      |

       The order of precedence given to extended regular expressions for lex is as shown  in  the
       following table, from high to low.

       Note:	 The escaped characters entry is not meant to imply that these are operators, but
		 they are included in the table to show their relationships to	the  true  opera-
		 tors.	The  start  condition, trailing context and anchoring notations have been
		 omitted from the table because of the placement restrictions described  in  this
		 section; they can only appear at the beginning or ending of an ERE.

       |      ERE Precedence in lex					|
       |collation-related bracket symbols   [= =]  [: :]  [. .] 	|
       |escaped characters		    \<special character>	|
       |bracket expression		    [ ] 			|
       |quoting 			    "..."			|
       |grouping			    ()				|
       |definition			    {name}			|
       |single-character RE duplication     * + ?			|
       |concatenation							|
       |interval expression		    {m,n}			|
       |alternation			    |				|

       The ERE anchoring operators (^ and $) do not appear in the table. With lex regular expres-
       sions, these operators are restricted in their use: the ^ operator can only be used at the
       beginning  of an entire regular expression, and the $ operator only at the end. The opera-
       tors apply to the entire regular expression. Thus, for example, the pattern  (^abc)|(def$)
       is  undefined;  it  can	instead  be  written  as two separate rules, one with the regular
       expression ^abc and one with def$, which share a common action via the  special	|  action
       (see below). If the pattern were written ^abc|def$, it would match either of abc or def on
       a line by itself.

       Unlike the general ERE rules, embedded anchoring is not allowed	by  most  historical  lex
       implementations.  An example of embedded anchoring would be for patterns such as (^)foo($)
       to match foo when it exists as a complete word. This functionality can be  obtained  using
       existing lex features:

	 ^foo/[ \n]|
	 " foo"/[ \n]	 /* found foo as a separate word */

       Notice also that $ is a form of trailing context (it is equivalent to /\n and as such can-
       not be used with regular expressions containing another instance of the operator (see  the
       preceding discussion of trailing context).

       The  additional	regular expressions trailing-context operator / (slash) can be used as an
       ordinary character if presented within double-quotes, "/"; preceded by a backslash, \/; or
       within  a  bracket expression, [/]. The start-condition < and > operators are special only
       in a start condition at the beginning of a regular expression; elsewhere  in  the  regular
       expression they are treated as ordinary characters.

       The following examples clarify the differences between lex regular expressions and regular
       expressions appearing elsewhere in this document. For regular expressions of the form r/x,
       the  string  matching  r  is  always returned; confusion may arise when the beginning of x
       matches the trailing portion of r. For example, given the regular  expression  a*b/cc  and
       the  input aaabcc, yytext would contain the string aaab on this match. But given the regu-
       lar expression x*/xy and the input xxxy, the token xxx, not xx, is returned by some imple-
       mentations because xxx matches x*.

       In the rule ab*/bc, the b* at the end of r will extend r's match into the beginning of the
       trailing context, so the result is unspecified. If this rule were ab/bc, however, the rule
       matches	the text ab when it is followed by the text bc. In this latter case, the matching
       of r cannot extend into the beginning of x, so the result is specified.

   Actions in lex
       The action to be taken when an ERE is matched can be a C program fragment or  the  special
       actions	described  below;  the program fragment can contain one or more C statements, and
       can also include special actions. The empty C statement ; is a valid action; any string in
       the  lex.yy.c input that matches the pattern portion of such a rule is effectively ignored
       or skipped. However, the absence of an action is not valid, and the action  lex	takes  in
       such a condition is undefined.

       The  specification  for	an action, including C statements and special actions, can extend
       across several lines if enclosed in braces:

	 ERE <one or more blanks> { program statement
	 program statement }

       The default action when a string in the input to a lex.yy.c program is not matched by  any
       expression  is to copy the string to the output. Because the default behavior of a program
       generated by lex is to read the input and copy it to the output, a minimal lex source pro-
       gram  that  has	just  %% generates a C program that simply copies the input to the output

       Four special actions are available:


       |	   The action | means that the action for the next rule is the	action	for  this
		   rule.  Unlike  the  other  three actions, | cannot be enclosed in braces or be
		   semicolon-terminated. It must be specified alone, with no other actions.

       ECHO;	   Writes the contents of the string yytext on the output.

       REJECT;	   Usually only a single expression is matched by a given string  in  the  input.
		   REJECT means "continue to the next expression that matches the current input,"
		   and causes whatever rule was the second choice after the current  rule  to  be
		   executed  for the same input. Thus, multiple rules can be matched and executed
		   for one input string or overlapping input strings. For example, given the reg-
		   ular  expressions  xyz  and	xy  and  the  input xyz, usually only the regular
		   expression xyz would match. The next attempted match would start after  z.  If
		   the	last  action  in  the xyz rule is REJECT , both this rule and the xy rule
		   would be executed. The REJECT action may be implemented in such a fashion that
		   flow of control does not continue after it, as if it were equivalent to a goto
		   to another part of yylex. The use of REJECT may result in somewhat larger  and
		   slower scanners.

       BEGIN	   The action:

		   BEGIN newstate;

		   switches  the  state (start condition) to newstate. If the string newstate has
		   not been declared previously as a start condition in the  Definitions  in  lex
		   section,  the  results  are unspecified. The initial state is indicated by the
		   digit 0 or the token INITIAL.

       The functions or macros described below are accessible to user code included  in  the  lex
       input.  It is unspecified whether they appear in the C code output of lex, or are accessi-
       ble only through the -l l operand to c89 or cc (the lex library).

       int yylex(void)	    Performs lexical analysis on the input; this is the primary  function
			    generated  by the lex utility. The function returns zero when the end
			    of input is reached; otherwise it returns  non-zero  values  (tokens)
			    determined by the actions that are selected.

       int yymore(void)     When called, indicates that when the next input string is recognized,
			    it is to be appended to the  current  value  of  yytext  rather  than
			    replacing it; the value in yyleng is adjusted accordingly.

       intyyless(int n)     Retains  n	initial  characters in yytext, NUL-terminated, and treats
			    the remaining characters as if they had not been read; the	value  in
			    yyleng is adjusted accordingly.

       int input(void)	    Returns the next character from the input, or zero on end-of-file. It
			    obtains input from the stream pointer yyin, although possibly via  an
			    intermediate  buffer.  Thus,  once	scanning has begun, the effect of
			    altering the value of  yyin  is  undefined.  The  character  read  is
			    removed  from  the input stream of the scanner without any processing
			    by the scanner.

       int unput(int c)     Returns the character c to the input; yytext and yyleng are undefined
			    until  the	next expression is matched. The result of using unput for
			    more characters than have been input is unspecified.

       The following functions appear only in the lex library accessible through the -l  l  oper-
       and; they can therefore be redefined by a portable application:

       int yywrap(void)

	   Called by yylex at end-of-file; the default yywrap always will return 1. If the appli-
	   cation requires yylex to continue processing with another source of	input,	then  the
	   application	can  include  a  function  yywrap, which associates another file with the
	   external variable FILE *yyin and will return a value of zero.

       int main(int argc, char *argv[])

	   Calls yylex to perform lexical analysis, then exits. The user code can contain main to
	   perform application-specific operations, calling yylex as applicable.

       The  reason  for  breaking  these functions into two lists is that only those functions in
       libl.a can be reliably redefined by a portable application.

       Except for input, unput and main, all external and static names	generated  by  lex  begin
       with the prefix yy or YY.

       Portable  applications  are  warned  that  in  the Rules in lex section, an ERE without an
       action is not acceptable, but need not be detected as erroneous by lex. This may result in
       compilation or run-time errors.

       The purpose of input is to take characters off the input stream and discard them as far as
       the lexical analysis is concerned. A common use is to discard the body of a  comment  once
       the beginning of a comment is recognized.

       The  lex utility is not fully internationalized in its treatment of regular expressions in
       the lex source code or generated lexical analyzer. It would seem  desirable  to	have  the
       lexical	analyzer  interpret  the regular expressions given in the lex source according to
       the environment specified when the lexical analyzer is executed, but this is not  possible
       with  the  current  lex	technology. Furthermore, the very nature of the lexical analyzers
       produced by lex must be closely tied to the lexical requirements  of  the  input  language
       being described, which will frequently be locale-specific anyway. (For example, writing an
       analyzer that is used for French text will not  automatically  be  useful  for  processing
       other languages.)

       Example 1 Using lex

       The  following  is an example of a lex program that implements a rudimentary scanner for a
       Pascal-like syntax:

	 /* need this for the call to atof() below */
	 #include <math.h>
	 /* need this for printf(), fopen() and stdin below */
	 #include <stdio.h>

	 DIGIT	  [0-9]
	 ID	  [a-z][a-z0-9]*

	 {DIGIT}+			   {
				printf("An integer: %s (%d)\n", yytext,

	 {DIGIT}+"."{DIGIT}*	{
				printf("A float: %s (%g)\n", yytext,

	 if|then|begin|end|procedure|function	     {
				printf("A keyword: %s\n", yytext);

	 {ID}			printf("An identifier: %s\n", yytext);

	 "+"|"-"|"*"|"/"	printf("An operator: %s\n", yytext);

	 "{"[^}\n]*"}"	       /* eat up one-line comments */

	 [ \t\n]+		/* eat up white space */

	 .			printf("Unrecognized character: %s\n", yytext);


	 int main(int argc, char *argv[])
			       ++argv, --argc;	/* skip over program name */
			       if (argc > 0)
											   yyin = fopen(argv[0], "r");
			       yyin = stdin;


       See environ(5) for descriptions of the following environment  variables	that  affect  the
       execution of lex: LANG, LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, and NLSPATH.

       The following exit values are returned:

       0      Successful completion.

       >0     An error occurred.

       See attributes(5) for descriptions of the following attributes:

       |      ATTRIBUTE TYPE	     |	    ATTRIBUTE VALUE	   |
       |Availability		     |SUNWbtool 		   |
       |Interface Stability	     |Standard			   |

       yacc(1), attributes(5), environ(5), regex(5), standards(5)

       If  routines such as yyback(), yywrap(), and yylock() in .l (ell) files are to be external
       C functions, the command line to compile a C++ program must define the __EXTERN_C__ macro.
       For example:

	 example%  CC -D__EXTERN_C__ ... file

SunOS 5.11				   22 Aug 1997					   lex(1)
Unix & Linux Commands & Man Pages : ©2000 - 2018 Unix and Linux Forums

All times are GMT -4. The time now is 08:38 AM.