Lex Theory
During the first phase the compiler
reads the input and converts strings in the source to tokens. With regular
expressions we can specify patterns to lex so it can generate code that will
allow it to scan and match strings in the input. Each pattern in the input to
lex has an associated action. Typically an action returns a token that
represents the matched string for subsequent use by the parser. Initially we
will simply print the matched string rather than return a token value.
The following represents a simple
pattern, composed of a regular expression, that scans for identifiers. Lex will
read this pattern and produce C code for a lexical analyzer that scans for
identifiers.
letter(letter|digit)*
This pattern matches a string of
characters that begins with a single letter followed by zero or more letters or
digits. This example nicely illustrates operations allowed in regular
expressions:
- repetition,
expressed by the "
*
" operator
- alternation,
expressed by the "
|
" operator
- concatenation
Any regular expression expressions may be expressed as a finite
state automaton (FSA). We can represent an FSA using states, and transitions
between states. There is one start state and one or more final or accepting
states.
Figure 3: Finite State Automaton
In Figure 3 state 0 is the start
state and state 2 is the accepting state. As characters are read we make a
transition from one state to another. When the first letter is read we transition
to state 1. We remain in state 1 as more letters or digits are read. When we
read a character other than a letter or digit we transition to accepting state
2. Any FSA may be expressed as a computer
program.
For example, our 3-state machine is
easily programmed:
start: goto state0
state0: read c
if c = letter goto state1
goto state0
state1: read c
if c = letter goto state1
if c = digit goto state1
goto state2
state2: accept string
This is the technique used by lex.
Regular expressions are translated by lex to a computer program that mimics an
FSA. Using the next input character and current state the next state is easily determined by
indexing into a computer-generated state table.
Now we can easily understand some
of lex’s limitations. For example, lex cannot be used to recognize nested
structures such as parentheses. Nested structures are handled by incorporating
a stack. Whenever we encounter a "(
" we push it on the stack. When
a ")
"
is encountered we match it with the top of the stack and pop the stack. However
lex only has states and transitions between states. Since it has no stack it is
not well suited for parsing nested structures. Yacc augments an FSA with a
stack and can process constructs such as parentheses with ease. The important
thing is to use the right tool for the job. Lex is good at pattern matching.
Yacc is appropriate for more challenging tasks.
Lex Practice
Table 1: Pattern Matching Primitives
|
Metacharacter
|
Matches
|
.
|
any character except newline
|
\n
|
newline
|
*
|
zero or more copies of the preceding
expression
|
+
|
one or more copies of the preceding
expression
|
?
|
zero or one copy of the preceding
expression
|
^
|
beginning of line
|
$
|
end of line
|
a|b
|
a or b
|
(ab)+
|
one or more copies of ab (grouping)
|
"a+b"
|
literal "a+b " (C escapes still work)
|
[]
|
character class
|
Table 2: Pattern Matching Examples
|
Expression
|
Matches
|
abc
|
abc
|
abc*
|
ab abc abcc
abccc ...
|
abc+
|
abc, abcc,
abccc, abcccc, ...
|
a(bc)+
|
abc, abcbc,
abcbcbc, ...
|
a(bc)?
|
a, abc
|
[abc]
|
one of: a, b, c
|
[a-z]
|
any letter, a through z
|
[a\-z]
|
one of: a, -, z
|
[-az]
|
one of: - a z
|
[A-Za-z0-9]+
|
one or more alphanumeric characters
|
[ \t\n]+
|
whitespace
|
[^ab]
|
anything except: a, b
|
[a^b]
|
a, ^, b
|
[a|b]
|
a, |, b
|
a|b
|
a, b
|
Regular expressions in lex are
composed of metacharacters (Table 1). Pattern-matching examples are shown in
Table 2. Within a character class normal operators lose their meaning. Two
operators allowed in a character class are the hyphen ("-
") and circumflex ("^
"). When used between two
characters the hyphen represents a range of characters. The circumflex, when
used as the first character, negates the expression. If two patterns match the
same string the longest match wins. In case both matches are the same length,
then the first pattern listed is used.
... definitions ...
%%
... rules ...
%%
... subroutines ...
|
Input to Lex is divided into three
sections with %%
dividing
the sections. This is best illustrated by example. The first example is the
shortest possible lex file:
%%
Input is copied to output one
character at a time. The first %%
is
always required as there must always be a rules section. However if we don’t
specify any rules then the default action is to match everything and copy it to
output. Defaults for input and output are stdin
and stdout
, respectively. Here is the same
example with defaults explicitly coded:
%%
/* match everything except newline */
. ECHO;
/* match newline */
\n ECHO;
%%
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}
Two patterns have been specified in
the rules section. Each pattern must begin in column one. This is followed by whitespace (space, tab or newline) and an
optional action associated with the pattern. The action may be a single C
statement, or multiple C statements, enclosed in braces. Anything not starting
in column one is copied verbatim to the generated C file. We may take advantage
of this behavior to specify comments in our lex file. In this example there are
two patterns, ".
"
and "\n
",
with an ECHO
action
associated for each pattern. Several macros and variables are predefined by
lex. ECHO
is
a macro that writes code matched by the pattern. This is the default action for
any unmatched strings. Typically, ECHO
is
defined as:
#define ECHO fwrite(yytext, yyleng, 1, yyout)
Variable yytext
is a pointer to the matched string
(NULL-terminated) and yyleng
is the length of the matched string.
Variable yyout
is the output file and defaults to stdout
. Function yywrap
is called by lex when input is
exhausted. Return 1 if you are done or 0 if more processing is required. Every
C program requires a main
function.
In this case we simply call yylex
that is the main entry-point for lex.
Some implementations of lex include copies of main
and yywrap
in a library thus eliminating the need
to code them explicitly. This is why our first example, the shortest lex
program, functioned properly.
Table 3: Lex Predefined Variables
|
Name
|
Function
|
int
yylex(void)
|
call to invoke lexer, returns token
|
char *yytext
|
pointer to matched string
|
yyleng
|
length of matched string
|
yylval
|
value associated with token
|
int
yywrap(void)
|
wrapup, return 1 if done, 0 if not
done
|
FILE *yyout
|
output file
|
FILE *yyin
|
input file
|
INITIAL
|
initial start condition
|
BEGIN
condition
|
switch start condition
|
ECHO
|
write matched string
|
Here is a program that does nothing
at all. All input is matched but no action is associated with any pattern so
there will be no output.
%%
.
\n
The following example prepends line
numbers to each line in a file. Some implementations of lex predefine and calculate yylineno
. The input file for lex is yyin
and
defaults to stdin
.
%{
int yylineno;
%}
%%
^(.*)\n printf("%4d\t%s", ++yylineno, yytext);
%%
int main(int argc, char *argv[]) {
yyin = fopen(argv[1], "r");
yylex();
fclose(yyin);
}
The definitions section is composed
of substitutions, code, and start states. Code in the definitions section is
simply copied as-is to the top of the generated C file and must be bracketed
with "%
{"
and "%
}"
markers. Substitutions simplify pattern-matching rules. For example, we may
define digits and letters:
digit [0-9]
letter [A-Za-z]
%{
int count;
%}
%%
/* match identifier */
{letter}({letter}|{digit})* count++;
%%
int main(void) {
yylex();
printf("number of identifiers = %d\n", count);
return 0;
}
Whitespace must separate the
defining term and the associated expression. References to substitutions in the
rules section are surrounded by braces ({letter}
) to distinguish them from
literals. When we have a match in the rules section the associated C code is
executed. Here is a scanner that counts the number of characters, words, and
lines in a file (similar to Unix wc):
%{
int nchar, nword, nline;
%}
%%
\n { nline++; nchar++; }
[^ \t\n]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void) {
yylex();
printf("%d\t%d\t%d\n", nchar, nword, nline);
return 0;
}
Yacc Theory
Grammars for yacc are described
using a variant of Backus Naur Form (BNF). This technique, pioneered by John
Backus and Peter Naur, was used to describe ALGOL60. A BNF grammar can be used
to express context-free languages. Most constructs in modern
programming languages can be represented in BNF. For example, the grammar for
an expression that multiplies and adds numbers is
1 E -> E + E
2 E -> E * E
3 E -> id
Three productions have been
specified. Terms that appear on the left-hand side (lhs) of a production, such
as E
, are nonterminals. Terms such as id
(identifier)
are terminals (tokens returned by lex) and only appear on the right-hand side
(rhs) of a production. This grammar specifies that an expression may be the sum
of two expressions, the product of two expressions, or an identifier. We can
use this grammar to generate expressions:
E -> E * E (r2)
-> E * z (r3)
-> E + E * z (r1)
-> E + y * z (r3)
-> x + y * z (r3)
At each step we expanded a term and
replace the lhs of a production with the corresponding rhs. The numbers on the
right indicate which rule applied. To parse an expression we a need to do the
reverse operation. Instead of starting with a single nonterminal (start symbol)
and generating an expression from a grammar we need to reduce an expression to a single nonterminal.
This is known asbottom-up or shift-reduce parsing and uses a stack for storing
terms. Here is the same derivation but in reverse order:
1 . x + y * z shift
2 x . + y * z reduce(r3)
3 E . + y * z shift
4 E + . y * z shift
5 E + y . * z reduce(r3)
6 E + E . * z shift
7 E + E * . z shift
8 E + E * z . reduce(r3)
9 E + E * E . reduce(r2) emit multiply
10 E + E . reduce(r1) emit add
11 E . accept
Terms to the left of the dot are on
the stack while remaining input is to the right of the dot. We start by
shifting tokens onto the stack. When the top of the stack matches the rhs of a
production we replace the matched tokens on the stack with the lhs of the
production. In other words the matched tokens of the rhs are popped off the
stack, and the lhs of the production is pushed on the stack. The matched tokens
are known as a handle and we are reducing the handle to the lhs of the
production. This process continues until we have shifted all input to the stack
and only the starting nonterminal remains on the stack. In step 1 we shift the x
to
the stack. Step 2 applies rule r3 to the stack to change x
to E
. We continue shifting and reducing
until a single nonterminal, the start symbol, remains in the stack. In step 9,
when we reduce rule r2, we emit the multiply instruction. Similarly the add
instruction is emitted in step 10. Consequently multiply has a higher
precedence than addition.
Consider the shift at step 6.
Instead of shifting we could have reduced and apply rule r1. This would result
in addition having a higher precedence than multiplication. This is known as a shift-reduceconflict. Our
grammar is ambiguous because there is more than one
possible derivation that will yield the expression. In this case operator
precedence is affected. As another example, associativity in the rule
E -> E + E
is ambiguous, for we may recurse on
the left or the right. To remedy the situation, we could rewrite the grammar or
supply yacc with directives that indicate which operator has precedence. The
latter method is simpler and will be demonstrated in the practice section.
The following grammar has a reduce-reduce conflict. With an id
on
the stack we may reduce to T
or E
.
E -> T
E -> id
T -> id
Yacc takes a default action when
there is a conflict. For shift-reduce conflicts yacc will shift. For
reduce-reduce conflicts it will use the first rule in the listing. It also
issues a warning message whenever a conflict exists. The warnings may be
suppressed by making the grammar unambiguous. Several methods for removing
ambiguity will be presented in subsequent sections.
Yacc Practice, Part I
... definitions ...
%%
... rules ...
%%
... subroutines ...
|
Input to yacc is divided into three
sections. The definitions section consists of token declarations
and C code bracketed by "%{
" and "%}
". The BNF grammar is placed in
the rules section and user subroutines are added
in the subroutines section.
This is best illustrated by
constructing a small calculator that can add and subtract numbers. We’ll begin
by examining the linkage between lex and yacc. Here is the definitions section
for the yacc input file:
%token INTEGER
This definition declares an INTEGER
token. Yacc generates a parser in file y.tab.c
and an include file y.tab.h
:
#ifndef YYSTYPE
#define YYSTYPE int
#endif
#define INTEGER 258
extern YYSTYPE yylval;
Lex includes this file and utilizes
the definitions for token values. To obtain tokens yacc calls yylex
. Function yylex
has a return type of int
that
returns a token. Values associated with the token are returned by lex in
variable yylval
. For example,
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
would store the value of the
integer in yylval
, and return token INTEGER
to yacc. The type of yylval
is determined by YYSTYPE
. Since the default type is
integer this works well in this case. Token values 0-255 are reserved for
character values. For example, if you had a rule such as
[-+] return *yytext; /* return operator */
the character value for minus or
plus is returned. Note that we placed the minus sign first so that it wouldn’t
be mistaken for a range designator. Generated token values typically start
around 258 because lex reserves several values for end-of-file and error
processing. Here is the complete lex input specification for our calculator:
%{
#include "y.tab.h"
#include <stdlib.h>
void yyerror(char *);
%}
%%
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
[-+\n] return *yytext;
[ \t] ; /* skip whitespace */
. yyerror("invalid character");
%%
int yywrap(void) {
return 1;
}
Internally yacc maintains two
stacks in memory; a parse stack and a value stack. The parse stack contains
terminals and nonterminals that represent the current parsing state. The value
stack is an array of YYSTYPE
elements and associates a value with
each element in the parse stack. For example when lex returns an INTEGER
token yacc shifts this token to the
parse stack. At the same time the corresponding yylval
is shifted to the value stack. The
parse and value stacks are always synchronized so finding a value related to a
token on the stack is easily accomplished. Here is the yacc input specification
for our calculator:
%{
#include <stdio.h>
int yylex(void);
void yyerror(char *);
%}
%token INTEGER
%%
program:
program expr '\n' { printf("%d\n", $2); }
|
;
expr:
INTEGER { $$ = $1; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
%%
void yyerror(char *s) {
fprintf(stderr, "%s\n", s);
}
int main(void) {
yyparse();
return 0;
}
The rules section resembles the BNF
grammar discussed earlier. The left-hand side of a production, or nonterminal,
is entered left-justified and followed by a colon. This is followed by the
right-hand side of the production. Actions associated with a rule are entered
in braces.
With left-recursion we have
specified that a program consists of zero or more expressions. Each expression
terminates with a newline. When a newline is detected we print the value of the
expression. When we apply the rule
expr: expr '+' expr { $$ = $1 + $3; }
we replace the right-hand side of
the production in the parse stack with the left-hand side of the same
production. In this case we pop "expr '+' expr
" and push "expr
". We have reduced the stack
by popping three terms off the stack and pushing back one term. We may
reference positions in the value stack in our C code by specifying "$1
" for the first term on the
right-hand side of the production, "$2
" for the second, and so on.
"$$
"
designates the top of the stack after reduction has taken place. The above
action adds the value associated with two expressions, pops three terms off the
value stack, and pushes back a single sum. As a consequence the parse and value
stacks remain synchronized.
Numeric values are initially
entered on the stack when we reduce from INTEGER
to expr
. After INTEGER
is shifted to the stack we apply the
rule
expr: INTEGER { $$ = $1; }
The INTEGER
token is popped off the parse stack
followed by a push of expr
. For the value stack we pop the
integer value off the stack and then push it back on again. In other words we
do nothing. In fact this is the default action and need not be specified.
Finally, when a newline is encountered, the value associated with expr
is
printed.
In the event of syntax errors yacc
calls the user-supplied function yyerror
. If you need to modify the
interface to yyerror then alter the canned file that yacc includes to fit your
needs. The last function in our yacc specification is main
…
in case you were wondering where it was. This example still has an ambiguous
grammar. Although yacc will issue shift-reduce warnings it will still process
the grammar using shift as the default operation.
Yacc Practice, Part
II
In this section we will extend the
calculator from the previous section to incorporate some new functionality. New
features include arithmetic operators multiply and divide. Parentheses may be
used to over-ride operator precedence, and single-character variables may be
specified in assignment statements. The following illustrates sample input and
calculator output:
user: 3 * (4 + 5)
calc: 27
user: x = 3 * (4 + 5)
user: y = 5
user: x
calc: 27
user: y
calc: 5
user: x + 2*y
calc: 37
The lexical analyzer returns VARIABLE
and INTEGER
tokens. For variables yylval
specifies an index to the symbol table sym
. For this program sym
merely
holds the value of the associated variable. When INTEGER
tokens are returned, yylval
contains the number scanned. Here is
the input specification for lex:
%{
#include <stdlib.h>
#include "y.tab.h"
void yyerror(char *);
%}
%%
/* variables */
[a-z] {
yylval = *yytext - 'a';
return VARIABLE;
}
/* integers */
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
/* operators */
[-+()=/*\n] { return *yytext; }
/* skip whitespace */
[ \t] ;
/* anything else is an error */
. yyerror("invalid character");
%%
int yywrap(void) {
return 1;
}
The input specification for yacc
follows. The tokens for INTEGER
and VARIABLE
are utilized by yacc to create #defines
in y.tab.h
for use in lex. This is followed by
definitions for the arithmetic operators. We may specify %left
, for left-associative or %right
for right associative. The last
definition listed has the highest precedence. Consequently multiplication and
division have higher precedence than addition and subtraction. All four
operators are left-associative. Using this simple technique we are able to
disambiguate our grammar.
%token INTEGER VARIABLE
%left '+' '-'
%left '*' '/'
%{
void yyerror(char *);
int yylex(void);
int sym[26];
%}
%%
program:
program statement '\n'
|
;
statement:
expr { printf("%d\n", $1); }
| VARIABLE '=' expr { sym[$1] = $3; }
;
expr:
INTEGER
| VARIABLE { $$ = sym[$1]; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '(' expr ')' { $$ = $2; }
;
%%
void yyerror(char *s) {
fprintf(stderr, "%s\n", s);
}
int main(void) {
yyparse();
return 0;
}