參考資料
-《lex & yacc 2nd》:下載下傳位址參考 http://blog.csdn.net/a_flying_bird/article/details/52486815
本文即此書的學習筆記。
Lex
要點
擴充名
lex檔案通常使用的字尾名: .l, .ll, .lex。——實際上,可以是任意的名稱。
檔案結構
檔案内容分為三部分,各個部分之間以 %% 分隔:
%{
/* part 1: Definition Section. e.g.: Global declaration of C. */
%}
%%
/* part 2: Rules section. Rule = Pattern + Action. */
%%
part 3: C codes.
注意,%} 不要寫成 }% 了,否則 premature EOF。
%{ 和 %} 之間的内容會原封不動地拷貝到最後生成的c檔案中,是以這裡可以是任何合法的C代碼。通常而言,這裡放lex檔案後面C代碼要用到的一些東西。
lex檔案生成c檔案
使用lex指令,把lex檔案轉換成c檔案(lex.yy.c);在生成可執行檔案的時候,要連結庫檔案l。示例:
lex simplest.l
gcc lex.yy.c -ll -o test
示例
最簡單的例子
對應En Page 2.
代碼(simplest.l):
%%
.|\n ECHO;
%%
編譯(把lex檔案轉換成c檔案)&連結&運作:
$ ls
simplest.l
$ lex simplest.l
$ ls
lex.yy.c simplest.l
$ gcc lex.yy.c -ll -o test
$ ./test
The simplest lex program. ------ 鍵盤輸入内容
The simplest lex program. ------ 程式回顯結果
^C
$
識别單詞 Recognizing words
這個例子可以識别指定的這些單詞,其他的不認識的直接回顯。- 對應原書 ch1-02.l
代碼:
%{
/*
* this sample demonstrates (very) simple recognition:
* a verb, or not a verb.
*/
%}
%%
[\t ]+ /* ignore whitespace */ ;
is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go {printf("%s: is a verb\n", yytext);}
[a-zA-Z]+ {printf("%s: is not verb\n", yytext);}
.|\n {ECHO; /* normal default anyway */ }
%%
int main()
{
yylex();
return 0;
}
運作:
$ lex recoginzing_word.l
$ gcc lex.yy.c -ll -o test
$ ./test
I am a student. You are a teacher. ------ 鍵盤輸入内容
I: is not verb
am: is a verb
a: is not verb
student: is not verb
.You: is not verb
are: is a verb
a: is not verb
teacher: is not verb
.
^C
$
要點
lex檔案的三部分:definition section, rules section, user subroutines section.
definition section可以有一段”%{“和”%}”,這中間用來放C代碼,比如#include,函數原型,全局變量等等。在由lex生成lex.yy.c的時候,這部分原封不動拷貝到C檔案中。
rules section: 每個規則由兩部分組成,即 pattern + action. 兩者由空格分開。其中pattern是正規表達式文法。lexer在識别到某個pattern後,就會執行其對應的action。——action: { C codes. }
user subroutines section: 拷貝到.c檔案的最後。
特殊的action:
- “;”: 同C語言的空餘句,即什麼也不做。——直接忽略這些輸入
- “ECHO;”: 預設行為,将比對的字元串列印到輸出檔案中(stdout,回顯)。
- “|”: 使用下一個pattern的action。——注意 | action的文法,會在pattern後面有一個空格。而作為正規表達式的|則不會有空格。
注意1: ;和ECHO;的差別:前者是忽略輸入,後者是列印到輸出。可以将示例中的ECHO;改成;後觀察輸出的變化情況。
注意2: | action不能像下面這種方法寫到同一行:
had | go {printf("%s: is a verb\n", yytext);}
變量:
- yytext: 存儲的是比對到的字元串,其類型可以在生成的.c中看到,即 extern char *yytext;
無歧義規則:每個輸入僅比對一次 + 最長比對。英文描述如下:
- Lex patterns only match a given input characer or string once.
- Lex executes th action for the longest possible match for the current input.
預設main:
這裡的例子中,定義的main()調用了yylex()。yylex()是lex定義的函數,預設情況下,如果lex檔案中沒有定義main()函數,lex系統有一個預設的main,也會去調用yylex()。
程式退出:
預設情況下,yylex()隻有處理了所有的輸入的時候,才會退出。對于控制台輸入,則要等到Ctrl+C。當然,使用者也可以主動return,即在action中增加return語句。為此,可以增加如下一個規則作驗證:
quit {printf("Program will exit normally.\n"); return 0;}
注意:這句話寫到a-zA-Z]+的前面,否則 warning, rule cannot be matched。
拓展
可以修改下面兩點,做對比分析:
[\t ]+ {printf("%s: white space\n", yytext);}
.|\n {printf("%s: Invalid word\n", yytext);}
示例:——注意觀察最後有一個換行符。
I am a student. You are a teacher. !@#$%^&*
I: is not verb
: white space
am: is a verb
: white space
a: is not verb
: white space
student: is not verb
.: Invalid word
: white space
You: is not verb
: white space
are: is a verb
: white space
a: is not verb
: white space
teacher: is not verb
.: Invalid word
: white space
!: Invalid word
@: Invalid word
#: Invalid word
$: Invalid word
%: Invalid word
^: Invalid word
&: Invalid word
*: Invalid word
: Invalid word
識别更多的單詞
對應 ch1-03.l
可以識别出動詞、副詞、介詞等等。——隻需要增加對應的rules即可。
代碼:
%{
/*
* this sample demonstrates (very) simple recognition:
* a verb, or not a verb.
*/
%}
%%
[\t ]+ {printf("%s: white space\n", yytext);}
is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go {printf("%s: is a verb\n", yytext);}
very |
simple |
gently |
quietly |
calmly |
angrily {printf("%s: is an adverb\n", yytext);}
to |
from |
behind |
above |
below |
between {printf("%s: is a preposition\n", yytext);}
if |
then |
and |
but |
or {printf("%s: is a conjunction\n", yytext);}
their |
my |
your |
his |
her |
its {printf("%s: is a adjective\n", yytext);}
I |
you |
he |
she |
we |
they {printf("%s: is a pronoun\n", yytext);}
QUIT {
printf("Program will exit normally.\n");
return 0;
}
[a-zA-Z]+ {printf("%s: don't recognize\n", yytext);}
.|\n {printf("%s: Invalid word\n", yytext);}
%%
int main()
{
yylex();
return 0;
}
運作:
he is a student. and he is a teacher. QUIT (ENTER)
he: is a pronoun
: white space
is: is a verb
: white space
a: don't recognize
: white space
student: don't recognize
.: Invalid word
: white space
and: is a conjunction
: white space
he: is a pronoun
: white space
is: is a verb
: white space
a: don't recognize
: white space
teacher: don't recognize
.: Invalid word
: white space
Program will exit normally.
動态定義單詞表 lexer with symbol table
對應 ch1-03.l, 這個例子說明如何在lex中寫更複雜的C代碼。
前面的例子是把每個單詞都定義在lex檔案中,接下來對其優化。
比如,可以在檔案中按照特定文法來定義單詞的詞性:
noun dog cat horse cow
verb chew eat lick
即每行開頭一個單詞用來定義詞性,接下來的每個單詞都屬于該詞性。如此,可以在檔案中作這種定義。當然,具體到這裡的示例代碼,暫時不處理檔案輸入,而仍然從控制台輸入。這時,就有兩種輸入:
- 定義:即首字母表示詞性,接下來是一系列屬于該詞性的單詞;
- 識别:同前一個例子,要求識别出每個單詞的詞性。
代碼:
%{
#include <stdbool.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
/*
* Word recognizer with a symbol table.
*/
enum {
LOOKUP = 0, /* default - looking rather than defining. */
VERB,
ADJ,
ADV,
NOUN,
PREP,
PRON,
CONJ
};
int state; // global variable, default to 0(LOOKUP).
bool add_word(int type, char *word);
int lookup_word(char *word);
%}
%%
[\t ]+ ; /* ignore whitespace */
\n {state = LOOKUP;} // end of line, return to default state.
/* Whenever a line starts with a reserved part of speech name */
/* start defining words of that type */
^verb {state = VERB;}
^adj {state = ADJ;}
^adv {state = ADV;}
^noun {state = NOUN;}
^prep {state = PREP;}
^pron {state = PRON;}
^conj {state = CONJ;}
/* a normal word, define it or look it up */
[a-zA-Z]+ {
if (state != LOOKUP) {
/* define the current word */
add_word(state, yytext);
} else {
switch(lookup_word(yytext)) {
case VERB: printf("%s: verb\n", yytext); break;
case ADJ: printf("%s: adjective\n", yytext); break;
case ADV: printf("%s: adverb\n", yytext); break;
case NOUN: printf("%s: noun\n", yytext); break;
case PREP: printf("%s: preposition\n", yytext); break;
case PRON: printf("%s: pronoun\n", yytext); break;
case CONJ: printf("%s: conjunction\n", yytext); break;
default:
printf("%s: don't recognize\n", yytext);
break;
}
}
}
[,:.] {printf("%s: punctuation, ignored.\n", yytext);}
. {printf("%s: invalid char\n", yytext);}
%%
int main()
{
yylex();
return 0;
}
/* define a linked list of words and types */
struct word {
char *word_name;
int word_type;
struct word *next;
};
struct word *word_list; /* first element in word list */
bool add_word(int type, char *word)
{
struct word *wp; // wp: word pointer
if (lookup_word(word) != LOOKUP) {
printf("!!! warning: word %s already defined.\n", word);
return false;
}
/* word not there, allocate a new entry and link it on the list */
wp = (struct word*)malloc(sizeof(struct word));
wp->next = word_list;
wp->word_name = (char*)malloc(strlen(word) + 1);
strcpy(wp->word_name, word);
wp->word_type = type;
word_list = wp;
return true;
}
int lookup_word(char *word)
{
struct word *wp = word_list;
for (; wp; wp = wp->next) {
if (strcmp(wp->word_name, word) == 0) {
return wp->word_type;
}
}
return LOOKUP;
}
這裡的枚舉值有兩個含義:
- 狀态。預設是LOOKUP狀态,即對目前輸入行的每個單詞,在詞庫/連結清單中查找其詞性(lookup_word),然後列印出來。但如果每一行的第一個單詞是noun/verb等保留字,則說明要進入defining狀态(細分為VERB等狀态),保留字後續的各個單詞将會添加到詞庫/連結清單中(add_word)。——在添加詞庫的時候,會先檢查該單詞是否已經入庫。
- 類型:詞庫中,每個單詞每個單詞對應的詞性用VERB等表示。
運作:
noun pet dog cat cats [ENTER]
verb is are [ENTER]
adj my his their [ENTER]
my pet is dog. their pets are cats. that's ok. [ENTER]
my: adjective
pet: noun
is: verb
dog: noun
.: punctuation, ignored.
their: adjective
pets: don't recognize
are: verb
cats: noun
.: punctuation, ignored.
that: don't recognize
': invalid char
s: don't recognize
ok: don't recognize
.: punctuation, ignored.
^C
yacc
前面的例子把一串字元串識别成了一個個單詞,接下來就是識别句子。
- 詞法分析:從輸入字元流中識别出一個個單詞,就是所謂的詞法分析,輸出是token。其關鍵就是定義詞法規則(正規表達式);
- 文法分析:在得到一個個單詞(包括詞性)之後,就是做更進階的分析,比如某些詞連在一起是否構成了一個正确的句子。——各個token如何組合或搭配在一起。對于不同的token 組合執行不同的action。
sentences
現在分析,如何由前面得到的noun&pronoun&verb等構造出句子(示例):
- 主語:(假定隻能是)名詞或代詞,即 subject -> noun | pronoun
- 賓語:(假定隻能是)名詞,即 object -> noun
- 句子(主謂賓):謂語隻支援動詞形式,即 sentence -> subject verb object.
這裡的subject&object&sentence就是基于詞法分析得到的noun&pronoun&verb等token而構造出來的新的symbol。
parser和lexer之間的通信
在yacc&lex系統中,詞法分析(lex/lexer)和文法分析(yacc/parser)是相對獨立的兩套子系統。詞法分析對應的(庫)函數是yylex(),這個函數對輸入的字元流做詞法分析,然後生成一個個token。文法分析對應的函數是parser(),其輸入是yylex()産生的token。是以,要把lex/lexer/yylex()的輸出作為yacc/parser/parser()的輸入。
yylex()的原型:
int yylex (void);
這裡的關鍵就在于yylex()的傳回值,其表示了目前識别的token的類别。當parser()需要一個token的時候,就調用yylex(),根據其傳回值,就知道這個token的類别,進而做進一步的處理。
需要注意的是,并非lexer要給parser傳回所有的token。比如,注釋部分或空白符号就不需要傳給parser,或者說parser對此不感興趣。這種情況下,lexer直接丢棄即可。
既然yacc和lex基于token通信,自然就需達成一緻的規定。這就是所謂的token codes,即每一類token規定一個token code。在yacc&lex系統中,是由yacc來定義token codes,然後lex的代碼include進來。具體地,
- 在yacc中用%token NOUN VERB文法定義token codes
- yacc -d test.y 會生成y.tab.c和y.tab.h兩個檔案,其中後者就包括了token codes的宏定義
- 在lex中include這個y.tab.h檔案。
注:取值為0的token code表示結束輸入(a logical end of input)。
示例
test.l
%{
/*
* We now build a lexical analyzer to be used by a higher-level parser.
*/
#include <stdbool.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include "y.tab.h"
#define LOOKUP 0 /* default - looking rather than defining. */
int state; // global variable, default to 0(LOOKUP).
bool add_word(int type, char *word);
int lookup_word(char *word);
const char* get_word_type(int type);
%}
%%
[\t ]+ ; /* ignore whitespace */
\n {state = LOOKUP;} // end of line, return to default state.
\.\n {
state = LOOKUP;
return 0; // end of sentence.
}
/* Whenever a line starts with a reserved part of speech name */
/* start defining words of that type */
^verb {state = VERB;}
^adj {state = ADJECTIVE;}
^adv {state = ADVERB;}
^noun {state = NOUN;}
^prep {state = PREPOSITION;}
^pron {state = PRONOUN;}
^conj {state = CONJUNCTION;}
/* a normal word, define it or look it up */
[a-zA-Z]+ {
if (state != LOOKUP) {
/* define the current word */
add_word(state, yytext);
} else {
int type = lookup_word(yytext);
printf("%s: %s\n", yytext, get_word_type(type));
switch(type) {
case VERB:
case ADJECTIVE:
case ADVERB:
case NOUN:
case PRONOUN:
case PREPOSITION:
case CONJUNCTION:
return type;
default:
//printf("%s: don't recognize\n", yytext);
break; // don't return, just ignore it.
}
}
}
. {printf("%s: ----\n", yytext);} // ignore it
%%
/* define a linked list of words and types */
struct word {
char *word_name;
int word_type;
struct word *next;
};
struct word *word_list; /* first element in word list */
bool add_word(int type, char *word)
{
struct word *wp; // wp: word pointer
if (lookup_word(word) != LOOKUP) {
printf("!!! warning: word %s already defined.\n", word);
return false;
}
/* word not there, allocate a new entry and link it on the list */
wp = (struct word*)malloc(sizeof(struct word));
wp->next = word_list;
wp->word_name = (char*)malloc(strlen(word) + 1);
strcpy(wp->word_name, word);
wp->word_type = type;
word_list = wp;
return true;
}
int lookup_word(char *word)
{
struct word *wp = word_list;
for (; wp; wp = wp->next) {
if (strcmp(wp->word_name, word) == 0) {
return wp->word_type;
}
}
return LOOKUP;
}
const char* get_word_type(int type)
{
switch(type) {
case VERB: return "verb";
case ADJECTIVE: return "adjective";
case ADVERB: return "adverb";
case NOUN: return "noun";
case PREPOSITION: return "preposition";
case PRONOUN: return "pronoun";
case CONJUNCTION: return "conjunction";
default: return "unknown";
}
}
test.y
%{
/*
* A lexer for the basic grammer to use for recognizing English sentence.
*/
#include <stdio.h>
extern int yylex (void);
void yyerror(const char *s, ...);
%}
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION
%%
sentence: subject VERB object {printf("Sentence is valid.\n");}
;
subject: NOUN
| PRONOUN
;
object: NOUN
;
%%
extern FILE *yyin;
int main()
{
//while(!feof(yyin)) {
yyparse();
//}
}
void yyerror(const char *s, ...)
{
fprintf(stderr, "%s\n", s);
}
y.tab.h
此檔案自動生成,如下:
/* A Bison parser, made by GNU Bison 2.3. */
/* Skeleton interface for Bison's Yacc-like parsers in C
...
This special exception was added by the Free Software Foundation in
version 2.2 of Bison. */
/* Tokens. */
#ifndef YYTOKENTYPE
# define YYTOKENTYPE
/* Put the tokens into the symbol table, so that GDB and other debuggers
know about them. */
enum yytokentype {
NOUN = 258,
PRONOUN = 259,
VERB = 260,
ADVERB = 261,
ADJECTIVE = 262,
PREPOSITION = 263,
CONJUNCTION = 264
};
#endif
/* Tokens. */
#define NOUN 258
#define PRONOUN 259
#define VERB 260
#define ADVERB 261
#define ADJECTIVE 262
#define PREPOSITION 263
#define CONJUNCTION 264
#if ! defined YYSTYPE && ! defined YYSTYPE_IS_DECLARED
typedef int YYSTYPE;
# define yystype YYSTYPE /* obsolescent; will be withdrawn */
# define YYSTYPE_IS_DECLARED 1
# define YYSTYPE_IS_TRIVIAL 1
#endif
extern YYSTYPE yylval;
運作
noun dogs
noun dog
verb is are
pron they it
it is dog.
it: pronoun
is: verb
dog: noun
Sentence is valid.
it is dog.
it: pronoun
syntax error
其他嘗試
增加一些列印
sentence: subject verb object {printf("Sentence is valid.\n");}
;
subject: NOUN {printf("subject of a noun.\n");}
| PRONOUN {printf("subject of a pronoun.\n");}
;
verb: VERB {printf("verb.\n");}
;
object: NOUN {printf("object of a noun.\n");}
;
運作:
noun dog
verb is
pron it
it is dog
it: pronoun
subject of a pronoun.
is: verb
verb.
dog: noun
object of a noun.
Sentence is valid.
或者:
noun dog dogs
verb is are
pron it they
it is dog they are dogs.
it: pronoun
subject of a pronoun.
is: verb
verb.
dog: noun
object of a noun.
Sentence is valid.
they: pronoun
syntax error
識别多個句子
extern FILE *yyin;
int main()
{
while(!feof(stdin/*yyin*/)) {
yyparse();
}
}
運作:
$ ./test
noun dog
verb is
pron it
it is dog.
it: pronoun
is: verb
dog: noun
Sentence is valid.
it is dog.
it: pronoun
is: verb
dog: noun
Sentence is valid.
noun dogs
verb are
pron they
they are dogs.
they: pronoun
are: verb
dogs: noun
Sentence is valid.
改成如下的代碼會運作錯誤:
int main()
{
//while(!feof(stdin/*yyin*/)) {
for (;;) {
yyparse();
}
}
運作:
$ ./test
noun dog
verb is
pron it
it is dog
it: pronoun
is: verb
dog: noun
Sentence is valid.
it is dog
it: pronoun
syntax error
is: verb
syntax error
dog: noun
從檔案中讀資料
要從檔案中讀取,需要使用全局變量yyin。如下這種方式無效:
//extern FILE *yyin;
int main()
{
FILE* f = NULL;
f = fopen("test.txt", "rb");
if (NULL == f) {
printf("Open file failed.\n");
return 1;
}
printf("Open file successfully.\n");
while(!feof(f)) {
yyparse();
}
}
注:在yy.lex.c中,使用的是yyin全局變量。該變量初始化為0(NULL)。如果使用者沒有更改yyin,會程式跑起來之後會自動設定為stdin。
正确代碼:
extern FILE *yyin;
int main()
{
yyin = fopen("test.txt", "rb");
if (NULL == yyin) {
printf("Open file failed.\n");
return 1;
}
printf("Open file successfully.\n");
while(!feof(yyin)) {
yyparse();
}
}
測試檔案test.txt的内容:
noun dog dogs
verb is are
pron it they
it is dog.
they are dogs.
運作:
$ ./test
Open file successfully.
it: pronoun
is: verb
dog: noun
Sentence is valid.
they: pronoun
syntax error
$
簡單語句和複合語句
對應 ch1-06.y
代碼
%{
/*
* A lexer for the basic grammer to use for recognizing English sentence.
*/
#include <stdio.h>
extern int yylex (void);
void yyerror(const char *s, ...);
%}
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION
%%
sentence: simple_sentence { printf("Parsed a simple sentence.\n"); }
| compound_sentence { printf("Parsed a compound sentence.\n"); }
;
simple_sentence: subject verb object {printf("simple sentence of type 1.\n");}
| subject verb object prep_phrase {printf("simple sentence of type 2.\n");}
;
compound_sentence: simple_sentence CONJUNCTION simple_sentence {printf("compound sentence of type 1.\n");}
| compound_sentence CONJUNCTION simple_sentence {printf("compound sentence of type 2.\n");}
;
subject: NOUN
| PRONOUN
| ADJECTIVE subject
;
verb: VERB
| ADVERB VERB
| verb VERB
;
object: NOUN
| ADJECTIVE object
;
prep_phrase: PREPOSITION NOUN
;
%%
extern FILE *yyin;
int main()
{
yyin = fopen("test.txt", "rb");
if (NULL == yyin) {
printf("Open file failed.\n");
return 1;
}
printf("Open file successfully.\n");
while(!feof(yyin)) {
yyparse();
}
fclose(yyin);
yyin = NULL;
return 0;
}
void yyerror(const char *s, ...)
{
fprintf(stderr, "%s\n", s);
}
測試檔案
noun dog dogs China
verb is are
pron it they
adj pretty
conj and
prep in
it is a pretty dog and they are dogs in China and it is dog.
運作結果
Open file successfully.
it: pronoun
is: verb
a: unknown
pretty: adjective
dog: noun
and: conjunction
simple sentence of type 1.
they: pronoun
are: verb
dogs: noun
in: preposition
China: noun
simple sentence of type 2.
compound sentence of type 1.
and: conjunction
it: pronoun
is: verb
dog: noun
.: ----
simple sentence of type 1.
compound sentence of type 2.
Parsed a compound sentence.