我们都知道YACC是一个parser生成器,它只能接受Contex Free Grammar(CFG);而且YACC当然可以产生C语言的parser。
然而C语言的语法其实并不完全是CFG的(它是接近CFG的一种语法)。且看以下这篇文章。
原文: http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar
http://trevorjim.com/c-and-cplusplus-are-not-context-free/
The context sensitivity of C's grammar
November 24, 2007 at 08:16
Tags Articles , C & C++ , Compilation
Context free grammars (CFGs) are a valuable theoretical tool on which the modern compilation theory relies for parsing the code of programming languages. For example, the most popular tool used for parsing – YACC, generates parsers for CFGs. What most people don’t know1 is that the vast majority of programming languages have grammars that are not context free.
C is a very good example, because it is one of the most popular languages in use and because its grammar is soalmost context free that it serves as a good model to demonstrate what I’m talking about.
Now, a CFG has several definitions in relation to formal languages and programming languages. I don’t want to delve too deep into the nomenclature here, but here is a discussion by a bunch of clever guys picking nits off this matter. When I say that the grammar of C is not a CFG, what I mean is that a grammar given to YACC[2] is not enough to parse C correctly, without referring to some context information that comes from elsewhere. It’s time for some examples.
Consider this code:
{T (x);...
}
Believe it or not, but given that T
is a type, this is actually a valid declaration of x
of the type T
in C. However, if T
is not a known type, this is a call to the function T
with the argument x
. How can the C parser know which way to parse without knowing whether T
was previously defined by a typedef
?
I can hear you say “but this is contrived, who ever writes code like that ?”. OK, something more standard:
{T * x;...
}
What is this, a declaration of x
as a pointer to T
, or a void multiplication of the variables T
and x
? There is no way to know without having the table of types defined by typedef
in the memory, and parsers aren’t built to do that – this iscontext sensitive information.
Here’s another example:
func((T) * x);
If T
is a type, the result of dereferencing x
is cast to T
and passed to func
. If T
isn’t a type, the multiplication of T
andx
is passed to func
.
In all these examples, the parser would be lost without having some information gathered on the code before the problematic statement is reached. Therefore, C cannot be parsed with a YACC grammar without mixing in some context sensitive information. This actually has a name in the compilation / C community – the “typedef-name: identifier” problem. Even K&R23 has something to say about it, when presenting the grammar for C in the appendix:
With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.
So, as you see, C is very close to having a CFG, but isn’t quite there. Fortunately, this problem is very simple to fix. All that’s needed is keeping a symbol table of types defined by typedef
as the parsing goes. Whenever a new identifier is recognized in the lexer, it checks if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals – an identifier and a defined type. All that’s left is updating the symbol table whenever a successful parse of a typedef statement completes. To show better how this works, I’ll show the relevant portions of the C parser and lexer from c2c’s code. Here is a portion of the Lex file:
identifier ([a-zA-Z_][0-9a-zA-Z_]*)<INITIAL,C>{identifier} { GetCoord(&yylval.tok); yylval.n = MakeIdCoord(UniqueString(yytext), yylval.tok);if (IsAType(yylval.n->u.id.text))RETURN_TOKEN(TYPEDEFname);else RETURN_TOKEN(IDENTIFIER); }
Without getting too much into the syntax of Lex here, what this basically says is that whenever an identifier is found, it is tested for being a type. If it is, the TYPEDEFname
token is returned. Otherwise, IDENTIFIER
is returned. For the Yacc grammar, these two are separate terminals.
1 To be more precise, “most people” don’t even care for things like this. By people I here refer to those who are interested in programming and computer science.
2 YACC only accepts CFGs, since in every production rule V -> w
, V
is a single nonterminal symbol.
3 “The ANSI C programming language, 2nd edition” by Kernighan and Ritchie
C and C++ are not context free
C and C++ are not context free
January 31, 2013 ∞
(Part 5 of a series: 1, 2, 3, 4)
It seems that the Internet is still confused about this, so let’s consider the question:
Are C and C++ context-free languages?
And the answer is
No, C and C++ are context-sensitive languages.
There are several reasons.
-
To parse C and C++, you start by using a very powerfulpreprocessor. These preprocessors are inevitably written by hand (they are not based on a theoretic foundation like regular expressions or context-free grammars).
-
C and C++ lexers require lexical feedback to differentiate between typedef names and identifiers. That is, the context-sensitive lexer needs help from the “context-free” parser to distinguish between an identifier “foo” and a typedef name “foo”. In this snippet,
int foo;typedef int foo;foo x;
the first “foo” is an identifier while the second and third are typedef names. You need to parse typedef declarations to figure this out (and since types have nested parentheses, this is definitely at least as hard as parsing a context-free language).
This means that the parser and lexer are mutually recursive, so it doesn’t make sense to say that the parser is context free while the lexer is context sensitive.
-
The grammar is ambiguous: it has LR conflicts, such as the if-then-else conflict. Parsers typically resolve this using context (an “else” matches the closest “if”).
If you believe that these reasons are not sufficient to show that the languages are context sensitive, please consult my previous post on the subject.
C and C++ are important languages so it’s worth saying a bit more.
The C and C++ preprocessors are a real problem for the languages. Among other things, they make it hard to write tools that operate on programs. For example, suppose you would like to refactor your program by renaming a function. Preprocessors can introduce new tokens, so ideally your tool would work by first running the preprocessor, then the lexer/parser to find all occurrences of the function and replacing it. But then, you have a refactored program that has already been preprocessed. At this point you would like to run the preprocessor in reverse, so that that result is both readable and similar to the original. Good luck with that.
C++ is much harder to parse than C. McPeak’s work on Elkhound and Elsa has a good discussion of the issues.
If you ever look closely at language specifications, you’ll see that they have a lot of problems. They have many ambiguities and even contradictions. Moreover, language implementations rarely meet the specification, and no two implementations work the same on all programs (cf. my posts on Postel’s Law). A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World is a delightful article by Coverity on the difficulties of writing static analyzers for C programs “in the wild.” The whole article is worth reading several times over, but here are a couple of relevant quotes:
Law: You can’t check code you can’t parse.Checking code deeply requires understanding the code’s semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is na?ve, rooted in the widely believed myth that programming languages exist.
The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as “C code” no matter how blatantly illegal its contents may be to a language scholar. Fed this illegal not-C code, a tool’s C front-end will reject it. This problem is the tool’s problem.
and
If divergence-induced parse errors are isolated events scattered here and there, then they don’t matter. An unsound tool can skip them. Unfortunately, failure often isn’t modular. In a sad, too-common story line, some crucial, purportedly “C” header file contains a blatantly illegal non-C construct. It gets included by all files. The no-longer-potential customer is treated to a constant stream of parse errors as your compiler rips through the customer’s source files, rejecting each in turn. The customer’s derisive stance is, “Deep source code analysis? Your tool can’t even compile code. How can it find bugs?” It may find this event so amusing that it tells many friends.
(Coverity, by the way, is now looking at HTML5 parsing to find security vulnerabilities, cf. this post and BEEP.)
I don’t blame C for these problems. C is old, and it was created when there were few widely available parser generators. In fact, it looks to me like yacc and C were essentially born together!Here’s an inspirational interview with Stephen Johnson that discusses those early days. It wasn’t easy; Johnson says
Over the next several years, I rewrote the program over a dozen times, speeding it up by a factor of 10,000 or so.
That’s the level of effort and craftsmanship you need to create one of the most influential programs ever written.