3.Lex Regular Expressions
A LEX regular expression is a word made of text characters (letters of the alphabet, digits, ...)
operators : " \ { } [ ] ^ $ < > ? . * + | () /
An operator can be used as a text character if it preceded with the escape operator (backslash).
The quotation marks indicate that whatever is contained between a pair of quotes is to be taken as text characters.
operators : " \ { } [ ] ^ $ < > ? . * + | () /
An operator can be used as a text character if it preceded with the escape operator (backslash).
The quotation marks indicate that whatever is contained between a pair of quotes is to be taken as text characters.
For instance
xyz"++"
matches the string xyz++.
xyz"++"
matches the string xyz++.
A Character Class
is a class of characters specified using the operator pair [ ]The expression [ab] matches the string a or b.
Within square brackets most operators are ignored except the three special characters \ - ^ which are used as follows
(a) the escape character \ as above,
(b) the minus character - which is used for ranges like in digit [0-9]
(c) the hat character ^ as first character after the opening square bracket, it is used for complemented matches like in
NOTabc [^abc]
(a) the escape character \ as above,
(b) the minus character - which is used for ranges like in digit [0-9]
(c) the hat character ^ as first character after the opening square bracket, it is used for complemented matches like in
NOTabc [^abc]
Repeated Expressions
Repetitions of patterns are indicated by the operators ?,* and +.
? zero or one occurrence
* zero or more occurrence
+ one or more occurrence
ab?c will match ac or abc ie; b? Will match zero or one occurrence of b.
The pattern a* matches any number of consecutive a characters (including zero).
ab* will match with a,ab,abb,abbb etc..
The pattern [a-z]+ is any positive number of consecutive lower-case alphabetic characters.
Hence we can recognize identifiers in a typical computer language with
[A-Za-z][A-Za-z0-9]*
Repetitions can also be obtained with the pair operator {}.
If {} encloses numbers, it specifies repetitions.
For instance a{1,5} matches 1 to 5 repetitions of a.
Note that if {} encloses a name, this name should be defined in the definition section. Then LEX substitutes the definition for the name.
Note that if {} encloses a name, this name should be defined in the definition section. Then LEX substitutes the definition for the name.
Alternating
The operator | indicates alternation. For instance
(ab|cd)
matches the language consisting of both words ab and cd.
Parentheses are used for grouping (when not clear). For instance
(ab|cd+)?(ef)*
denotes the language of the words that are either empty or
optionally starts with
ab or
c followed by any positive number of d
and continues with any number of repetition of ef
Context Sensitivity
LEX provides some support for contextual grammatical rules.
- If ^ is the first character in an expression, then this expression will only be matched at the beginning of a line.
- If $ is the last character in an expression, then this expression will only be matched at the end of a line.
- If r and s are two LEX regular expressions then r/s is another LEX regular expression.
Summary
x match the character ’x’
. any character (byte) except newline
[xyz] a "character class"; in this case, the pattern matches either an ’x’, a ’y’, or a ’z’
[abj-oZ] a "character class" with a range in it; matches an ’a’, a ’b’, any letter from ’j’ through ’o’,
or a ’Z’
[^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any
character EXCEPT an uppercase letter.
[^A-Z\n] any character EXCEPT an uppercase letter or a newline
r* zero or more r’s, where r is any regular expression
r+ one or more r’s
r? zero or one r’s (that is, "an optional r")
r{2,5} anywhere from two to five r’s
r{2,} two or more r’s
r{4} exactly 4 r’s
{name} the expansion of the "name" definition
"[xyz]\"foo" the literal string: [xyz]"foo
\X if X is an ’a’, ’b’, ’f’, ’n’, ’r’, ’t’, or ’v’, then the ANSI-C interpretation of \x.
Otherwise, a literal ’X’ (used to escape operators such as ’*’)
\0 a NUL character (ASCII code 0)
\123 the character with octal value 123
\x2a the character with hexadecimal value 2a
(r) match an r; parentheses are used to override precedence
rs the regular expression r followed by the regular expression s; called "concatenation"
r|s either an r or an s
r/s an r but only if it is followed by an s. The text matched by s is included when determining whether this rule is the "longest match", but is then returned to the input before the action is executed. So the action only sees the text matched by r. This type
of pattern is called trailing context".
(There are some combinations of r/s that flex cannot match correctly; see notes in the Deficiencies / Bugs section below regarding "dangerous trailing context".)
^r an r, but only at the beginning of a line (i.e., which just starting to scan, or right after a newline has been scanned).
r$ an r, but only at the end of a line (i.e., just before a newline). Equivalent to "r/\n".
Note that flex’s notion of "newline" is exactly whatever the C compiler used to compile flex interprets ’\n’ as; in particular, on some DOS systems you must either filter out \r’s in the input yourself, or explicitly use r/\r\n for "r$".
<s>r an r, but only in start condition s <s1,s2,s3>r same, but in any of start conditions s1, s2, or s3
<*>r an r in any start condition, even an exclusive one.
<<EOF>> an end-of-file
<s1,s2><<EOF>> an end-of-file when in start condition s1 or s2
Note that inside of a character class, all regular expression operators lose their special meaning except escape (’\’) and the character class operators, ’-’, ’]’, and, at the beginning of the class, ’^’.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence. For example,
foo|bar*
is the same as
(foo)|(ba(r*))
since the ’*’ operator has higher precedence than concatenation, and concatenation higher than alternation (’|’). This pattern therefore matches either the string "foo" or the string "ba" followed by zero-or- more r’s.
x match the character ’x’
. any character (byte) except newline
[xyz] a "character class"; in this case, the pattern matches either an ’x’, a ’y’, or a ’z’
[abj-oZ] a "character class" with a range in it; matches an ’a’, a ’b’, any letter from ’j’ through ’o’,
or a ’Z’
[^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any
character EXCEPT an uppercase letter.
[^A-Z\n] any character EXCEPT an uppercase letter or a newline
r* zero or more r’s, where r is any regular expression
r+ one or more r’s
r? zero or one r’s (that is, "an optional r")
r{2,5} anywhere from two to five r’s
r{2,} two or more r’s
r{4} exactly 4 r’s
{name} the expansion of the "name" definition
"[xyz]\"foo" the literal string: [xyz]"foo
\X if X is an ’a’, ’b’, ’f’, ’n’, ’r’, ’t’, or ’v’, then the ANSI-C interpretation of \x.
Otherwise, a literal ’X’ (used to escape operators such as ’*’)
\0 a NUL character (ASCII code 0)
\123 the character with octal value 123
\x2a the character with hexadecimal value 2a
(r) match an r; parentheses are used to override precedence
rs the regular expression r followed by the regular expression s; called "concatenation"
r|s either an r or an s
r/s an r but only if it is followed by an s. The text matched by s is included when determining whether this rule is the "longest match", but is then returned to the input before the action is executed. So the action only sees the text matched by r. This type
of pattern is called trailing context".
(There are some combinations of r/s that flex cannot match correctly; see notes in the Deficiencies / Bugs section below regarding "dangerous trailing context".)
^r an r, but only at the beginning of a line (i.e., which just starting to scan, or right after a newline has been scanned).
r$ an r, but only at the end of a line (i.e., just before a newline). Equivalent to "r/\n".
Note that flex’s notion of "newline" is exactly whatever the C compiler used to compile flex interprets ’\n’ as; in particular, on some DOS systems you must either filter out \r’s in the input yourself, or explicitly use r/\r\n for "r$".
<s>r an r, but only in start condition s <s1,s2,s3>r same, but in any of start conditions s1, s2, or s3
<*>r an r in any start condition, even an exclusive one.
<<EOF>> an end-of-file
<s1,s2><<EOF>> an end-of-file when in start condition s1 or s2
Note that inside of a character class, all regular expression operators lose their special meaning except escape (’\’) and the character class operators, ’-’, ’]’, and, at the beginning of the class, ’^’.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence. For example,
foo|bar*
is the same as
(foo)|(ba(r*))
since the ’*’ operator has higher precedence than concatenation, and concatenation higher than alternation (’|’). This pattern therefore matches either the string "foo" or the string "ba" followed by zero-or- more r’s.
To match "foo" or zero-or-more "bar"’s,
use: foo|(bar)*
and to match zero-or-more "foo"’s-or-"bar"’s:
use:(foo|bar)*
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
and to match zero-or-more "foo"’s-or-"bar"’s:
use:(foo|bar)*
In addition to characters and ranges of characters, character classes can also contain character class expressions. These are expressions enclosed inside [: and :] delimiters (which themselves must appear between the ’[’ and ’]’ of the character class; other elements may occur inside the character class, too). The valid expressions are:
[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
Pattern Matching Primitives
Meta character
|
Matches
|
.
|
any character except newline
|
\n
|
newline
|
*
|
zero or more copies of the
preceding expression
|
+
|
one or more copies of the
preceding expression
|
?
|
zero or one copy of the
preceding expression
|
^
|
beginning of line
|
$
|
end of line
|
a|b
|
a or b
|
(ab)+
|
one or more copies of ab
(grouping)
|
"a+b"
|
literal "a+b" (C
escapes still work)
|
[]
|
character class
|
Pattern Matching Examples
Expression
|
Matches
|
abc
|
abc
|
abc*
|
ab abc abcc abccc ...
|
abc+
|
abc abcc abccc ...
|
a(bc)+
|
abc abcbc abcbcbc ...
|
a(bc)?
|
a abc
|
[abc]
|
one of: a, b, c
|
[a-z]
|
any letter, a-z
|
[a\-z]
|
one of: a, -, z
|
[-az]
|
one of: -, a, z
|
[A-Za-z0-9]+
|
one or more alphanumeric
characters
|
[ \t\n]+
|
whitespace
|
[^ab]
|
anything except: a, b
|
[a^b]
|
one of: a, ^, b
|
[a|b]
|
one of: a, |, b
|
a|b
|
one of: a, b
|
Comments
Post a Comment