Skip to content

Lexical analyzer generator with multilingual support.

License

Notifications You must be signed in to change notification settings

bihai/poodle-lex

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Poodle-Lex is a lexical analyzer generator, similar to lex and its variants. 
It produces source code which scans an input stream and outputs a stream of 
word units. Features include Unicode support, hierarchical patterns and a 
platform for supporting arbitrary programming languages through plugins.

Usage: Poodle-Lex RULES_FILE OUTPUT_DIR [Optional arguments]

Required arguments:
RULES_FILE: A file containing token names and pattern definitions
OUTPUT_DIR: An existing directory to which the state machine code is written.

Optional arguments:
-e ENCODING,  --encoding ENCODING - Encoding of the input file. ENCODING must be an encoding codec name supported by Python.
-l LANGUAGE,  --language LANGUAGE - Output emitter language.
-c CLASS_NAME, --class-name CLASS_NAME - The name of the lexical analyzer class to generate
-s NAMESPACE, --namespace NAMESPACE - The namespace name of the lexical analyzer class to generate
-f FILE_NAME, --file-name FILE_NAME - The base name of the generated source files
-i ALGORITHM, --minimizer ALGORITHM - Algorithm to use for minimizing the DFA. Default is hopcroft, but polynomial is allowed as well
-n DOT_FILE,  --print-nfa DOT_FILE - Print a GraphViz(.dot) file with the lexical analyzer's NFA representation
-d DOT_FILE,  --print-dfa DOT_FILE - Print a GraphViz(.dot) file with the lexical analyzer's DFA representation
-m DOT_FILE,  --print-min-dfa DOT_FILE - Print a GraphViz(.dot) file with the lexical analyzer's minimized DFA representation.

Commands:
Poodle-Lex list-languages: Prints out a list of languages supportd by the -language option
Poodle-Lex list-minimizers: Prints out a list of DFA minimization algorithms supported by the --minimizer option.

Rules file:
The rules file is a simple format which describes the lexical analyzer as a 
set of token definitions, one on each line. The rules file consists of four
types of statements:
1) Token rules
2) Variable definitions
3) Token name reservations
4) Comments
5) Section blocks

1) Token rules
Token rules take one of the following forms:
- [COMMANDS] TOKEN_ID: [PATTERN] [SECTION_ACTION]
- Skip: PATTERN [SECTION_ACTION]

The first form is a rule which returns a token. TOKEN_ID is the name 
of the rule. PATTERN is a regular expression defining a pattern of 
text which should be matched to the rule, surrounded by either single 
or double quotes. The surrounding quotes can be escaped by adding them
twice ("" or '') inside of the pattern. If an "i" appears in front of the 
opening quote, then the pattern will be interpreted as case-insensitive 
for ASCII Latin characters.

The second form is a rule which is skipped and does not return a token.
Here, the same syntax applies, but the token ID may be omitted.

PATTERN may consist of multiple strings, concatenated by "+". If a line ends 
with "+", then the expression may extend over multiple lines. If a piece of 
text matches two rules, then the rule appearing first is used.

SECTION_ACTION is a command which tells the lexical analyzer to enter or 
exit a different context, referred to as "section", after matching the rule. 
It can be one of the following:
 - Enter SECTION_NAME
 - Switch SECTION_NAME
 - Exit Section
 - [Enter|Switch] Section [ATTRIBUTES]
       ...
   Exit Section
The first form instructs the lexical analyzer to enter an existing section, 
where SECTION_NAME is an identifier matching the section. For anonymous 
sections, the rule name is the section name. The second form instructs the
lexical analyzer to switch to a different section without remembering the 
current section. The third form instructs the lexical analyzer to exit the 
current section. The fourth form actually defines an anonymous section, and 
instructs the lexical analyzer to enter or switch to it.

Sections in other areas of the hierarchy can be referred to using scoped
section names. Scoped names are a set of names divided by a period. The 
current section is searched for a section named as the first name in the set. 
Then the found section is searched for the second name in the set, and so 
forth. If the name begins with a period, the root of the rules file is 
searched.

The lexical analyzer remembers the order in which each section is entered.
When a section is exited, the lexical analyzer returns to the section that 
it previously entered. If a section is switched, the current section 
changes, but the movement is not recorded.

COMMANDS is an optional, comma-separated list of words which describe actions 
take regarding the token. Commands are case-insensitive, and valid commands are:
 - Skip: The pattern will be matched, but no token will be produced
 - Capture: The text of the match will be returned along with the token
 - Reserve: ID will be reserved, and optionally a pattern will be stored
 - Import: Use the same rule name, and optionally the same pattern as an existing rule.
 
2) Variable Definitions
Variable definitions take the following form:
Let VARIABLE_NAME = PATTERN

Variables can be substituted within a pattern, but are not themselves rules.
VARIABLE_NAME is the name of a variable to define. PATTERN is a pattern in the 
same format as with token rules.

3) Token name reservations
Token name reservations take the following form:
Reserve TOKEN_ID[: PATTERN]

If used, this will add TOKEN_ID to the list of token ids. This can be used to 
reserve token ids, or create ids that can be emitted by a preprocessor. A pattern
can optionally be stored with the token ID. If the rule is imported, the pattern
only needs to be specified if the pattern was not specified in the rule reservation.

4) Comments
Any text on a line after "#" is considered a comment. 

5) Section Blocks
Finally, different contexts, referred to as sections or section blocks, can
be defined. These are useful if, at certain points, the lexical analyzer should
use a different set of rules. A section can be define in two ways, as a 
standalone section or anonymously as part of a rule. See the section on 
rules for how to define a section as part of a rule.

Standalone sections take the following form:
Section SECTION_NAME [ATTRIBUTES]
    CONTENT
End Section

SECTION_NAME is the name of the section. CONTENT is a set of rules,
variable definitions, comments, or sub-sections that exist within the
context of the section. Sections can contain other sections for 
organizational purposes, though this only affects behavior if attributes
are defined for the section.

ATTRIBUTES is a comma-separated list of attributes which can be applied 
to the section. Each attribute defines the behavior of the lexical 
analyzer if a rule is not matched on the first character of a token.
The following attributes can be applied:
 - Inherits: If the first character does not fit a rule in this section, attempt
       attempt to match the rule in the section containing this one
 - Exits: If the first character does not fit a rule in this section, exit
       the section and return to the section the lexical analyzer was in 
       previously.
       
Language plug-ins:
As packaged, Poodle-Lex supports the following language arguments
 - freebasic: FreeBasic language emitter with full Unicode support
 - cpp: C++ emitter which consumes UTF-8 input
 
The default language is freebasic. More languages can be supportd by 
installing plugins
 
Regular expression rules:
Poodle-Lex patterns use a regular expression syntax similar to the IEEE POSIX 
ERE standard. The pattern will match the characters present, and the following 
special characters can be used to define more complex patterns:
".": Matches any Unicode codepoint.
"[...]": Matches one of any characters inside of the brackets
"[^...]": Matches all Unicode codepoints except any characters inside the brackets.
"{m, n}": Matches the previous item if it repeats a minimum of m times and a maximum of n times.
"*": Matches the prevous item if it appears zero or more times in a row.
"+": Matches the previous item if it appears one or more times in a row.
"?": Matches the previous item if it appears zero or one times.
"(...)": Matches the patern within the parenthesis.
"{...}": Matches a pattern defined by a variable, identified by the text between the curly braces
"\": Matches a special character or the next character literally
"|": Matches either the previous item or the next item.

Variable names within a pattern follow the same scoping rules as section names.
A scoped name is a set of names separated by a period. If there are multiple names,
all names but the last one refer to a list of sections within the hierarchy. The 
current section is searched for the first section name, then the matched section
is searched for the next section name, and so forth. Then, the last matched section
is searched for the variable name. If the scoped name begins with a period, the 
root of the document is searched.

The following standard special characters are NOT accepted:
"$": In the POSIX ERE standard, matches text at the end of a line or input
"^": In the POSIX ERE standard, matches text at the start of a line or input

These symbols are not supported because they require additional context. The 
characters are not allowed except when escaped and they are reserved for 
future use.

The following special characters are supported:
"\t": Matches a tab (\x09) character
"\r": Matches a line feed (\x0a) character
"\n": Matches a carriage return (\x0d) characeter
"\x**": Matches a codepoint between 0 and 255, with "*" being any hexidecimal digit
"\u****": Matches a codepoint between 0 and 65535, with "*" being any hexidecimal digit
"\U******": Matches any Unicode codepoint, with "*" being any hexidecimal digit

The following named character classes are also supported within a character class:
"[:alnum:]" - Alphanumeric characters ([A-Za-z0-9])
"[:word:]" - Alphanumeric characters + "_" ([A-Za-z0-9_])
"[:alpha:]" - Alphabetical characters ([A-Za-z])
"[:blank:]" - Space and tab ([ \t])
"[:cntrl:]" - ASCII control characters ([\x01-\x1F])
"[:digit:]" - Numerical digits ([0-9])
"[:graph:]" - Visible characters ([\x21-\x7F])
"[:lower:]" - Lowercase characters ([a-z])
"[:print:]" - Printable characters ([\x20-\x7F])
"[:punct:]" - Punctuation ([\]\[\!\"\#\$\%\&\'\(\)\*\+\,\.\/\:\;\<\=\>\?\@\\\^\_\`\{\|\}\~\-])
"[:space:]" - Whitespace characters ([ \t\r\n\v\f])
"[:upper:]" - Uppercase characters ([A-Z])
"[:xdigit:]" - Hexadecimal digits ([A-Fa-f0-9])

Output:
When run, the program generates source code which provides an interface. The 
interface will take in a stream of text, and produce "tokens" in the form of
a structure. The structure will contain an enumerated identifier, and 
optionally a string containing the text which was grouped into the token.

Example:
The following generates, compiles, and executes a simple FreeBasic test 
application on Linux and Windows. Python 2.7 and FreeBASIC >= 0.90.1 should be 
installed prior to running. 

Windows:
1) Add the Poodle-Lex installation directory to your PATH environment variable
2) Copy any .rules files in the "Example" folder from the installation directory to any folder
3) Navigate to that folder in the command prompt
4) Enter the following, replacing RULES_FILE with the name of the rules file being used:
mkdir Output
Poodle-Lex RULES_FILE Output
cd Output\Demo
make_demo
Demo

Linux:
1) Sync the source
2) Install python 2.7
3) install blist (Ubuntu: sudo apt-get install python-blist)
4) Navigate to the source directory
4) Enter the following, replacing RULES_FILE with the nae of the rules file being used:
mkdir Output
python __main__.py RULES_FILE Output
cd Output/Demo
./make_demo.sh
./Demo

Credits:
Parker Michaels - Primary developer
AGS - C lexer example, countless bug reports and tests

Thank you to everyone who helped test or provided feedback for this tool. This 
includes members of the FreeBasic.net forum such as TFS.

About

Lexical analyzer generator with multilingual support.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 91.4%
  • Python 7.5%
  • Other 1.1%