Automatically exported from code.google.com/p/pytoken
License
zurgzurg/pytoken
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
What is pytoken Pytoken is a scanner generator. Given an input specification - a bunch of regular expressions - pytoken will generate x86 machine code that recognizes those regular expressions. Pytoken will be most useful for programmers that want to parse text files. Install ---------------------------------------------- To install pytoken: python ./setup.py install Documentation ---------------------------------------------- To build the documentation: cd doc make -f Makefile.docs html Getting Started ---------------------------------------------- Pytoken has separate objects for scanners and buffers. Here is a simple example: import pytoken lexer_obj = pytoken.lexer() lexer_obj.add_pattern("a", 1) lexer_obj.add_pattern("b", 2) lexer_obj.compile_to_machine_code() buf = pytoken.lexer_state() buf.set_input("ab") tok = lexer_obj.get_token(buf) assert tok == 1 tok = lexer_obj.get_token(buf) assert tok == 2 Pytoken requires only python developement libraries, make and a C compiler. Pytoken has been written in a portable fashion - it is designed to support multiple CPU types, even though only the x86 (32 bit) is supported now. Why use pytoken 1. Ease of use. Pytoken was created to make parsing files easy. The flex/lex based scanners that are traditionally used in C based solutions have complicated rules for buffer management. Pytoken follows the python philispophy and makes buffer manegement explicit. 2. Performance. pytoken was also written with the aim of creating very high speed lexers. Tokenization is expensive, by definition some amount of work needs to be done for each character in the input stream. Since pytoken generates machine code directly it can ensure that no extra work is done is processing the input stream. Pytoken aims to keep the overhead of executing inside the python interpreter to a minimum. Minimal benchmarking - see bencharm.py shows that pytoken is around 30% slower than flex. TODO ---------------------------------------------- For next full release: '|' as a regex is illegal some simple examples support \xx encoding of chars in hex Need to support matching NULL chars in users patterns. Fully working end of buffer detection. Before initial release: done make buffers work with file like objects - set_input() done make buffers work with file like objects - fill() done handle None as token id done support token id as callable, need to have clear done protocol on getting token info from buffer. Need done way to get buffer contents as well., also callable done objs must have a way of passing user defined args. done very long token - larger that 4k done fill buffers with varying token sizes - to ensure that tokens will be split accross buffer fills done benchmarking vs flex Other TODO Items: inprog Need a manual. inprog Need to support a richer set of REGEXP meta chars. Rework/Rewrite the IR representation. This needs to be done before better machine code can be generated. Emit better machine code. Need a frame work for doing traditional compiler style optimizations. ? Comments from users ? Info for developers ---------------------------------------------------- General info testing the assembler: Normally I just use gas or nasm to test create the expected output. Notes on x86 assembler ---------------------- +-----+-------+-------+ ModRM : | 7 6 | 5 4 3 | 2 1 0 | +-----+-------+-------+ mod reg r/m The AMD and Intel manuals specify some opcodes with a ModRM byte: /0 /1 /2 /3 /4 /5 /6 /7 The ModRM byte specifies only one register or memory operand. The digit is specified by the ModRM reg field. /r Indicates that the ModRM byte specifies both a register operand and a reg/mem (register or memory) operand. junk.s: ----------------------- l1: nop jmp l1 ----------------------- $ as -a junk.s GAS LISTING junk.s page 1 1 0000 90 l1: nop 2 0001 EBFD jmp l1 3 GAS LISTING junk.s page 2 DEFINED SYMBOLS junk.s:1 .text:00000000 l1 NO UNDEFINED SYMBOLS --------------------------------------------------------- GDT LDT from kernel requestor priv level cs == #x73 ==> 0_1110 0 11 14 ds == #x7b ==> 0_1111 0 11 15 From the AMD64 manual: Default Operand Size (D/B) Bit. Bit 22 of byte +4. The default operand-size bit is found in code- segment and data-segment descriptors but not in system-segment descriptors. Setting this bit to 1 indicates a 32-bit default operand size, and clearing it to 0 indicates a 16-bit default size. The effect this bit has on a segment depends on the segment-descriptor type. See "Code-Segment Default-Operand Size (D) Bit" on page 81 for a description of the D bit in code-segment descriptors. "Data-Segment Default Operand Size (D/B) Bit" on page 83 describes the D bit in data-segment descriptors, including stack segments, where the bit is referred to as the "B" bit. From linux kernel: arch/x86/kernel/cpu/common.c [GDT_ENTRY_DEFAULT_USER_CS] a = 0x0000ffff = b = 0x00cffa00 --> 0000_0000_1100_1111_1111_1010_0000_0000 bit22------------------------^ [GDT_ENTRY_DEFAULT_USER_DS] a = 0x0000ffff b = 0x00cff200 --> 0000_0000_1100_1111_1111_0010_0000_0000 bit22------------------------^ So it appears that the default operand size is set to 32 bits. --------------------------------------------------------- Additional stuff from the AMD64 manual: Each instruction has a syntax that includes the mnemonic and any operands that the instruction can take. Figure 2-14 shows an example of a syntax in which the instruction takes two operands. In most instructions that take two operands, the first (left-most) operand is both a source operand (the first source operand) and the destination operand. The second (right-most) operand serves only as a source, not a destination. The following notation is used to denote the size and type of source and destination operands: cReg—Control register. dReg—Debug register. imm8—Byte (8-bit) immediate. imm16—Word (16-bit) immediate. imm16/32—Word (16-bit) or doubleword (32-bit) immediate. imm32—Doubleword (32-bit) immediate. imm32/64—Doubleword (32-bit) or quadword (64-bit) immediate. imm64—Quadword (64-bit) immediate. mem—An operand of unspecified size in memory. mem8—Byte (8-bit) operand in memory. mem16—Word (16-bit) operand in memory. mem16/32—Word (16-bit) or doubleword (32-bit) operand in memory. mem32—Doubleword (32-bit) operand in memory. mem32/48—Doubleword (32-bit) or 48-bit operand in memory. mem48—48-bit operand in memory. mem64—Quadword (64-bit) operand in memory. mem128—Double quadword (128-bit) operand in memory. mem16:16—Two sequential word (16-bit) operands in memory. mem16:32—A doubleword (32-bit) operand followed by a word (16-bit) operand in memory. mem32real—Single-precision (32-bit) floating-point operand in memory. mem32int—Doubleword (32-bit) integer operand in memory. mem64real—Double-precision (64-bit) floating-point operand in memory. mem64int—Quadword (64-bit) integer operand in memory. mem80real—Double-extended-precision (80-bit) floating-point operand in memory. mem80dec—80-bit packed BCD operand in memory, containing 18 4-bit BCD digits. mem2env—16-bit x87 control word or x87 status word. mem14/28env—14-byte or 28-byte x87 environment. The x87 environment consists of the x87 control word, x87 status word, x87 tag word, last non-control instruction pointer, last data pointer, and opcode of the last non-control instruction completed. mem94/108env—94-byte or 108-byte x87 environment and register stack. mem512env—512-byte environment for 128-bit media, 64-bit media, and x87 instructions. mmx—Quadword (64-bit) operand in an MMX register. mmx1—Quadword (64-bit) operand in an MMX register, specified as the left-most (first) operand in the instruction syntax. mmx2—Quadword (64-bit) operand in an MMX register, specified as the right-most (second) operand in the instruction syntax. mmx/mem32—Doubleword (32-bit) operand in an MMX register or memory. mmx/mem64—Quadword (64-bit) operand in an MMX register or memory. mmx1/mem64—Quadword (64-bit) operand in an MMX register or memory, specified as the left- most (first) operand in the instruction syntax. mmx2/mem64—Quadword (64-bit) operand in an MMX register or memory, specified as the right- most (second) operand in the instruction syntax. moffset—Direct memory offset that specifies an operand in memory. moffset8—Direct memory offset that specifies a byte (8-bit) operand in memory. moffset16—Direct memory offset that specifies a word (16-bit) operand in memory. moffset32—Direct memory offset that specifies a doubleword (32-bit) operand in memory. moffset64—Direct memory offset that specifies a quadword (64-bit) operand in memory. pntr16:16—Far pointer with 16-bit selector and 16-bit offset. pntr16:32—Far pointer with 16-bit selector and 32-bit offset. reg—Operand of unspecified size in a GPR register. reg8—Byte (8-bit) operand in a GPR register. reg16—Word (16-bit) operand in a GPR register. reg16/32—Word (16-bit) or doubleword (32-bit) operand in a GPR register. reg32—Doubleword (32-bit) operand in a GPR register. reg64—Quadword (64-bit) operand in a GPR register. reg/mem8—Byte (8-bit) operand in a GPR register or memory. reg/mem16—Word (16-bit) operand in a GPR register or memory. reg/mem32—Doubleword (32-bit) operand in a GPR register or memory. reg/mem64—Quadword (64-bit) operand in a GPR register or memory. rel8off—Signed 8-bit offset relative to the instruction pointer. rel16off—Signed 16-bit offset relative to the instruction pointer. rel32off—Signed 32-bit offset relative to the instruction pointer. segReg or sReg—Word (16-bit) operand in a segment register. ST(0)—x87 stack register 0. ST(i)—x87 stack register i, where i is between 0 and 7. xmm—Double quadword (128-bit) operand in an XMM register. xmm1—Double quadword (128-bit) operand in an XMM register, specified as the left-most (first) operand in the instruction syntax. xmm2—Double quadword (128-bit) operand in an XMM register, specified as the right-most (second) operand in the instruction syntax. xmm/mem64—Quadword (64-bit) operand in a 128-bit XMM register or memory. xmm/mem128—Double quadword (128-bit) operand in an XMM register or memory. xmm1/mem128—Double quadword (128-bit) operand in an XMM register or memory, specified as the left-most (first) operand in the instruction syntax. xmm2/mem128—Double quadword (128-bit) operand in an XMM register or memory, specified as the right-most (second) operand in the instruction syntax. opcode syntax In addition to the notation shown above in "Mnemonic Syntax" on page 37, the following notation indicates the size and type of operands in the syntax of an instruction opcode: /digit Indicates that the ModRM byte specifies only one register or memory (r/m) operand. The digit is specified by the ModRM reg field and is used as an instruction-opcode extension. Valid digit values range from 0 to 7. /r Indicates that the ModRM byte specifies both a register operand and a reg/mem (register or memory) operand. cb, cw, cd, cp Specifies a code-offset value and possibly a new code-segment register value. The value following the opcode is either one byte (cb), two bytes (cw), four bytes (cd), or six bytes (cp). ib, iw, id, iq Specifies an immediate-operand value. The opcode determines whether the value is signed or unsigned. The value following the opcode, ModRM, or SIB byte is either one byte (ib), two bytes (iw), or four bytes (id). Word and doubleword values start with the low-order byte. +rb, +rw, +rd, +rq Specifies a register value that is added to the hexadecimal byte on the left, forming a one-byte opcode. The result is an instruction that operates on the register specified by the register code. Valid register-code values are shown in Table 2-2. m64 Specifies a quadword (64-bit) operand in memory. +i Specifies an x87 floating-point stack operand, ST(i). The value is used only with x87 floating-point instructions. It is added to the hexadecimal byte on the left, forming a one-byte opcode. Valid values range from 0 to 7. --------------------------------------------------------- How is it different from these packages: (This need more info). ----- http://pypi.python.org/pypi/pyggy/0.3 pyggy 0.3 pyggy includes a lexer and parser generator. The lexer is table driven. The generated scanner is written in python. Takes an input specification file reminiscant of flex. ----- http://pypi.python.org/pypi/PyLexer/0.3 PyLexer 0.3 Takes an input file in the general style of flex and generates a python based scanner. Appears to be similar to pyggy. ----- http://pypi.python.org/pypi/SilverCity/0.9.5 Python bindings for the SilverCity package. There is some kind of relationship to Scintilla but its not clear to me. ----- http://pypi.python.org/pypi/reflex/0.1 ? ----- http://pypi.python.org/pypi/Pyrr/0.2 home page link is dead -----
About
Automatically exported from code.google.com/p/pytoken
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published