Skip to content

zurgzurg/pytoken

Repository files navigation

What is pytoken

Pytoken is a scanner generator. Given an input specification - a
bunch of regular expressions - pytoken will generate x86 machine
code that recognizes those regular expressions. Pytoken will be
most useful for programmers that want to parse text files.

Install
----------------------------------------------
To install pytoken:

    python ./setup.py install


Documentation
----------------------------------------------
To build the documentation:
    cd doc
    make -f Makefile.docs html


Getting Started
----------------------------------------------
Pytoken has separate objects for scanners and buffers.

Here is a simple example:
  import pytoken

  lexer_obj = pytoken.lexer()
  lexer_obj.add_pattern("a", 1)
  lexer_obj.add_pattern("b", 2)
  lexer_obj.compile_to_machine_code()

  buf = pytoken.lexer_state()
  buf.set_input("ab")

  tok = lexer_obj.get_token(buf)
  assert tok == 1
  tok = lexer_obj.get_token(buf)
  assert tok == 2

Pytoken requires only python developement libraries, make and a C compiler.

Pytoken has been written in a portable fashion - it is designed to support
multiple CPU types, even though only the x86 (32 bit) is supported now.

Why use pytoken

1. Ease of use. Pytoken was created to make parsing files easy. The flex/lex
   based scanners that are traditionally used in C based solutions have
   complicated rules for buffer management. Pytoken follows the python
   philispophy and makes buffer manegement explicit.

2. Performance. pytoken was also written with the aim of creating very
   high speed lexers. Tokenization is expensive, by definition some amount
   of work needs to be done for each character in the input stream. Since
   pytoken generates machine code directly it can ensure that no extra
   work is done is processing the input stream. Pytoken aims to keep the
   overhead of executing inside the python interpreter to a minimum.
   Minimal benchmarking - see bencharm.py shows that pytoken is around
   30% slower than flex.

TODO
----------------------------------------------
For next full release:

	'|' as a regex is illegal

	some simple examples

	support \xx encoding of chars in hex

	Need to support matching NULL chars in users patterns.

	Fully working end of buffer detection.

Before initial release:

done	make buffers work with file like objects - set_input()
done	make buffers work with file like objects - fill()

done	handle None as token id

done	support token id as callable, need to have clear
done	protocol on getting token info from buffer. Need
done	way to get buffer contents as well., also callable
done	objs must have a way of passing user defined args.

done	very long token - larger that 4k
	
done	fill buffers with varying token sizes - to ensure
	that tokens will be split accross buffer fills


done	benchmarking vs flex


Other TODO Items:
inprog	Need a manual.

inprog	Need to support a richer set of REGEXP meta chars.

	Rework/Rewrite the IR representation. This needs to
	be done before better machine code can be generated.

	Emit better machine code. Need a frame work for doing
	traditional compiler style optimizations.

	? Comments from users ?


Info for developers
----------------------------------------------------
General info

testing the assembler: Normally I just use gas or nasm to
test create the expected output.

Notes on x86 assembler
----------------------

        +-----+-------+-------+
ModRM : | 7 6 | 5 4 3 | 2 1 0 |
        +-----+-------+-------+
         mod    reg     r/m


The AMD and Intel manuals specify some opcodes with a ModRM
byte:

  /0 /1 /2 /3 /4 /5 /6 /7
  The ModRM byte specifies only one register or memory
  operand. The digit is specified by the ModRM reg field.

  /r
  Indicates that the ModRM byte specifies both a register
  operand and a reg/mem (register or memory) operand.

junk.s:
-----------------------
l1:	nop
	jmp l1
-----------------------
$ as -a junk.s
GAS LISTING junk.s                      page 1


   1 0000 90            l1:     nop
   2 0001 EBFD                  jmp l1
   3

GAS LISTING junk.s                      page 2


DEFINED SYMBOLS
              junk.s:1      .text:00000000 l1

NO UNDEFINED SYMBOLS

---------------------------------------------------------
GDT LDT from kernel

                       requestor
                       priv
                       level
cs == #x73 ==> 0_1110 0 11
                 14

ds == #x7b ==> 0_1111 0 11
                 15


From the AMD64 manual:

Default Operand Size (D/B) Bit. Bit 22 of byte +4. The default
operand-size bit is found in code- segment and data-segment
descriptors but not in system-segment descriptors. Setting this bit to
1 indicates a 32-bit default operand size, and clearing it to 0
indicates a 16-bit default size. The effect this bit has on a segment
depends on the segment-descriptor type. See "Code-Segment
Default-Operand Size (D) Bit" on page 81 for a description of the D
bit in code-segment descriptors. "Data-Segment Default Operand Size
(D/B) Bit" on page 83 describes the D bit in data-segment descriptors,
including stack segments, where the bit is referred to as the "B" bit.


From linux kernel: arch/x86/kernel/cpu/common.c
[GDT_ENTRY_DEFAULT_USER_CS]
   a = 0x0000ffff = 
   b = 0x00cffa00 --> 0000_0000_1100_1111_1111_1010_0000_0000
    bit22------------------------^

[GDT_ENTRY_DEFAULT_USER_DS]
   a = 0x0000ffff
   b = 0x00cff200 --> 0000_0000_1100_1111_1111_0010_0000_0000
    bit22------------------------^


So it appears that the default operand size is set to 32 bits.


---------------------------------------------------------
Additional stuff from the AMD64 manual:

Each instruction has a syntax that includes the mnemonic and any
operands that the instruction can take. Figure 2-14 shows an example
of a syntax in which the instruction takes two operands. In most
instructions that take two operands, the first (left-most) operand is
both a source operand (the first source operand) and the destination
operand. The second (right-most) operand serves only as a source, not
a destination.


The following notation is used to denote the size and type of source
and destination operands:

   cReg—Control register.
   dReg—Debug register.
   imm8—Byte (8-bit) immediate.
   imm16—Word (16-bit) immediate.
   imm16/32—Word (16-bit) or doubleword (32-bit) immediate.
   imm32—Doubleword (32-bit) immediate.
   imm32/64—Doubleword (32-bit) or quadword (64-bit) immediate.
   imm64—Quadword (64-bit) immediate.
   mem—An operand of unspecified size in memory.
   mem8—Byte (8-bit) operand in memory.
   mem16—Word (16-bit) operand in memory.
   mem16/32—Word (16-bit) or doubleword (32-bit) operand in memory.
   mem32—Doubleword (32-bit) operand in memory.
   mem32/48—Doubleword (32-bit) or 48-bit operand in memory.
   mem48—48-bit operand in memory.


   mem64—Quadword (64-bit) operand in memory.
   mem128—Double quadword (128-bit) operand in memory.
   mem16:16—Two sequential word (16-bit) operands in memory.
   mem16:32—A doubleword (32-bit) operand followed by a word (16-bit) operand in memory.
   mem32real—Single-precision (32-bit) floating-point operand in memory.
   mem32int—Doubleword (32-bit) integer operand in memory.
   mem64real—Double-precision (64-bit) floating-point operand in memory.
   mem64int—Quadword (64-bit) integer operand in memory.
   mem80real—Double-extended-precision (80-bit) floating-point operand in memory.
   mem80dec—80-bit packed BCD operand in memory, containing 18 4-bit BCD digits.
   mem2env—16-bit x87 control word or x87 status word.
   mem14/28env—14-byte or 28-byte x87 environment. The x87 environment consists of the x87
   control word, x87 status word, x87 tag word, last non-control instruction pointer, last data pointer,
   and opcode of the last non-control instruction completed.
   mem94/108env—94-byte or 108-byte x87 environment and register stack.
   mem512env—512-byte environment for 128-bit media, 64-bit media, and x87 instructions.
   mmx—Quadword (64-bit) operand in an MMX register.
   mmx1—Quadword (64-bit) operand in an MMX register, specified as the left-most (first) operand
   in the instruction syntax.
   mmx2—Quadword (64-bit) operand in an MMX register, specified as the right-most (second)
   operand in the instruction syntax.
   mmx/mem32—Doubleword (32-bit) operand in an MMX register or memory.
   mmx/mem64—Quadword (64-bit) operand in an MMX register or memory.
   mmx1/mem64—Quadword (64-bit) operand in an MMX register or memory, specified as the left-
   most (first) operand in the instruction syntax.
   mmx2/mem64—Quadword (64-bit) operand in an MMX register or memory, specified as the right-
   most (second) operand in the instruction syntax.
   moffset—Direct memory offset that specifies an operand in memory.
   moffset8—Direct memory offset that specifies a byte (8-bit) operand in memory.
   moffset16—Direct memory offset that specifies a word (16-bit) operand in memory.
   moffset32—Direct memory offset that specifies a doubleword (32-bit) operand in memory.
   moffset64—Direct memory offset that specifies a quadword (64-bit) operand in memory.
   pntr16:16—Far pointer with 16-bit selector and 16-bit offset.
   pntr16:32—Far pointer with 16-bit selector and 32-bit offset.
   reg—Operand of unspecified size in a GPR register.
   reg8—Byte (8-bit) operand in a GPR register.
   reg16—Word (16-bit) operand in a GPR register.
   reg16/32—Word (16-bit) or doubleword (32-bit) operand in a GPR register.
   reg32—Doubleword (32-bit) operand in a GPR register.
   reg64—Quadword (64-bit) operand in a GPR register.
   reg/mem8—Byte (8-bit) operand in a GPR register or memory.
   reg/mem16—Word (16-bit) operand in a GPR register or memory.
   reg/mem32—Doubleword (32-bit) operand in a GPR register or memory.
   reg/mem64—Quadword (64-bit) operand in a GPR register or memory.
   rel8off—Signed 8-bit offset relative to the instruction pointer.
   rel16off—Signed 16-bit offset relative to the instruction pointer.
   rel32off—Signed 32-bit offset relative to the instruction pointer.
   segReg or sReg—Word (16-bit) operand in a segment register.
   ST(0)—x87 stack register 0.
   ST(i)—x87 stack register i, where i is between 0 and 7.
   xmm—Double quadword (128-bit) operand in an XMM register.
   xmm1—Double quadword (128-bit) operand in an XMM register, specified as the left-most (first)
   operand in the instruction syntax.
   xmm2—Double quadword (128-bit) operand in an XMM register, specified as the right-most
   (second) operand in the instruction syntax.
   xmm/mem64—Quadword (64-bit) operand in a 128-bit XMM register or memory.
   xmm/mem128—Double quadword (128-bit) operand in an XMM register or memory.
   xmm1/mem128—Double quadword (128-bit) operand in an XMM register or memory, specified as
   the left-most (first) operand in the instruction syntax.
   xmm2/mem128—Double quadword (128-bit) operand in an XMM register or memory, specified as
   the right-most (second) operand in the instruction syntax.



opcode syntax

In addition to the notation shown above in "Mnemonic Syntax" on page
37, the following notation indicates the size and type of operands in
the syntax of an instruction opcode:

    /digit Indicates that the ModRM byte specifies only
           one register or memory (r/m) operand. The digit is
           specified by the ModRM reg field and is used as an
           instruction-opcode extension. Valid digit values
           range from 0 to 7.

    /r     Indicates that the ModRM byte specifies both a
           register operand and a reg/mem (register or
           memory) operand.

    cb, cw, cd, cp
           Specifies a code-offset value and possibly a new
           code-segment register value. The value following the
           opcode is either one byte (cb), two bytes (cw),
           four bytes (cd), or six bytes (cp).

    ib, iw, id, iq
           Specifies an immediate-operand value. The opcode
           determines whether the value is signed or unsigned. The
           value following the opcode, ModRM, or SIB byte is either
           one byte (ib), two bytes (iw), or four bytes (id). Word
           and doubleword values start with the low-order byte.

    +rb, +rw, +rd, +rq
           Specifies a register value that is added to the
           hexadecimal byte on the left, forming a one-byte opcode.
           The result is an instruction that operates on the
           register specified by the register code. Valid register-code
           values are shown in Table 2-2.

  m64      Specifies a quadword (64-bit) operand in memory.

  +i       Specifies an x87 floating-point stack operand,
           ST(i). The value is used only with x87 floating-point
           instructions. It is added to the hexadecimal byte on
           the left, forming a one-byte opcode. Valid values
           range from 0 to 7.

---------------------------------------------------------




How is it different from these packages: (This need more info).

-----
http://pypi.python.org/pypi/pyggy/0.3
pyggy 0.3

pyggy includes a lexer and parser generator. The lexer is
table driven. The generated scanner is written in python. Takes
an input specification file reminiscant of flex.


-----
http://pypi.python.org/pypi/PyLexer/0.3
PyLexer 0.3

Takes an input file in the general style of flex and generates
a python based scanner. Appears to be similar to pyggy.

-----
http://pypi.python.org/pypi/SilverCity/0.9.5

Python bindings for the SilverCity package. There is some
kind of relationship to Scintilla but its not clear to me.


-----
http://pypi.python.org/pypi/reflex/0.1

?

-----
http://pypi.python.org/pypi/Pyrr/0.2
home page link is dead

-----

About

Automatically exported from code.google.com/p/pytoken

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published