sre_yield

Quick Start

The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently. It uses the parsed regular expression, so you get a much more accurate result than trying to just split strings.

>>> s = 'foo|ba[rz]'
>>> s.split('|')  # bad
['foo', 'ba[rz]']

>>> import sre_yield
>>> list(sre_yield.AllStrings(s))  # better
['foo', 'bar', 'baz']

It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

>>> import sre_yield
>>> list(sre_yield.AllStrings('.|a', charset='ab'))
['a', 'b', 'a']

...and happens in simpler cases too:

>>> list(sre_yield.AllStrings('a|a'))
['a', 'a']

Quirks

The membership check, 'abc' in values_obj is by necessity fullmatch -- it must cover the entire string. Imagine that it has ^(...)$ around it. Because re.search can match anywhere in an arbitrarily string, emulating this would produce a large number of junk matches -- probably not what you want. (If that is what you want, add a .* on either side.)

Here's a quick example, using the presidents regex from http://xkcd.com/1313/

>>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'

>>> import re
>>> re.search(s, 'kennedy') is not None  # note .search
True
>>> v = sre_yield.AllStrings(s)
>>> v.__len__()
23
>>> 'bu' in v
True
>>> v[:5]
['bu', 'rt', 'nt', 'ce', 'oe']

If you do want to emulate search, you end up with a large number of matches quickly. Limiting the repetition a bit helps, but it's still a very large number.

>>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
>>> el = v2.__len__()  # too big for int
>>> print(str(el).rstrip('L'))
57220492262913872576843611006974799576789176661653180757625052079917448874638816841926032487457234703154759402702651149752815320219511292208238103
>>> 'kennedy' in v2
True

Capturing Groups

If you're interested in extracting what would match during generation of a value, you can use AllMatches instead to get Match objects.

>>> v = sre_yield.AllMatches(r'a(\d)b')
>>> m = v[0]
>>> m.group(0)
'a0b'
>>> m.group(1)
'0'

This even works for simplistic backreferences, in this case to have matching quotes.

>>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
>>> m = v[0]
>>> m.group(0)
'"000"'
>>> m.groups()
('"', '000')
>>> m.group(1)
'"'
>>> m.group(2)
'000'

Anchors and Lookaround

Some very simple anchors are supported out of the box, to enable parsing patterns where the anchors are actually redundant with it being fullmatch, such as ^foo$. More complex anchoring should raise an exception at parse time, as will any use of lookaround. (\b is supported at beginning/end despite this being not quite correct.)

>>> list(sre_yield.AllStrings('foo$'))
['foo']
>>> list(sre_yield.AllStrings('^$'))
['']
>>> list(sre_yield.AllStrings('.\\b.'))  # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ParseError: Non-end-anchor None found at END state

Supporting these in a limited way is possible, but I haven't found the time in 6+ years to implement it, because I don't have a need. Instead if you need these and don't mind (potentially many) extra matches, provide relaxed=True to pretend these don't exist, at which point the values returned by sre_yield will be a superset of the true matches, and you can postprocess them yourself.

>>> pattern = '.\\b.'
>>> values = list(sre_yield.AllStrings(pattern, charset='a-', relaxed=True))
>>> values
['aa', '-a', 'a-', '--']
>>> [v for v in values if re.fullmatch(pattern, v)]
['-a', 'a-']

Reporting Bugs, etc.

We welcome bug reports -- see our issue tracker on GitHub to see if it's been reported before. If you'd like to discuss anything, we have a Google Group as well.

We're aware of three similar modules, but each has a different goal.

xeger

Xeger was originally written in Java and ported to Python. This generates random entries, which may suffice if you want to get just a few matching values. This module and xeger differ statistically in the way they handle repetitions:

>>> import random
>>> v = sre_yield.AllStrings('[abc]{1,4}')
>>> len(v)
120

# Now random.choice(v) has a 3/120 chance of choosing a single letter.
>>> len([x for x in v if len(x) == 1])
3

# xeger(v) has ~25% chance of choosing a single letter, because the length
and match are chosen independently.
# (This doesn't run, so the doctests don't require xeger)
> from rstr import xeger
> sum([1 if len(xeger('[abc]{1,4}')) == 1 else 0 for _ in range(120)])
26

In addition, xeger differs in the default matching of '.' is for printable characters (which you can get by setting charset=string.printable in sre_yield, if desired).

sre_dump

Another module that walks sre_parse's tree is sre_dump, although it does not generate matches, only reconstructs the string pattern (useful primarily if you hand-generate a tree). If you're interested in the space, it's a good read. http://www.dalkescientific.com/Python/sre_dump.html

jpetkau1

Can find matches by using randomization, so sort of handles anchors. Not guaranteed though, but another good look at internals. http://web.archive.org/web/20071024164712/http://www.uselesspython.com/jpetkau1.py (and slightly older version in the announcement on python-list).

Differences between sre_yield and the re module

There are certainly valid regular expressions which sre_yield does not handle. These include things like lookarounds, backreferences, but also a few other exceptions:

The maximum value for repeats is system-dependant -- CPython's sre module there's a special value which is treated as infinite (either 2*16-1 or 232-1 depending on build). In sre_yield, this is taken as a literal, rather than infinite, thus (on a 2*16-1 platform):
```
>>> len(sre_yield.AllStrings('a*')[-1])
65535
>>> import re
>>> len(re.match('.*', 'a' * 100000).group(0))
100000
```
The re module docs say "Regular expression pattern strings may not contain null bytes" yet this appears to work fine.
Order does not depend on greediness.
The regex is treated as fullmatch.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
sre_yield		sre_yield
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
requirements-dev.txt		requirements-dev.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

benchmarks

benchmarks

sre_yield

sre_yield

.gitignore

.gitignore

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

Makefile

Makefile

README.rst

README.rst

requirements-dev.txt

requirements-dev.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

sre_yield

Quick Start

Quirks

Capturing Groups

Anchors and Lookaround

Reporting Bugs, etc.

xeger

sre_dump

jpetkau1

Differences between sre_yield and the re module

About

Releases

Packages

Languages

License

L3viathan/sre_yield

Folders and files

Latest commit

History

Repository files navigation

sre_yield

Quick Start

Quirks

Capturing Groups

Anchors and Lookaround

Reporting Bugs, etc.

Related Modules

xeger

sre_dump

jpetkau1

Differences between sre_yield and the re module

About

Resources

License

Stars

Watchers

Forks

Languages