re2, a modern regular expression syntax

Beta

The rest of the readme will describe the final version, which is about two weeks from completion. Skip to the end for instructions on how to try it right now.

re2, a modern regular expression syntax

Regular Expressions are one of the best ideas in the programming world. However, Regular Expression syntax is a ^#.*! accident from the 70s. Lets fix it.

Now 100% less painful to migrate! (you heard that right: migration is not painful at all)

Should I Make The Switch?

If you're new to regular expressions, you should go ahead and learn the new version. It's easier to learn, easier to read and can be used in every existing tool that supports regular expressions (no need to install a new version, just pass it through the online translator).

If you use regular expressions in code, (e.g. to specify HTTP routes, input validation, or string search patterns), re2 will make your codebase much more readable while keeping 100% backwards compatibility, requiring minimal effort to switch. Definitely switch.

If you heavily use regular expressions in a text editor, before you make the switch make sure your editor has a plugin or external utility to enable re2 support, since using the online translator would become tiresome if done a lot. Plugins currently exist for vim and emacs.

Syntax

The traditional regex:

(\d+) Reasons To Switch To re2, The (\d+)th Made Me ([Ll][Aa][Uu][Gg][Hh]|[Cc][Rr][Yy])

May be written in re2 as:

[#save_num] Reasons To Switch To re2, The [#save_num]th Made Me [case_insensitive 'Laugh' | 'Cry'][#save_num=[capture 1+ #digit]]

Or, if you're in a hurry:

[c 1+ #d] Reasons To Switch To re2, The [c 1+ #d]th Made Me [ci 'Laugh' | 'Cry']

(and when you're done you can use our automatic tool to convert it to the more readable version and commit that instead.)

Design criteria

Migration

Ease of migration trumps any other design consideration. Without a clear, painless migration path, there would be no adoption.

Capabilities should be exactly equivalent to those of legacy regex syntax
Provide a tool to translate between legacy and re2 syntax to aid in learning and porting existing code
Provide a tool to translate between short and long macro names (because typing [#start_line [1+ #letter] #end_line] instead of ^[a-zA-Z]+$
Provide libraries for every common language with a function to convert re2 syntax to the language's legacy native syntax, and a factory that constructs compiled regex objects (since it returns a native regex engine object, no code changes will ever be required except for translating the patterns)
Provide a command line tool, e.g. $ grep "`re2 "\d+ Reasons"`"

Syntax

Should be easy to read
Should be easy to teach
Should be easy to type (e.g. "between 3 and 5 times" is not a very good syntax)
Should minimize comic book cursing like ^[^#]*$
Should make simple expressions literals (i.e. /Yo, dawg!/ matches "Yo, dawg!" and no other string)
Should only have 1-2 "special characters" that make an expression be more than a simple literal
Should not rely on characters that need to be escaped in many use cases, e.g. " and \ in most languages' string literals, ` $ in bash (' is OK because every language that allows ' strings also allows " strings. Except for SQL. Sorry, SQL.)
Different things should look different, beware of Lisp-like parentheses forests
Macros (e.g. a way to write the IP address regex as something like /\bX.X.X.X\b where X is (?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)/

Possible Names

re2 (regular expressions 2)
coffeex (coffee expressions)

Rejected Names

matchers
humexp (human expressions)
readex (readable expressions)
renex (renovated expressions)
modex (modern expressions)

Beta Installation

You can try it today with Python and/or grep, e.g.

$ pip install -e git+git@github.com:SonOfLilit/re2.git#egg=re2
$ echo "Trololo lolo" | grep -P "`re2 "[#sl]Tro[0+ #space | 'lo']lo[#el]"`"

import re2

INVALID = re2.compile("([0+ not ')'](")
STUFF_IN_PARENS = re2.compile("([0+ not ')'])")
def remove_parentheses(line):
    if INVALID.search(line):
        raise ValueError()
    return STUFF_IN_PARENS.sub('', line)
assert remove_parentheses('a(b)c(d)e') == 'ace'

(the original is from a hackathon project I participated in and looks like this:)

import re

def remove_parentheses(line):
    if re.search(r'\([^)]*\(', line):
        raise ValueError()
    return re.sub(r'\([^)]*\)', '', line)
assert remove_parentheses('a(b)c(d)e') == 'ace'

Tutorial

This is still in Beta, we'd love to get your feedback on the syntax.

Anything outside of brackets is a literal:

This is a (short) literal :-)

You can use macros like #digit (short: #d) or #not_linefeed (#nlf):

This is a [#lowercase #lc #lc #lc] regex :-)

You can repeat with n+ or n-m:

This is a [1+ #lc] regex :-)

If you want one of a few options, use |:

This is a ['Happy' | 'Short' | 'readable'] regex :-)

Capture with [capture <regex>]:

This is a [capture 1+ #letter | ' ' | ','] regex :-)

Define your own macros with #name=[<regex>]:

This is a [#trochee #trochee #trochee] regex :-)[
    #trochee=['Robot' | 'Ninja' | 'Pirate' | 'Doctor' | 'Laser' | 'Monkey' | 'XKCD856']]

Some macros you can use:

#any #a (but usually you want to use #nlf, see next line)
#linefeed #lf #not_linefeed #nlf
#carriage_return #cr #not_carriage_return #ncr
#tab #t #not_tab #nt
#digit #d #not_digit #nd
#letter #l #not_letter #nl
#lowercase #lc #not_lowercase #nlc
#uppercase #uc #uppercase #uc
#space #s #space #s
#word_character #wc
#word_boundary #wb

"[not 'a' | 'b']" => /[^ab]/
"[#digit | 'a' |'b' |'c' |'d' |'e' |'f']" => /[0-9abcdef]/   ([a..f] syntax almost implemented)

Coming soon: #integer, #ip, ..., abc[ignore_case 'de' #lowercase] (which translates to abc[['D' | 'd'] ['E'|'e'] [A-Za-z]], today you just wouldn't try), [a..f], [0..255] (which translates to ['25' [0..5] | '2' [0..4] #d | '1' #d #d | [1..9] #d | #d], [capture:name ...], [1+:fewest ...] (for non-greedy repeat), unicode support. Full PCRE feature support (lookahead/lookback, some other stuff). See TODO.txt.

License

re2 is distrubuted under the MIT license:

Copyright (c) 2015, Aur Saraf
All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
re2		re2
tests		tests
.gitignore		.gitignore
README.md		README.md
TODO.txt		TODO.txt
emails.txt		emails.txt
regexes.txt		regexes.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re2

re2

tests

tests

.gitignore

.gitignore

README.md

README.md

TODO.txt

TODO.txt

emails.txt

emails.txt

regexes.txt

regexes.txt

setup.py

setup.py

Repository files navigation

Beta

re2, a modern regular expression syntax

Should I Make The Switch?

Syntax

Design criteria

Migration

Syntax

Possible Names

Rejected Names

Beta Installation

Tutorial

License

About

Releases

Packages

Languages

cben/re2

Folders and files

Latest commit

History

Repository files navigation

Beta

re2, a modern regular expression syntax

Should I Make The Switch?

Syntax

Design criteria

Migration

Syntax

Possible Names

Rejected Names

Beta Installation

Tutorial

License

About

Resources

Stars

Watchers

Forks

Languages