Differences between revisions 12 and 22 (spanning 10 versions)
Revision 12 as of 2006-01-28 11:19:02
Size: 5316
Editor: 84
Comment: = Missing unicode documentation =
Revision 22 as of 2015-08-30 10:11:18
Size: 5947
Editor: pythonguru
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
 * PythonLibraryReference, [http://www.python.org/doc/current/lib/module-re.html 4.2 re module]
 * PythonLibraryReference, [http://www.python.org/doc/current/lib/re-syntax.html 4.2.1 Regular Expression Syntax]


== Reference Diagrams ==

http://taoriver.net/img/for_pi/regex_characters.png
{{http://wiki.python.org/moin/RegularExpression?action=AttachFile&do=get&target=regex_characters.png}}
Line 13: Line 7:
http://taoriver.net/img/for_pi/regex_flags.png {{http://taoriver.net/img/for_pi/regex_flags.png}}
Line 15: Line 9:
(All images PD, released by me, author, LionKimbro.)

DavidMertz has also created [http://gnosis.cx/TPiP/regex_patterns.gif a regular expression reference.]
(All images PD, released by author, LionKimbro.)
Line 69: Line 61:
I don't know how much faster compiled forms are than non-compiled forms. The relative speed of compiled versus non-compiled patterns can be shown using the timeit module:
Line 71: Line 63:
= Links = {{{
python -m timeit -s 'import re' \
   'match_obj = re.match("<(.*?)>(.*?)</(.*?)>", "<h1>robot</h1>")'
1000000 loops, best of 3: 1.35 usec per loop
Line 73: Line 68:
 * [http://www.amk.ca/python/howto/regex/ Regular Expression HOWTO] - excellent Python-based regular expression tutorial, by [http://www.amk.ca/ A.M. Kuchling.] python -m timeit -s 'import re ; match_re = re.compile("<(.*?)>(.*?)</(.*?)>")' \
   'match_obj = match_re.match("<h1>robot</h1>")'
1000000 loops, best of 3: 0.572 usec per loop
}}}
Line 75: Line 73:
For those interested in visualization, you may also be interested in a [http://www.ozonehouse.com/mark/blog/code/PeriodicTable.pdf periodic table of PERL operators.] The above numbers were produced on an Intel Core i7 3770k running Python 2.7.6 (circa 2014). You should run the above commands on your Python version and specific hardware, and use patterns that represent your problem domain, for more representative results.
Line 77: Line 75:
= Missing unicode documentation =
Any documentation is misisng on how to use re with unicode on python
http://www.amk.ca/python/howto/regex/
= Missing features =
== See Also ==

 * [[http://www.python.org/doc/current/lib/module-re.html|4.2 re module]] -- PythonLibraryReference
 * [[http://www.python.org/doc/current/lib/re-syntax.html|4.2.1 Regular Expression Syntax]] -- PythonLibraryReference
 * [[http://www.amk.ca/python/howto/regex/|Regular Expression HOWTO]] - excellent Python-based regular expression tutorial, by [[http://www.amk.ca/|A.M. Kuchling.]]
 * [[http://www.bitcetera.com/en/techblog/2008/04/01/regex-in-a-nutshell/|Regex in a Nutshell]] cheat sheet
 * [[http://www.regexbuddy.com/python.html|RegexBuddy]] - Handy tool to create and test Python regular expressions
 * [[http://gnosis.cx/TPiP/regex_patterns.gif|Summary of Regular Expression Patterns]] -- by DavidMertz
 * [[http://pythoncard.sourceforge.net/samples/redemo.html|redemo tool]] -- ships with Python (C:\\Python24\Tools\Scripts\redemo.py), indispensible when trying out regular expressions; ships with PythonCard as well
 * [[http://www.ozonehouse.com/mark/blog/code/PeriodicTable.pdf|periodic table of PERL operators]] -- for those who like visualization
 * [[http://thepythonguru.com/python-regular-expression/|Regular Expression Starter]] -- A simple guide for beginners
 * SVG source: [[http://taoriver.net/img/for_pi/regex_characters.svg|1,]] [[http://taoriver.net/img/for_pi/regex_flags.svg|2.]]


== Discussion ==

=== Requests ===

 * documentation on using re with [[Unicode]] ..?

=== Problem? ===
Line 91: Line 107:
-- anonymous
Line 92: Line 109:
= Discussion = I don't understand the problem. -- LionKimbro <<DateTime(2006-03-25T16:31:35Z)>>
Line 94: Line 111:
I've made a couple of diagrams, which I've linked at the top of the page here. === Visualization ===
Line 96: Line 113:
I have SVG links as well; [http://taoriver.net/img/for_pi/regex_characters.svg first] and [http://taoriver.net/img/for_pi/regex_flags.svg second.]

Damn the [http://visual.wiki.taoriver.net/moin.cgi/LongImageIncorporationProcess LongImageIncorporationProcess!] Damn it to hell! We'd have a billion pretty pictures here, if tablets were cheap, and we had protocols and implementations for saving and loading straight to and from the wiki.

I have another diagram I'd like to make and place; It's the pattern (RegexObject) and match (MatchObject) API, visualized, and arranged dense. We'll see if I get around to drawing it, but it looks like no. Too much else to do.

-- LionKimbro [[DateTime(2004-12-28T08:38:12Z)]]


  ''I like the Venn diagram in this image. However, one part of the image is confusing. Where it refers to python strings, and "regex strings" (which are actually Python "raw" strings) and something called "match strings" ... what are these "match strings. -- JimD [[DateTime(2004-12-30T20:03:54Z)]]''
  ''I like the Venn diagram in this image. However, one part of the image is confusing. Where it refers to python strings, and "regex strings" (which are actually Python "raw" strings) and something called "match strings" ... what are these "match strings. -- JimD <<DateTime(2004-12-30T20:03:54Z)>>''
Line 109: Line 117:
That said: The "match string" is the final product of either of the two above expressions. It is what the above two expressions will literally match. If you have a better phrase, or would like to correct "raw" to "regex," feel free to download the SVG, edit the text, place an image on the web, and link it from here. (The damn [http://visual.wiki.taoriver.net/moin.cgi/LongImageIncorporationProcess LongImageIncorporationProcess] strikes again.) I may eventually get around to it myself one day, but it seems there are higher priorities, and the diagram is "good enough." That said: The "match string" is the final product of either of the two above expressions. It is what the above two expressions will literally match. If you have a better phrase, or would like to correct "raw" to "regex," feel free to download the SVG, edit the text, place an image on the web, and link it from here. (The damn [[http://visual.wiki.taoriver.net/moin.cgi/LongImageIncorporationProcess|LongImageIncorporationProcess]] strikes again.) I may eventually get around to it myself one day, but it seems there are higher priorities, and the diagram is "good enough."
Line 111: Line 119:
That said, I appreciate the correction. -- LionKimbro [[DateTime(2005-01-01T00:22:14Z)]] That said, I appreciate the correction. -- LionKimbro <<DateTime(2005-01-01T00:22:14Z)>>
Line 113: Line 121:
I should mention: PythonCard comes with "'''redemo'''," which is simply indispensible if you are doing a bunch of regex work. === Image Hosting ===
Line 115: Line 123:
It lets you type in a regular expression, and highlights matches. You can edit either regex or text, the highlights adjust in realtime. And then, it prints out how the .groups() are identified in code. It's terribly useful.
-- LionKimbro [[DateTime(2005-07-19T23:34:53Z)]]
At the top of this page were two images/scehmes about re. Is it possible to redraw them here somehow? That server is not working, maybe someone has them dowloaded locally. Thanks a lot.
-- PavelKosina

I've uploaded one as an attachment, still need to upload the other... And, the source...

(Anyone can do this, though, when the computers are online.)

-- LionKimbro <<DateTime(2006-03-25T16:31:35Z)>>
----
CategoryDocumentation

http://wiki.python.org/moin/RegularExpression?action=AttachFile&do=get&target=regex_characters.png

flags when compiling:

http://taoriver.net/img/for_pi/regex_flags.png

(All images PD, released by author, LionKimbro.)

Searching & Matching

You can search or match.

  • search -- find something anywhere in the string, and return it

  • match -- find something from the beginning of the string, and return it

You can also split on a pattern.

For example:

   1 import re
   2 split_up = re.split(r"(\(\([^)]+\)\))",
   3                     "This is a ((test)) of the ((emergency broadcasting station.))")

...which produces:

["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]

Compiling

If you use a regex a lot, compile it first.

Consider:

   1 import re
   2 match_obj=re.match("<(.*?)>(.*?)</(.*?)>", "<h1>robot</h1>")
   3 print mo.groups()

...which outputs: ('h1', 'robot', 'h1')

If you were going to do that match a lot, you could compile it, like so:

   1 import re
   2 match_re=re.compile("<(.*?)>(.*?)</(.*?)>")
   3 match_obj=match_re.match("<h1>robot</h1>")
   4 print match_obj.groups()

...which yields the same result.

The relative speed of compiled versus non-compiled patterns can be shown using the timeit module:

python -m timeit -s 'import re' \
   'match_obj = re.match("<(.*?)>(.*?)</(.*?)>", "<h1>robot</h1>")'
1000000 loops, best of 3: 1.35 usec per loop

python -m timeit -s 'import re ; match_re = re.compile("<(.*?)>(.*?)</(.*?)>")' \
   'match_obj = match_re.match("<h1>robot</h1>")'
1000000 loops, best of 3: 0.572 usec per loop

The above numbers were produced on an Intel Core i7 3770k running Python 2.7.6 (circa 2014). You should run the above commands on your Python version and specific hardware, and use patterns that represent your problem domain, for more representative results.

See Also

Discussion

Requests

  • documentation on using re with Unicode ..?

Problem?

The following feature does not seems to work in python:

For example, the ICU regular expression provides the following patterns:

  • \N{UNICODE CHARACTER NAME} Correspond au caractère nommé
  • \p{UNICODE PROPERTY NAME} Correspond au carctère doté de la propriété Unicode spécifiée.
  • \P{UNICODE PROPERTY NAME} Correspond au carctère non doté de la propriété Unicode spécifiée.
  • \s Correspond à un caractère séparateur. un séparateur est définit comme [\t\n\f\r\p{Z}].
  • \uhhhh Correspond à un caractère dont la valeur hexa est hhhh.
  • \Uhhhhhhhh Correspond à un caractère dont la valeur hexa est hhhhhhhh. Exactement huit chiffres héxa doivent être fournis, même si le code point unicode le plus grand est \U0010ffff.

-- anonymous

I don't understand the problem. -- LionKimbro 2006-03-25 16:31:35

Visualization

  • I like the Venn diagram in this image. However, one part of the image is confusing. Where it refers to python strings, and "regex strings" (which are actually Python "raw" strings) and something called "match strings" ... what are these "match strings. -- JimD 2004-12-30 20:03:54

The image isn't meant to be explanatory, it is meant to be reference and refreshing material.

That said: The "match string" is the final product of either of the two above expressions. It is what the above two expressions will literally match. If you have a better phrase, or would like to correct "raw" to "regex," feel free to download the SVG, edit the text, place an image on the web, and link it from here. (The damn LongImageIncorporationProcess strikes again.) I may eventually get around to it myself one day, but it seems there are higher priorities, and the diagram is "good enough."

That said, I appreciate the correction. -- LionKimbro 2005-01-01 00:22:14

Image Hosting

At the top of this page were two images/scehmes about re. Is it possible to redraw them here somehow? That server is not working, maybe someone has them dowloaded locally. Thanks a lot. -- PavelKosina

I've uploaded one as an attachment, still need to upload the other... And, the source...

(Anyone can do this, though, when the computers are online.)

-- LionKimbro 2006-03-25 16:31:35


CategoryDocumentation

RegularExpression (last edited 2015-08-30 10:11:18 by pythonguru)

Unable to edit the page? See the FrontPage for instructions.