2476
Comment:
|
3035
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
ATTENTION: NONE OF THE SOLUTIONS PRESENTED CURRENTLY WORK | |
Line 16: | Line 15: |
Here's a small snippet that will let you escape those as well: | Here's a small snippet that will let you escape quotes and apostrophes as well: |
Line 78: | Line 77: |
== Formal htmlentitydefs == Yet another approach available with recent Python takes advantage of htmlentitydefs: {{{ import re from htmlentitydefs import name2codepoint def htmlentitydecode(s): return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s) }}} |
|
Line 80: | Line 91: |
XML entities are different from, if related to, HTML entities. This page hints at the details: |
|
Line 81: | Line 94: |
John J. Lee discusses still more refinements in implementation in [http://groups.google.com/group/comp.lang.python/msg/ce3fc3330cbbac0a this comp.lang.python follow-up]. |
Escaping HTML
The cgi module that comes with Python has an escape() function:
However, it doesn't escape characters beyond &, <, and >.
Here's a small snippet that will let you escape quotes and apostrophes as well:
Unescaping HTML
Undoing the escaping performed by cgi.escape() isn't directly supported by the library. This can be accomplished using a fairly simple function, however:
Note that this will undo exactly what cgi.escape() does; it's easy to extend this to undo what the html_escape() function above does. Note the comment that converting the & must be last; this avoids getting strings like "&lt;" wrong.
This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the htmlentitydefs module from the library; this is not normally used directly, but the htmllib module uses it to support most common entities. It can be used very easily:
This version has the additional advantage that it supports character references (things like A) as well as entity references.
A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data).
Formal htmlentitydefs
Yet another approach available with recent Python takes advantage of htmlentitydefs:
import re from htmlentitydefs import name2codepoint def htmlentitydecode(s): return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s)
See Also
XML entities are different from, if related to, HTML entities. This page hints at the details:
John J. Lee discusses still more refinements in implementation in [http://groups.google.com/group/comp.lang.python/msg/ce3fc3330cbbac0a this comp.lang.python follow-up].