1175
Comment:
|
4609
wiki restore 2013-01-23
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
Line 3: | Line 4: |
The {{{cgi}}} module that comes with Python has an {{{escape}}} function: | |
Line 5: | Line 5: |
{{{ #!python |
The {{{cgi}}} module that comes with Python has an {{{escape()}}} function: {{{#!python |
Line 12: | Line 19: |
However, it doesn't escape characters beyond {{{&}}}, {{{<}}}, and {{{>}}}. | |
Line 14: | Line 20: |
Here's a small snippet that will let you escape those as well: | |
Line 16: | Line 21: |
{{{ #!python html_escape_table = \ {"&": "&", '"': """, "'": "'", ">": ">", "<": "<"} |
However, it doesn't escape characters beyond {{{&}}}, {{{<}}}, and {{{>}}}. If it is used as {{{cgi.escape(string_to_escape, quote=True)}}}, it also escapes {{{"}}}. Here's a small snippet that will let you escape quotes and apostrophes as well: {{{#!python html_escape_table = { "&": "&", '"': """, "'": "'", ">": ">", "<": "<", } |
Line 27: | Line 43: |
L=[] for c in text: L.append(html_escape_table.get(c,c)) return "".join(L) |
return "".join(html_escape_table.get(c,c) for c in text) |
Line 33: | Line 46: |
== Discussion == | |
Line 35: | Line 47: |
LionKimbro: Is there anything in the standard library for going the other way? Is there something where you can give it "&" and get back "&"? Perhaps in the XML libraries? I looked, but did not see anything. DOM, SAX- wouldn't be there. Not exactly XML-RPC either. Anyone know? [[Date(2005-06-10T16:35:16Z)]] | |
Line 37: | Line 48: |
FredDrake: Do you want this for XML or HTML? The responses are different. | You can also use {{{escape()}}} from {{{xml.sax.saxutils}}} to escape html. This function should execute faster. The {{{unescape()}}} function of the same module can be passed the same arguments to decode a string. |
Line 39: | Line 50: |
LionKimbro: XML, for right now. But if you can point me in the right direction for HTML, that would be nice too. | {{{#!python from xml.sax.saxutils import escape, unescape # escape() and unescape() takes care of &, < and >. html_escape_table = { '"': """, "'": "'" } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_escape(text): return escape(text, html_escape_table) def html_unescape(text): return unescape(text, html_unescape_table) }}} == Unescaping HTML == Undoing the escaping performed by {{{cgi.escape()}}} isn't directly supported by the library. This can be accomplished using a fairly simple function, however: {{{#!python def unescape(s): s = s.replace("<", "<") s = s.replace(">", ">") # this has to be last: s = s.replace("&", "&") return s }}} or alternatively (before [[http://bugs.python.org/issue2927|issue2927]]): {{{ >>> from HTMLParser import HTMLParser >>> HTMLParser.unescape.__func__(HTMLParser, 'ss©') u'ss\xa9' }}} Note that this will undo exactly what {{{cgi.escape()}}} does; it's easy to extend this to undo what the {{{html_escape()}}} function above does. Note the comment that converting the {{{&}}} must be last; this avoids getting strings like {{{"&lt;"}}} wrong. This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the {{{htmlentitydefs}}} module from the library; this is not normally used directly, but the {{{htmllib}}} module uses it to support most common entities. It can be used very easily: {{{#!python import htmllib def unescape(s): p = htmllib.HTMLParser(None) p.save_bgn() p.feed(s) return p.save_end() }}} This version has the additional advantage that it supports character references (things like {{{A}}}) as well as entity references. A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data). == Formal htmlentitydefs == Yet another approach available with recent Python takes advantage of htmlentitydefs: {{{ import re from htmlentitydefs import name2codepoint def htmlentitydecode(s): return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s) }}} == Builtin HTML/XML escaping via ASCII encoding == A very easy way to transform non-ASCII characters like German umlauts or letters with accents into their HTML equivalents is simply encoding them from unicode to ASCII and use the {{{xmlcharrefreplace}}} encoding error handling: {{{ >>> a = u"äöüßáà" >>> a.encode('ascii', 'xmlcharrefreplace') 'äöüßáà' }}} Note, that this does only transform ''non''-ASCII characters and therefore leaves {{{<}}}, {{{>}}}, {{{?}}} as they are. However, you can combine this technique with the {{{cgi.escape}}}. == See Also == XML entities are different from, if related to, HTML entities. This page hints at the details: * [[EscapingXml|EscapingXml]] John J. Lee discusses still more refinements in implementation in [[http://groups.google.com/group/comp.lang.python/msg/ce3fc3330cbbac0a|this comp.lang.python follow-up]]. |
Escaping HTML
The cgi module that comes with Python has an escape() function:
However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".
Here's a small snippet that will let you escape quotes and apostrophes as well:
You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.
1 from xml.sax.saxutils import escape, unescape
2 # escape() and unescape() takes care of &, < and >.
3 html_escape_table = {
4 '"': """,
5 "'": "'"
6 }
7 html_unescape_table = {v:k for k, v in html_escape_table.items()}
8
9 def html_escape(text):
10 return escape(text, html_escape_table)
11
12 def html_unescape(text):
13 return unescape(text, html_unescape_table)
Unescaping HTML
Undoing the escaping performed by cgi.escape() isn't directly supported by the library. This can be accomplished using a fairly simple function, however:
or alternatively (before issue2927):
>>> from HTMLParser import HTMLParser >>> HTMLParser.unescape.__func__(HTMLParser, 'ss©') u'ss\xa9'
Note that this will undo exactly what cgi.escape() does; it's easy to extend this to undo what the html_escape() function above does. Note the comment that converting the & must be last; this avoids getting strings like "&lt;" wrong.
This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the htmlentitydefs module from the library; this is not normally used directly, but the htmllib module uses it to support most common entities. It can be used very easily:
This version has the additional advantage that it supports character references (things like A) as well as entity references.
A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data).
Formal htmlentitydefs
Yet another approach available with recent Python takes advantage of htmlentitydefs:
import re from htmlentitydefs import name2codepoint def htmlentitydecode(s): return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s)
Builtin HTML/XML escaping via ASCII encoding
A very easy way to transform non-ASCII characters like German umlauts or letters with accents into their HTML equivalents is simply encoding them from unicode to ASCII and use the xmlcharrefreplace encoding error handling:
>>> a = u"äöüßáà" >>> a.encode('ascii', 'xmlcharrefreplace') 'äöüßáà'
Note, that this does only transform non-ASCII characters and therefore leaves <, >, ? as they are. However, you can combine this technique with the cgi.escape.
See Also
XML entities are different from, if related to, HTML entities. This page hints at the details:
John J. Lee discusses still more refinements in implementation in this comp.lang.python follow-up.