Differences between revisions 11 and 13 (spanning 2 versions)

Escaping HTML

The cgi module that comes with Python has an escape() function:

   1 import cgi
   2 
   3 s = cgi.escape( """& < >""" )   # s = "&amp; &lt; &gt;"

However, it doesn't escape characters beyond &, <, and >.

Here's a small snippet that will let you escape those as well:

   1 html_escape_table = {
   2     "&": "&amp;",
   3     '"': "&quot;",
   4     "'": "&apos;",
   5     ">": "&gt;",
   6     "<": "&lt;",
   7     }
   8 
   9 def html_escape(text):
  10     """Produce entities within text."""
  11     L=[]
  12     for c in text:
  13         L.append(html_escape_table.get(c,c))
  14     return "".join(L)

Unescaping HTML

Undoing the escaping performed by cgi.escape() isn't directly supported by the library. This can be accomplished using a fairly simple function, however:

   1 def unescape(s):
   2     s = s.replace("&lt;", "<")
   3     s = s.replace("&gt;", ">")
   4     # this has to be last:
   5     s = s.replace("&amp;", "&")
   6     return s

Note that this will undo exactly what cgi.escape() does; it's easy to extend this to undo what the html_escape() function above does. Note the comment that converting the & must be last; this avoids getting strings like "&lt;" wrong.

This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the htmlentitydefs module from the library; this is not normally used directly, but the htmllib module uses it to support most common entities. It can be used very easily:

   1 import htmllib
   2 
   3 def unescape(s):
   4     p = htmllib.HTMLParser(None)
   5     p.save_bgn()
   6     p.feed(s)
   7     return p.save_end()

This version has the additional advantage that it supports character references (things like A) as well as entity references.

A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data).

Discussion

LionKimbro: Is there anything in the standard library for going the other way? Is there something where you can give it "&" and get back "&"? Perhaps in the XML libraries? I looked, but did not see anything. DOM, SAX- wouldn't be there. Not exactly XML-RPC either. Date(2005-06-10T16:35:16Z)

FredDrake: EscapingXml includes more than you probably wanted to know. And if you want, I can toss in an approach using xmlrpclib as well.

LionKimbro: Yyyy....ow. Thank you. Thank you very much. Date(2005-06-13T14:41:02Z)

-  ⇤ ← Revision 11 as of 2005-06-10 21:38:02 → 
  Size: 3077
  Editor: FredDrake
  Comment:
+   ← Revision 13 as of 2005-06-13 14:41:04 → ⇥
  Size: 2987
  Editor: 63
  Comment: thank you Fred Drake :)
-Deletions are marked like this.
+Additions are marked like this.
 Line 82:
-LionKimbro: Is there anything in the standard library for going the other way? Is there something where you can give it "&amp;" and get back "&"? Perhaps in the XML libraries? I looked, but did not see anything. DOM, SAX- wouldn't be there. Not exactly XML-RPC either.  Anyone know?  (Answer needed for XML.)  [[Date(2005-06-10T16:35:16Z)]]
+LionKimbro: Is there anything in the standard library for going the other way? Is there something where you can give it "&amp;" and get back "&"? Perhaps in the XML libraries? I looked, but did not see anything. DOM, SAX- wouldn't be there. Not exactly XML-RPC either.  [[Date(2005-06-10T16:35:16Z)]]
 Line 84:
-FredDrake:  Well, the HTML answer is easy.  :-)  The XML isn't hard, but there's certainly more overhead in
making it more than a trivial replacement for a handful of entity references (at least if we involve SAX).
I'll write up how to do this efficiently for XML after I put the kids to bed.
+FredDrake:  EscapingXml includes more than you probably wanted to know.  And if you want, I can toss in an approach using {{{xmlrpclib}}} as well.  :-)

LionKimbro: Yyyy....ow. Thank you. Thank you very much. :) [[Date(2005-06-13T14:41:02Z)]]

Page

User

Escaping HTML

Unescaping HTML

See Also

Discussion