Differences between revisions 5 and 25 (spanning 20 versions)

Escaping HTML

The cgi module that comes with Python has an escape() function:

   1 import cgi
   2 
   3 s = cgi.escape( """& < >""" )   # s = "&amp; &lt; &gt;"

However, it doesn't escape characters beyond &, <, and >.

Here's a small snippet that will let you escape quotes and apostrophes as well:

   1 html_escape_table = {
   2     "&": "&amp;",
   3     '"': "&quot;",
   4     "'": "&apos;",
   5     ">": "&gt;",
   6     "<": "&lt;",
   7     }
   8 
   9 def html_escape(text):
  10     """Produce entities within text."""
  11     return "".join(html_escape_table.get(c,c) for c in text)

Unescaping HTML

Undoing the escaping performed by cgi.escape() isn't directly supported by the library. This can be accomplished using a fairly simple function, however:

   1 def unescape(s):
   2     s = s.replace("&lt;", "<")
   3     s = s.replace("&gt;", ">")
   4     # this has to be last:
   5     s = s.replace("&amp;", "&")
   6     return s

or alternatively (before issue2927):

>>> from HTMLParser import HTMLParser
>>> HTMLParser.unescape.__func__(HTMLParser, 'ss&copy;')
u'ss\xa9'

Note that this will undo exactly what cgi.escape() does; it's easy to extend this to undo what the html_escape() function above does. Note the comment that converting the & must be last; this avoids getting strings like "&lt;" wrong.

This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the htmlentitydefs module from the library; this is not normally used directly, but the htmllib module uses it to support most common entities. It can be used very easily:

   1 import htmllib
   2 
   3 def unescape(s):
   4     p = htmllib.HTMLParser(None)
   5     p.save_bgn()
   6     p.feed(s)
   7     return p.save_end()

This version has the additional advantage that it supports character references (things like A) as well as entity references.

A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data).

Formal htmlentitydefs

Yet another approach available with recent Python takes advantage of htmlentitydefs:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
    return re.sub('&(%s);' % '|'.join(name2codepoint), 
            lambda m: unichr(name2codepoint[m.group(1)]), s)

-  ⇤ ← Revision 5 as of 2005-06-10 17:09:43 → 
  Size: 1059
  Editor: pcp744441pcs
  Comment:
+   ← Revision 25 as of 2011-07-11 06:03:21 → ⇥
  Size: 3197
  Editor: 180
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-The {{{cgi}}} module that comes with Python has an {{{escape}}} function:
+The {{{cgi}}} module that comes with Python has an {{{escape()}}} function:
-Line 14:
+Line 15:
-Here's a small snippet that will let you escape those as well:
+Here's a small snippet that will let you escape quotes and apostrophes as well:
-Line 18:
+Line 19:
-html_escape_table = \
    {"&": "&amp;",
     '"': "&quot;",
     "'": "&apos;",
     ">": "&gt;",
     "<": "&lt;"}
+html_escape_table = {
    "&": "&amp;",
    '"': "&quot;",
    "'": "&apos;",
    ">": "&gt;",
    "<": "&lt;",
    }
-Line 27:
+Line 29:
-    L=[]
    for c in text:
        L.append(html_escape_table.get(c,c))
    return "".join(L)
+    return "".join(html_escape_table.get(c,c) for c in text)
-Line 33:
+Line 32:
-== Discussion ==
+== Unescaping HTML ==
-Line 35:
+Line 34:
-LionKimbro: Is there anything in the standard library for going the other way? Is there something where you can give it "&amp;" and get back "&"? Perhaps in the XML libraries? I looked, but did not see anything. DOM, SAX- wouldn't be there. Not exactly XML-RPC either. Anyone know? [[Date(2005-06-10T16:35:16Z)]]
+Undoing the escaping performed by {{{cgi.escape()}}} isn't directly supported by the library.  This can
be accomplished using a fairly simple function, however:
 Line 37:
-FredDrake: Do you want this for XML or HTML?  The responses are different.
+{{{
#!python
def unescape(s):
    s = s.replace("&lt;", "<")
    s = s.replace("&gt;", ">")
    # this has to be last:
    s = s.replace("&amp;", "&")
    return s
}}}

or alternatively (before [[http://bugs.python.org/issue2927|issue2927]]):

{{{
>>> from HTMLParser import HTMLParser
>>> HTMLParser.unescape.__func__(HTMLParser, 'ss&copy;')
u'ss\xa9'
}}}

Note that this will undo exactly what {{{cgi.escape()}}} does; it's easy to extend this to undo what
the {{{html_escape()}}} function above does.  Note the comment that converting the {{{&amp;}}} must be last;
this avoids getting strings like {{{"&amp;lt;"}}} wrong.

This approach is simple and fairly efficient, but is limited to supporting the entities given in the list.
A more thorough approach would be to perform the same processing as an HTML parser.  Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported
"out of the box."  The table of entities which are supported can be found in the {{{htmlentitydefs}}}
module from the library; this is not normally used directly, but the {{{htmllib}}} module uses it to
support most common entities.  It can be used very easily:

{{{
#!python
import htmllib

def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()
}}}

This version has the additional advantage that it supports character references (things like {{{&#65;}}})
as well as entity references.

A more efficient implementation would simply parse the string for entity and character references directly
(and would be a good candidate for the library, if there's really a need for it outside of HTML data).


== Formal htmlentitydefs ==
Yet another approach available with recent Python takes advantage 
of htmlentitydefs:
{{{
import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
    return re.sub('&(%s);' % '|'.join(name2codepoint), 
            lambda m: unichr(name2codepoint[m.group(1)]), s)
}}}

== See Also ==

XML entities are different from, if related to, HTML entities.
This page hints at the details:
 * EscapingXml

John J. Lee discusses still more refinements in implementation in
[[http://groups.google.com/group/comp.lang.python/msg/ce3fc3330cbbac0a|this comp.lang.python follow-up]].

Page

User

Escaping HTML

Unescaping HTML

Formal htmlentitydefs

See Also