Revision 1 as of 2011-01-13 19:44:05

Clear message

"""Beautiful Soup Elixir and Tonic "The Screen-Scraper's Friend" http://www.crummy.com/software/BeautifulSoup/

Beautiful Soup parses a (possibly invalid) XML or HTML document into a tree representation. It provides methods and Pythonic idioms that make it easy to navigate, search, and modify the tree.

A well-formed XML/HTML document yields a well-formed data structure. An ill-formed XML/HTML document yields a correspondingly ill-formed data structure. If your document is only locally well-formed, you can use this library to find and process the well-formed part of it.

Beautiful Soup works with Python 2.2 and up. It has no external dependencies, but you'll have more success at converting data to UTF-8 if you also install these three packages:

* chardet, for auto-detecting character encodings

* cjkcodecs and iconv_codec, which add more encodings to the ones supported

Beautiful Soup defines classes for two main parsing strategies:

Beautiful Soup also defines a class (UnicodeDammit) for autodetecting the encoding of an HTML or XML document, and converting it to Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.

For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html

Here, have some legalese:

Copyright (c) 2004-2008, Leonard Richardson

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.

""" from future import generators

author = "Leonard Richardson (leonardr@segfault.org)" version = "3.0.7a" copyright = "Copyright (c) 2004-2008 Leonard Richardson" license = "New-style BSD"

from sgmllib import SGMLParser, SGMLParseError import codecs import markupbase import types import re import sgmllib try:

except ImportError:

try:

except NameError:

#These hacks make Beautiful Soup able to parse XML with namespaces sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match

DEFAULT_OUTPUT_ENCODING = "utf-8"

# First, the classes that represent markup elements.

class PageElement:

class NavigableString(unicode, PageElement):

class CData(NavigableString):

class ProcessingInstruction(NavigableString):

class Comment(NavigableString):

class Declaration(NavigableString):

class Tag(PageElement):

# Next, a couple classes to represent queries and their results. class SoupStrainer:

class ResultSet(list):

# Now, some helper functions.

def isList(l):

def isString(s):

def buildTagMap(default, *args):

# Now, the parser classes.

class BeautifulStoneSoup(Tag, SGMLParser):

class BeautifulSoup(BeautifulStoneSoup):

class StopParsing(Exception):

class ICantBelieveItsBeautifulSoup(BeautifulSoup):

class MinimalSoup(BeautifulSoup):

class BeautifulSOAP(BeautifulStoneSoup):

#Enterprise class names! It has come to our attention that some people #think the names of the Beautiful Soup parser classes are too silly #and "unprofessional" for use in enterprise screen-scraping. We feel #your pain! For such-minded folk, the Beautiful Soup Consortium And #All-Night Kosher Bakery recommends renaming this file to #"RobustParser.py" (or, in cases of extreme enterprisiness, #"RobustParserBeanInterface.class") and using the following #enterprise-friendly class aliases: class RobustXMLParser(BeautifulStoneSoup):

class RobustHTMLParser(BeautifulSoup):

class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):

class RobustInsanelyWackAssHTMLParser(MinimalSoup):

class SimplifyingSOAPParser(BeautifulSOAP):

# # Bonus library: Unicode, Dammit # # This class forces XML data into a standard format (usually to UTF-8 # or Unicode). It is heavily based on code from Mark Pilgrim's # Universal Feed Parser. It does not rewrite the XML or HTML to # reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi # (XML) and BeautifulSoup.start_meta (HTML).

# Autodetects character encodings. # Download from http://chardet.feedparser.org/ try:

# import chardet.constants # chardet.constants._debug = 1 except ImportError:

# cjkcodecs and iconv_codec make Python know about more character encodings. # Both are available from http://cjkpython.i18n.org/ # They're built in if you use Python 2.4. try:

except ImportError:

try:

except ImportError:

class UnicodeDammit:

#By default, act as an HTML pretty-printer. if name == 'main':

Unable to edit the page? See the FrontPage for instructions.