Size: 437
Comment: Create new page
|
Size: 1881
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Client-Side Web Programming = | |
Line 2: | Line 3: |
There is probably a huge amount of good stuff available from the people who are working actively with XML-RPC, Biztalk and other approaches to web services. More too from XML writers such as [http://uche.ogbuji.net/uche.ogbuji.net/ Uche Ogbuji], who has put much good stuff on IBM's developerworks site, among other things. | == Libraries == |
Line 4: | Line 5: |
Sadly nobody has categorised or classified it in the Wiki, so at the moment we have to scratch around. | * [[http://utidylib.berlios.de/|utidylib]] and [[http://www.egenix.com/files/python/mxTidy.html|mxTidy]] -- Python interfaces to [[http://tidy.sourceforge.net/|html tidy]] library to clean up HTML documents. * [[http://code.google.com/p/html5lib|html5lib]] A HTML5-compliant library for parsing arbitarily-broken HTML to a range of tree formats including minidom, elementtree (including lxml) and BeautifulSoup * [[http://www.crummy.com/software/BeautifulSoup/|BeautifulSoup]] -- a permissive HTML parser. * Don't use [[http://python.org/doc/current/lib/module-HTMLParser.html|HTMLParser]] on HTML that might be invalid! That way lies pain. Either clean it up (using tidy), or use a different parser. * [[http://docasdfs.python.org/library/urllib.html|urllib]], [[http://docs.python.org/library/urllib2.html|urllib2]], and [[http://docs.python.org/library/httplib.html|httplib]] in the standard library. * [[http://wwwsearch.sourceforge.net/ClientCookie/|ClientCookie]], [[http://wwwsearch.sourceforge.net/ClientForm/|ClientForm]], and [[http://wwwsearch.sourceforge.net/mechanize/|Mechanize]] are higher-level libraries for writing a web client. * [[http://www.python.org/pypi?:action=display&name=mechanoid&version=0.4.1|mechanoid]] a mechanize fork. * [[http://www.python.org/pypi/libxml2dom|libxml2dom]] can parse HTML by employing libxml2's liberal HTML parser. |
Line 6: | Line 14: |
== Resources == | |
Line 7: | Line 16: |
* [[http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52199|Grab a document from the web]] - from the Python Cookbook * [[http://wwwsearch.sourceforge.net/bits/clientx.html|Python web-client programming general FAQs]]. * [[http://docs.python.org/library/urllib.html|urllib -- Open arbitrary resources by URL]] * [[http://docs.python.org/library/urllib2.html|urllib2 -- extensible library for opening URLs]] |
Client-Side Web Programming
Libraries
utidylib and mxTidy -- Python interfaces to html tidy library to clean up HTML documents.
html5lib A HTML5-compliant library for parsing arbitarily-broken HTML to a range of tree formats including minidom, elementtree (including lxml) and BeautifulSoup
BeautifulSoup -- a permissive HTML parser.
Don't use HTMLParser on HTML that might be invalid! That way lies pain. Either clean it up (using tidy), or use a different parser.
ClientCookie, ClientForm, and Mechanize are higher-level libraries for writing a web client.
mechanoid a mechanize fork.
libxml2dom can parse HTML by employing libxml2's liberal HTML parser.