Differences between revisions 7 and 8
Revision 7 as of 2004-11-17 01:39:02
Size: 4683
Editor: 186
Comment: note about rawdog, poll frequently, preference to one thread.
Revision 8 as of 2004-11-23 16:05:18
Size: 4684
Editor: 222
Comment:
Deletions are marked like this. Additions are marked like this.
Line 67: Line 67:
   * be sure to support HttpCompression in the responses.    * be sure to support HttpCompression in the responses/.

Python RSS Code

Articles:

Libraries:

Feed Parser

[http://diveintomark.org/projects/feed_parser/ Feed Parser] is an awesome RSS reader.

Download it, and then start a Python prompt in the same directory.

   1 import feedparser
   2 
   3 python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
   4                       "RecentChanges?action=rss_rc"
   5 
   6 feed = feedparser.parse( python_wiki_rss_url )

You now have the RSS feed data for the PythonInfo wiki!

Take a look at it; There's a lot of data there.

Of particular interest:

feed[ "bozo" ]

1 if the feed data isn't well-formed XML.

feed[ "url" ]

URL of the feed's RSS feed

feed[ "version" ]

version of the RSS feed

feed[ "channel" ][ "title" ] 

"PythonInfo Wiki" - Title of the Feed.

feed[ "channel" ][ "description" ]

"RecentChanges at PythonInfo Wiki." - Description of the Feed

feed[ "channel" ][ "link" ]

Link to RecentChanges - Web page associated with the feed.

feed[ "channel" ][ "wiki_interwiki" ]

"Python``Info" - For wiki, the wiki's preferred InterWiki moniker.

feed[ "items" ]

A gigantic list of all of the RecentChanges items.

For each item in feed["items"], we have:

item[ "date" ]

"2004-02-13T22:28:23+08:00" - ISO 8601 (right#?) date

item[ "date_parsed" ]

(2004,02,13,14,28,23,4,44,0)

item[ "title" ]

title for item

item[ "summary" ]

change summary

item[ "link" ]

URL to the page

item[ "wiki_diff" ]

for wiki, a link to the diff for the page

item[ "wiki_history" ]

for wiki, a link to the page history

Aggregating Feeds with Feed Parser

RawDog is a ready made aggregator if you don't want to write your own.

If you're pulling down a lot of feeds, and aggregating them:

First, you may want to use [http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/84317 Future threads] to pull down your feeds. That way, you can send out 5 requests immediately, and wait for them all to come back at once, rather than sending out a request, waiting for it to come in, send out another request, wait for it to come back in, etc., etc.,.

Usuallly, such performance is not necessary, unless you have thousands of feeds to retrieve every hour. If you have less than a few hundred feeds an hour to retrieve, one a time is probably better - why peak out your processor/bandwidth?

Other things to help you be polite:

  • try and retrieve things a few times a day or week. don't request hourly updates unless you need them.
  • avoid updates on the hour or half hour. Try a random time into the hour, like 27 or 33 or whatever, or poll at an interval like 693 minutes rather than 600, so that only rarely do you poll sites near the hour boundry. It is problematic for sites to get polled by hundreds of aggregators on the hour.
  • be sure to use HttpConditionalGetRequests and honour content-expires response flags

  • be sure to support HttpCompression in the responses/.

   1 from future import Future
   2 
   3 hit_list = [ "http://...", "...", "..." ] # list of feeds to pull down
   4 
   5 # pull down all feeds
   6 future_calls = [Future(feedparser.parse,rss_url) for rss_url in hit_list]
   7 # block until they are all in
   8 feeds = [future_obj() for future_obj in future_calls]

Now that you have your feeds, extract all the entries.

   1 entries = []
   2 for feed in feeds:
   3     entries.extend( feed[ "items" ] )

...and sort them, by SortingListsOfDictionaries:

   1 decorated = [(entry["date_parsed"], entry) for entry in entries]
   2 decorated.sort()
   3 decorated.reverse() # for most recent entries first
   4 sorted = [entry for (date,entry) in decorated]

Congradulations! You've aggregated a bunch of changes!

Contributors

LionKimbro

Discussion

Getting the "author"/"contributor" out of most ModWiki RSS feeds with the feedparser module is a bit confusing as of now. Right now (feedparser 3.3), it goes into the "rdf_value" attribute of the entry.

RssLibraries (last edited 2014-05-08 00:46:56 by DaleAthanasias)

Unable to edit the page? See the FrontPage for instructions.