Is there an existing Python module that takes care of retrieving and caching web page contents?
Something like:
Perhaps there are different options for where and how to store cache data.
I have written at least three programs that do this, (nLSD interpreter, and two Local Names servers,) and am about to embark on a fourth program.
Has anyone created a standard module or interface for this sort of thing?
Some things that would be nice:
- Optional attention to HTTP cache directives.
- Specify directory to store cache entries in.
- Optional compression, decompression, of cached data.
- Optional connection with a client-side Squid cache. (Pooling a web cache with other programs.)
- Conceivably, a caching module could be a drop-in replacement for urllib.
Some info for would-be cachers:
Caching Tutorial for Web Authors and Webmasters - talks about HTTP headers having to do with caching
urllib.urlretrieve - performs some of what we want, though you have to do a lot of maintenance yourself
-- LionKimbro 2005-03-29 06:45:28
Primitive Example
Here's a very simple example:
1 #!/usr/bin/python
2 """Retrieve and cache web pages.
3
4 webcache retrieves and caches web pages. If the webpage has been
5 retrieved before, the cached version is used.
6
7 The module is primitive; it DOES NOT respect HTTP cache headers. Cached
8 pages are stored in a BSD database.
9
10 WebCache -- cache for web pages
11 """
12
13 import time
14 import urllib
15 import optparse
16 import bsddb
17
18
19 class WebCache:
20
21 """BSD DB cache for web pages.
22
23 get_page -- retrieve a page from cache or web
24 dump_page -- dump a cache entry
25 clean -- vet expired cache entries
26 """
27
28 def __init__(self, page_db_filename, time_db_filename, cache_ttl):
29 """Initialize web cache.
30
31 Berkeley databases are created if they don't already exist. The
32 page database stores the contents of web pages. The time
33 database stores the times that the pages were loaded.
34
35 Times are stored in seconds since the epoch.
36
37 page_db_filename -- filename of page database
38 time_db_filename -- filename of load timestamp database
39 cache_ttl -- cache time to live in seconds
40 """
41 self._page_db = bsddb.hashopen(page_db_filename)
42 self._time_db = bsddb.hashopen(time_db_filename)
43 self.cache_ttl = cache_ttl
44
45 def get_page(self, url):
46 """Retrieve a page from the web or the cache.
47
48 get_page returns the page contents retrieved by urllib.urlopen.
49
50 url -- URL of web page to retrieve
51 """
52
53 now = time.time()
54 if url in self._time_db:
55 last_read = float(self._time_db[url])
56 if now < last_read + self.cache_ttl:
57 return self._page_db[url]
58
59 contents = urllib.urlopen(url).read()
60
61 self._page_db[url] = contents
62 self._time_db[url] = str(now)
63 self._page_db.sync()
64 self._time_db.sync()
65
66 return contents
67
68 def dump_page(self, url):
69 """Force a cache entry to expire."""
70
71 del self._time_db[url]
72 del self._page_db[url]
73 self._time_db.sync()
74 self._page_db.sync()
75
76 def clean(self):
77 """Vet cache of expired entries.
78
79 Note that the BSD database file may not actually get smaller.
80 (Rather, older data will be overwritten by new data.)
81 """
82
83 now = time.time()
84 for (url, last_read) in self._time_db.items():
85 last_read = float(last_read)
86 if now >= last_read + self.cache_ttl:
87 del self._time_db[url]
88 del self._page_db[url]
89 self._time_db.sync()
90 self._page_db.sync()
91
92
93 if __name__ == "__main__":
94 parser = optparse.OptionParser("usage: %prog [options]\n"
95 "cleans the cache if no URL is"
96 " supplied")
97 parser.add_option("-p", "--pages", dest="page_db_filename",
98 default="pages.db", type="string",
99 help="pages BSD database filename")
100 parser.add_option("-t", "--times", dest="time_db_filename",
101 default="times.db", type="string",
102 help="timestamps BSD database filename")
103 parser.add_option("-T", "--ttl", dest="ttl", default=60*60,
104 type="int", help="time to live (in seconds)")
105 parser.add_option("-u", "--url", dest="url", type="string",
106 help="url of page to retrieve and display")
107
108 (options, args) = parser.parse_args()
109 if len(args) > 0:
110 parser.error("incorrect number of arguments")
111
112 cache = WebCache(options.page_db_filename, options.time_db_filename,
113 options.ttl)
114 if options.url is None:
115 cache.clean()
116 else:
117 print cache.get_page(args[0])