python - from - ¿Caché en urllib2?

urllib2 python 3 (7)

Esta receta de ActiveState Python podría ser útil: http://code.activestate.com/recipes/491261/

¿Hay alguna manera fácil de almacenar en caché las cosas cuando utilizo urllib2 que estoy pasando por alto, o tengo que hacer las mías?

Puede usar una función decoradora como:

class cache(object): def __init__(self, fun): self.fun = fun self.cache = {} def __call__(self, *args, **kwargs): key = str(args) + str(kwargs) try: return self.cache[key] except KeyError: self.cache[key] = rval = self.fun(*args, **kwargs) return rval except TypeError: # incase key isn''t a valid key - don''t cache return self.fun(*args, **kwargs)

y define una función a lo largo de las líneas de:

@cache def get_url_src(url): return urllib.urlopen(url).read()

Esto supone que no está prestando atención a los controles de caché de HTTP, sino que solo quiere almacenar en caché la página durante la duración de la aplicación.

Estaba buscando algo similar, y me encontré con "Receta 491261: almacenamiento en caché y aceleración para urllib2" que publicó danivo. El problema es que realmente no me gusta el código de almacenamiento en caché (mucha duplicación, mucha unión manual de rutas de archivos en lugar de usar os.path.join, usa métodos estáticos, no muy PEP8''sih, y otras cosas que trato de evitar)

El código es un poco más agradable (en mi opinión de todos modos) y es funcionalmente muy similar, con algunas adiciones, principalmente el método "recache" (el uso del ejemplo se puede ver aquí , o en la sección if __name__ == "__main__": al final del código).

La última versión se puede encontrar en http://github.com/dbr/tvdb_api/blob/master/cache.py , y la pegaré aquí para la posteridad (con los encabezados específicos de mi aplicación eliminados):

#!/usr/bin/env python """ urllib2 caching handler Modified from http://code.activestate.com/recipes/491261/ by dbr """ import os import time import httplib import urllib2 import StringIO from hashlib import md5 def calculate_cache_path(cache_location, url): """Checks if [cache_location]/[hash_of_url].headers and .body exist """ thumb = md5(url).hexdigest() header = os.path.join(cache_location, thumb + ".headers") body = os.path.join(cache_location, thumb + ".body") return header, body def check_cache_time(path, max_age): """Checks if a file has been created/modified in the [last max_age] seconds. False means the file is too old (or doesn''t exist), True means it is up-to-date and valid""" if not os.path.isfile(path): return False cache_modified_time = os.stat(path).st_mtime time_now = time.time() if cache_modified_time < time_now - max_age: # Cache is old return False else: return True def exists_in_cache(cache_location, url, max_age): """Returns if header AND body cache file exist (and are up-to-date)""" hpath, bpath = calculate_cache_path(cache_location, url) if os.path.exists(hpath) and os.path.exists(bpath): return( check_cache_time(hpath, max_age) and check_cache_time(bpath, max_age) ) else: # File does not exist return False def store_in_cache(cache_location, url, response): """Tries to store response in cache.""" hpath, bpath = calculate_cache_path(cache_location, url) try: outf = open(hpath, "w") headers = str(response.info()) outf.write(headers) outf.close() outf = open(bpath, "w") outf.write(response.read()) outf.close() except IOError: return True else: return False class CacheHandler(urllib2.BaseHandler): """Stores responses in a persistant on-disk cache. If a subsequent GET request is made for the same URL, the stored response is returned, saving time, resources and bandwidth """ def __init__(self, cache_location, max_age = 21600): """The location of the cache directory""" self.max_age = max_age self.cache_location = cache_location if not os.path.exists(self.cache_location): os.mkdir(self.cache_location) def default_open(self, request): """Handles GET requests, if the response is cached it returns it """ if request.get_method() is not "GET": return None # let the next handler try to handle the request if exists_in_cache( self.cache_location, request.get_full_url(), self.max_age ): return CachedResponse( self.cache_location, request.get_full_url(), set_cache_header = True ) else: return None def http_response(self, request, response): """Gets a HTTP response, if it was a GET request and the status code starts with 2 (200 OK etc) it caches it and returns a CachedResponse """ if (request.get_method() == "GET" and str(response.code).startswith("2") ): if ''x-local-cache'' not in response.info(): # Response is not cached set_cache_header = store_in_cache( self.cache_location, request.get_full_url(), response ) else: set_cache_header = True #end if x-cache in response return CachedResponse( self.cache_location, request.get_full_url(), set_cache_header = set_cache_header ) else: return response class CachedResponse(StringIO.StringIO): """An urllib2.response-like object for cached responses. To determine if a response is cached or coming directly from the network, check the x-local-cache header rather than the object type. """ def __init__(self, cache_location, url, set_cache_header=True): self.cache_location = cache_location hpath, bpath = calculate_cache_path(cache_location, url) StringIO.StringIO.__init__(self, file(bpath).read()) self.url = url self.code = 200 self.msg = "OK" headerbuf = file(hpath).read() if set_cache_header: headerbuf += "x-local-cache: %s/r/n" % (bpath) self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf)) def info(self): """Returns headers """ return self.headers def geturl(self): """Returns original URL """ return self.url def recache(self): new_request = urllib2.urlopen(self.url) set_cache_header = store_in_cache( self.cache_location, new_request.url, new_request ) CachedResponse.__init__(self, self.cache_location, self.url, True) if __name__ == "__main__": def main(): """Quick test/example of CacheHandler""" opener = urllib2.build_opener(CacheHandler("/tmp/")) response = opener.open("http://google.com") print response.headers print "Response:", response.read() response.recache() print response.headers print "After recache:", response.read() main()

Este artículo en Yahoo Developer Network - http://developer.yahoo.com/python/python-caching.html - describe cómo almacenar en caché las llamadas http realizadas a través de urllib a la memoria o al disco.

Siempre me he sentido dividido entre el uso de httplib2, que hace un trabajo sólido de manejo del almacenamiento en caché HTTP y la autenticación, y urllib2, que está en el stdlib, tiene una interfaz extensible y admite servidores HTTP Proxy.

La receta de ActiveState comienza a agregar compatibilidad de almacenamiento en caché a urllib2, pero solo de una manera muy primitiva. No permite la extensibilidad en los mecanismos de almacenamiento, codificando el almacenamiento respaldado por el sistema de archivos. Tampoco respeta los encabezados de caché HTTP.

En un intento de reunir las mejores características del caché de httplib2 y la extensibilidad de urllib2, he adaptado la receta de ActiveState para implementar la mayor parte de la misma funcionalidad de caché que se encuentra en httplib2. El módulo está en jaraco.net como jaraco.net.http.caching . El enlace apunta al módulo tal como existe en el momento de escribir esto. Si bien ese módulo es actualmente parte del paquete jaraco.net más grande, no tiene dependencias dentro del paquete, así que siéntase libre de extraer el módulo y usarlo en sus propios proyectos.

Alternativamente, si tiene Python 2.6 o posterior, puede easy_install jaraco.net>=1.3 y luego utilizar CachingHandler con algo así como el código en caching.quick_test() .

"""Quick test/example of CacheHandler""" import logging import urllib2 from httplib2 import FileCache from jaraco.net.http.caching import CacheHandler logging.basicConfig(level=logging.DEBUG) store = FileCache(".cache") opener = urllib2.build_opener(CacheHandler(store)) urllib2.install_opener(opener) response = opener.open("http://www.google.com/") print response.headers print "Response:", response.read()[:100], ''.../n'' response.reload(store) print response.headers print "After reload:", response.read()[:100], ''.../n''

Tenga en cuenta que jaraco.util.http.caching no proporciona una especificación para el almacén de respaldo para el caché, sino que en su lugar sigue la interfaz utilizada por httplib2. Por este motivo, el httplib.FileCache se puede usar directamente con urllib2 y CacheHandler. Además, otras cachés de respaldo diseñadas para httplib2 deberían ser utilizables por CacheHandler.

Si no te importa trabajar en un nivel ligeramente inferior, httplib2 ( https://github.com/httplib2/httplib2 ) es una excelente biblioteca HTTP que incluye la funcionalidad de almacenamiento en caché.

@dbr: es posible que necesite agregar también caché de respuestas https con:

def https_response(self, request, response): return self.http_response(request,response)