python - formrequest - scrapy robotstxt_obey
"500 Internal Server Error" al combinar Scrapy sobre Splash con un proxy HTTP (1)
Estoy tratando de rastrear una araña Scrapy en un contenedor Docker usando tanto Splash (para renderizar JavaScript) como Tor a través de Privoxy (para proporcionar anonimato). Aquí está el docker-compose.yml
que estoy usando para este fin:
version: ''3''
services:
scraper:
build: ./apk_splash
# environment:
# - http_proxy=http://tor-privoxy:8118
links:
- tor-privoxy
- splash
tor-privoxy:
image: rdsubhas/tor-privoxy-alpine
splash:
image: scrapinghub/splash
donde el raspador tiene el siguiente Dockerfile
:
FROM python:alpine
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash
RUN pip install scrapy scrapy-splash scrapy-fake-useragent
COPY . /scraper
WORKDIR /scraper
CMD ["scrapy", "crawl", "apkmirror"]
y la araña que estoy tratando de rastrear es
import scrapy
from scrapy_splash import SplashRequest
from apk_splash.items import ApkmirrorItem
class ApkmirrorSpider(scrapy.Spider):
name = ''apkmirror''
allowed_domains = [''apkmirror.com'']
start_urls = [
''http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/'',
]
custom_settings = {''USER_AGENT'': ''Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36''}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint=''render.html'', args={''wait'': 0.5})
def parse(self, response):
item = ApkmirrorItem()
item[''url''] = response.url
item[''developer''] = response.css(''.breadcrumbs'').xpath(''.//*[re:test(@href, "^/(?:[^/]+/){1}[^/]+/$")]/text()'').extract_first()
item[''app''] = response.css(''.breadcrumbs'').xpath(''.//*[re:test(@href, "^/(?:[^/]+/){2}[^/]+/$")]/text()'').extract_first()
item[''version''] = response.css(''.breadcrumbs'').xpath(''.//*[re:test(@href, "^/(?:[^/]+/){3}[^/]+/$")]/text()'').extract_first()
yield item
donde agregué lo siguiente a settings.py
:
SPIDER_MIDDLEWARES = {
''scrapy_splash.SplashDeduplicateArgsMiddleware'': 100,
}
DOWNLOADER_MIDDLEWARES = {
''scrapy_splash.SplashCookiesMiddleware'': 723,
''scrapy_splash.SplashMiddleware'': 725,
''scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'': 810,
}
SPLASH_URL = ''http://splash:8050/''
DUPEFILTER_CLASS = ''scrapy_splash.SplashAwareDupeFilter''
HTTPCACHE_STORAGE = ''scrapy_splash.SplashAwareFSCacheStorage''
Con el environment
para el contenedor scraper
comentado, el raspador funciona más o menos. Obtengo registros que contienen lo siguiente:
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (referer: None)
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/>
scraper_1 | {''app'': ''Androbench (Storage Benchmark)'',
scraper_1 | ''developer'': ''CSL@SKKU'',
scraper_1 | ''url'': ''http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/'',
scraper_1 | ''version'': ''5.0''}
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-07-11 13:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {''downloader/request_bytes'': 1508,
scraper_1 | ''downloader/request_count'': 3,
scraper_1 | ''downloader/request_method_count/GET'': 2,
scraper_1 | ''downloader/request_method_count/POST'': 1,
scraper_1 | ''downloader/response_bytes'': 190320,
scraper_1 | ''downloader/response_count'': 3,
scraper_1 | ''downloader/response_status_count/200'': 2,
scraper_1 | ''downloader/response_status_count/404'': 1,
scraper_1 | ''finish_reason'': ''finished'',
scraper_1 | ''finish_time'': datetime.datetime(2017, 7, 11, 13, 57, 19, 488874),
scraper_1 | ''item_scraped_count'': 1,
scraper_1 | ''log_count/DEBUG'': 5,
scraper_1 | ''log_count/INFO'': 7,
scraper_1 | ''memusage/max'': 49131520,
scraper_1 | ''memusage/startup'': 49131520,
scraper_1 | ''response_received_count'': 3,
scraper_1 | ''scheduler/dequeued'': 2,
scraper_1 | ''scheduler/dequeued/memory'': 2,
scraper_1 | ''scheduler/enqueued'': 2,
scraper_1 | ''scheduler/enqueued/memory'': 2,
scraper_1 | ''splash/render.html/request_count'': 1,
scraper_1 | ''splash/render.html/response_count/200'': 1,
scraper_1 | ''start_time'': datetime.datetime(2017, 7, 11, 13, 57, 13, 788850)}
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
apksplashcompose_scraper_1 exited with code 0
Sin embargo, si comento en las líneas de environment
en docker-compose.yml
, docker-compose.yml
un error interno del servidor 500:
scraper_1 | 2017-07-11 14:05:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (failed 3 times): 500 Internal Server Error
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (referer: None)
scraper_1 | 2017-07-11 14:05:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/>: HTTP status code is not handled or not allowed
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-07-11 14:05:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {''downloader/request_bytes'': 3898,
scraper_1 | ''downloader/request_count'': 7,
scraper_1 | ''downloader/request_method_count/GET'': 4,
scraper_1 | ''downloader/request_method_count/POST'': 3,
scraper_1 | ''downloader/response_bytes'': 6839,
scraper_1 | ''downloader/response_count'': 7,
scraper_1 | ''downloader/response_status_count/200'': 1,
scraper_1 | ''downloader/response_status_count/500'': 6,
scraper_1 | ''finish_reason'': ''finished'',
scraper_1 | ''finish_time'': datetime.datetime(2017, 7, 11, 14, 5, 7, 866713),
scraper_1 | ''httperror/response_ignored_count'': 1,
scraper_1 | ''httperror/response_ignored_status_count/500'': 1,
scraper_1 | ''log_count/DEBUG'': 10,
scraper_1 | ''log_count/INFO'': 8,
scraper_1 | ''memusage/max'': 49065984,
scraper_1 | ''memusage/startup'': 49065984,
scraper_1 | ''response_received_count'': 3,
scraper_1 | ''retry/count'': 4,
scraper_1 | ''retry/max_reached'': 2,
scraper_1 | ''retry/reason_count/500 Internal Server Error'': 4,
scraper_1 | ''scheduler/dequeued'': 4,
scraper_1 | ''scheduler/dequeued/memory'': 4,
scraper_1 | ''scheduler/enqueued'': 4,
scraper_1 | ''scheduler/enqueued/memory'': 4,
scraper_1 | ''splash/render.html/request_count'': 1,
scraper_1 | ''splash/render.html/response_count/500'': 3,
scraper_1 | ''start_time'': datetime.datetime(2017, 7, 11, 14, 4, 46, 717691)}
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Spider closed (finished)
apksplashcompose_scraper_1 exited with code 0
En resumen, cuando uso Splash para renderizar JavaScript, no puedo usar con éxito el HttpProxyMiddleware para usar también Tor a través de Privoxy. ¿Alguien puede ver lo que está mal aquí?
Actualizar
Siguiendo el comentario de Paul, traté de adaptar el servicio splash
siguiente manera:
splash:
image: scrapinghub/splash
volumes:
- ./splash/proxy-profiles:/etc/splash/proxy-profiles
donde agregué un directorio ''splash'' al directorio principal así:
.
├── apk_splash
├── docker-compose.yml
└── splash
└── proxy-profiles
└── proxy.ini
y proxy.ini
lee
[proxy]
host=tor-privoxy
port=8118
Según tengo entendido, esto debería hacer que el proxy se use siempre (es decir, la whitelist
predetermina a ".*"
Y no a la blacklist
).
Sin embargo, si docker-compose build
y docker-compose up
, sigo teniendo errores HTTP 500. Entonces, la pregunta sigue siendo cómo resolver esto?
(Por cierto, esta pregunta parece similar a https://github.com/scrapy-plugins/scrapy-splash/issues/117 ; sin embargo, no estoy usando Crawlera, así que no estoy seguro de cómo adaptar la respuesta).
Actualización 2
Siguiendo el segundo comentario de Paul, verifiqué que tor-privoxy
resuelve dentro del contenedor haciendo esto (mientras todavía estaba ejecutándose):
~$ docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
04909e6ef5cb apksplashcompose_scraper "scrapy crawl apkm..." 2 hours ago Up 8 seconds apksplashcompose_scraper_1
~$ docker exec -it $(docker ps -lq) /bin/bash
bash-4.3# python
Python 3.6.1 (default, Jun 19 2017, 23:58:41)
[GCC 5.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostbyname(''tor-privoxy'')
''172.22.0.2''
En cuanto a cómo estoy ejecutando Splash, es a través de un contenedor vinculado, similar a la forma descrita en https://splash.readthedocs.io/en/stable/install.html#docker-folder-sharing . /etc/splash/proxy-profiles/proxy.ini
que /etc/splash/proxy-profiles/proxy.ini
está presente en el contenedor:
~$ docker exec -it apksplashcompose_splash_1 /bin/bash
root@b091fbef4c78:/# cd /etc/splash/proxy-profiles
root@b091fbef4c78:/etc/splash/proxy-profiles# ls
proxy.ini
root@b091fbef4c78:/etc/splash/proxy-profiles# cat proxy.ini
[proxy]
host=tor-privoxy
port=8118
Probaré Aquarium , pero la pregunta sigue siendo ¿por qué la configuración actual no funciona?
Siguiendo la estructura del proyecto Aquarium sugerido por paul trmbrth , encontré que es esencial nombrar el archivo .ini default.ini
, no proxy.ini
(de lo contrario, no se ''levantará'' automáticamente). Me las arreglé para hacer que el raspador funcione de esta manera (compárenme mi respuesta automática a Cómo usar Scrapy con Splash y Tor sobre Privoxy en Docker Compose ).