tutorial pages example español crawling crawler javascript python web-scraping scrapy splash

javascript - pages - scrapy python español



Utilice scrapy+splash return html (2)

Estoy tratando de descubrir el scrapy y el splash. Como ejercicio, intenté hacer clic en el botón en el siguiente sitio web pesado de javascript: http://thestlbrowns.com/ y luego devolver el html de la página recién renderizada.

Mi código se ve así:

import scrapy import json from scrapy import Request class MySpider(scrapy.Spider): name = ''spiderman'' domain = [''web''] start_urls = [''http://thestlbrowns.com/''] def start_requests(self): script = """ function main(splash) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(1)) assert(splash:runjs("$(''#title.play-ball > a:first-child'').click()")) assert(splash:wait(1)) -- return result as a JSON object return { html = splash:html(), -- we don''t need screenshot or network activity --png = splash:png(), --har = splash:har(), } end """ for url in self.start_urls: yield Request(url, self.parse, meta={''splash'': {''args'':{''lua_source'': script},''endpoint'':''execute'',}}) def parse(self, response): splash_json = json.loads(response.body_as_unicode())

Sin embargo, cuando ejecuto este código obtengo el siguiente resultado:

$ scrapy crawl spiderman 2017-01-12 14:19:03 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: myScrapingProject) 2017-01-12 14:19:03 [scrapy.utils.log] INFO: Overridden settings: {''BOT_NAME'': ''myScrapingProject'', ''DOWNLOAD_DELAY'': 0.25, ''DUPEFILTER_CLASS'': ''scrapy_splash.SplashAwareDupeFilter'', ''HTTPCACHE_STORAGE'': ''scrapy_splash.SplashAwareFSCacheStorage'', ''NEWSPIDER_MODULE'': ''myScrapingProject.spiders'', ''SPIDER_MODULES'': [''myScrapingProject.spiders''], ''USER_AGENT'': ''Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7''} 2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled extensions: [''scrapy.extensions.corestats.CoreStats'', ''scrapy.extensions.telnet.TelnetConsole'', ''scrapy.extensions.logstats.LogStats''] 2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled downloader middlewares: [''scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'', ''scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'', ''scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'', ''scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'', ''scrapy.downloadermiddlewares.retry.RetryMiddleware'', ''scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'', ''scrapy.downloadermiddlewares.redirect.RedirectMiddleware'', ''scrapy.downloadermiddlewares.cookies.CookiesMiddleware'', ''scrapy_splash.SplashCookiesMiddleware'', ''scrapy_splash.SplashMiddleware'', ''scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'', ''scrapy.downloadermiddlewares.stats.DownloaderStats''] 2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled spider middlewares: [''scrapy.spidermiddlewares.httperror.HttpErrorMiddleware'', ''scrapy_splash.SplashDeduplicateArgsMiddleware'', ''scrapy.spidermiddlewares.offsite.OffsiteMiddleware'', ''scrapy.spidermiddlewares.referer.RefererMiddleware'', ''scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'', ''scrapy.spidermiddlewares.depth.DepthMiddleware''] 2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-01-12 14:19:03 [scrapy.core.engine] INFO: Spider opened 2017-01-12 14:19:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-01-12 14:19:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-01-12 14:19:16 [scrapy_splash.middleware] WARNING: Bad request to Splash: {''error'': 400, ''info'': {''error'': "bad argument #2 to ''assert'' (string expected, got table)", ''line_number'': 8, ''source'': ''[string "..."]'', ''message'': ''Lua error: [string "..."]:8: bad argument #2 to /'assert/' (string expected, got table)'', ''type'': ''LUA_ERROR''}, ''description'': ''Error happened while executing Lua script'', ''type'': ''ScriptError''} 2017-01-12 14:19:16 [scrapy.core.engine] DEBUG: Crawled (400) <POST http://localhost:8050/execute> (referer: None) 2017-01-12 14:19:16 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://thestlbrowns.com/>: HTTP status code is not handled or not allowed 2017-01-12 14:19:16 [scrapy.core.engine] INFO: Closing spider (finished) 2017-01-12 14:19:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {''downloader/request_bytes'': 1222, ''downloader/request_count'': 1, ''downloader/request_method_count/POST'': 1, ''downloader/response_bytes'': 471, ''downloader/response_count'': 1, ''downloader/response_status_count/400'': 1, ''finish_reason'': ''finished'', ''finish_time'': datetime.datetime(2017, 1, 12, 13, 19, 16, 846242), ''log_count/DEBUG'': 2, ''log_count/INFO'': 8, ''log_count/WARNING'': 1, ''response_received_count'': 1, ''scheduler/dequeued'': 2, ''scheduler/dequeued/memory'': 2, ''scheduler/enqueued'': 2, ''scheduler/enqueued/memory'': 2, ''splash/execute/request_count'': 1, ''splash/execute/response_count/400'': 1, ''start_time'': datetime.datetime(2017, 1, 12, 13, 19, 3, 417278)} 2017-01-12 14:19:16 [scrapy.core.engine] INFO: Spider closed (finished)

P: ¿Alguien sabe cómo arreglar esto / lo que estoy haciendo mal?

EDITAR: Cuando agrego script = quote(script) antes de pasar el script a splash, obtengo el siguiente resultado de error:

Message: ''Bad request to Splash: {/'type/': /'ScriptError/', /'description/': /'Error happened while executing Lua script/', '' / ''/'error/': 400, /'info/': {/'error/': "unexpected symbol near /'%/'", /'type/': /'LUA_INIT_ERROR/', /'line_number/': 1, '' / ''/'source/': /'[string "%0A%20%20%20%20%20%20%20%20%20function%20main..."]/','' / '' /'message/': /'[string "%0A%20%20%20%20%20%20%20%20%20function%20main..."]:1: unexpected symbol near ///'%///'/'}}''


La respuesta de Splash contiene algunos consejos:

{''description'': ''Error happened while executing Lua script'', ''error'': 400, ''info'': {''error'': "bad argument #2 to ''assert'' (string expected, got table)", ''line_number'': 8, ''message'': ''Lua error: [string "..."]:8: bad argument #2 to /'assert/' (string expected, got table)'', ''source'': ''[string "..."]'', ''type'': ''LUA_ERROR''}, ''type'': ''ScriptError''}

Si prueba su secuencia de comandos en la interfaz web de Splash (¡es su amigo!), Tiene el mismo error, que proviene de esta línea:

assert(splash:runjs("$(''#title.play-ball > a:first-child'').click()"))

Si cambias un poco esa secuencia de comandos de Lua, captas el error (por cierto, creo que te referías a .title.play-ball > a:first-child porque no hay ningún elemento con id="title" ):

function main(splash) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(1)) -- go back 1 month in time and wait a little (1 second) ok, err = splash:runjs("$(''.title.play-ball > a:first-child'').click()") assert(splash:wait(1)) -- return result as a JSON object return { html = splash:html(), error = err -- we don''t need screenshot or network activity --png = splash:png(), --har = splash:har(), } end

y al ejecutarlo en la interfaz web, obtienes un objeto de "error" en la respuesta, que muestra:

error: Object js_error: "ReferenceError: Can''t find variable: $" js_error_message: "Can''t find variable: $" js_error_type: "ReferenceError" message: "JS error: /"ReferenceError: Can''t find variable: $/"" splash_method: "runjs" type: "JS_ERROR"

Parece que $ magic no está funcionando en ese sitio web. Puedes usarlo en la consola de Chrome, por ejemplo, pero con Splash probablemente / aparentemente necesites cargar jQuery (o algo similar), con splash:autoload generalmente. Por ejemplo:

function main(splash) assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js")) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(1)) -- go back 1 month in time and wait a little (1 second) ok, err = splash:runjs("$(''.title.play-ball > a:first-child'').click()") assert(splash:wait(1)) -- return result as a JSON object return { html = splash:html(), error = err -- we don''t need screenshot or network activity --png = splash:png(), --har = splash:har(), } end

Tenga en cuenta que este código de JavaScript no funcionó para mí con Splash (la captura de pantalla no mostró el asunto "Historial").

Pero probé con lo siguiente en la interfaz web, y obtuve el programa "Historial" (en la captura de pantalla png, que se comenta aquí):

function main(splash) -- no need to load jQuery when you use splash:select --assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js")) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(15)) local element = splash:select(''.title.play-ball > a:first-child'') local bounds = element:bounds() assert(element:mouse_click{x=bounds.width/2, y=bounds.height/2}) assert(splash:wait(5)) -- return result as a JSON object return { html = splash:html(), -- we don''t need screenshot or network activity --png = splash:png(), --har = splash:har(), } end

De hecho, Splash 2.3 tiene ayudantes para ese tipo de interacción (por ejemplo, hacer clic en un elemento). Ver por ejemplo splash: select y element: mouse_click

También tenga en cuenta que aumenté los valores de wait() .


Necesita "citar" su script antes de pasarlo a Splash:

script = """Your script""" from urllib.parse import quote script = quote(script) # ''Your%20script''