javascript - pages - scrapy python español
Utilice scrapy+splash return html (2)
Estoy tratando de descubrir el scrapy y el splash. Como ejercicio, intenté hacer clic en el botón en el siguiente sitio web pesado de javascript: http://thestlbrowns.com/ y luego devolver el html de la página recién renderizada.
Mi código se ve así:
import scrapy
import json
from scrapy import Request
class MySpider(scrapy.Spider):
name = ''spiderman''
domain = [''web'']
start_urls = [''http://thestlbrowns.com/'']
def start_requests(self):
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(1))
assert(splash:runjs("$(''#title.play-ball > a:first-child'').click()"))
assert(splash:wait(1))
-- return result as a JSON object
return {
html = splash:html(),
-- we don''t need screenshot or network activity
--png = splash:png(),
--har = splash:har(),
}
end
"""
for url in self.start_urls:
yield Request(url, self.parse, meta={''splash'': {''args'':{''lua_source'': script},''endpoint'':''execute'',}})
def parse(self, response):
splash_json = json.loads(response.body_as_unicode())
Sin embargo, cuando ejecuto este código obtengo el siguiente resultado:
$ scrapy crawl spiderman
2017-01-12 14:19:03 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: myScrapingProject)
2017-01-12 14:19:03 [scrapy.utils.log] INFO: Overridden settings: {''BOT_NAME'': ''myScrapingProject'', ''DOWNLOAD_DELAY'': 0.25, ''DUPEFILTER_CLASS'': ''scrapy_splash.SplashAwareDupeFilter'', ''HTTPCACHE_STORAGE'': ''scrapy_splash.SplashAwareFSCacheStorage'', ''NEWSPIDER_MODULE'': ''myScrapingProject.spiders'', ''SPIDER_MODULES'': [''myScrapingProject.spiders''], ''USER_AGENT'': ''Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7''}
2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled extensions:
[''scrapy.extensions.corestats.CoreStats'',
''scrapy.extensions.telnet.TelnetConsole'',
''scrapy.extensions.logstats.LogStats'']
2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
[''scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'',
''scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'',
''scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'',
''scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'',
''scrapy.downloadermiddlewares.retry.RetryMiddleware'',
''scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'',
''scrapy.downloadermiddlewares.redirect.RedirectMiddleware'',
''scrapy.downloadermiddlewares.cookies.CookiesMiddleware'',
''scrapy_splash.SplashCookiesMiddleware'',
''scrapy_splash.SplashMiddleware'',
''scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'',
''scrapy.downloadermiddlewares.stats.DownloaderStats'']
2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled spider middlewares:
[''scrapy.spidermiddlewares.httperror.HttpErrorMiddleware'',
''scrapy_splash.SplashDeduplicateArgsMiddleware'',
''scrapy.spidermiddlewares.offsite.OffsiteMiddleware'',
''scrapy.spidermiddlewares.referer.RefererMiddleware'',
''scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'',
''scrapy.spidermiddlewares.depth.DepthMiddleware'']
2017-01-12 14:19:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-12 14:19:03 [scrapy.core.engine] INFO: Spider opened
2017-01-12 14:19:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-12 14:19:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-12 14:19:16 [scrapy_splash.middleware] WARNING: Bad request to Splash: {''error'': 400, ''info'': {''error'': "bad argument #2 to ''assert'' (string expected, got table)", ''line_number'': 8, ''source'': ''[string "..."]'', ''message'': ''Lua error: [string "..."]:8: bad argument #2 to /'assert/' (string expected, got table)'', ''type'': ''LUA_ERROR''}, ''description'': ''Error happened while executing Lua script'', ''type'': ''ScriptError''}
2017-01-12 14:19:16 [scrapy.core.engine] DEBUG: Crawled (400) <POST http://localhost:8050/execute> (referer: None)
2017-01-12 14:19:16 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://thestlbrowns.com/>: HTTP status code is not handled or not allowed
2017-01-12 14:19:16 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-12 14:19:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{''downloader/request_bytes'': 1222,
''downloader/request_count'': 1,
''downloader/request_method_count/POST'': 1,
''downloader/response_bytes'': 471,
''downloader/response_count'': 1,
''downloader/response_status_count/400'': 1,
''finish_reason'': ''finished'',
''finish_time'': datetime.datetime(2017, 1, 12, 13, 19, 16, 846242),
''log_count/DEBUG'': 2,
''log_count/INFO'': 8,
''log_count/WARNING'': 1,
''response_received_count'': 1,
''scheduler/dequeued'': 2,
''scheduler/dequeued/memory'': 2,
''scheduler/enqueued'': 2,
''scheduler/enqueued/memory'': 2,
''splash/execute/request_count'': 1,
''splash/execute/response_count/400'': 1,
''start_time'': datetime.datetime(2017, 1, 12, 13, 19, 3, 417278)}
2017-01-12 14:19:16 [scrapy.core.engine] INFO: Spider closed (finished)
P: ¿Alguien sabe cómo arreglar esto / lo que estoy haciendo mal?
EDITAR: Cuando agrego script = quote(script)
antes de pasar el script a splash, obtengo el siguiente resultado de error:
Message: ''Bad request to Splash: {/'type/': /'ScriptError/', /'description/': /'Error happened while executing Lua script/', '' /
''/'error/': 400, /'info/': {/'error/': "unexpected symbol near /'%/'", /'type/': /'LUA_INIT_ERROR/', /'line_number/': 1, '' /
''/'source/': /'[string "%0A%20%20%20%20%20%20%20%20%20function%20main..."]/','' /
'' /'message/': /'[string "%0A%20%20%20%20%20%20%20%20%20function%20main..."]:1: unexpected symbol near ///'%///'/'}}''
La respuesta de Splash contiene algunos consejos:
{''description'': ''Error happened while executing Lua script'',
''error'': 400,
''info'': {''error'': "bad argument #2 to ''assert'' (string expected, got table)",
''line_number'': 8,
''message'': ''Lua error: [string "..."]:8: bad argument #2 to /'assert/' (string expected, got table)'',
''source'': ''[string "..."]'',
''type'': ''LUA_ERROR''},
''type'': ''ScriptError''}
Si prueba su secuencia de comandos en la interfaz web de Splash (¡es su amigo!), Tiene el mismo error, que proviene de esta línea:
assert(splash:runjs("$(''#title.play-ball > a:first-child'').click()"))
Si cambias un poco esa secuencia de comandos de Lua, captas el error (por cierto, creo que te referías a .title.play-ball > a:first-child
porque no hay ningún elemento con id="title"
):
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(1))
-- go back 1 month in time and wait a little (1 second)
ok, err = splash:runjs("$(''.title.play-ball > a:first-child'').click()")
assert(splash:wait(1))
-- return result as a JSON object
return {
html = splash:html(),
error = err
-- we don''t need screenshot or network activity
--png = splash:png(),
--har = splash:har(),
}
end
y al ejecutarlo en la interfaz web, obtienes un objeto de "error" en la respuesta, que muestra:
error: Object
js_error: "ReferenceError: Can''t find variable: $"
js_error_message: "Can''t find variable: $"
js_error_type: "ReferenceError"
message: "JS error: /"ReferenceError: Can''t find variable: $/""
splash_method: "runjs"
type: "JS_ERROR"
Parece que $
magic no está funcionando en ese sitio web. Puedes usarlo en la consola de Chrome, por ejemplo, pero con Splash probablemente / aparentemente necesites cargar jQuery (o algo similar), con splash:autoload
generalmente. Por ejemplo:
function main(splash)
assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(1))
-- go back 1 month in time and wait a little (1 second)
ok, err = splash:runjs("$(''.title.play-ball > a:first-child'').click()")
assert(splash:wait(1))
-- return result as a JSON object
return {
html = splash:html(),
error = err
-- we don''t need screenshot or network activity
--png = splash:png(),
--har = splash:har(),
}
end
Tenga en cuenta que este código de JavaScript no funcionó para mí con Splash (la captura de pantalla no mostró el asunto "Historial").
Pero probé con lo siguiente en la interfaz web, y obtuve el programa "Historial" (en la captura de pantalla png, que se comenta aquí):
function main(splash)
-- no need to load jQuery when you use splash:select
--assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(15))
local element = splash:select(''.title.play-ball > a:first-child'')
local bounds = element:bounds()
assert(element:mouse_click{x=bounds.width/2, y=bounds.height/2})
assert(splash:wait(5))
-- return result as a JSON object
return {
html = splash:html(),
-- we don''t need screenshot or network activity
--png = splash:png(),
--har = splash:har(),
}
end
De hecho, Splash 2.3 tiene ayudantes para ese tipo de interacción (por ejemplo, hacer clic en un elemento). Ver por ejemplo splash: select y element: mouse_click
También tenga en cuenta que aumenté los valores de wait()
.
Necesita "citar" su script antes de pasarlo a Splash:
script = """Your script"""
from urllib.parse import quote
script = quote(script)
# ''Your%20script''