python scrapy scrapy-spider cloudflare

python - Cómo raspar un sitio web con protección de sucuri



scrapy scrapy-spider (2)

El sitio usa protección basada en cookies y agente de usuario. Puede verificarlo de esa manera. Abra DevTools en Chrome. Navegue a la página de destino http://www.dwarozh.net/sport/ , luego, en la pestaña Red, haga clic derecho en la solicitud de la página y "Copie como CURL" Abra la consola y ejecute el CURL:

$ curl ''http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95'' -H ''Accept-Encoding: gzip, deflate, sdch'' -H ''Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2'' -H ''Upgrade-Insecure-Requests: 1'' -H ''X-Compress: null'' -H ''User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'' -H ''Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'' -H ''Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505'' -H ''Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688'' -H ''Connection: keep-alive'' --compressed

Verás el código html normal. Si elimina la cookie de User-Agent de la solicitud, obtiene la página de límite.

Permite comprobarlo en scrapy:

$ scrapy shell >>> from scrapy import Request >>> cookie_str = ''''''here; your; cookies; from; browser; go;'''''' >>> cookies = dict(pair.split(''='') for pair in cookie_str.split(''; '')) >>> cookies # check them {''__auc'': ''999'', ''__cfduid'': ''796'', ''_gat'': ''1'', ''__atuvc'': ''1%7C49'', ''sucuri_cloudproxy_uuid_0d5c97a96'': ''6ab007eb1 9'', ''ASP.NET_SessionId'': ''u9'', ''_ga'': ''GA1.2.1968.148'', ''__asc'': ''sfsdf'', ''sucuri_cloudproxy_uuid_ce2 sfsdfs'': ''sdfsdf''} >>> r = Request(url=''http://www.dwarozh.net/sport/'', cookies=cookies, headers={''User-Agent'': ''Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome /54. Safari/5''}) >>> fetch(r) >>> response.xpath(''//div[@class="news-more-img"]/ul/li'') [<Selector xpath=''//div[@class="news-more-img"]/ul/li'' data=''<li><a href="details.aspx?jimare=10507">''>, <Selector xpath=''//div[@class="news-more-img"]/ul/li'' data=''<li><a href="de tails.aspx?jimare=10505">''>, <Selector xpath=''//div[@class="news-more-img"]/ul/li'' data=''<li><a href="details.aspx?jimare=10504">''>, <Selector xpath=''//div[@class="news-more-img"]/ ul/li'' data=''<li><a href="details.aspx?jimare=10503">''>, <Selector xpath=''//div[@class="news-more-img"]/ul/li'' data=''<li><a href="details.aspx?jimare=10323">''>]

¡Excelente! Hagamos una araña:

He modificado el tuyo porque no tengo un código fuente de algunos componentes.

from scrapy import Spider, Request from scrapy.selector import Selector import scrapy #from Stack.items import StackItem #from bs4 import BeautifulSoup from scrapy import log from scrapy.utils.response import open_in_browser class StackSpider(Spider): name = "dwarozh" start_urls = [ "http://www.dwarozh.net/sport/", ] _cookie_str = ''''''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''''' _user_agent = ''Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'' def start_requests(self): cookies = dict(pair.split(''='') for pair in self._cookie_str.split(''; '')) return [Request(url=url, cookies=cookies, headers={''User-Agent'': self._user_agent}) for url in self.start_urls] def parse(self, response): mItems = Selector(response).xpath(''//div[@class="news-more-img"]/ul/li'') for mItem in mItems: item = {} # StackItem() item[''title''] = mItem.xpath(''a/h2/text()'').extract_first() item[''url''] = mItem.xpath(''viewa/@href'').extract_first() yield {''url'': item[''url''], ''title'': item[''title'']}

Vamos a ejecutarlo:

$ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG /Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more. return f(*args, **kwds) 2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1) 2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {''SPIDER_MODULES'': [''scrap1.spiders''], ''FEED_FORMAT'': ''csv'', ''BOT_NAME'': ''scrap1'', ''FEED_URI'': ''stdout:'', ''NEWSPIDER_MODULE'': ''scrap1.spiders'', ''ROBOTSTXT_OBEY'': True} 2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions: [''scrapy.extensions.corestats.CoreStats'', ''scrapy.extensions.telnet.TelnetConsole'', ''scrapy.extensions.feedexport.FeedExporter'', ''scrapy.extensions.logstats.LogStats''] 2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares: [''scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware'', ''scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'', ''scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'', ''scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'', ''scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'', ''scrapy.downloadermiddlewares.retry.RetryMiddleware'', ''scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'', ''scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'', ''scrapy.downloadermiddlewares.redirect.RedirectMiddleware'', ''scrapy.downloadermiddlewares.cookies.CookiesMiddleware'', ''scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware'', ''scrapy.downloadermiddlewares.stats.DownloaderStats''] 2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares: [''scrapy.spidermiddlewares.httperror.HttpErrorMiddleware'', ''scrapy.spidermiddlewares.offsite.OffsiteMiddleware'', ''scrapy.spidermiddlewares.referer.RefererMiddleware'', ''scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'', ''scrapy.spidermiddlewares.depth.DepthMiddleware''] 2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines: [] 2016-12-10 00:18:55 [scrapy] INFO: Spider opened 2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None) 2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None) 2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> {''url'': None, ''title'': ''/nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە''} 2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> {''url'': None, ''title'': ''/nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید''} 2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> {''url'': None, ''title'': ''/nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا''} 2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> {''url'': None, ''title'': ''/nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە''} 2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> {''url'': None, ''title'': ''/nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە''} 2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished) 2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout: 2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats: {''downloader/request_bytes'': 950, ''downloader/request_count'': 2, ''downloader/request_method_count/GET'': 2, ''downloader/response_bytes'': 15121, ''downloader/response_count'': 2, ''downloader/response_status_count/200'': 2, ''finish_reason'': ''finished'', ''finish_time'': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371), ''item_scraped_count'': 5, ''log_count/DEBUG'': 8, ''log_count/INFO'': 8, ''response_received_count'': 2, ''scheduler/dequeued'': 1, ''scheduler/dequeued/memory'': 1, ''scheduler/enqueued'': 1, ''scheduler/enqueued/memory'': 1, ''start_time'': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)} 2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished) url,title ," لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە" ," هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید" ," گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا" ," بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە" ," كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە"

Posiblemente tendrá que actualizar las cookies de vez en cuando. Puede usar PhantomJS para esto.

ACTUALIZAR :

Cómo obtener cookies utilizando PhantomJS.

  1. Instala PhantomJS .

  2. Hacer un script como este dwarosh.js :

    var page = require(''webpage'').create(); page.settings.userAgent = ''SpecialAgent''; page.open(''http://www.dwarozh.net/sport/'', function(status) { console.log("Status: " + status); if(status === "success") { page.render(''example.png''); page.evaluate(function() { return document.title; }); } for (var i=0; i<page.cookies.length; i++) { var c = page.cookies[i]; console.log(c.name, c.value); }; phantom.exit(); });

  3. Ejecutar guión:

    $ phantomjs --cookies-file=cookie.txt dwarosh.js TypeError: undefined is not an object (evaluating ''activeElement.position().left'') http://www.dwarozh.net/sport/js/script.js:5 https://code.jquery.com/jquery-1.10.2.min.js:4 in c https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith https://code.jquery.com/jquery-1.10.2.min.js:4 in ready https://code.jquery.com/jquery-1.10.2.min.js:4 in q Status: success __auc 250ab0a9158ee9e73eeeac78bba __asc 250ab0a9158ee9e73eeeac78bba _gat 1 _ga GA1.2.260482211.1481472111 ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2 sucuri_cloudproxy_uuid_3e07984e4 26e4ab3... __cfduid d9059962a4c12e0f....1

  4. Obtenga la cookie sucuri_cloudproxy_uuid_3e07984e4 e intente obtener la página con curl y el mismo User-Agent.

    $ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent * Trying 104.25.209.23... * Connected to www.dwarozh.net (104.25.209.23) port 80 (#0) > GET /sport/ HTTP/1.1 > Host: www.dwarozh.net > User-Agent: SpecialAgent > Accept: */* > Cookie: sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 > < HTTP/1.1 200 OK < Date: Sun, 11 Dec 2016 16:17:04 GMT < Content-Type: text/html; charset=utf-8 < Transfer-Encoding: chunked < Connection: keep-alive < Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly < Cache-Control: private < Vary: Accept-Encoding < Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly < X-AspNet-Version: 4.0.30319 < X-XSS-Protection: 1; mode=block < X-Frame-Options: SAMEORIGIN < X-Content-Type-Options: nosniff < X-Sucuri-ID: 15008 < Server: cloudflare-nginx < CF-RAY: 30fa3ea1335237b0-ARN < <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title> Dwarozh : Sport </title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/> <script src="js/jquery-2.1.1.js" type="text/javascript"></script> <script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script> <script src="js/script.js" type="text/javascript"></script> <link href="css/styles.css" rel="stylesheet"/> <script src="js/classie.js" type="text/javascript"></script> <script type="text/javascript">

Según Scrapy Documetions , quiero rastrear y rastrear datos de varios sitios. Mis códigos funcionan correctamente con el sitio web habitual, pero cuando quiero rastrear un sitio web con Sucuri no obtengo ningún dato, parece que el firewall de sucurrido me impide acceder a sitios web. marcas.

El sitio web objetivo es http://www.dwarozh.net/ y This is my spider snippet

from scrapy import Spider from scrapy.selector import Selector import scrapy from Stack.items import StackItem from bs4 import BeautifulSoup from scrapy import log from scrapy.utils.response import open_in_browser class StackSpider(Spider): name = "stack" start_urls = [ "http://www.dwarozh.net/sport/", ] def parse(self, response): mItems = Selector(response).xpath(''//div[@class="news-more-img"]/ul/li'') for mItem in mItems: item = StackItem() item[''title''] = mItem.xpath( ''a/h2/text()'').extract_first() item[''url''] = mItem.xpath( ''viewa/@href'').extract_first() yield item

Y este es el resultado que recibo en respuesta

<html><title>You are being redirected...</title> <noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript> <script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='''',S=''cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs='';L=S.length;U=0;r='''';var A=''ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>

¿Cómo puedo evitar sucuri con scrapy?


La solución general para analizar el contenido dinámico será obtener primero dom / html mediante el uso de algo capaz de ejecutar Javascript (por ejemplo http://phantomjs.org/ ) luego guardar html y alimentarlo a un analizador.

Esto también ayudará a eludir algunos protectores basados ​​en js.

phantomjs es un único archivo ejecutable y cargará un uri como un navegador real con todos los JS evaluados. Puede ejecutarlo desde Python por subprocess.call([phantomJsPath, jsProgramPath, url, htmlFileToSave])

Para el ejemplo de jsProgram puedes consultar https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

Para guardar html desde el programa js, use fs.write(htmlFileToSave, page.content, "w");

Probé este método en dwarozh.net y funcionó, aunque debes averiguar cómo conectarlo a tu pipeline de scrapy .

Específicamente para su ejemplo, puede intentar "manualmente" analizar el javascript proporcionado para obtener los detalles de las cookies que se requieren para cargar la página real. Aunque el algoritmo Sucuri puede cambiarse en cualquier momento y cualquier solución basada en decodificación de cookie o js se romperá.