python - tag - ¿Cómo se enumeran los recursos cargados con Selenium/PhantomJS?

tag name selenium (2)

Quiero cargar una página web y enumerar todos los recursos cargados (javascript / images / css) para esa página. Yo uso este código para cargar la página:

from selenium import webdriver driver = webdriver.PhantomJS() driver.get(''http://example.com'')

El código anterior funciona perfectamente y puedo hacer algunos procesamientos en la página HTML. La pregunta es, ¿cómo enumero todos los recursos cargados por esa página? Quiero algo como esto:

[''http://example.com/img/logo.png'', ''http://example.com/css/style.css'', ''http://example.com/js/jquery.js'', ''http://www.google-analytics.com/ga.js'']

También abro a otra solución, como usar el módulo PySide.QWebView . Solo quiero enumerar los recursos cargados por página.

No hay una función en webdribver que devuelva todos los recursos que tiene la página web, pero lo que podría hacer es algo como esto:

from selenium.webdriver.common.by import By images = driver.find_elements(By.TAG_NAME, "img")

y lo mismo para script y enlace.

Esta no es una solución de Selenium , pero puede funcionar muy bien con Python y PhantomJS.

La idea es hacer exactamente lo mismo que en la pestaña ''Red'' en Chrome Developper Tools. Para hacerlo, debemos escuchar cada solicitud hecha por la página web.

Parte Javascript / Phantomjs

Usando phantomjs, esto se puede hacer usando este script, utilízalo según tu conveniencia:

// getResources.js // Usage: // ./phantomjs --ssl-protocol=any --web-security=false getResources.js your_url // the ssl-protocol and web-security flags are added to dismiss SSL errors var page = require(''webpage'').create(); var system = require(''system''); var urls = Array(); // function to check if the requested resource is an image function isImg(url) { var acceptedExts = [''jpg'', ''jpeg'', ''png'']; var baseUrl = url.split(''?'')[0]; var ext = baseUrl.split(''.'').pop().toLowerCase(); if (acceptedExts.indexOf(ext) > -1) { return true; } else { return false; } } // function to check if an url has a given extension function isExt(url, ext) { var baseUrl = url.split(''?'')[0]; var fileExt = baseUrl.split(''.'').pop().toLowerCase(); if (ext == fileExt) { return true; } else { return false; } } // Listen for all requests made by the webpage, // (like the ''Network'' tab of Chrome developper tools) // and add them to an array page.onResourceRequested = function(request, networkRequest) { // If the requested url if the one of the webpage, do nothing // to allow other ressource requests if (system.args[1] == request.url) { return; } else if (isImg(request.url) || isExt(request.url, ''js'') || isExt(request.url, ''css'')) { // The url is an image, css or js file // add it to the array urls.push(request.url) // abort the request for a better response time // can be omitted for collecting asynchronous loaded files networkRequest.abort(); } }; // When all requests are made, output the array to the console page.onLoadFinished = function(status) { console.log(JSON.stringify(urls)); phantom.exit(); }; // If an error occur, dismiss it page.onResourceError = function(){ return false; } page.onError = function(){ return false; } // Open the web page page.open(system.args[1]);

Python parte

Y ahora llama al código en python con:

from subprocess import check_output import json out = check_output([''./phantomjs'', ''--ssl-protocol=any'', / ''--web-security=false'', ''getResources.js'', your_url]) data = json.loads(out)

Espero que esto ayude