tutorial scraping hacer example espaƱol con como r selenium selenium-webdriver web-scraping rselenium

scraping - scrapy python tutorial



Scraping Javascript en R con RSelenium (1)

Sería mucho más fácil usar los datos JSON directamente (use las "Herramientas del desarrollador" en casi cualquier navegador moderno para rastrear las URL cargadas ... esto no tardó en encontrar en esa lista):

library(jsonlite) url <- "https://js.washingtonpost.com/graphics/policeshootings/policeshootings.json?d14385542" shootings <- fromJSON(url) dplyr::glimpse(shootings) ## Observations: 564 ## Variables: ## $ id (int) 3, 4, 5, 8, 9, 11, 13, 15, 16, 17, 19, 21, ... ## $ date (chr) "2015-01-02", "2015-01-02", "2015-01-03", "... ## $ description (chr) "Elliot, who was on medication for depressi... ## $ blurb (chr) "a 53-year-old man of Asian heritage armed ... ## $ name (chr) "Tim Elliot", "Lewis Lee Lembke", "John Pau... ## $ age (int) 53, 47, 23, 32, 39, 18, 22, 35, 34, 47, 25,... ## $ gender (chr) "M", "M", "M", "M", "M", "M", "M", "M", "F"... ## $ race (chr) "A", "W", "H", "W", "H", "W", "H", "W", "W"... ## $ armed (chr) "gun", "gun", "unarmed", "toy weapon", "nai... ## $ city (chr) "Shelton", "Aloha", "Wichita", "San Francis... ## $ state (chr) "WA", "OR", "KS", "CA", "CO", "OK", "AZ", "... ## $ address (chr) "600 block of E. Island Lake Drive", "4519 ... ## $ lat (dbl) 47.24683, 45.48620, 37.69477, 37.76291, 40.... ## $ lon (dbl) -123.12159, -122.89128, -97.28055, -122.422... ## $ is_geocoding_exact (lgl) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T... ## $ mental (lgl) TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FAL... ## $ sources (list) http://kbkw.com/local-news/329755, http://... ## $ photos (list) NULL, NULL, 107, , , , //img.washingtonpos... ## $ videos (list) NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...

Intento robar la base de datos del Washington Post sobre tiroteos policiales . Como no es html, no puedo usar rvest , así que usé RSelenium y phantomjs .

library(RSelenium) checkForServer() startServer() eCap <- list(phantomjs.binary.path = "C:/Program Files/Chrome Driver/phantomjs.exe") remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap) remDr$open() remDr$navigate("http://www.washingtonpost.com/graphics/national/police-shootings/")

Al inspeccionar la fuente, es evidente que los elementos que me interesan tienen la siguiente id y class

<div id="js-list-690" class="listWrapper cf">

o en Chrome:

Puedo acceder al texto del artículo individual:

remDr$findElement("css", "#js-list-691")$getElementText()

devoluciones

[[1]] [1] "An unidentified person, a 47-year-old Hispanic man, was shocked with a stun gun and shot on July 30, 2015, in Whittier, Calif. Los Angeles County deputies were investigating a domestic disturbance when he threatened the officers and struck one of them with a metal rod./nMALEDEADLY WEAPONHISPANIC45 TO 54/nCBS Los AngelesWhittier Daily News"}

Pero si quiero obtener una lista de todos estos elementos:

remDr$findElements("class name", "listWrapper cf")

resulta en un error

Cómo puedo

  1. Obtenga una lista de todos los elementos que comparten esta listWrapper cf clase de listWrapper cf ?
  2. ¿Devuelve una lista del texto asociado con cada elemento?