tutorial - scrapear con java

¿Utilizando Java para extraer datos de una página web? (3)

Los basicos

Mira esto para construir una solución más o menos desde cero:

Comience desde lo básico: el capítulo del Tutorial de Java sobre redes , incluido Trabajar con URL
Haga las cosas más fáciles para usted: Apache HttpComponents (incluido HttpClient)

El material fácilmente pegado y cosido

Siempre tiene la opción de llamar a herramientas externas desde Java usando exec() y métodos similares. Por ejemplo, puedes usar wget , o cURL .

Las cosas incondicionales

Luego, si desea profundizar en el tema, afortunadamente la necesidad de realizar pruebas web automatizadas nos brinda herramientas muy prácticas para esto. Mirar:

HtmlUnit (potente y simple)
Selenium , Selenium-RC
WebDriver/Selenium2 (todavía en las obras)
JBehave con JBehave Web

Algunas otras libretas están escritas a propósito con el desguace web en mente:

Algunas soluciones

Java es un lenguaje, pero también una plataforma, con muchos otros lenguajes que se ejecutan en él. Algunos de los cuales integran grandes azúcares sintácticos o bibliotecas para construir fácilmente scrappers.

Revisa:

Groovy (y su XmlSlurper )
o Scala (con gran soporte XML como se presenta here y here )

Si conoces una gran biblioteca para Ruby ( JRuby , con un artículo sobre raspado con JRuby y HtmlUnit ) o Python ( Jython ) o prefieres estos idiomas, entonces dale una oportunidad a sus puertos JVM.

Algunos suplementos

Algunas otras preguntas similares:

Estoy intentando hacer mi primer programa en Java. El objetivo es escribir un programa que navegue por un sitio web y descargue un archivo para mí. Sin embargo, no sé cómo usar Java para interactuar con Internet. ¿Alguien puede decirme qué temas buscar / leer o recomendar algunos buenos recursos?

Aquí está mi solución usando URL y try with resources frase de try with resources para detectar las excepciones.

/** * Created by mona on 5/27/16. */ import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.MalformedURLException; import java.net.URL; public class ReadFromWeb { public static void readFromWeb(String webURL) throws IOException { URL url = new URL(webURL); InputStream is = url.openStream(); try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) { String line; while ((line = br.readLine()) != null) { System.out.println(line); } } catch (MalformedURLException e) { e.printStackTrace(); throw new MalformedURLException("URL is malformed!!"); } catch (IOException e) { e.printStackTrace(); throw new IOException(); } } public static void main(String[] args) throws IOException { String url = "https://madison.craigslist.org/search/sub"; readFromWeb(url); } }

También puede guardarlo en un archivo según sus necesidades o analizarlo utilizando bibliotecas XML o HTML .

La solución más sencilla (sin depender de ninguna biblioteca o plataforma de terceros) es crear una instancia de URL que apunte a la página web / enlace que desea descargar, y leer el contenido mediante secuencias.

Por ejemplo:

import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; public class DownloadPage { public static void main(String[] args) throws IOException { // Make a URL to the web page URL url = new URL("http://.com/questions/6159118/using-java-to-pull-data-from-a-webpage"); // Get the input stream through URL Connection URLConnection con = url.openConnection(); InputStream is =con.getInputStream(); // Once you have the Input Stream, it''s just plain old Java IO stuff. // For this case, since you are interested in getting plain-text web page // I''ll use a reader and output the text content to System.out. // For binary content, it''s better to directly read the bytes from stream and write // to the target file. BufferedReader br = new BufferedReader(new InputStreamReader(is)); String line = null; // read each line and write to System.out while ((line = br.readLine()) != null) { System.out.println(line); } } }

Espero que esto ayude.