w3schools - Extracción de texto desde HTML Java

web scrap java (8)

Estoy trabajando en un programa que descarga páginas HTML y luego selecciona parte de la información y la escribe en otro archivo.

Quiero extraer la información que se encuentra entre las etiquetas de los párrafos, pero solo puedo obtener una línea del párrafo. Mi código es el siguiente;

estaba intentando agregar otro bucle while, que le diría al programa que siga escribiendo en el archivo hasta que la línea contenga la etiqueta  , diciendo;

while ((s = br.readLine()) !=null) { if(s.contains("")) { while(!s.contains("") { try { out.write(s); } catch (IOException e) { } } } }

Pero esto no funciona. ¿Podría alguien ayudar, por favor?

jsoup

Otro analizador html que realmente me gustó usar fue jsoup . Podrías obtener todos los elementos  en 2 líneas de código.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements ps = doc.select("p");

Luego escríbelo en un archivo en una línea más

out.write(ps.text()); //it will append all of the p elements together in one long string

o si los quiere en líneas separadas, puede iterar a través de los elementos y escribirlos por separado.

He tenido éxito utilizando TagSoup y XPath para analizar HTML.

http://home.ccil.org/~cowan/XML/tagsoup/

Intente (si no desea utilizar una biblioteca de analizador de HTML):

FileReader fileReader = new FileReader(file); BufferedReader buffRd = new BufferedReader(fileReader); BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt)); String s; int writeTo = 0; while ((s = br.readLine()) !=null) { if(s.contains("")) { writeTo = 1; try { out.write(s); } catch (IOException e) { } } if(s.contains("")) { writeTo = 0; try { out.write(s); } catch (IOException e) { } } else if(writeTo==1) { try { out.write(s); } catch (IOException e) { } } }

Prueba esto.

public static void main( String[] args ) { String url = "http://en.wikipedia.org/wiki/Big_data"; Document document; try { document = Jsoup.connect(url).get(); Elements paragraphs = document.select("p"); Element firstParagraph = paragraphs.first(); Element lastParagraph = paragraphs.last(); Element p; int i=1; p=firstParagraph; System.out.println("* " +p.text()); while (p!=lastParagraph){ p=paragraphs.get(i); System.out.println("* " +p.text()); i++; } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }

Puede que solo esté utilizando la herramienta incorrecta para el trabajo:

perl -ne "print if m|| .. m||" infile.txt >outfile.txt

Utilice un ParserCallback. Es una clase simple que se incluye con el JDK. Le notifica cada vez que se encuentra una nueva etiqueta y luego puede extraer el texto de la etiqueta. Ejemplo simple:

import java.io.*; import java.net.*; import javax.swing.text.*; import javax.swing.text.html.*; import javax.swing.text.html.parser.*; public class ParserCallbackTest extends HTMLEditorKit.ParserCallback { private int tabLevel = 1; private int line = 1; public void handleComment(char[] data, int pos) { displayData(new String(data)); } public void handleEndOfLineString(String eol) { System.out.println( line++ ); } public void handleEndTag(HTML.Tag tag, int pos) { tabLevel--; displayData("/" + tag); } public void handleError(String errorMsg, int pos) { displayData(pos + ":" + errorMsg); } public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos) { displayData("mutable:" + tag + ": " + pos + ": " + a); } public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos) { displayData( tag + "::" + a ); // tabLevel++; } public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos) { displayData( tag + ":" + a ); tabLevel++; } public void handleText(char[] data, int pos) { displayData( new String(data) ); } private void displayData(String text) { for (int i = 0; i < tabLevel; i++) System.out.print("/t"); System.out.println(text); } public static void main(String[] args) throws IOException { ParserCallbackTest parser = new ParserCallbackTest(); // args[0] is the file to parse Reader reader = new FileReader(args[0]); // URLConnection conn = new URL(args[0]).openConnection(); // Reader reader = new InputStreamReader(conn.getInputStream()); try { new ParserDelegator().parse(reader, parser, true); } catch (IOException e) { System.out.println(e); } } }

Así que todo lo que necesita hacer es establecer una bandera booleana cuando se encuentra la etiqueta de párrafo. Luego, en el método handleText () extraes el texto.

jericho es uno de los varios posibles analizadores html que podrían hacer esta tarea fácil y segura.

JTidy puede representar un documento HTML (incluso uno con formato incorrecto) como un modelo de documento, haciendo que el proceso de extracción de los contenidos de una etiqueta  un proceso bastante más elegante que el procesamiento manual a través del texto sin formato.