html groovy xmlslurper

¿Cómo evitar XmlSlurper de Groovy negarse a analizar HTML debido a las restricciones de DOCTYPE y DTD?



(2)

Estoy intentando copiar un elemento en un informe de cobertura HTML, por lo que los totales de cobertura aparecen en la parte superior del informe y en la parte inferior.

El HTML comienza así y creo que está bien formado:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> <link rel="stylesheet" href=".resources/report.css" type="text/css" /> <link rel="shortcut icon" href=".resources/report.gif" type="image/gif" /> <title>Unified coverage</title> <script type="text/javascript" src=".resources/sort.js"></script> </head> <body onload="initialSort([''breadcrumb'', ''coveragetable''])">

XmlSlurper de Groovy se queja de la siguiente manera:

doc = new XmlSlurper( /* false, false, false */ ).parse("index.html") [Fatal Error] index.html:1:48: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true. DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.

Habilitando DOCTYPE:

doc = new XmlSlurper(false, false, true).parse("index.html") [Fatal Error] index.html:1:148: External DTD: Failed to read external DTD ''xhtml1-strict.dtd'', because ''http'' access is not allowed due to restriction set by the accessExternalDTD property. External DTD: Failed to read external DTD ''xhtml1-strict.dtd'', because ''http'' access is not allowed due to restriction set by the accessExternalDTD property. doc = new XmlSlurper(false, true, true).parse("index.html") [Fatal Error] index.html:1:148: External DTD: Failed to read external DTD ''xhtml1-strict.dtd'', because ''http'' access is not allowed due to restriction set by the accessExternalDTD property. External DTD: Failed to read external DTD ''xhtml1-strict.dtd'', because ''http'' access is not allowed due to restriction set by the accessExternalDTD property. doc = new XmlSlurper(true, true, true).parse("index.html") External DTD: Failed to read external DTD ''xhtml1-strict.dtd'', because ''http'' access is not allowed due to restriction set by the accessExternalDTD property. doc = new XmlSlurper(true, false, true).parse("index.html") External DTD: Failed to read external DTD ''xhtml1-strict.dtd'', because ''http'' access is not allowed due to restriction set by the accessExternalDTD property.

Así que creo que he cubierto todas las opciones. Debe haber una manera de hacer que esto funcione sin tener que recurrir a las expresiones regulares y arriesgando la ira de Tony The Pony.


A pesar de que su HTML también es un XML bien formado, una solución más general para analizar HTML es usar un verdadero analizador de HTML. He usado el analizador TagSoup en el pasado, y maneja bastante bien el HTML del mundo real.

TagSoup proporciona un analizador que implementa la interfaz javax.xml.parsers.SAXParser y se puede proporcionar a XmlSlurper en el constructor. Ejemplo:

@Grab(''org.ccil.cowan.tagsoup:tagsoup:1.2.1'') import org.ccil.cowan.tagsoup.Parser def doc = new XmlSlurper(new Parser()).parse("index.html")


Tsk.

parser=new XmlSlurper() parser.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false) parser.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); parser.parse(it)