keywords - meta tags html

¿Puedes proporcionar ejemplos de análisis de HTML? (29)

¿Cómo se analiza HTML con una variedad de idiomas y bibliotecas de análisis?

Al responder:

Los comentarios individuales se vincularán en las respuestas a preguntas sobre cómo analizar HTML con expresiones regulares como una forma de mostrar la forma correcta de hacer las cosas.

En aras de la coherencia, solicito que el ejemplo analice un archivo HTML para href en las etiquetas de anclaje. Para facilitar la búsqueda de esta pregunta, le pido que siga este formato

Idioma: [nombre del idioma]

Biblioteca: [nombre de la biblioteca]

[example code]

Haga que la biblioteca sea un enlace a la documentación de la biblioteca. Si desea proporcionar un ejemplo que no sea la extracción de enlaces, incluya también:

Propósito: [lo que hace el análisis]

Idioma Perl
Biblioteca: HTML::LinkExtor

La belleza de Perl es que tienes módulos para tareas muy específicas. Como la extracción del enlace.

Programa completo:

#!/usr/bin/perl -w use strict; use HTML::LinkExtor; use LWP::Simple; my $url = ''http://www.google.com/''; my $content = get( $url ); my $p = HTML::LinkExtor->new( /&process_link, $url, ); $p->parse( $content ); exit; sub process_link { my ( $tag, %attr ) = @_; return unless $tag eq ''a''; return unless defined $attr{ ''href'' }; print "- $attr{''href''}/n"; return; }

Explicación:

use strict - activa el modo "estricto" - alivia la posible eliminación de errores, no es completamente relevante para el ejemplo
use HTML :: LinkExtor - carga del módulo interesante
use LWP :: Simple, solo una manera simple de obtener algunos html para las pruebas
my $ url = '' http://www.google.com/ '' - en qué página vamos a extraer URL de
my $ content = get ($ url) - busca la página html
my $ p = HTML :: LinkExtor-> new (/ & process_link, $ url) - crea el objeto LinkExtor, le da una referencia a la función que se usará como devolución de llamada en cada url, y $ url para usar como BASEURL para las URL relativas
$ p-> parse ($ contenido) - bastante obvio, supongo
salida - fin del programa
sub process_link - inicio de la función process_link
my ($ tag,% attr) - get arguments, que son nombre de la etiqueta y sus atributos
devolver a menos que $ tag eq ''a'' - omitir el procesamiento si la etiqueta no es <a>
return a menos que se defina $ attr {''href''} - omita el procesamiento si la etiqueta <a> no tiene el atributo href
print "- $ attr {''href''} / n"; - bastante obvio, supongo :)
regreso; - terminar la función

Eso es todo.

Idioma: Clojure
Biblioteca: Enlive (un sistema de plantillas y transformación basado en selector (à la CSS) para Clojure)

Selector de expresión:

(def test-select (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))

Ahora podemos hacer lo siguiente en REPL (agregué saltos de línea en la selección de test-select ):

user> test-select ({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]} {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]} {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]}) user> (map #(get-in % [:attrs :href]) test-select) ("http://foo.com/" "http://bar.com/" "http://baz.com/")

Necesitará lo siguiente para probarlo:

Preámbulo:

(require ''[net.cgrand.enlive-html :as html])

Prueba HTML:

(def test-html (apply str (concat ["<html><body>"] (for [link ["foo" "bar" "baz"]] (str "<a href=/"http://" link ".com//">" link "</a>")) ["</body></html>"])))

Idioma: JavaScript
Biblioteca: DOM

var links = document.links; for(var i in links){ var href = links[i].href; if(href != null) console.debug(href); }

(usando firebug console.debug para la salida ...)

Idioma: JavaScript
Biblioteca: jQuery

$.each($(''a[href]''), function(){ console.debug(this.href); });

(usando firebug console.debug para la salida ...)

Y cargando cualquier página html:

$.get(''http://.com/'', function(page){ $(page).find(''a[href]'').each(function(){ console.debug(this.href); }); });

Usé otra función para cada una, creo que es más limpia cuando se usan métodos de encadenamiento.

Idioma: Objective-C
Biblioteca: libxml2 + envolturas libxml2 de Matt Gallagher + ASIHTTPRequest de Ben Copsey

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://.com/questions/773340"]; [request start]; NSError *error = [request error]; if (!error) { NSData *response = [request responseData]; NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]); [request release]; } else @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil]; ... - (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp { NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery); if (nodes != nil) return nodes; return nil; }

Idioma: Python
Biblioteca: HTQL

import htql; page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>"; query="<a>:href,tx"; for url, text in htql.HTQL(page, query): print url, text;

Simple e intuitivo.

Idioma: Raqueta

Biblioteca: (planet ashinn / html-parser: 1) y (planet clements / sxml2: 1)

(require net/url (planet ashinn/html-parser:1) (planet clements/sxml2:1)) (define the-url (string->url "http://.com/")) (define doc (call/input-url the-url get-pure-port html->sxml)) (define links ((sxpath "//a/@href/text()") doc))

Ejemplo de arriba utilizando paquetes del nuevo sistema de paquete: html-parsing y sxml

(require net/url html-parsing sxml) (define the-url (string->url "http://.com/")) (define doc (call/input-url the-url get-pure-port html->xexp)) (define links ((sxpath "//a/@href/text()") doc))

Nota: Instale los paquetes requeridos con ''raco'' desde una línea de comando, con:

raco pkg install html-parsing

raco pkg install sxml

Idioma: Biblioteca PHP: DOM

<?php $doc = new DOMDocument(); $doc->strictErrorChecking = false; $doc->loadHTMLFile(''http://.com/questions/773340''); $xpath = new DOMXpath($doc); $links = $xpath->query(''//a[@href]''); for ($i = 0; $i < $links->length; $i++) echo $links->item($i)->getAttribute(''href''), "/n";

A veces es útil poner @ symbol antes de $doc->loadHTMLFile para suprimir advertencias de análisis html no válidas

Idioma: C #
Biblioteca: System.XML (estándar .NET)

using System.Collections.Generic; using System.Xml; public static void Main(string[] args) { List<string> matches = new List<string>(); XmlDocument xd = new XmlDocument(); xd.LoadXml("<html>...</html>"); FindHrefs(xd.FirstChild, matches); } static void FindHrefs(XmlNode xn, List<string> matches) { if (xn.Attributes != null && xn.Attributes["href"] != null) matches.Add(xn.Attributes["href"].InnerXml); foreach (XmlNode child in xn.ChildNodes) FindHrefs(child, matches); }

Idioma: C #
Biblioteca: HtmlAgilityPack

class Program { static void Main(string[] args) { var web = new HtmlWeb(); var doc = web.Load("http://www..com"); var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); foreach (var node in nodes) { Console.WriteLine(node.InnerHtml); } } }

Idioma: Coldfusion 9.0.1+

Biblioteca: jSoup

<cfscript> function parseURL(required string url){ var res = []; var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]); var jSoupClass = javaLoader.create("org.jsoup.Jsoup"); //var dom = jSoupClass.parse(html); // if you already have some html to parse. var dom = jSoupClass.connect( arguments.url ).get(); var links = dom.select("a"); for(var a=1;a LT arrayLen(links);a++){ var s={};s.href= links[a].attr(''href''); s.text= links[a].text(); if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); } return res; } //writeoutput(writedump(parseURL(url))); </cfscript> <cfdump var="#parseURL("http://.com/questions/773340/can-you-provide-examples-of-parsing-html")#">

Devuelve una matriz de estructuras, cada estructura contiene objetos HREF y TEXT.

Idioma: Java
Biblioteca: jsoup

import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Document document = Jsoup.parse("<html><body><ul><li><a href=/"http://google.com/">google</li><li><a HREF=/"http://reddit.org/" target=/"_blank/">reddit</a></li><li><a name=/"nothing/">nothing</a><li></ul></body></html>"); final Elements links = document.select("a[href]"); for (final Element element : links) { System.out.println(element.attr("href")); } } }

Idioma: Java
Bibliotecas: XOM , TagSoup

Incluí XML intencionalmente mal formado e inconsistente en esta muestra.

import java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.Element; import nu.xom.Node; import nu.xom.Nodes; import nu.xom.ParsingException; import nu.xom.ValidityException; import org.ccil.cowan.tagsoup.Parser; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Parser parser = new Parser(); parser.setFeature(Parser.namespacesFeature, false); final Builder builder = new Builder(parser); final Document document = builder.build("<html><body><ul><li><a href=/"http://google.com/">google</li><li><a HREF=/"http://reddit.org/" target=/"_blank/">reddit</a></li><li><a name=/"nothing/">nothing</a><li></ul></body></html>", null); final Element root = document.getRootElement(); final Nodes links = root.query("//a[@href]"); for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) { final Node node = links.get(linkNumber); System.out.println(((Element) node).getAttributeValue("href")); } } }

TagSoup agrega un espacio de nombres XML que hace referencia a XHTML en el documento de forma predeterminada. He elegido suprimir eso en esta muestra. Usar el comportamiento predeterminado requeriría la llamada a root.query para incluir un espacio de nombres así:

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

Idioma: JavaScript / Node.js

Biblioteca: Request y Cheerio

var request = require(''request''); var cheerio = require(''cheerio''); var url = "https://news.ycombinator.com/"; request(url, function (error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html); var anchorTags = $(''a''); anchorTags.each(function(i,element){ console.log(element["attribs"]["href"]); }); } });

Solicitar biblioteca descarga el documento html y Cheerio le permite usar los selectores de jquery css para orientar el documento html.

Idioma: Lisp común
Biblioteca: cierre HTML , cierre Xml , CL-WHO

(se muestra usando DOM API, sin usar XPATH o STP API)

(defvar *html* (who:with-html-output-to-string (stream) (:html (:body (loop for site in (list "foo" "bar" "baz") do (who:htm (:a :href (format nil "http://~A.com/" site)))))))) (defvar *dom* (chtml:parse *html* (cxml-dom:make-dom-builder))) (loop for tag across (dom:get-elements-by-tag-name *dom* "a") collect (dom:get-attribute tag "href")) => ("http://foo.com/" "http://bar.com/" "http://baz.com/")

Idioma: PHP
Biblioteca: SimpleXML (y DOM)

<?php $page = new DOMDocument(); $page->strictErrorChecking = false; $page->loadHTMLFile(''http://.com/questions/773340''); $xml = simplexml_import_dom($page); $links = $xml->xpath(''//a[@href]''); foreach($links as $link) echo $link[''href'']."/n";

Idioma: Perl
Biblioteca: HTML::Parser
Objetivo: ¿Cómo puedo eliminar las etiquetas de extensión HTML anidadas y no utilizadas con una expresión regular de Perl?

Idioma: Perl
Biblioteca: pQuery

use strict; use warnings; use pQuery; my $html = join '''', "<html><body>", (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/), "</body></html>"; pQuery( $html )->find( ''a'' )->each( sub { my $at = $_->getAttribute( ''href'' ); print "$at/n" if defined $at; } );

Idioma: Perl
Biblioteca: HTML::TreeBuilder

use strict; use HTML::TreeBuilder; use LWP::Simple; my $content = get ''http://www..com''; my $document = HTML::TreeBuilder->new->parse($content)->eof; for my $a ($document->find(''a'')) { print $a->attr(''href''), "/n" if $a->attr(''href''); }

Idioma: Ruby
Biblioteca: Nokogiri

#!/usr/bin/env ruby require ''nokogiri'' require ''open-uri'' document = Nokogiri::HTML(open("http://google.com")) document.css("html head title").first.content => "Google" document.xpath("//title").first.content => "Google"

Usando phantomjs, guarde este archivo como extract-links.js:

var page = new WebPage(), url = ''http://www.udacity.com''; page.open(url, function (status) { if (status !== ''success'') { console.log(''Unable to access network''); } else { var results = page.evaluate(function() { var list = document.querySelectorAll(''a''), links = [], i; for (i = 0; i < list.length; i++) { links.push(list[i].href); } return links; }); console.log(results.join(''/n'')); } phantom.exit(); });

correr:

$ ../path/to/bin/phantomjs extract-links.js

idioma: Perl
biblioteca: HTML::Parser

#!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $find_links = HTML::Parser->new( start_h => [ sub { my ($tag, $attr) = @_; if ($tag eq ''a'' and exists $attr->{href}) { print "$attr->{href}/n"; } }, "tag, attr" ] ); my $html = join '''', "<html><body>", (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/), "</body></html>"; $find_links->parse($html);

idioma: Perl
biblioteca: XML::Twig

#!/usr/bin/perl use strict; use warnings; use Encode '':all''; use LWP::Simple; use XML::Twig; #my $url = ''http://.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser''; my $url = ''http://www.google.com''; my $content = get($url); die "Couldn''t fetch!" unless defined $content; my $twig = XML::Twig->new(); $twig->parse_html($content); my @hrefs = map { $_->att(''href''); } $twig->get_xpath(''//*[@href]''); print "$_/n" for @hrefs;

advertencia: puede obtener errores de caracteres anchos con páginas como esta (cambiar el URL a la comentada obtendrá este error), pero la solución HTML :: Analizador anterior no comparte este problema.

idioma: Ruby
biblioteca: Nokogiri

#!/usr/bin/env ruby require "nokogiri" require "open-uri" doc = Nokogiri::HTML(open(''http://www.example.com'')) hrefs = doc.search(''a'').map{ |n| n[''href''] } puts hrefs

Qué salidas:

/ /domains/ /numbers/ /protocols/ /about/ /go/rfc2606 /about/ /about/presentations/ /about/performance/ /reports/ /domains/ /domains/root/ /domains/int/ /domains/arpa/ /domains/idn-tables/ /protocols/ /numbers/ /abuse/ http://www.icann.org/ mailto:iana@iana.org?subject=General%20website%20feedback

Este es un giro menor en el anterior, lo que resulta en un resultado que se puede utilizar para un informe. Solo devuelvo el primer y el último elemento en la lista de hrefs:

#!/usr/bin/env ruby require "nokogiri" require "open-uri" doc = Nokogiri::HTML(open(''http://nokogiri.org'')) hrefs = doc.search(''a[href]'').map{ |n| n[''href''] } puts hrefs .each_with_index # add an array index .minmax{ |a,b| a.last <=> b.last } # find the first and last element .map{ |h,i| ''%3d %s'' % [1 + i, h ] } # format the output 1 http://github.com/tenderlove/nokogiri 100 http://yokolet.blogspot.com

idioma: Ruby
biblioteca: Hpricot

#!/usr/bin/ruby require ''hpricot'' html = ''<html><body>'' [''foo'', ''bar'', ''baz''].each {|link| html += "<a href=/"http://#{link}.com/">#{link}</a>" } html += ''</body></html>'' doc = Hpricot(html) doc.search(''//a'').each {|elm| puts elm.attributes[''href''] }

idioma: concha
library: lynx (bueno, no es una biblioteca, pero en shell, cada programa es una especie de biblioteca)

lynx -dump -listonly http://news.google.com/

lenguaje: Python
biblioteca: lxml.html

import lxml.html html = "<html><body>" for link in ("foo", "bar", "baz"): html += ''<a href="http://%s.com">%s</a>'' % (link, link) html += "</body></html>" tree = lxml.html.document_fromstring(html) for element, attribute, link, pos in tree.iterlinks(): if attribute == "href": print link

lxml también tiene una clase de selector CSS para atravesar el DOM, lo que puede hacer que usarlo sea muy similar al uso de JQuery:

for a in tree.cssselect(''a[href]''): print a.get(''href'')

lenguaje: Python
biblioteca: HTMLParser

#!/usr/bin/python from HTMLParser import HTMLParser class FindLinks(HTMLParser): def __init__(self): HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): at = dict(attrs) if tag == ''a'' and ''href'' in at: print at[''href''] find = FindLinks() html = "<html><body>" for link in ("foo", "bar", "baz"): html += ''<a href="http://%s.com">%s</a>'' % (link, link) html += "</body></html>" find.feed(html)

lenguaje: Python
biblioteca: BeautifulSoup

from BeautifulSoup import BeautifulSoup html = "<html><body>" for link in ("foo", "bar", "baz"): html += ''<a href="http://%s.com">%s</a>'' % (link, link) html += "</body></html>" soup = BeautifulSoup(html) links = soup.findAll(''a'', href=True) # find <a> with a defined href attribute print links

salida:

[<a href="http://foo.com">foo</a>, <a href="http://bar.com">bar</a>, <a href="http://baz.com">baz</a>]

también es posible:

for link in links: print link[''href'']

salida:

http://foo.com http://bar.com http://baz.com