c# - parser - htmlagilitypack nuget

Cómo usar el paquete de agilidad HTML (6)

Mi documento XHTML no es completamente válido. Por eso quise usarlo. ¿Cómo lo uso en mi proyecto? Mi proyecto está en C #.

Comenzando - HTML Agility Pack

// From File var doc = new HtmlDocument(); doc.Load(filePath); // From String var doc = new HtmlDocument(); doc.LoadHtml(html); // From Web var url = "http://html-agility-pack.net/"; var web = new HtmlWeb(); var doc = web.Load(url);

El código principal relacionado con HTMLAgilityPack es el siguiente

using System; using System.Net; using System.Web; using System.Web.Services; using System.Web.Script.Services; using System.Text.RegularExpressions; using HtmlAgilityPack; namespace GetMetaData { /// <summary> /// Summary description for MetaDataWebService /// </summary> [WebService(Namespace = "http://tempuri.org/")] [WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)] [System.ComponentModel.ToolboxItem(false)] // To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line. [System.Web.Script.Services.ScriptService] public class MetaDataWebService: System.Web.Services.WebService { [WebMethod] [ScriptMethod(UseHttpGet = false)] public MetaData GetMetaData(string url) { MetaData objMetaData = new MetaData(); //Get Title WebClient client = new WebClient(); string sourceUrl = client.DownloadString(url); objMetaData.PageTitle = Regex.Match(sourceUrl, @ "/<title/b[^>]*/>/s*(?<Title>[/s/S]*?)/</title/>", RegexOptions.IgnoreCase).Groups["Title"].Value; //Method to get Meta Tags objMetaData.MetaDescription = GetMetaDescription(url); return objMetaData; } private string GetMetaDescription(string url) { string description = string.Empty; //Get Meta Tags var webGet = new HtmlWeb(); var document = webGet.Load(url); var metaTags = document.DocumentNode.SelectNodes("//meta"); if (metaTags != null) { foreach(var tag in metaTags) { if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description") { description = tag.Attributes["content"].Value; } } } else { description = string.Empty; } return description; } } }

HtmlAgilityPack usa la sintaxis XPath, y aunque muchos argumentan que está mal documentado, no tuve problemas para usarlo con la ayuda de esta documentación de XPath: https://www.w3schools.com/xml/xpath_syntax.asp

Analizar

<h2> <a href="">Jack</a> </h2> <ul> <li class="tel"> <a href="">81 75 53 60</a> </li> </ul> <h2> <a href="">Roy</a> </h2> <ul> <li class="tel"> <a href="">44 52 16 87</a> </li> </ul>

Hice esto:

string url = "http://website.com"; var Webget = new HtmlWeb(); var doc = Webget.Load(url); foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a")) { names.Add(node.ChildNodes[0].InnerHtml); } foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[@class=''tel'']//a")) { phones.Add(node.ChildNodes[0].InnerHtml); }

No sé si esto será de alguna ayuda para usted, pero he escrito un par de artículos que presentan los conceptos básicos.

El siguiente artículo está completo en un 95%, solo tengo que escribir explicaciones de las últimas partes del código que he escrito. Si estás interesado, intentaré recordar publicar aquí cuando lo publique.

Primero, instale el paquete HTMLAgilityPack HTMLAgilityPack en su proyecto.

Entonces, como ejemplo:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); // There are various options, set as needed htmlDoc.OptionFixNestedTags=true; // filePath is a path to a file containing the html htmlDoc.Load(filePath); // Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString) // ParseErrors is an ArrayList containing any errors from the Load statement if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0) { // Handle any parse errors as required } else { if (htmlDoc.DocumentNode != null) { HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body"); if (bodyNode != null) { // Do something with bodyNode } } }

(Nota: este código es solo un ejemplo y no es necesariamente el mejor / el único enfoque. No lo use a ciegas en su propia aplicación).

El método HtmlDocument.Load() también acepta un flujo que es muy útil en la integración con otras clases orientadas al flujo en el marco .NET. Mientras que HtmlEntity.DeEntitize() es otro método útil para procesar entidades html correctamente. (gracias Matthew)

HtmlDocument y HtmlNode son las clases que más utilizará. Similar a un analizador XML, proporciona los métodos selectSingleNode y selectNodes que aceptan expresiones XPath.

Preste atención al HtmlDocument.Option?????? propiedades booleanas. Estos controlan cómo los métodos Load y LoadXML procesarán su HTML / XHTML.

También hay un archivo de ayuda compilado llamado HtmlAgilityPack.chm que tiene una referencia completa para cada uno de los objetos. Esto normalmente está en la carpeta base de la solución.

public string HtmlAgi(string url, string key) { var Webget = new HtmlWeb(); var doc = Webget.Load(url); HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[@name=''{0}'']", key)); if (ourNode != null) { return ourNode.GetAttributeValue("content", ""); } else { return "not fount"; } }