c# - parser - Paquete de agilidad de HTML: ¿elimina etiquetas indeseadas sin eliminar contenido?

htmlagilitypack nuget (5)

He visto algunas preguntas relacionadas aquí, pero no hablan exactamente sobre el mismo problema que estoy enfrentando.

Quiero utilizar HTML Agility Pack para eliminar etiquetas no deseadas de mi HTML sin perder el contenido dentro de las etiquetas.

Entonces, por ejemplo, en mi escenario, me gustaría conservar las etiquetas " b ", " i " y " u ".

Y para una entrada como:

my paragraph <div>and my div</div> are italic and bold

El HTML resultante debe ser:

my paragraph and my div are italic and bold

Intenté usar el método Remove HtmlNode , pero también elimina mi contenido. ¿Alguna sugerencia?

Cómo eliminar recursivamente una lista dada de etiquetas html no deseadas de una cadena html

Respondí @mathias y mejoré su método de extensión para que pueda suministrar una lista de etiquetas para excluir como una List<string> (por ejemplo, {"a","p","hr"} ). También arreglé la lógica para que funcione recursivamente correctamente:

public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags) { if (String.IsNullOrEmpty(html)) { return html; } var document = new HtmlDocument(); document.LoadHtml(html); HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()"); if (tryGetNodes == null || !tryGetNodes.Any()) { return html; } var nodes = new Queue<HtmlNode>(tryGetNodes); while (nodes.Count > 0) { var node = nodes.Dequeue(); var parentNode = node.ParentNode; var childNodes = node.SelectNodes("./*|./text()"); if (childNodes != null) { foreach (var child in childNodes) { nodes.Enqueue(child); } } if (unwantedTags.Any(tag => tag == node.Name)) { if (childNodes != null) { foreach (var child in childNodes) { parentNode.InsertBefore(child, node); } } parentNode.RemoveChild(node); } } return document.DocumentNode.InnerHtml; }

Antes de eliminar un nodo, obtenga su padre y su texto InnerText , luego elimine el nodo y vuelva a asignar el texto InnerText al padre.

var parent = node.ParentNode; var innerText = parent.InnerText; node.Remove(); parent.AppendChild(doc.CreateTextNode(innerText));

Escribí un algoritmo basado en las sugerencias de Oded. Aquí está. Funciona de maravilla.

Elimina todas las etiquetas, excepto los nodos de texto strong , em , u y raw.

internal static string RemoveUnwantedTags(string data) { if(string.IsNullOrEmpty(data)) return string.Empty; var document = new HtmlDocument(); document.LoadHtml(data); var acceptableTags = new String[] { "strong", "em", "u"}; var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()")); while(nodes.Count > 0) { var node = nodes.Dequeue(); var parentNode = node.ParentNode; if(!acceptableTags.Contains(node.Name) && node.Name != "#text") { var childNodes = node.SelectNodes("./*|./text()"); if (childNodes != null) { foreach (var child in childNodes) { nodes.Enqueue(child); parentNode.InsertBefore(child, node); } } parentNode.RemoveChild(node); } } return document.DocumentNode.InnerHtml; }

Intente lo siguiente, puede encontrarlo un poco más ordenado que las otras soluciones propuestas:

public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath) { HtmlNodeCollection nodes = rootNode.SelectNodes(xPath); if (nodes == null) return 0; foreach (HtmlNode node in nodes) node.RemoveButKeepChildren(); return nodes.Count; } public static void RemoveButKeepChildren(this HtmlNode node) { foreach (HtmlNode child in node.ChildNodes) node.ParentNode.InsertBefore(child, node); node.Remove(); } public static bool TestYourSpecificExample() { string html = "my paragraph <div>and my div</div> are italic and bold"; HtmlDocument document = new HtmlDocument(); document.LoadHtml(html); document.DocumentNode.RemoveNodesButKeepChildren("//div"); document.DocumentNode.RemoveNodesButKeepChildren("//p"); return document.DocumentNode.InnerHtml == "my paragraph and my div are italic and bold"; }

Si no desea utilizar el paquete de agilidad Html y aún desea eliminar la etiqueta Html no deseada, puede hacerlo como se indica a continuación.

public static string RemoveHtmlTags(string strHtml) { string strText = Regex.Replace(strHtml, "<(.|/n)*?>", String.Empty); strText = HttpUtility.HtmlDecode(strText); strText = Regex.Replace(strText, @"/s+", " "); return strText; }