c# - linea - Convierta(renderice) HTML a texto con saltos de línea correctos

saltos de linea en visual (8)

El siguiente código funciona correctamente con el ejemplo proporcionado, incluso trata con algunas cosas raras como <div><br></div> , todavía hay algunas cosas que mejorar, pero la idea básica está ahí. Ver los comentarios.

public static string FormatLineBreaks(string html) { //first - remove all the existing ''/n'' from HTML //they mean nothing in HTML, but break our logic html = html.Replace("/r", "").Replace("/n", " "); //now create an Html Agile Doc object HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); //remove comments, head, style and script tags foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//comment() | //script | //style | //head")) { node.ParentNode.RemoveChild(node); } //now remove all "meaningless" inline elements like "span" foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//span | //label")) //add "b", "i" if required { node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node); } //block-elements - convert to line-breaks foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//p | //div")) //you could add more tags here { //we add a "/n" ONLY if the node contains some plain text as "direct" child //meaning - text is not nested inside children, but only one-level deep //use XPath to find direct "text" in element var txtNode = node.SelectSingleNode("text()"); //no "direct" text - NOT ADDDING the /n !!!! if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue; //"surround" the node with line breaks node.ParentNode.InsertBefore(doc.CreateTextNode("/r/n"), node); node.ParentNode.InsertAfter(doc.CreateTextNode("/r/n"), node); } //todo: might need to replace multiple "/n/n" into one here, I''m still testing... //now BR tags - simply replace with "/n" and forget foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//br")) node.ParentNode.ReplaceChild(doc.CreateTextNode("/r/n"), node); //finally - return the text which will have our inserted line-breaks in it return doc.DocumentNode.InnerText.Trim(); //todo - you should probably add "&code;" processing, to decode all the   and such } //here''s the extension method I use private static HtmlNodeCollection SafeSelectNodes(this HtmlNode node, string selector) { return (node.SelectNodes(selector) ?? new HtmlNodeCollection(node)); }

Necesito convertir la cadena HTML en texto plano (preferiblemente usando el paquete de agilidad HTML). Con espacios en blanco adecuados y, especialmente, saltos de línea adecuados .

Y con "saltos de línea correctos" quiero decir que este código:

<div> <div> <div> line1 </div> </div> </div> <div>line2</div>

Debe convertirse como

line1 line2

Es decir, solo un salto de línea.

La mayoría de las soluciones que he visto simplemente convierten todas las <div> <br> <p> a /n que, obviamente, s * cks.

¿Alguna sugerencia para la lógica de renderizado de html a texto plano para C #? No el código completo, al menos las respuestas lógicas comunes como "reemplazar todas las DIV de cierre con saltos de línea, pero solo si el próximo hermano no es un DIV también" realmente ayudará.

Cosas que intenté: simplemente obteniendo la propiedad .InnerText (erróneamente obviamente), regex (lento, doloroso, muchos hacks, también los regex son 12 veces más lentos que HtmlAgilityPack - lo medí), esta solution y similares (devuelve más saltos de línea luego necesario)

El siguiente código funciona para mí:

static void Main(string[] args) { StringBuilder sb = new StringBuilder(); string path = new WebClient().DownloadString("https://www.google.com"); HtmlDocument htmlDoc = new HtmlDocument(); ////htmlDoc.LoadHtml(File.ReadAllText(path)); htmlDoc.LoadHtml(path); var bodySegment = htmlDoc.DocumentNode.Descendants("body").FirstOrDefault(); if (bodySegment != null) { foreach (var item in bodySegment.ChildNodes) { if (item.NodeType == HtmlNodeType.Element && string.Compare(item.Name, "script", true) != 0) { foreach (var a in item.Descendants()) { if (string.Compare(a.Name, "script", true) == 0 || string.Compare(a.Name, "style", true) == 0) { a.InnerHtml = string.Empty; } } sb.AppendLine(item.InnerText.Trim()); } } } Console.WriteLine(sb.ToString()); Console.Read(); }

La clase a continuación proporciona una implementación alternativa a innerText . No emite más de una nueva línea para los siguientes divs, porque solo considera las etiquetas que diferencian diferentes contenidos de texto. Cada padre del nodo de texto se evalúa para decidir si se va a insertar una nueva línea o un espacio. Cualquier etiqueta que no contenga texto directo se ignora automáticamente.

El caso que presentó proporcionó el mismo resultado que usted deseó. Además:

<div>ABC<br>DEF<span>GHI</span></div>

ABC DEF GHI

mientras

<div>ABC<br>DEF<div>GHI</div></div>

ABC DEF GHI

ya que div es una etiqueta de bloque. script elementos de script y style son ignorados por completo. El método de utilidad HttpUtility.HtmlDecode (en System.Web ) se usa para descodificar texto escapado en HTML como & . Las repeticiones múltiples de espacios en blanco ( /s+ ) se reemplazan por un solo espacio. br etiquetas no causarán múltiples nuevas líneas si se repite.

static class HtmlTextProvider { private static readonly HashSet<string> InlineElementNames = new HashSet<string> { //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente "b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "code", "dfn", "em", "kbd", "strong", "samp", "var", "a", "bdo", "br", "img", "map", "object", "q", "script", "span", "sub", "sup", "button", "input", "label", "select", "textarea" }; private static readonly Regex WhitespaceNormalizer = new Regex(@"(/s+)", RegexOptions.Compiled); private static readonly HashSet<string> ExcludedElementNames = new HashSet<string> { "script", "style" }; public static string GetFormattedInnerText(this HtmlDocument document) { var textBuilder = new StringBuilder(); var root = document.DocumentNode; foreach (var node in root.Descendants()) { if (node is HtmlTextNode && !ExcludedElementNames.Contains(node.ParentNode.Name)) { var text = HttpUtility.HtmlDecode(node.InnerText); text = WhitespaceNormalizer.Replace(text, " ").Trim(); if(string.IsNullOrWhiteSpace(text)) continue; var whitespace = InlineElementNames.Contains(node.ParentNode.Name) ? " " : Environment.NewLine; //only if (EndsWith(textBuilder, " ") && whitespace == Environment.NewLine) { textBuilder.Remove(textBuilder.Length - 1, 1); textBuilder.AppendLine(); } textBuilder.Append(text); textBuilder.Append(whitespace); if (!char.IsWhiteSpace(textBuilder[textBuilder.Length - 1])) { if (InlineElementNames.Contains(node.ParentNode.Name)) { textBuilder.Append('' ''); } else { textBuilder.AppendLine(); } } } else if (node.Name == "br" && EndsWith(textBuilder, Environment.NewLine)) { textBuilder.AppendLine(); } } return textBuilder.ToString().TrimEnd(Environment.NewLine.ToCharArray()); } private static bool EndsWith(StringBuilder builder, string value) { return builder.Length > value.Length && builder.ToString(builder.Length - value.Length, value.Length) == value; } }

No creo que SO sea sobre el intercambio de recompensas por escribir soluciones completas de código. Creo que las mejores respuestas son aquellas que le brindan orientación y lo ayudan a resolverlo usted mismo. En ese espíritu, aquí hay un proceso que se me ocurre que debería funcionar:

Reemplace cualquier longitud de caracteres en espacios en blanco con un solo espacio (esto representa las reglas de procesamiento de espacios en blanco HTML estándar)
Reemplazar todas las instancias de </div> con líneas nuevas
Contraer cualquier instancia múltiple de nuevas líneas con una nueva línea
Reemplaza instancias de </p> , <br> y <br/> con una nueva línea
Eliminar cualquier etiqueta html abrir / cerrar
Expandir cualquier entidad, por ejemplo, ™ según sea necesario
Recorte el resultado para eliminar espacios iniciales y finales

Básicamente, desea una nueva línea para cada pestaña de párrafo o de salto de línea, pero para contraer múltiples cierres de div con una sola, por lo que los primero.

Finalmente, tenga en cuenta que realmente está realizando un diseño HTML, y esto depende del CSS de las etiquetas. El comportamiento que ves ocurre porque divs se establece de forma predeterminada en el modo de visualización / diseño de bloques. CSS cambiaría eso. No hay una manera fácil de encontrar una solución general para este problema sin un motor de renderizado / diseño sin cabeza, es decir, algo que pueda procesar CSS.

Pero para su caso de ejemplo simple, el enfoque anterior debe ser bueno.

No sé mucho sobre html-agility-pack, pero aquí está la alternativa de CA.

public string GetPlainText() { WebRequest request = WebRequest.Create("URL for page you want to ''stringify''"); WebResponse response = request.GetResponse(); Stream data = response.GetResponseStream(); string html = String.Empty; using (StreamReader sr = new StreamReader(data)) { html = sr.ReadToEnd(); } html = Regex.Replace(html, "<.*?>", "/n"); html = Regex.Replace(html, @"//r|//n|/n|/r", @"$"); html = Regex.Replace(html, @"/$ +", @"$"); html = Regex.Replace(html, @"(/$)+", Environment.NewLine); return html; }

Si tiene la intención de mostrar esto en una página html, reemplace Environment.NewLine con <br/> .

Preocupaciones:

Etiquetas no visibles (script, estilo)
Etiquetas de nivel de bloque
Etiquetas en línea
Etiqueta de Br
Espacios ajustables (espacios iniciales, finales y múltiples espacios en blanco)
Espacios duros
Entidades

Decisión algebraica:

plain-text = Process(Plain(html)) Plain(node-s) => Plain(node-0), Plain(node-1), ..., Plain(node-N) Plain(BR) => BR Plain(not-visible-element(child-s)) => nil Plain(block-element(child-s)) => BS, Plain(child-s), BE Plain(inline-element(child-s)) => Plain(child-s) Plain(text) => ch-0, ch-1, .., ch-N Process(symbol-s) => Process(start-line, symbol-s) Process(start-line, BR, symbol-s) => Print(''/n''), Process(start-line, symbol-s) Process(start-line, BS, symbol-s) => Process(start-line, symbol-s) Process(start-line, BE, symbol-s) => Process(start-line, symbol-s) Process(start-line, hard-space, symbol-s) => Print('' ''), Process(not-ws, symbol-s) Process(start-line, space, symbol-s) => Process(start-line, symbol-s) Process(start-line, common-symbol, symbol-s) => Print(common-symbol), Process(not-ws, symbol-s) Process(not-ws, BR|BS|BE, symbol-s) => Print(''/n''), Process(start-line, symbol-s) Process(not-ws, hard-space, symbol-s) => Print('' ''), Process(not-ws, symbol-s) Process(not-ws, space, symbol-s) => Process(ws, symbol-s) Process(not-ws, common-symbol, symbol-s) => Process(ws, symbol-s) Process(ws, BR|BS|BE, symbol-s) => Print(''/n''), Process(start-line, symbol-s) Process(ws, hard-space, symbol-s) => Print('' ''), Print('' ''), Process(not-ws, symbol-s) Process(ws, space, symbol-s) => Process(ws, symbol-s) Process(ws, common-symbol, symbol-s) => Print('' ''), Print(common-symbol), Process(not-ws, symbol-s)

Decisión de C # para HtmlAgilityPack y System.Xml.Linq:

//HtmlAgilityPack part public static string ToPlainText(this HtmlAgilityPack.HtmlDocument doc) { var builder = new System.Text.StringBuilder(); var state = ToPlainTextState.StartLine; Plain(builder, ref state, new[]{doc.DocumentNode}); return builder.ToString(); } static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<HtmlAgilityPack.HtmlNode> nodes) { foreach (var node in nodes) { if (node is HtmlAgilityPack.HtmlTextNode) { var text = (HtmlAgilityPack.HtmlTextNode)node; Process(builder, ref state, HtmlAgilityPack.HtmlEntity.DeEntitize(text.Text).ToCharArray()); } else { var tag = node.Name.ToLower(); if (tag == "br") { builder.AppendLine(); state = ToPlainTextState.StartLine; } else if (NonVisibleTags.Contains(tag)) { } else if (InlineTags.Contains(tag)) { Plain(builder, ref state, node.ChildNodes); } else { if (state != ToPlainTextState.StartLine) { builder.AppendLine(); state = ToPlainTextState.StartLine; } Plain(builder, ref state, node.ChildNodes); if (state != ToPlainTextState.StartLine) { builder.AppendLine(); state = ToPlainTextState.StartLine; } } } } } //System.Xml.Linq part public static string ToPlainText(this IEnumerable<XNode> nodes) { var builder = new System.Text.StringBuilder(); var state = ToPlainTextState.StartLine; Plain(builder, ref state, nodes); return builder.ToString(); } static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<XNode> nodes) { foreach (var node in nodes) { if (node is XElement) { var element = (XElement)node; var tag = element.Name.LocalName.ToLower(); if (tag == "br") { builder.AppendLine(); state = ToPlainTextState.StartLine; } else if (NonVisibleTags.Contains(tag)) { } else if (InlineTags.Contains(tag)) { Plain(builder, ref state, element.Nodes()); } else { if (state != ToPlainTextState.StartLine) { builder.AppendLine(); state = ToPlainTextState.StartLine; } Plain(builder, ref state, element.Nodes()); if (state != ToPlainTextState.StartLine) { builder.AppendLine(); state = ToPlainTextState.StartLine; } } } else if (node is XText) { var text = (XText)node; Process(builder, ref state, text.Value.ToCharArray()); } } } //common part public static void Process(System.Text.StringBuilder builder, ref ToPlainTextState state, params char[] chars) { foreach (var ch in chars) { if (char.IsWhiteSpace(ch)) { if (IsHardSpace(ch)) { if (state == ToPlainTextState.WhiteSpace) builder.Append('' ''); builder.Append('' ''); state = ToPlainTextState.NotWhiteSpace; } else { if (state == ToPlainTextState.NotWhiteSpace) state = ToPlainTextState.WhiteSpace; } } else { if (state == ToPlainTextState.WhiteSpace) builder.Append('' ''); builder.Append(ch); state = ToPlainTextState.NotWhiteSpace; } } } static bool IsHardSpace(char ch) { return ch == 0xA0 || ch == 0x2007 || ch == 0x202F; } private static readonly HashSet<string> InlineTags = new HashSet<string> { //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente "b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "code", "dfn", "em", "kbd", "strong", "samp", "var", "a", "bdo", "br", "img", "map", "object", "q", "script", "span", "sub", "sup", "button", "input", "label", "select", "textarea" }; private static readonly HashSet<string> NonVisibleTags = new HashSet<string> { "script", "style" }; public enum ToPlainTextState { StartLine = 0, NotWhiteSpace, WhiteSpace, } }

Ejemplos:

// <div> 1 </div> 2 <div> 3 </div> 1 2 3 // <div>1 <br/><br/>  <b> 2 </b> <div> </div><div> </div>  3</div> 1 2 3 // <span>1<style> text </style><i>2</i></span>3 123 //<div> // <div> // <div> // line1 // </div> // </div> //</div> //<div>line2</div> line1 line2

Siempre uso CsQuery para mis proyectos. Supuestamente es más rápido que HtmlAgilityPack y mucho más fácil de usar con los selectores css en lugar de xpath.

var html = @"<div> <div> <div> line1 </div> </div> </div> <div>line2</div>"; var lines = CQ.Create(html) .Text() .Replace("/r/n", "/n") // I like to do this before splitting on line breaks .Split(''/n'') .Select(s => s.Trim()) // Trim elements .Where(s => !s.IsNullOrWhiteSpace()) // Remove empty lines ; var result = string.Join(Environment.NewLine, lines);

El código anterior funciona como se esperaba, sin embargo, si tiene un ejemplo más complejo con un resultado esperado, este código se puede acomodar fácilmente.

Si desea conservar <br> por ejemplo, puede reemplazarlo con algo como "--- br ---" en la variable html y dividirlo nuevamente en el resultado final.

Solución no-regex:

while (text.IndexOf("/n/n") > -1 || text.IndexOf("/n /n") > -1) { text = text.Replace("/n/n", "/n"); text = text.Replace("/n /n", "/n"); }

Regex:

text = Regex.Replace(text, @"^/s*$/n|/r", "", RegexOptions.Multiline).TrimEnd();

Además, según recuerdo,

text = HtmlAgilityPack.HtmlEntity.DeEntitize(text);

hace el favor