extraer - ¿Cómo puedo convertir HTML a texto en C#?

html to text c# (18)

Estoy buscando el código C # para convertir un documento HTML a texto sin formato.

No busco la eliminación simple de etiquetas, sino algo que genere texto sin formato con una conservación razonable del diseño original.

La salida debería verse así:

Html2Txt en W3C

Miré el HTML Agility Pack, pero no creo que sea eso lo que necesito. ¿Alguien tiene alguna otra sugerencia?

EDITAR: Acabo de descargar HTML Agility Pack de CodePlex y ejecuté el proyecto Html2Txt. ¡Qué decepción (al menos el módulo que hace html a la conversión de texto)! Todo lo que hizo fue quitar las etiquetas, aplanar las tablas, etc. La salida no se parecía en nada al Html2Txt @ W3C producido. Lástima que la fuente no parece estar disponible. Estaba buscando para ver si hay una solución más "enlatada" disponible.

EDIT 2: Gracias a todos por sus sugerencias. FlySwat me indicó la dirección en la que quería ir. Puedo usar la clase System.Diagnostics.Process para ejecutar lynx.exe con el modificador "-dump" para enviar el texto a la salida estándar, y capturar el stdout con ProcessStartInfo.UseShellExecute = false y ProcessStartInfo.RedirectStandardOutput = true . Voy a envolver todo esto en una clase C #. Este código se llamará sólo ocasionalmente, por lo que no estoy demasiado preocupado por generar un nuevo proceso o hacerlo en el código. ¡Además, Lynx es RÁPIDO!

¿Has probado http://www.aaronsw.com/2002/html2text/ es Python, pero de código abierto.

Como quería la conversión a texto sin formato con LF y viñetas, encontré esta bonita solución en codeproject, que cubre muchos usos de conversión:

Convierte HTML a texto sin formato

Sí, parece tan grande, pero funciona bien.

En Genexus puedes hacer con Regex

& pattern = ''<[^>] +>''

& TSTRPNOT = & TSTRPNOT.ReplaceRegEx (& pattern, "")

En Genexus possiamo gestirlo con Regex,

Escuché de una fuente confiable que, si está analizando HTML en .Net, debería ver el paquete de agilidad HTML de nuevo.

http://www.codeplex.com/htmlagilitypack

Alguna muestra en SO ...

Paquete HTML Agility: tablas de análisis

Esta es otra solución para convertir HTML a texto o RTF en C #:

SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf(); h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode; string text = h.ConvertString(htmlString);

Esta biblioteca no es gratuita, es un producto comercial y es mi propio producto.

He usado Detagger en el pasado. Hace un buen trabajo formateando el HTML como texto y es más que un removedor de etiquetas.

Lo más fácil probablemente sería quitar etiquetas combinadas con el reemplazo de algunas etiquetas con elementos de diseño de texto como guiones para elementos de lista (li) y saltos de línea para br''s yp''s. No debería ser demasiado difícil extender esto a las tablas.

Lo que está buscando es un renderizador DOM en modo texto que emite texto, al igual que Lynx u otros navegadores de texto ... Esto es mucho más difícil de lo que cabría esperar.

No sé C #, pero aquí hay una secuencia de comandos python ht22txt bastante pequeña y fácil de leer: http://www.aaronsw.com/2002/html2text/

Puede usar el control WebBrowser para renderizar en memoria su contenido html. Después de que se active el evento LoadCompleted ...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document; string innerHTML = htmlDoc.body.innerHTML; string innerText = htmlDoc.body.innerText;

Puedes usar esto:

public static string StripHTML(string HTMLText, bool decode = true) { Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase); var stripped = reg.Replace(HTMLText, ""); return decode ? HttpUtility.HtmlDecode(stripped) : stripped; }

Actualizado

Gracias por los comentarios que he actualizado para mejorar esta función

Si está usando .NET framework 4.5, puede usar System.Net.WebUtility.HtmlDecode () que toma una cadena codificada en HTML y devuelve una cadena descodificada.

Documentado en MSDN en: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx

Puede usar esto en una aplicación de la Tienda Windows también.

Solo una nota sobre HtmlAgilityPack para la posteridad. El proyecto contiene un ejemplo de análisis de texto a html , que, como señala el OP, no maneja el espacio en blanco como lo haría cualquiera que escriba HTML. Hay soluciones de renderizado de texto completo, notadas por otros en esta pregunta, que no lo es (ni siquiera puede manejar tablas en su forma actual), pero es ligero y rápido, que es todo lo que quería para crear un texto simple versión de correos electrónicos HTML.

using System.IO; using System.Text.RegularExpressions; using HtmlAgilityPack; //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs public static class HtmlToText { public static string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); return ConvertDoc(doc); } public static string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); return ConvertDoc(doc); } public static string ConvertDoc (HtmlDocument doc) { using (StringWriter sw = new StringWriter()) { ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } } internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText, textInfo); } } public static void ConvertTo(HtmlNode node, TextWriter outText) { ConvertTo(node, outText, new PreceedingDomTextInfo(false)); } internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { string html; switch (node.NodeType) { case HtmlNodeType.Comment: // don''t output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText, textInfo); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) { break; } // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) { break; } // check the text is meaningful and not a bunch of whitespaces if (html.Length == 0) { break; } if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) { html= html.TrimStart(); if (html.Length == 0) { break; } textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true; } outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"/s{2,}", " "))); if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) { outText.Write('' ''); } break; case HtmlNodeType.Element: string endElementString = null; bool isInline; bool skip = false; int listIndex = 0; switch (node.Name) { case "nav": skip = true; isInline = false; break; case "body": case "section": case "article": case "aside": case "h1": case "h2": case "header": case "footer": case "address": case "main": case "div": case "p": // stylistic - adjust as you tend to use if (textInfo.IsFirstTextOfDocWritten) { outText.Write("/r/n"); } endElementString = "/r/n"; isInline = false; break; case "br": outText.Write("/r/n"); skip = true; textInfo.WritePrecedingWhiteSpace = false; isInline = true; break; case "a": if (node.Attributes.Contains("href")) { string href = node.Attributes["href"].Value.Trim(); if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1) { endElementString = "<" + href + ">"; } } isInline = true; break; case "li": if(textInfo.ListIndex>0) { outText.Write("/r/n{0}./t", textInfo.ListIndex++); } else { outText.Write("/r/n*/t"); //using ''*'' as bullet char, with tab after, but whatever you want eg "/t->", if utf-8 0x2022 } isInline = false; break; case "ol": listIndex = 1; goto case "ul"; case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems endElementString = "/r/n"; isInline = false; break; case "img": //inline-block in reality if (node.Attributes.Contains("alt")) { outText.Write(''['' + node.Attributes["alt"].Value); endElementString = "]"; } if (node.Attributes.Contains("src")) { outText.Write(''<'' + node.Attributes["src"].Value + ''>''); } isInline = true; break; default: isInline = true; break; } if (!skip && node.HasChildNodes) { ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex }); } if (endElementString != null) { outText.Write(endElementString); } break; } } } internal class PreceedingDomTextInfo { public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) { IsFirstTextOfDocWritten = isFirstTextOfDocWritten; } public bool WritePrecedingWhiteSpace {get;set;} public bool LastCharWasSpace { get; set; } public readonly BoolWrapper IsFirstTextOfDocWritten; public int ListIndex { get; set; } } internal class BoolWrapper { public BoolWrapper() { } public bool Value { get; set; } public static implicit operator bool(BoolWrapper boolWrapper) { return boolWrapper.Value; } public static implicit operator BoolWrapper(bool boolWrapper) { return new BoolWrapper{ Value = boolWrapper }; } }

Como ejemplo, el siguiente código HTML ...

<!DOCTYPE HTML> <html> <head> </head> <body> <header> Whatever Inc. </header> <main> <p> Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things: </p> <ol> <li> Please confirm this is your email by replying. </li> <li> Then perform this step. </li> </ol> <p> Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please: </p> <ul> <li> a point. </li> <li> another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>. </li> </ul> <p> Sincerely, </p> <p> The whatever.com team </p> </main> <footer> Ph: 000 000 000<br/> mail: whatever st </footer> </body> </html>

... se transformará en:

Whatever Inc. Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 1. Please confirm this is your email by replying. 2. Then perform this step. Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: * a point. * another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. Sincerely, The whatever.com team Ph: 000 000 000 mail: whatever st

...Opuesto a:

Whatever Inc. Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: Please confirm this is your email by replying. Then perform this step. Please solve this . Then, in any order, could you please: a point. another point, with a hyperlink. Sincerely, The whatever.com team Ph: 000 000 000 mail: whatever st

Suponiendo que ha formado html, también podría intentar una transformación XSL.

Aquí hay un ejemplo:

using System; using System.IO; using System.Xml.Linq; using System.Xml.XPath; using System.Xml.Xsl; class Html2TextExample { public static string Html2Text(XDocument source) { var writer = new StringWriter(); Html2Text(source, writer); return writer.ToString(); } public static void Html2Text(XDocument source, TextWriter output) { Transformer.Transform(source.CreateReader(), null, output); } public static XslCompiledTransform _transformer; public static XslCompiledTransform Transformer { get { if (_transformer == null) { _transformer = new XslCompiledTransform(); var xsl = XDocument.Parse(@"<?xml version=''1.0''?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>"); _transformer.Load(xsl.CreateNavigator()); } return _transformer; } } static void Main(string[] args) { var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>"); var text = Html2Text(html); Console.WriteLine(text); } }

Tuve algunos problemas de decodificación con HtmlAgility y no quería invertir tiempo investigando.

En cambio, utilicé esa utilidad de la API de Microsoft Team Foundation:

var text = HtmlFilter.ConvertToPlainText(htmlContent);

Otra publicación sugiere el http://www.codeplex.com/htmlagilitypack :

Este es un analizador de HTML ágil que construye un DOM de lectura / escritura y admite XPATH o XSLT simples (en realidad no TIENES que entender XPATH ni XSLT para usarlo, no te preocupes ...). Es una biblioteca de códigos .NET que le permite analizar archivos HTML "fuera de la web". El analizador es muy tolerante con el HTML malformado del "mundo real". El modelo de objetos es muy similar a lo que propone System.Xml, pero para documentos HTML (o streams).

Recientemente publiqué en mi blog una solución que funcionó para mí mediante el uso de un archivo Markdown XSLT para transformar el código fuente HTML. La fuente HTML, por supuesto, tendrá que ser XML válido primero

Pruebe la manera fácil y utilizable: simplemente llame a StripHTML(WebBrowserControl_name);

public string StripHTML(WebBrowser webp) { try { doc.execCommand("SelectAll", true, null); IHTMLSelectionObject currentSelection = doc.selection; if (currentSelection != null) { IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange; if (range != null) { currentSelection.empty(); return range.text; } } } catch (Exception ep) { //MessageBox.Show(ep.Message); } return ""; }