Caracteres árabes desde contenido html a pdf usando iText

arabic (1)

Tengo problemas para mostrar los caracteres árabes del contenido HTML en la generación de PDF como " ? "

Puedo mostrar el texto árabe de la variable de cadena. Al mismo tiempo, no puedo generar el texto en árabe a partir de la cadena HTML.

Quiero mostrar el PDF con dos columnas, el lado izquierdo en inglés y el lado derecho en árabe.

cuando uso el siguiente programa para convertir a pdf. Por favor, ayúdame en este sentido.

try { Document document = new Document(PageSize.A4, 50, 50, 50, 50); ByteArrayOutputStream out = new ByteArrayOutputStream(); PdfWriter writer = PdfWriter.getInstance(document, out); BaseFont bf = BaseFont.createFont("C://arial.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED); Font font = new Font(bf, 8); document.open(); BufferedReader br = new BufferedReader(new FileReader("C://style.css")); StringBuffer fileContents = new StringBuffer(); String line = br.readLine(); while (line != null) { fileContents.append(line); line = br.readLine(); } br.close(); String styles = fileContents.toString(); //"p { font-family: Arial;}"; Paragraph cirNoEn = null; Paragraph cirNoAr = null; String htmlContentEn = null; String htmlContentAr = null; PdfPCell contentEnCell = new PdfPCell(); PdfPCell contentArCell = new PdfPCell(); cirNoEn = new Paragraph("Circular No. (" + cirEnNo + ")", new Font(bf, 14, Font.BOLD | Font.UNDERLINE)); cirNoAr = new Paragraph("رقم التعميم (" + cirArNo + ")", new Font(bf, 14, Font.BOLD | Font.UNDERLINE)); htmlContentEn = “ Dear….”; htmlContentAr = “ رقم التعميم رقم التعميم رقم التعميم ….”; for (Element e : XMLWorkerHelper.parseToElementList(htmlContentEn, styles)) { for (Chunk c : e.getChunks()) { c.setFont(new Font(bf)); } contentEnCell.addElement(e); } for (Element e : XMLWorkerHelper.parseToElementList(htmlContentAr, styles)) { for (Chunk c:e.getChunks()) { c.setFont(new Font(bf)); } contentArCell.addElement(e); } PdfPCell emptyCell = new PdfPCell(); PdfPCell cirNoEnCell = new PdfPCell(cirNoEn); PdfPCell cirNoArCell = new PdfPCell(cirNoAr); cirNoEnCell.setHorizontalAlignment(Element.ALIGN_CENTER); cirNoArCell.setHorizontalAlignment(Element.ALIGN_CENTER); emptyCell.setBorder(Rectangle.NO_BORDER); emptyCell.setFixedHeight(15); cirNoEnCell.setBorder(Rectangle.NO_BORDER); cirNoArCell.setBorder(Rectangle.NO_BORDER); contentEnCell.setBorder(Rectangle.NO_BORDER); contentArCell.setBorder(Rectangle.NO_BORDER); cirNoArCell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL); contentArCell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL); contentEnCell.setNoWrap(false); contentArCell.setNoWrap(false); PdfPTable circularInfoTable = null; emptyCell.setColspan(2); circularInfoTable = new PdfPTable(2); circularInfoTable.addCell(cirNoEnCell); circularInfoTable.addCell(cirNoArCell); circularInfoTable.addCell(emptyCell); circularInfoTable.addCell(emptyCell); circularInfoTable.addCell(emptyCell); circularInfoTable.addCell(contentEnCell); circularInfoTable.addCell(contentArCell); circularInfoTable.addCell(emptyCell); circularInfoTable.getDefaultCell().setBorder(PdfPCell.NO_BORDER); circularInfoTable.setWidthPercentage(100); document.add(circularInfoTable); document.close(); } catch (Exception e) { }

Eche un vistazo a los ejemplos ParseHtml7 y ParseHtml8 . Toman entrada HTML con caracteres árabes y crean un PDF con el mismo texto árabe:

Antes de ver el código, permítame explicarle que no es una buena idea usar caracteres no ASCII en el código fuente. Por ejemplo: esto no se hace:

htmlContentAr = “ رقم التعميم رقم التعميمرقم التعميم ….”;

Nunca se sabe cómo se almacenará un archivo Java que contenga estos glifos. Si no se almacena como UTF-8, los personajes pueden terminar pareciendo algo completamente diferente. Se sabe que los sistemas de versiones tienen problemas con los caracteres que no son ASCII e incluso los compiladores pueden equivocarse en la codificación. Si realmente desea almacenar valores de String codificados en su código, use la notación UNICODE. Parte de su problema es un problema de codificación, y puede leer más sobre esto aquí: No se pueden obtener caracteres checos al generar un PDF

Para los ejemplos que se muestran en las capturas de pantalla, guardé los siguientes archivos usando la codificación UTF-8:

Esto es lo que encontrarás en el archivo arabic.html :

<html> <body style="font-family: Noto Naskh Arabic"> رقم التعميم رقم التعميم رقم التعميم </body> </html>

Esto es lo que encontrarás en el archivo arabic2.html :

<html> <body style="font-family: Noto Naskh Arabic"> <table> <tr> <td dir="rtl">رقم التعميم رقم التعميم</td> <td dir="rtl">رقم التعميم</td> </tr> </table> </body> </html>

La segunda parte de su problema se refiere a la fuente. Es importante que utilice una fuente que sepa dibujar glifos árabes. Es difícil creer que tenga arial.ttf justo en la raíz de su unidad C: . Esa no es una buena idea. Esperaría que C:/windows/fonts/arialuni.ttf que ciertamente conoce glifos árabes.

Seleccionar la fuente no es suficiente. Su HTML necesita saber qué familia de fuentes usar. Debido a que la mayoría de los ejemplos en la documentación usan Arial, decidí usar una fuente NOTO. Descubrí estas fuentes al leer esta pregunta: iText pdf no muestra caracteres chinos al usar fuentes NOTO o Source Hans . Realmente me gustan estas fuentes porque son agradables y (casi) todos los idiomas son compatibles. Por ejemplo, usé NotoNaskhArabic-Regular.ttf que significa que necesito definir la fuente familiar de esta manera:

style="font-family: Noto Naskh Arabic"

Definí el estilo en la etiqueta del cuerpo de mi XML, es obvio que puede elegir dónde definirlo: en un archivo CSS externo, en la sección de estilos de <head> , al nivel de una etiqueta <td> . .. Esa opción es completamente suya, pero debe definir en algún lugar qué fuente usar.

Por supuesto: cuando XML Worker se encuentra con font-family: Noto Naskh Arabic , iText no sabe dónde encontrar el NotoNaskhArabic-Regular.ttf correspondiente NotoNaskhArabic-Regular.ttf menos que registremos esa fuente. Podemos hacer esto creando una instancia de la interfaz FontProvider . Elegí usar XMLWorkerFontProvider , pero eres libre de escribir tu propia implementación de FontProvider :

XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS); fontProvider.register("resources/fonts/NotoNaskhArabic-Regular.ttf");

Hay un obstáculo más que tomar: el árabe se escribe de derecha a izquierda. Veo que desea definir la dirección de ejecución en el nivel de PdfPCell y que agrega el contenido HTML a esta celda usando una ElementList . Es por eso que primero escribí un ejemplo similar, llamado ParseHtml7 :

public void createPdf(String file) throws IOException, DocumentException { // step 1 Document document = new Document(); // step 2 PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file)); // step 3 document.open(); // step 4 // Styles CSSResolver cssResolver = new StyleAttrCSSResolver(); XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS); fontProvider.register("resources/fonts/NotoNaskhArabic-Regular.ttf"); CssAppliers cssAppliers = new CssAppliersImpl(fontProvider); // HTML HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers); htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory()); // Pipelines ElementList elements = new ElementList(); ElementHandlerPipeline pdf = new ElementHandlerPipeline(elements, null); HtmlPipeline html = new HtmlPipeline(htmlContext, pdf); CssResolverPipeline css = new CssResolverPipeline(cssResolver, html); // XML Worker XMLWorker worker = new XMLWorker(css, true); XMLParser p = new XMLParser(worker); p.parse(new FileInputStream(HTML), Charset.forName("UTF-8")); PdfPTable table = new PdfPTable(1); PdfPCell cell = new PdfPCell(); cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL); for (Element e : elements) { cell.addElement(e); } table.addCell(cell); document.add(table); // step 5 document.close(); }

No hay una tabla en el HTML, pero creamos nuestra propia PdfPTable , agregamos el contenido del HTML a un PdfPCell con la dirección de ejecución LTR, y agregamos esta celda a la tabla y la tabla al documento.

Tal vez ese sea su requisito real, pero ¿por qué haría esto de una manera tan complicada? Si necesita una tabla, ¿por qué no crea esa tabla en HTML y define que algunas celdas son RTL como esta:

<td dir="rtl">...</td>

De esa manera, no tiene que crear una ElementList , solo puede analizar el HTML a PDF como se hace en el ejemplo ParseHtml8 :

public void createPdf(String file) throws IOException, DocumentException { // step 1 Document document = new Document(); // step 2 PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file)); // step 3 document.open(); // step 4 // Styles CSSResolver cssResolver = new StyleAttrCSSResolver(); XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS); fontProvider.register("resources/fonts/NotoNaskhArabic-Regular.ttf"); CssAppliers cssAppliers = new CssAppliersImpl(fontProvider); HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers); htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory()); // Pipelines PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer); HtmlPipeline html = new HtmlPipeline(htmlContext, pdf); CssResolverPipeline css = new CssResolverPipeline(cssResolver, html); // XML Worker XMLWorker worker = new XMLWorker(css, true); XMLParser p = new XMLParser(worker); p.parse(new FileInputStream(HTML), Charset.forName("UTF-8"));; // step 5 document.close(); }

Se necesita menos código en este ejemplo, y cuando desea cambiar el diseño, es suficiente cambiar el HTML. No necesita cambiar su código Java.

Un ejemplo más: en ParseHtml9 , creo una tabla con un nombre en inglés en una columna ("Lawrence de Arabia") y la traducción al árabe en la otra columna ("لورانس العرب"). Como necesito diferentes fuentes para inglés y árabe, defino la fuente en el nivel <td> :

<table> <tr> <td>Lawrence of Arabia</td> <td dir="rtl" style="font-family: Noto Naskh Arabic">لورانس العرب</td> </tr> </table>

Para la primera columna, se utiliza la fuente predeterminada y no se necesitan configuraciones especiales para escribir de izquierda a derecha. Para la segunda columna, defino una fuente árabe y establezco la dirección de ejecución en "rtl" .

El resultado se ve así:

Eso es mucho más fácil de lo que intentas hacer en tu código.