java - printer - Uso de PDFbox para determinar las coordenadas de las palabras en un documento

pdfbox android (4)

Estoy utilizando PDFbox para extraer las coordenadas de las palabras / cadenas en un documento PDF, y hasta ahora he tenido éxito en la determinación de la posición de los caracteres individuales. Este es el código hasta ahora, desde el documento PDFbox:

package printtextlocations; import java.io.*; import org.apache.pdfbox.exceptions.InvalidPasswordException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.common.PDStream; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import java.io.IOException; import java.util.List; public class PrintTextLocations extends PDFTextStripper { public PrintTextLocations() throws IOException { super.setSortByPosition(true); } public static void main(String[] args) throws Exception { PDDocument document = null; try { File input = new File("C://path//to//PDF.pdf"); document = PDDocument.load(input); if (document.isEncrypted()) { try { document.decrypt(""); } catch (InvalidPasswordException e) { System.err.println("Error: Document is encrypted with a password."); System.exit(1); } } PrintTextLocations printer = new PrintTextLocations(); List allPages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { PDPage page = (PDPage) allPages.get(i); System.out.println("Processing page: " + i); PDStream contents = page.getContents(); if (contents != null) { printer.processStream(page, page.findResources(), page.getContents().getStream()); } } } finally { if (document != null) { document.close(); } } } /** * @param text The text to be processed */ @Override /* this is questionable, not sure if needed... */ protected void processTextPosition(TextPosition text) { System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter()); } }

Esto produce una serie de líneas que contienen la posición de cada carácter, incluidos los espacios, que se ve así:

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

Donde ''P'' es el personaje. No he podido encontrar una función en PDFbox para encontrar palabras, y no estoy lo suficientemente familiarizado con Java para poder concatenar con precisión estos caracteres en palabras para buscar aunque los espacios también están incluidos. ¿Alguien más ha estado en una situación similar, y si es así, cómo lo abordó? Realmente solo necesito la coordenada del primer carácter de la palabra para que las partes se simplifiquen, pero la forma en que voy a hacer coincidir una cadena con ese tipo de salida me supera.

Basada en la idea original, aquí está una versión de la búsqueda de texto para PDFBox 2. El código en sí es áspero, pero simple. Debería empezar bastante rápido.

import java.io.IOException; import java.io.Writer; import java.util.List; import java.util.Set; import lu.abac.pdfclient.data.PDFTextLocation; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.apache.pdfbox.text.TextPosition; public class PrintTextLocator extends PDFTextStripper { private final Set<PDFTextLocation> locations; public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException { super.setSortByPosition(true); this.document = document; this.locations = locations; this.output = new Writer() { @Override public void write(char[] cbuf, int off, int len) throws IOException { } @Override public void flush() throws IOException { } @Override public void close() throws IOException { } }; } public Set<PDFTextLocation> doSearch() throws IOException { processPages(document.getDocumentCatalog().getPages()); return locations; } @Override protected void writeString(String text, List<TextPosition> textPositions) throws IOException { super.writeString(text); String searchText = text.toLowerCase(); for (PDFTextLocation textLoc:locations) { int start = searchText.indexOf(textLoc.getText().toLowerCase()); if (start!=-1) { // found TextPosition pos = textPositions.get(start); textLoc.setFound(true); textLoc.setPage(getCurrentPageNo()); textLoc.setX(pos.getXDirAdj()); textLoc.setY(pos.getYDirAdj()); } } } }

Conseguí este trabajo utilizando la conversión IKVM PDFBox.NET 1.8.9. en C # y .NET.

Finalmente, descubrí que las coordenadas de los caracteres (glifos) son privadas para el ensamblado .NET, pero se puede acceder a ellas mediante System.Reflection .

Publiqué un ejemplo completo de cómo obtener las coordenadas de PALABRAS y dibujarlas en imágenes de PDF usando SVG y HTML aquí: https://github.com/tsamop/PDF_Interpreter

Para el ejemplo a continuación, necesita PDFbox.NET: http://www.squarepdf.net/pdfbox-in-net , e incluya referencias a él en su proyecto.

Me tomó bastante tiempo averiguarlo, ¡así que realmente espero que le ahorre tiempo a otra persona!

Si solo necesita saber dónde buscar los caracteres y las coordenadas, una versión abreviada sería:

using System; using System.Reflection; using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // to test run pdfTest.RunTest(@"C:/temp/test_2.pdf"); class pdfTest { //simple example for getting character (gliph) coordinates out of a pdf doc. // a more complete example is here: https://github.com/tsamop/PDF_Interpreter public static void RunTest(string sFilename) { //probably a better way to get page count, but I cut this out of a bigger project. PDDocument oDoc = PDDocument.load(sFilename); object[] oPages = oDoc.getDocumentCatalog().getAllPages().toArray(); int iPageNo = 0; //1''s based!! foreach (object oPage in oPages) { iPageNo++; //feed the stripper a page. PDFTextStripper tStripper = new PDFTextStripper(); tStripper.setStartPage(iPageNo); tStripper.setEndPage(iPageNo); tStripper.getText(oDoc); //This gets the "charactersByArticle" private object in PDF Box. FieldInfo charactersByArticleInfo = typeof(PDFTextStripper).GetField("charactersByArticle", BIndingFlags.NonPublic | BindingFlags.Instance); object charactersByArticle = charactersByArticleInfo.GetValue(tStripper); object[] aoArticles = (object[])charactersByArticle.GetField("elementData"); foreach (object oArticle in aoArticles) { if (oArticle != null) { //THE CHARACTERS within the article object[] aoCharacters = (object[])oArticle.GetField("elementData"); foreach (object oChar in aoCharacters) { /*properties I caulght using reflection: * endX, endY, font, fontSize, fontSizePt, maxTextHeight, pageHeight, pageWidth, rot, str textPos, unicodCP, widthOfSpace, widths, wordSpacing, x, y * */ if (oChar != null) { //this is a really quick test. // for a more complete solution that pulls the characters into words and displays the word positions on the page, try this: https://github.com/tsamop/PDF_Interpreter //the Y''s appear to be the bottom of the char? double mfMaxTextHeight = Convert.ToDouble(oChar.GetField("maxTextHeight")); //I think this is the height of the character/word char mcThisChar = oChar.GetField("str").ToString().ToCharArray()[0]; double mfX = Convert.ToDouble(oChar.GetField("x")); double mfY = Convert.ToDouble(oChar.GetField("y")) - mfMaxTextHeight; //CALCULATE THE OTHER SIDE OF THE GLIPH double mfWidth0 = ((Single[])oChar.GetField("widths"))[0]; double mfXend = mfX + mfWidth0; // Convert.ToDouble(oChar.GetField("endX")); //CALCULATE THE BOTTOM OF THE GLIPH. double mfYend = mfY + mfMaxTextHeight; // Convert.ToDouble(oChar.GetField("endY")); double mfPageHeight = Convert.ToDouble(oChar.GetField("pageHeight")); double mfPageWidth = Convert.ToDouble(oChar.GetField("pageWidth")); System.Diagnostics.Debug.Print(@"add some stuff to test {0}, {1}, {2}", mcThisChar, mfX, mfY); } } } } } } } using System.Reflection; /// <summary> /// To deal with the Java interface hiding necessary properties! ~mwr /// </summary> public static class GetField_Extension { public static object GetField(this object randomPDFboxObject, string sFieldName) { FieldInfo itemInfo = randomPDFboxObject.GetType().GetField(sFieldName, BindingFlags.NonPublic | BindingFlags.Instance); return itemInfo.GetValue(randomPDFboxObject); } }

Echa un vistazo a esto, creo que es lo que necesitas.

https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-java/

Aquí está el código:

import java.io.File; import java.io.IOException; import java.text.DecimalFormat; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import org.apache.pdfbox.exceptions.InvalidPasswordException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.common.PDStream; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; public class PrintTextLocations extends PDFTextStripper { public static StringBuilder tWord = new StringBuilder(); public static String seek; public static String[] seekA; public static List wordList = new ArrayList(); public static boolean is1stChar = true; public static boolean lineMatch; public static int pageNo = 1; public static double lastYVal; public PrintTextLocations() throws IOException { super.setSortByPosition(true); } public static void main(String[] args) throws Exception { PDDocument document = null; seekA = args[1].split(","); seek = args[1]; try { File input = new File(args[0]); document = PDDocument.load(input); if (document.isEncrypted()) { try { document.decrypt(""); } catch (InvalidPasswordException e) { System.err.println("Error: Document is encrypted with a password."); System.exit(1); } } PrintTextLocations printer = new PrintTextLocations(); List allPages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { PDPage page = (PDPage) allPages.get(i); PDStream contents = page.getContents(); if (contents != null) { printer.processStream(page, page.findResources(), page.getContents().getStream()); } pageNo += 1; } } finally { if (document != null) { System.out.println(wordList); document.close(); } } } @Override protected void processTextPosition(TextPosition text) { String tChar = text.getCharacter(); System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter()); String REGEX = "[,.//[//](:;!?)/]"; char c = tChar.charAt(0); lineMatch = matchCharLine(text); if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) { if ((!is1stChar) && (lineMatch == true)) { appendChar(tChar); } else if (is1stChar == true) { setWordCoord(text, tChar); } } else { endWord(); } } protected void appendChar(String tChar) { tWord.append(tChar); is1stChar = false; } protected void setWordCoord(TextPosition text, String tChar) { tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar); is1stChar = false; } protected void endWord() { String newWord = tWord.toString().replaceAll("[^//x00-//x7F]", ""); String sWord = newWord.substring(newWord.lastIndexOf('' '') + 1); if (!"".equals(sWord)) { if (Arrays.asList(seekA).contains(sWord)) { wordList.add(newWord); } else if ("SHOWMETHEMONEY".equals(seek)) { wordList.add(newWord); } } tWord.delete(0, tWord.length()); is1stChar = true; } protected boolean matchCharLine(TextPosition text) { Double yVal = roundVal(Float.valueOf(text.getYDirAdj())); if (yVal.doubleValue() == lastYVal) { return true; } lastYVal = yVal.doubleValue(); endWord(); return false; } protected Double roundVal(Float yVal) { DecimalFormat rounded = new DecimalFormat("0.0''0''"); Double yValDub = new Double(rounded.format(yVal)); return yValDub; } }

Dependencias:

PDFBox, FontBox, Apache Common Logging Interface.

Puedes ejecutarlo escribiendo en la línea de comando:

javac PrintTextLocations.java sudo java PrintTextLocations file.pdf WORD1,WORD2,....

La salida es similar a:

[(1)[190.3 : 286.8] WORD1, (1)[283.3 : 286.8] WORD2, ...]

No hay ninguna función en PDFBox que le permita extraer palabras automáticamente. Actualmente estoy trabajando en la extracción de datos para reunirlos en bloques y aquí está mi proceso:

Extraigo todos los caracteres del documento (llamados glifos) y los almaceno en una lista.
Hago un análisis de las coordenadas de cada glifo, repitiendo la lista. Si se superponen (si la parte superior del glifo actual se encuentra entre la parte superior e inferior del precedente o la parte inferior del glifo actual se encuentra entre la parte superior e inferior de la anterior), lo agrego a la misma línea.
En este punto, he extraído las diferentes líneas del documento (tenga cuidado, si su documento es de varias columnas, la expresión "líneas" significa todos los glifos que se superponen verticalmente, es decir, el texto de todas las columnas que tienen la misma vertical coordenadas).
Luego, puede comparar la coordenada izquierda del glifo actual con la coordenada derecha del anterior para determinar si pertenecen a la misma palabra o no (la clase PDFTextStripper proporciona un método getSpacingTolerance () que le proporciona, basado en pruebas y errores , el valor de un espacio "normal". Si la diferencia entre las coordenadas derecha e izquierda es menor que este valor, ambos glifos pertenecen a la misma palabra.

Apliqué este método a mi trabajo y funciona bien.