and - ¿Cómo extraer texto de un archivo PDF en Python?

python and word (1)

Si está ejecutando Linux o Mac, puede usar el comando ps2ascii en su código:

import os input="someFile.pdf" output="out.txt" os.system(("ps2ascii %s %s") %( input , output))

¿Cómo puedo extraer texto de un archivo PDF en Python?

Intenté lo siguiente:

import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + " /n" content = " ".join(content.replace(u"/xa0", u" ").strip().split()) return content f = open(''a.txt'',''w+'') f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace")) f.close()

Pero el resultado es el siguiente, en lugar de texto legible:

728; ˇˆ˜ ˚ˇˇ! "" ˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ "ˆ˘" ˆˆˆ˜ # $ ˙ˆ˚ˆ% & ˆ ˘˛ˆ˜''''% ˝˛ˆˇ˙ ˜ˆˆ˜''ˆ ˇˆ # $% & (''% $ &)) $ $ +% #, -. + && ˝ ())) ˝ + ,, -. / 012) (˝) * ˝ +, - 3˙ˆ / 0245) 6 # 57 + 82,55) 6 # 57 +, + 2, + /! # !! & ˘˘1 "% ˘20˛˛3ˆ07% 4!" 6 ˛ ˆ ˝ˆ ˆ˘ & / & 4 "9ˆ% 6ˇ% 4% 4 y 5˘2) ˘˘˛%: 6 (