parser - python read xml file example

Analizando XML roto con lxml.etree.iterparse (3)

Estoy tratando de analizar un gran archivo xml con lxml de una manera eficiente de la memoria (es decir, la transmisión de forma perezosa desde el disco en lugar de cargar todo el archivo en la memoria). Desafortunadamente, el archivo contiene algunos caracteres ascii malos que rompen el analizador predeterminado. El analizador funciona si configuro recover = True, pero el método iterparse no toma el parámetro de recuperación o un objeto analizador personalizado. ¿Alguien sabe cómo usar iterparse para analizar el xml roto?

#this works, but loads the whole file into memory parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters. tree = lxml.etree.parse(filename, parser) #how do I do the equivalent with iterparse? (using iterparse so the file can be streamed lazily from disk) context = lxml.etree.iterparse(filename, tag=''RECORD'') #record contains 6 elements that I need to extract the text from

¡Gracias por tu ayuda!

EDITAR - Aquí hay un ejemplo de los tipos de errores de codificación con los que me estoy metiendo:

In [17]: data Out[17]: ''/t<articletext>The cafeteria rang with excited voices. Our barbershop quartet, The Bell /r Tones was asked to perform at the local Home for the Blind in the next town. We, of course, were glad to entertain such a worthy group and immediately agreed . One wag joked, "Which uniform should we wear?" followed with, "Oh, that/'s right, they/'ll never notice." The others didn/'t respond to this, in fact, one said that we should wear the nicest outfit we had.A small stage was set up for us and a pretty decent P.A. system was donated for the occasion. The audience was made up of blind persons of every age, from the thirties to the nineties. Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally. I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on. After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program. Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind. We didn/'t mind at all that some sang along /x1e they enjoyed it so much.In fact, a popular part of our program is when the audience gets to sing some of the old favorites. The harmony parts were quite evident as they tried their voices to the different parts. I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important. We received a big hand at the finale and were made to promise to return the following year. Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal. As we approached a new group, one blind lady amazed me by turning to me saying, "You/'re the baritone, aren/'t you?" Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.Retired portrait photographer. Main hobby - quartet singing.</articletext>/n'' In [18]: lxml.etree.from lxml.etree.fromstring lxml.etree.fromstringlist In [18]: lxml.etree.fromstring(data) --------------------------------------------------------------------------- XMLSyntaxError Traceback (most recent call last) /mnt/articles/<ipython console> in <module>() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)() /usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)() XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190 In [19]: chardet.detect(data) Out[19]: {''confidence'': 1.0, ''encoding'': ''ascii''}

Como puede ver, chardet piensa que es un archivo ascii, pero hay un "/ x1e" justo en el medio de este ejemplo que hace que lxml genere una excepción.

Resolví el problema creando una clase con un archivo como interfaz de objeto. El método de clase ''read () lee una línea del archivo y reemplaza cualquier'' ''caracteres incorrectos'' ''antes de devolver la línea a iterparse.

#psudo code class myFile(object): def __init__(self, filename): self.f = open(filename) def read(self, size=None): return self.f.next().replace(''/x1e'', '''').replace(''some other bad character...'' ,'''') #iterparse context = lxml.etree.iterparse(myFile(''bigfile.xml'', tag=''RECORD'')

Tuve que editar la clase myFile unas cuantas veces agregando algunas llamadas a replace () para algunos otros personajes que estaban haciendo lxml choke. Creo que el análisis SAX de lxml también habría funcionado (parece apoyar la opción de recuperación), ¡pero esta solución funcionó como un encanto!

Editar:

Esta es una respuesta más antigua y lo hubiera hecho de manera diferente hoy. Y no me estoy refiriendo al tonto snark ... desde entonces, BeutifulSoup4 está disponible y es bastante agradable. Lo recomiendo a cualquiera que tropiece aquí.

La respuesta actualmente aceptada es, bueno, no lo que uno debería hacer. La pregunta en sí también tiene una mala suposición:

parser = lxml.etree.XMLParser (recover = True) #recovers de los caracteres incorrectos.

En realidad, recover=True es para recuperar XML mal formado . Sin embargo, hay una opción de "codificación" que habría solucionado su problema.

parser = lxml.etree.XMLParser(encoding=''utf-8'' #Your encoding issue. recover=True, #I assume you probably still want to recover from bad xml, it''s quite nice. If not, remove. )

Eso es, esa es la solución.

Por cierto: para cualquier persona que tenga dificultades para analizar XML en Python, especialmente de fuentes de terceros. Lo sé, lo sé, la documentación es mala y hay muchas pistas falsas de SO; muchos malos consejos

lxml.etree.fromstring ()? - Eso es para XML perfectamente formado, tonto
BeautifulStoneSoup? - Lento, y tiene una política estúpida de manera de etiquetas de cierre automático
lxml.etree.HTMLParser ()? - (porque el xml está roto) Aquí hay un secreto - HTMLParser () es ... un analizador con recover = True
lxml.html.soupparser? - Se supone que la detección de codificación es mejor, pero tiene las mismas fallas de BeautifulSoup para las etiquetas de cierre automático. Quizás pueda combinar XMLParser con BeautifulSoup''s UnicodeDammit
UnicodeDammit y otras cosas cockamamie para corregir codificaciones? - Bueno, UnicodeDammit es algo lindo, me gusta el nombre y es útil para cosas que van más allá de xml, pero las cosas generalmente se arreglan si haces lo correcto con XMLParser ()

Podrías probar todo tipo de cosas de lo que está disponible en línea. La documentación de lxml podría ser mejor. El código anterior es lo que necesita para el 90% de sus casos de análisis de XML. Aquí voy a replantearlo:

magical_parser = XMLParser(encoding=''utf-8'', recover=True) tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

De nada. Mis dolores de cabeza == tu cordura. Además, tiene otras funciones que podría necesitar, ya sabe, XML.

Edite su pregunta, indicando qué sucede (mensaje de error exacto y rastreo (copiar / pegar, no escribir desde la memoria)) para hacerle pensar que el problema es un "mal Unicode".

Obtenga chardet y alimente su volcado de MySQL. Cuéntanos lo que dice

Muéstranos los primeros 200 a 300 bytes de tu volcado, usando, por ejemplo, print repr(dump[:300])

Actualización Usted escribió "" "Como puede ver, chardet piensa que es un archivo ascii, pero hay un" / x1e "justo en el medio de este ejemplo que hace que lxml genere una excepción." ""

No veo "mal unicode" aquí.

chardet es correcto. ¿Qué te hace pensar que "/ x1e" no es ASCII? Es un personaje ASCII, un personaje de control C0 llamado "SEPARADOR DE REGISTRO".

El mensaje de error dice que tienes un personaje inválido. Eso también es correcto Los únicos caracteres de control que son válidos en XML son "/t" , "/r" y "/n" . MySQL debería estar quejándose de eso y / u ofreciéndole una forma de escapar, por ejemplo, _x001e_ (yuk!)

Dado el contexto, parece que ese personaje podría eliminarse sin pérdida. Es posible que desee reparar su base de datos o tal vez desee eliminar dichos caracteres de su volcado (después de comprobar que son todos extraíbles) o puede elegir un formato de salida menos quisquilloso y menos voluminizador que XML.

Actualización 2 Probablemente desee iterparse() no porque sea su objetivo final sino porque desea guardar memoria. Si utilizó un formato como CSV, no tendría un problema de memoria.

Actualización 3 En respuesta a un comentario de @Purrell:

pruébalo tú mismo, amigo. pastie.org/3280965

Aquí está el contenido de ese pastie; Merece preservación:

from lxml.etree import etree data = ''/t<articletext>The cafeteria rang with excited voices. Our barbershop quartet, The Bell /r Tones was asked to perform at the local Home for the Blind in the next town. We, of course, were glad to entertain such a worthy group and immediately agreed . One wag joked, "Which uniform should we wear?" followed with, "Oh, that/'s right, they/'ll never notice." The others didn/'t respond to this, in fact, one said that we should wear the nicest outfit we had.A small stage was set up for us and a pretty decent P.A. system was donated for the occasion. The audience was made up of blind persons of every age, from the thirties to the nineties. Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally. I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on. After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program. Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind. We didn/'t mind at all that some sang along /x1e they enjoyed it so much.In fact, a popular part of our program is when the audience gets to sing some of the old favorites. The harmony parts were quite evident as they tried their voices to the different parts. I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important. We received a big hand at the finale and were made to promise to return the following year. Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal. As we approached a new group, one blind lady amazed me by turning to me saying, "You/'re the baritone, aren/'t you?" Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.Retired portrait photographer. Main hobby - quartet singing.</articletext>/n'' magical_parser = etree.XMLParser(encoding=''utf-8'', recover=True) tree = etree.parse(StringIO(data), magical_parser)

Para que se ejecute, una importación debe ser reparada y otra suministrada. La información es monstruosa. No hay salida para mostrar el resultado. Aquí hay un reemplazo con los datos reducidos a lo esencial. Las 5 piezas de texto ASCII (excluyendo < y > ) que son todos caracteres XML válidos se reemplazan por t1 , ..., t5 . El ofensor /x1e está flanqueado por t2 y t3 .

[output wraps at column 80] Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win 32 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> from cStringIO import StringIO >>> data = ''<article>t1t2/x1et3t4 t5</article>'' >>> magical_parser = etree.XMLParser(encoding=''utf-8'', recover=True) >>> tree = etree.parse(StringIO(data), magical_parser) >>> print(repr(tree.getroot().text)) ''t1t2t3/ppt4/ppt5/p''

No es lo que yo llamaría "recuperación"; después del personaje malo, los caracteres < y > desaparecen.

El pastiche fue en respuesta a mi pregunta "¿Qué te da la idea de que la codificación = ''utf-8'' resolverá su problema?". Esto fue desencadenado por la declaración ''Sin embargo, existe una opción de'' codificación ''que habría solucionado tu problema''. Pero encoding = ascii produce el mismo resultado. También lo hace omitiendo la codificación arg. NO es un problema de codificación. Caso cerrado.