from - Accesando valores en el archivo xml con espacios de nombres en python 2.7 lxml

python etree lxml (1)

Use el método xpath() con namespaces llamado argumento:

namespaces = { ''rdf'': ''http://www.w3.org/1999/02/22-rdf-syntax-ns#'', ''dcat'': ''http://www.w3.org/ns/dcat#'', ''dct'': ''http://purl.org/dc/terms/'' } print(doc.xpath(''//rdf:RDF'', namespaces=namespaces)) print(doc.xpath(''//dcat:Dataset'', namespaces=namespaces)) print(doc.xpath(''//dct:identifier'', namespaces=namespaces))

Estoy siguiendo este enlace para tratar de obtener valores de varias etiquetas:

Análisis de XML con espacio de nombre en Python a través de ''ElementTree''

En este enlace no hay ningún problema para acceder a la etiqueta raíz de esta manera:

import sys from lxml import etree as ET doc = ET.parse(''file.xml'') namespaces_rdf = {''rdf'': ''http://www.w3.org/1999/02/22-rdf-syntax-ns#''} # add more as needed namespaces_dcat = {''dcat'': ''http://www.w3.org/ns/dcat#''} # add more as needed namespaces_dct = {''dct'': ''http://purl.org/dc/terms/''} print doc.findall(''rdf:RDF'', namespaces_rdf) print doc.findall(''dcat:Dataset'', namespaces_dcat) print doc.findall(''dct:identifier'', namespaces_dct)

SALIDA:

[] [<Element {http://www.w3.org/ns/dcat#}Dataset at 0x2269b98>] []

Solo tengo acceso a dcat: Dataset, y no puedo ver cómo acceder al valor de rdf: about

Y luego acceso a dct: identificador

Por supuesto, una vez que he accedido a esta información, necesito acceder a dcat: información de distribución

Este es mi archivo de ejemplo, generado con ckanext-dcat:

<?xml version="1.0" encoding="utf-8"?> <rdf:RDF xmlns:dct="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcat="http://www.w3.org/ns/dcat#" > <dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01"> <dct:identifier>ec631628-2f46-4f17-a685-d62a37466c01</dct:identifier> <dct:description>FOO-Description</dct:description> <dct:title>FOO-title</dct:title> <dcat:keyword>keyword1</dcat:keyword> <dcat:keyword>keyword2</dcat:keyword> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-10-08T08:55:04.566618</dct:issued> <dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-06-25T11:04:10.328902</dct:modified> <dcat:distribution> <dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f"> <dct:title>FOO-title-1</dct:title> <dct:description>FOO-Description-1</dct:description> <dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f/download/myxls.xls</dcat:accessURL> <dct:format>XLS</dct:format> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f"> <dct:format>XLS</dct:format> <dct:title>FOO-title-2</dct:title> <dct:description>FOO-Description-2</dct:description> <dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f/download/myxls.xls</dcat:accessURL> </dcat:Distribution> </dcat:distribution> </dcat:Dataset> </rdf:RDF>

¿Alguna idea sobre cómo acceder a esta información? Gracias

ACTUALIZACIÓN: Bueno, necesito acceder a rdf: sobre en:

<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">

entonces con este código tomado de:

Parse xml con lxml - extraer valor del elemento

for node in doc.xpath(''//dcat:Dataset'', namespaces=namespaces): # Iterate over attributes for attrib in node.attrib: print ''@'' + attrib + ''='' + node.attrib[attrib]

Obtengo esta salida:

[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x23d8ee0>] @{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about=http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01

Entonces, la pregunta es:

¿Cómo puedo preguntar si el atributo está a punto de tomar este valor, porque en otros archivos tengo varias etiquetas?

ACTUALIZACIÓN 2: Se corrigió cómo obtengo el valor (notaciones de clark)

for node in doc.xpath(''//dcat:Dataset'', namespaces=namespaces): # Iterate over attributes for attrib in node.attrib: if attrib.endswith(''about''): #do my jobs

Bueno, casi terminado, pero tengo la última pregunta: necesito saber cuándo accedo a mi

<dct:title>

a la cual pertenece, tengo:

<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01"> <dct:title>FOO-title</dct:title> <dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f"> <dct:title>FOO-title-1</dct:title> <dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f"> <dct:title>FOO-title-2</dct:title>

Si hago algo como esto obtengo:

for node in doc.xpath(''//dct:title'', namespaces=namespaces): print node.tag, node.text {http://purl.org/dc/terms/}title FOO-title {http://purl.org/dc/terms/}title FOO-title-1 {http://purl.org/dc/terms/}title FOO-title-2

Gracias