How can you parse xml in Google Refine using jython/python ElementTree

How can you parse xml in Google Refine using jython/python ElementTree - python

I trying to parse some xml in Google Refine using Jython and ElementTree but I'm struggling to find any documentation to help me getting this working (probably not helped by not being a python coder)
Here's an extract of the XML I'm trying to parse. I'm trying to return a joined string of all the dc:indentifier:
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:creator>J. Koenig</dc:creator>
<dc:date>2010-01-13T15:47:38Z</dc:date>
<dc:date>2010-01-13T15:47:38Z</dc:date>
<dc:date>2010-01-13T15:47:38Z</dc:date>
<dc:identifier>CCTL0059</dc:identifier>
<dc:identifier>CCTL0059</dc:identifier>
<dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
<dc:format>application/pdf</dc:format>
</oai_dc:dc>
Here's the code I've got so far. This is a test to return anything as right now all I'm getting is 'Error: null'
from elementtree import ElementTree as ET
element = ET.parse(value)
namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
count += 1
return count

You can use a GREL expression like this, try it:
forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")
For each identifier found, give me the htmlText and join them all with commas.
parseHtml() uses Jsoup.org library and really just parses tags and structure. It also knows about parsing namespaces with the format of ns|identifier and is a nice way to get what your after in this case.

You've used the wrong namespace. This works on Jython 2.5.1:
from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question
namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
print e.text
Output
CCTL0059
CCTL0059
http://open.jorum.ac.uk:80/xmlui/handle/123456789/335

Here's a slight tweak on J.F. Sebastian's version which can be pasted directly into Google Refine:
from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])
It returns a comma separated list, but you can change the delimiter used in the return statement.

Related

Using ElementTree to find a node - invalid predicate

I'm very new to this area so I'm sure it's just something obvious. I'm trying to change a python script so that it finds a node in a different way but I get an "invalid predicate" error.
import xml.etree.ElementTree as ET
tree = ET.parse("/tmp/failing.xml")
doc = tree.getroot()
thingy = doc.find(".//File/Diag[#id='53']")
print(thingy.attrib)
thingy = doc.find(".//File/Diag[BaseName = 'HTTPHeaders']")
print(thingy.attrib)
That should find the same node twice but the second find gets the error. Here is an extract of the XML:
<Diag id="53">
<Formatted>xyz</Formatted>
<BaseName>HTTPHeaders</BaseName>
<Column>17</Column>
I hope I've not cut it down too much. Basically, finding it with "#id" works but I want to search on that BaseName tag instead.
Actually, I want to search on a combination of tags so I have a more complicated expression lined up but I can't get the simple one to work!

The code in the question works when using Python 3.7. If the spaces before and after the equals sign in the predicate are removed, it also works with earlier Python versions.
thingy = doc.find(".//File/Diag[BaseName='HTTPHeaders']")
See https://bugs.python.org/issue31648.

Iterparse object in Python is not returning iter object

I'm working with the XML file in this link (Downloadable file of 40MB). In this file, I'm expecting data from 2 types of tags.
Those are: OpportunityForecastDetail_1_0 and OpportunitySynopsisDetail_1_0.
I wrote the following code for that:
ARTICLE_TAGS = ['OpportunitySynopsisDetail_1_0', 'OpportunityForecastDetail_1_0']
for _tag in ARTICLE_TAGS:
f = open(xml_f)
context = etree.iterparse(f, tag = _tag)
for _, e in context:
_id = e.xpath('.//OpportunityID/text()')
text = e.xpath('.//OpportunityTitle/text()')
f.close()
Then etree.iterparse(f, tag = _tag) is returning an object which is not iterable. I think this occurs when the tag is not found in the XML file.
So, I added name spaces to the iterable tag like this.
context = etree.iterparse(f, tag='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'+_tag)
Now, it is creating an iterable object. But, I'm not getting any text. I tried other namespaces in that file. But, not working.
Please tell me the solution to this problem. This is a sample snippet of the XML file. OpportunityForecastDetail_1_0 and OpportunitySynopsisDetail_1_0 tags are repeated n number of times in the XML file.
<?xml version="1.0" encoding="UTF-8"?>
<Grants xsi:schemaLocation="http://apply.grants.gov/system/OpportunityDetail-V1.0 https://apply07.grants.gov/apply/system/schemas/OppotunityDetail-V1.0.xsd" xmlns="http://apply.grants.gov/system/OpportunityDetail-V1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instace">
<OpportunitySynopsisDetail_1_0>
<OpportunityID>262148</OpportunityID>
<OpportunityTitle>Establishment of the Edmund S. Muskie Graduate Internship Program</OpportunityTitle>
</OpportunitySynopsisDetail_1_0>
<OpportunityForecastDetail_1_0>
<OpportunityID>284765</OpportunityID>
<OpportunityTitle>PPHF 2015: Immunization Grants-CDC Partnership: Strengthening Public Health Laboratories-financed in part by 2015 Prevention and Public Health Funds</OpportunityTitle>
</OpportunityForecastDetail_1_0>
</Grants>

First, when you are parsing XML that contains namespaces, you must use those namespaces when looking at tag names.
Second, iterparse doesn't take an argument named tag, so I don't see how your code could have worked as posted.
Finally, the elements returned from iterparse don't have a member function called xpath, so that can't have worked either.
Here is an example of how to parse XML using iterparse:
NS='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'
ARTICLE_TAGS = [NS+'OpportunitySynopsisDetail_1_0', NS+'OpportunityForecastDetail_1_0']
with open(xml_f, 'r') as f:
context = etree.iterparse(f)
for _, e in context:
if e.tag in ARTICLE_TAGS:
_id = e.find(NS+'OpportunityID')
text = e.find(NS+'OpportunityTitle')
print(_id.text, text.text)
As I said in my comment, the Python documentation is helpful, as is the Effbot page on ElementTree. There are lots of other resources available; put xml.etree.elementtree into Google and start reading!

Parsing XML with Python - accessing elements

I'm using lxml to parse some xml, but for some reason I can't find a specific element.
I'm trying to access the <Constant> elements.
Here's an xml snippet:
</rdf:Description>
</rdf:RDF>
</MiriamAnnotation>
<ListOfSubstrates>
<Substrate metabolite="Metabolite_5" stoichiometry="1"/>
</ListOfSubstrates>
<ListOfModifiers>
<Modifier metabolite="Metabolite_9" stoichiometry="1"/>
</ListOfModifiers>
<ListOfConstants>
<Constant key="Parameter_4344" name="Kcat" value="433.724"/>
<Constant key="Parameter_4343" name="km" value="479.617"/>
The code I'm using is like this:
>>> from lxml import etree as ET
>>> parsed = ET.parse('ct.cps')
>>> root = parsed.getroot()
>>> for a in root.findall(".//Constant"):
... print a.attrib['key']
...
>>> for a in root.findall('Constant'):
... print a.get('key')
...
>>> for a in root.findall('Constant'):
... print a.attrib['key']
...
As you can see, none of these things seem to work.
What am I doing wrong?
EDIT: I'm wondering if it has something to do with the fact that <Constant> elements are empty?
EDIT2: Source xml here: https://www.dropbox.com/s/i6hga7nvmcd6rxx/ct.cps?dl=0

Here is how you can get the values you are looking for:
from lxml import etree
parsed = etree.parse('ct.cps')
for a in parsed.findall("//{http://www.copasi.org/static/schema}Constant"):
print a.attrib["key"]
Output:
Parameter_4344
Parameter_4343
Parameter_4342
Parameter_4341
Parameter_4340
Parameter_4339
Parameter_4338
Parameter_4337
Parameter_4336
Parameter_4335
Parameter_4334
Parameter_4333
Parameter_4332
Parameter_4331
Parameter_4330
Parameter_4329
Parameter_4328
Parameter_4327
Parameter_4326
Parameter_4325
Parameter_4324
Parameter_4323
Parameter_4322
Parameter_4321
Parameter_4320
Parameter_4319
The important thing here is that the COPASI root element in your XML file (the real one at the Dropbox URL) declares a default namespace (http://www.copasi.org/static/schema). This means that the element and all its descendants, including Constant, belong to that namespace.
So instead of Constant elements, you need to look for {http://www.copasi.org/static/schema}Constant elements.
See http://lxml.de/tutorial.html#namespaces.
Here is how you could do it using XPath instead of findall:
from lxml import etree
NSMAP = {"c": "http://www.copasi.org/static/schema"}
parsed = etree.parse('ct.cps')
for a in parsed.xpath("//c:Constant", namespaces=NSMAP):
print a.attrib["key"]
See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.

First, please disregard my comment. It turns out that xml.etree is much better than the standard xml.etree.ElementTree in that it takes care of the namespace. The problem you have is you want to search for '//Constant', which means the nodes can be at any level. However, the root element does not allow you to do it:
>>> root.findall('//Constant')
SyntaxError: cannot use absolute path on element
However, you can do that at higher level:
>>> parsed.findall('//Constant')
[<Element Constant at 0x10a7ce128>, <Element Constant at 0x10a7ce170>]
Update
I am posting here the full text. Since I don't have your full XML file, I make something up to fill in the blank.
from lxml import etree as ET
from StringIO import StringIO
xml_text = """<?xml version='1.0' encoding='utf-8' ?>
<rdf:root xmlns:rdf='http://foo.bar.com/rdf'>
<rdf:RDF>
<rdf:Description>
DescriptionX
</rdf:Description>
</rdf:RDF>
<rdf:foo>
<MiriamAnnotation>
bar
</MiriamAnnotation>
<ListOfSubstrates>
<Substrate metabolite="Metabolite_5" stoichiometry="1"/>
</ListOfSubstrates>
<ListOfModifiers>
<Modifier metabolite="Metabolite_9" stoichiometry="1"/>
</ListOfModifiers>
<ListOfConstants>
<Constant key="Parameter_4344" name="Kcat" value="433.724"/>
<Constant key="Parameter_4343" name="km" value="479.617"/>
</ListOfConstants>
</rdf:foo>
</rdf:root>
"""
buffer = StringIO(xml_text)
tree = ET.parse(buffer)
for constant_node in tree.findall('//Constant'):
print constant_node.attrib['key']

Don't use findall. It is has a limited featureset and is designed to be compatible with ElementTree.
Instead, use xpath, which supports namespaces. From the above, it appears that you probably want to say something like
# possibilities, you need to get these right...
ns_dict = {'atom':"http://www.w3.org/2005/Atom",,
"rdf":"http://www.w3.org/2000/01/rdf-schema#" }
root = parsed.getroot()
for a in root.xpath('.//rdf:Constant', namespaces=ns_dict):
print a.attrib['key']
Note that you must include a namespace prefix in your xpath expression whenever an element has a non-blank namespace, and they must map to one of the namespace URLs that match the same URLs in your document.
Update
Since you posted your original document, I see that there is no namespace assigned to the elements you are looking for. This will work, I just tried it with your source document:
for a in tree.xpath("//Constant"):
print a.attrib['key']
You don't need a namespace because there is no default namespace specified in the document itself.

XML Python Parsing Namespace [duplicate]

I have the following XML which I want to parse using Python's ElementTree:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
Because of the namespace, I am getting the following error.
SyntaxError: prefix 'owl' not found in prefix map
I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.
Kindly let me know how to change the code to find all the owl:Class tags.

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
Also see the Parsing XML with Namespaces section of the ElementTree documentation.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here's another case and how I handled it:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
root.findall('owl:Class', my_namespaces)

I've been using similar code to this and have found it's always worth reading the documentation... as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
link = root.find(f"{ns}link")

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:
from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
namespaces = dict([
node for _, node in ElementTree.iterparse(
StringIO(xml_string), events=['start-ns']
)
])
namespaces["ns0"] = namespaces[""]
return namespaces
where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.
If I then do:
my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)
It also produces the correct answer for tags using the default namespace as well.

My solution is based on #Martijn Pieters' comment:
register_namespace only influences serialisation, not search.
So the trick here is to use different dictionaries for serialization and for searching.
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
Now, register all namespaces for parsing and writing:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).
self.namespaces['default'] = self.namespaces['']
Now, the functions from the find() family can be used with the default prefix:
print root.find('default:myelem', namespaces)
but
tree.write(destination)
does not use any prefixes for elements in the default namespace.

Processing large XML file [duplicate]

I have the following XML which I want to parse using Python's ElementTree:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
Because of the namespace, I am getting the following error.
SyntaxError: prefix 'owl' not found in prefix map
I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.
Kindly let me know how to change the code to find all the owl:Class tags.

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
Also see the Parsing XML with Namespaces section of the ElementTree documentation.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here's another case and how I handled it:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
root.findall('owl:Class', my_namespaces)

I've been using similar code to this and have found it's always worth reading the documentation... as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
link = root.find(f"{ns}link")

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:
from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
namespaces = dict([
node for _, node in ElementTree.iterparse(
StringIO(xml_string), events=['start-ns']
)
])
namespaces["ns0"] = namespaces[""]
return namespaces
where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.
If I then do:
my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)
It also produces the correct answer for tags using the default namespace as well.

My solution is based on #Martijn Pieters' comment:
register_namespace only influences serialisation, not search.
So the trick here is to use different dictionaries for serialization and for searching.
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
Now, register all namespaces for parsing and writing:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).
self.namespaces['default'] = self.namespaces['']
Now, the functions from the find() family can be used with the default prefix:
print root.find('default:myelem', namespaces)
but
tree.write(destination)
does not use any prefixes for elements in the default namespace.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can you parse xml in Google Refine using jython/python ElementTree - python

Related

Using ElementTree to find a node - invalid predicate

Iterparse object in Python is not returning iter object

Parsing XML with Python - accessing elements

XML Python Parsing Namespace [duplicate]

Processing large XML file [duplicate]

Categories

Resources