I have a simple XML structure from which I want to extract the data then process it. I'm using Python and xml.etree.ElementTree, and it works well, except for a particular case. When I parse a particular XML, there is one node that returns None as values for the content of the elements.
Here is the code and the output:
import xml.etree.ElementTree as ET
sample = '<?xml version="1.0" encoding="UTF-8"?>\
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2">\
<file original="global" datatype="plaintext" source-language="en" target-language="fr-CA">\
<body>\
<trans-unit id="translations.amountMoreString" resname="e438d8a3237fefa5ace76eae98c157bd">\
<source xml:lang="en"><ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> more</source>\
<target xml:lang="fr-CA" state="signed-off"><ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> de plus</target>\
</trans-unit>\
<trans-unit id="translations.amountString" resname="7cdc0d444b4ee4ccc0a11819a3f96af2">\
<source xml:lang="en">Amount</source>\
<target xml:lang="fr-CA" state="signed-off">Quantité</target>\
</trans-unit>\
</body>\
</file>\
</xliff>'
xliff_root = ET.fromstring(sample)
for file_element in xliff_root:
for body_element in file_element:
for trans_unit_element in body_element:
print('\ntrans-unit element', trans_unit_element)
for text_element in trans_unit_element:
print('\tText element:', text_element)
if text_element.tag == '{urn:oasis:names:tc:xliff:document:1.2}source':
print('\t\tSource:', text_element.text)
translation_unit['source'] = text_element.text
elif text_element.tag == '{urn:oasis:names:tc:xliff:document:1.2}target':
print('\t\tTarget:', text_element.text)
translation_unit['target'] = text_element.text
The output:
trans-unit element <Element '{urn:oasis:names:tc:xliff:document:1.2}trans-unit' at 0x00000163DBAA71D0>
Text element: <Element '{urn:oasis:names:tc:xliff:document:1.2}source' at 0x00000163DBAA72C0>
Source: None
Text element: <Element '{urn:oasis:names:tc:xliff:document:1.2}target' at 0x00000163DBAA7450>
Target: None
trans-unit element <Element '{urn:oasis:names:tc:xliff:document:1.2}trans-unit' at 0x00000163DBAA7590>
Text element: <Element '{urn:oasis:names:tc:xliff:document:1.2}source' at 0x00000163DBAA75E0>
Source: Amount
Text element: <Element '{urn:oasis:names:tc:xliff:document:1.2}target' at 0x00000163DBAA7630>
Target: Quantité
I'm getting None as values for the <source> and <target> elements for the first <trans-unit>, but the second one returned the correct values. I understand that there are other elements inside the <source> and <target> elements, but there is also textual content.
I would like to thank in advance anyone who could help me understand and correct this issue...
Kind regards,
JF
IN the first <trans-unit> element, <source> does not contain only text; it contains a <ph> element. The structure is:
xliff
body
trans-unit
source
ph
"text is here"
"text is also here"
Compare the second trans-unit element, where the structure is:
xliff
body
trans-unit
source
"text is here"
You can use the itertext method to extract all the text children. For example, this code:
import xml.etree.ElementTree as ET
sample='''<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2">
<file original="global" datatype="plaintext" source-language="en" target-language="fr-CA">
<body>
<trans-unit id="translations.amountMoreString" resname="e438d8a3237fefa5ace76eae98c157bd">
<source xml:lang="en"><ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> more</source>
<target xml:lang="fr-CA" state="signed-off"><ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> de plus</target>
</trans-unit>
<trans-unit id="translations.amountString" resname="7cdc0d444b4ee4ccc0a11819a3f96af2">
<source xml:lang="en">Amount</source>
<target xml:lang="fr-CA" state="signed-off">Quantité</target>
</trans-unit>
</body>
</file>
</xliff>
'''
xliff_root = ET.fromstring(sample)
translation_unit = {}
for trans_unit in xliff_root.findall('.//{urn:oasis:names:tc:xliff:document:1.2}trans-unit'):
source = trans_unit.find('{urn:oasis:names:tc:xliff:document:1.2}source')
source_text = [text.strip() for text in source.itertext()][-1]
target = trans_unit.find('{urn:oasis:names:tc:xliff:document:1.2}target')
target_text = [text.strip() for text in target.itertext()][-1]
print('source:', source_text)
print('target:', target_text)
Produces:
source: more
target: de plus
source: Amount
target: Quantité
Thanks to #larsks answers + some digging in the documentation, I managed to get exactly what I need. I found the ET.tostring method that returns the whole node as text string, then I use a couple of REs to remove the unwanted stuff from those strings.
Full code here:
import xml.etree.ElementTree as ET
import re
sample = '''<?xml version="1.0" encoding="UTF-8"?>\
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2">\
<file original="global" datatype="plaintext" source-language="en" target-language="fr-CA">\
<body>\
<trans-unit id="translations.amountMoreString" resname="e438d8a3237fefa5ace76eae98c157bd">\
<source xml:lang="en"><ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> more</source>\
<target xml:lang="fr-CA" state="signed-off"><ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> de plus</target>\
</trans-unit>\
<trans-unit id="translations.amountString" resname="7cdc0d444b4ee4ccc0a11819a3f96af2">\
<source xml:lang="en">Amount</source>\
<target xml:lang="fr-CA" state="signed-off">Quantité</target>\
</trans-unit>\
</body>\
</file>\
</xliff>
'''
xliff_root = ET.fromstring(sample)
translation_unit = {}
for trans_unit in xliff_root.findall('.//{urn:oasis:names:tc:xliff:document:1.2}trans-unit'):
source = trans_unit.find('{urn:oasis:names:tc:xliff:document:1.2}source')
source_text = ET.tostring(source, encoding='utf8').decode('utf8')
source_text = re.sub(r'^<?(.*?)?>\n', r'', source_text)
source_text = re.sub(r'<(\/?)ns\d+:', r'<\1', source_text)
source_text = re.sub(r'<(\/?)source(.*?)>', r'', source_text)
print(source_text)
target = trans_unit.find('{urn:oasis:names:tc:xliff:document:1.2}target')
target_text = ET.tostring(target, encoding='utf8').decode('utf8')
target_text = re.sub(r'^<?(.*?)?>\n', r'', target_text)
target_text = re.sub(r'<(\/?)ns\d+:', r'<\1', target_text)
target_text = re.sub(r'<(\/?)target(.*?)>', r'', target_text)
print(target_text)
Results:
<ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> more
<ph id="1" ctype="x-phrase-placeholder">{{ count }}</ph> de plus
Amount
Quantité
Big thank you to #larsks also for showing me a more efficient way to parse XML documents using XPATH and the .findall & .find methods! :-)
JF
Related
I have an xml that looks like this.
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>
I am parsing it using BeautifulSoup. Upon doing this:
soup = BeautifulSoup(file, 'xml')
pos = []
for i in (soup.find_all('pos')):
pos.append(i.text)
I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.
So I get (697, 112, 099. 113).
However, I only want to get the POS values of the non-nested tags.
Expected desired result is (697, 099).
How can I achieve this?
Here is one way of getting those first level pos:
from bs4 import BeautifulSoup as bs
xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>'''
soup = bs(xml_doc, 'xml')
pos = []
for i in (soup.select('offer > pos')):
pos.append(i.text)
print(pos)
Result in terminal:
['697', '099']
I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
pos = []
for ele in (doc.xpath('//offer/pos')):
pos.append(ele.text)
print(pos)
Given your example input, the above code prints:
['697', '099']
I am using python's ElementTree library to parse an XML file which has the following structure. I am trying to get the xml string corresponding to entity with id = 192 with all its parents (folders) but without other entities
<catalog>
<folder name="entities">
<entity id="102">
</entity>
<folder name="newEntities">
<entity id="192">
</entity>
<entity id="2982">
</entity>
</folder>
</folder>
</catalog>
The required result should be
<catalog>
<folder name="entities">
<folder name="newEntities">
<entity id="192">
</entity>
</folder>
</folder>
</catalog>
assuming the 1st xml string is stored in a variable called xml_string
tree = ET.fromstring(xmlstring)
id = 192
required_element = tree.find(".//entity[#id='" + id + "']")
This gets the xml element for the required entity but not the parent folders, any quick solution fix for this?
The challenge here is to bypass the fact that ET has no parent information. The solution is to use parent_map
import copy
import xml.etree.ElementTree as ET
import xml.dom.minidom as minidom
xml = '''<catalog>
<folder name="entities">
<entity id="102">
</entity>
<folder name="newEntities">
<entity id="192">
</entity>
<entity id="2982">
</entity>
</folder>
</folder>
</catalog>'''
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent="\t")
root = ET.fromstring(xml)
parent_map = {c: p for p in root.iter() for c in p}
_id = 192
required_element = root.find(".//entity[#id='" + str(_id) + "']")
_path = [copy.deepcopy(required_element)]
while True:
parent = parent_map.get(required_element)
if parent:
_path.append(copy.deepcopy(parent))
required_element = parent
else:
break
idx = len(_path) - 1
while idx >= 1:
_path[idx].clear()
_path[idx].append(_path[idx-1])
idx -= 1
print(prettify(_path[-1]))
output
<?xml version="1.0" ?>
<catalog>
<folder>
<folder>
<entity id="192">
</entity>
</folder>
</folder>
</catalog>
I'm trying to export all the movie titles from an xml file but I can't seem to get the titles. The xml looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<videodb>
<version>1</version>
<movie>
<title>2 Guns</title>
<originaltitle>2 Guns</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>6.500000</value>
<votes>1776</votes>
</rating>
</ratings>
I've seen lots of examples for values where xml has value="title" but can't find a guiding example that works when there is no value="title"
My code so far:
#Import required library
import xml.etree.cElementTree as ET
root = ET.parse('D:\\temp\\videodb.xml').getroot()
for type_text in root.findall('movie/title'):
value = type_text.get ('text')
print(value)
XML file:
<?xml version="1.0" encoding="utf-8"?>
<videodb>
<version>1</version>
<movie>
<title>2 Guns</title>
<originaltitle>2 Guns</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>6.500000</value>
<votes>1776</votes>
</rating>
</ratings>
</movie>
<movie>
<title>Top Gun</title>
<originaltitle>Top Gun</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>7.500000</value>
<votes>1566</votes>
</rating>
</ratings>
</movie>
<movie>
<title>Inception</title>
<originaltitle>Inceptions</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>9.500000</value>
<votes>177346</votes>
</rating>
</ratings>
</movie>
</videodb>
Code:
import xml.etree.ElementTree as ET
tree = ET.parse('E:\Python\DataFiles\movies.xml') # replace with your path
root = tree.getroot()
for aMovie in root.iter('movie'):
print(aMovie.find('title').text)
Output:
2 Guns
Top Gun
Inception
Try replacing:
value = type_text.get ('text')
with
value = type_text.text
xml.etree uses Element.get to retrieve the content element attributes.
You're after the element text; see Element.text.
For example, given this contrived XML:
<element some_attribute="Some Attribute">Some Text</element>
.get('some_attribute') would return Some Attribute while .text would return Some Text.
Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?
I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
Example xml:
<response version-api="2.0">
<value>
<books>
<book available="20" id="1" tags="">
<title></title>
<author id="1" tags="Joel">Manuel De Cervantes</author>
</book>
<book available="14" id="2" tags="Jane">
<title>Catcher in the Rye</title>
<author id="2" tags="">JD Salinger</author>
</book>
<book available="13" id="3" tags="">
<title></title>
<author id="3">Lewis Carroll</author>
</book>
<book available="5" id="4" tags="Harry">
<title>Don</title>
<author id="4">Manuel De Cervantes</author>
</book>
</books>
</value>
</response>
I want to append a string value of my choosing to all attributes called "tags". This is whether the "tags" attribute has a value or not and also the attributes are at different levels of the xml structure. I have tried the method findall() but I keep on getting an error "IndexError: list index out of range." This is the code I have so far which is a little short but I have run out of steam for what else I need to type...
splitter = etree.XMLParser(strip_cdata=False)
xmldoc = etree.parse(os.path.join(root, xml_file), splitter ).getroot()
for child in xmldoc:
if child.tag != 'response':
allDescendants = list(etree.findall())
for child in allDescendants:
if hasattr(child, 'tags'):
child.attribute["tags"].value = "someString"
findall() is the right API to use. Here is an example:
from lxml import etree
import os
splitter = etree.XMLParser(strip_cdata=False)
xml_file = 'foo.xml'
root = '.'
xmldoc = etree.parse(os.path.join(root, xml_file), splitter ).getroot()
for element in xmldoc.findall(".//*[#tags]"):
element.attrib["tags"] += " KILROY!"
print etree.tostring(xmldoc)