parsing XML file in python - python

I have a XML file such as:
<?xml version="1.0" encoding="utf-8"?>
<result>
<data>
<_0>stream1</_0>
<_1>file</_1>
<_2>livestream1</_2>
</data>
</result>
I used
xmlTag = dom.getElementsByTagName('data')[0].toxml()
xmlData=xmlTag.replace('<data>','').replace('</data>','')
and i got xmlData
<_0>stream</_0>
<_1>file</_1>
<_2>livestream1</_2>
but i need values stream,file,livestream1 etc.
How to do this?

I would suggest to use ElementTree. It's faster than the usual DOM implementations and I think its more elegant as well.
from xml.etree import ElementTree
#assuming xml_string is your XML above
xml_etree = ElementTree.fromstring(xml_string)
data = xml_etree.find('data')
for elem in data:
print elem.text
Output would be:
stream1
file
livestream1

For your information, this is how to do it with lxml and xpath:
from lxml import etree
doc = etree.fromstring(xml_string)
for elem in doc.xpath('//data/*'):
print elem.text
The output should be the same:
stream1
file
livestream1

Related

'lxml.etree._ElementTree' object has no attribute 'insert'

I am trying to parse through my .xml file using glob and then use etree to add more code to my .xml. However, I keep getting an error when using doc insert that says object has no attribute insert. Does anyone know how I can effectively add code to my .xml?
from lxml import etree
path = "D:/Test/"
for xml_file in glob.glob(path + '/*/*.xml'):
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
doc.insert(1,new_elem)
new_elem.tail = "\n"
My original xml looks like this :
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
</data>
And I'd like to modify it to look like this:
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
<new_code abortExpression="" elseExpression="" errorIfNoMatch="false"/>
</data>
The problem is that you need to extract the root from your document before you can start modifying it: modify doc.getroot() instead of doc.
This works for me:
from lxml import etree
xml_file = "./doc.xml"
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
root = doc.getroot()
root.insert(1, new_elem)
new_elem.tail="\n"
To print the results to a file, you can use doc.write():
doc.write("doc-out.xml", encoding="utf8", xml_declaration=True)
Note the xml_declaration=True argument: it tells doc.write() to produce the <?xml version='1.0' encoding='UTF8'?> header.

How to access the tag below another tag in xml using xml.dom.minidom in python?

I am using python 3.10.4 . I am new at parsing xml files.
like for eg, let the xml file be with the filename "test.xml":
<?xml version="1.0" encoding="UTF-8"?>
<tag1 name="1">
<tag2 name="a"></tag2>
</tag1>
<tag1 name = "2">
<tag2 name = "b"></tag2>
</tag1>
</xml>
python code
import xml.dom.minidom
file = xml.dom.minidom.parse('test.xml')
list = []
tags=file.getElementsByTagName("tag1")
for tag in tags:
if(tag.getAttribute("name")=="1"):
print(tag.getAttribute("tag2"))
So here I want to access the tag2 of tag1 with name="1". How can I do it?

How to output XML declaration <?xml version="1.0"?> in Python/ElementTree

I'm trying to create a XML file for the word reference source file which is in XML. When I write to the file, with only "xml_decaration=True" it shows <?xml version='1.0' encoding='us-ascii'?> but I want it in the form <?xml version="1.0"?>.
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as ET
import uuid
from lxml import etree
root=Element('b:sources')
root.set('SelectedStyle','')
root.set('xmlns:b','http://schemas.openxmlformats.org/officeDocument/2006/bibliography')
root.set('xmlns','http://schemas.openxmlformats.org/officeDocument/2006/bibliography')
#root.attrib=('SelectedStyle'='', 'xmlns:b'='"http://schemas.openxmlformats.org/officeDocument/2006/bibliography"', 'xmlns:b'='"http://schemas.openxmlformats.org/officeDocument/2006/bibliography"','xmlns'='"http://schemas.openxmlformats.org/officeDocument/2006/bibliography"')
source=ET.SubElement(root, 'b:source')
ET.SubElement(source,'b:Tag')
ET.SubElement(source,'b:SourceType').text='Misc'
ET.SubElement(source,'b:guid').text=str(uuid.uuid1())
Author=ET.SubElement(source,'b:Author')
Author2=ET.SubElement(Author,'b:Author')
ET.SubElement(Author2,'b:Corporate').text='Norsk olje og gass'
ET.SubElement(source, 'b:Title').text='R-002'
ET.SubElement(source, 'b:Year').text='2019'
ET.SubElement(source, 'b:Month').text='10'
ET.SubElement(source, 'b:Day').text='27'
tree=ElementTree(root)
tree.write('Sources.xml', xml_declaration=True, method='xml')
Answer:
When using xml.etree.ElementTree there is no way to avoid the inclusion of an encoding attribute in the declaration. If you don't want an encoding attribute in the XML declaration at all, you need to use xml.dom.minidom not xml.etree.ElementTree.
Here is a snippet to setup an example:
import xml.etree.ElementTree
a = xml.etree.ElementTree.Element('a')
tree = xml.etree.ElementTree.ElementTree(element=a)
root = tree.getroot()
Omit Encoding:
out = xml.etree.ElementTree.tostring(root, xml_declaration=True)
b"<?xml version='1.0' encoding='us-ascii'?>\n<a />"
Encoding us-ascii:
out = xml.etree.ElementTree.tostring(root, encoding='us-ascii', xml_declaration=True)
b"<?xml version='1.0' encoding='us-ascii'?>\n<a />"
Encoding unicode:
out = xml.etree.ElementTree.tostring(root, encoding='unicode', xml_declaration=True)
"<?xml version='1.0' encoding='UTF-8'?>\n<a />"
Using minidom:
Let's take the first example from above with the encoding omitted and use the variable out as the input to xml.dom.minidom and you will see the output that you're seeking.
import xml.dom.minidom
dom = xml.dom.minidom.parseString(out)
dom.toxml()
'<?xml version="1.0" ?><a/>'
There is also a pretty print option:
dom.toprettyxml()
'<?xml version="1.0" ?>\n<a/>\n'
Note
Take a look at the source code, and you can see that the encoding is hard coded in the output.
with _get_writer(file_or_filename, encoding) as (write, declared_encoding):
if method == "xml" and (xml_declaration or
(xml_declaration is None and
declared_encoding.lower() not in ("utf-8", "us-ascii"))):
write("<?xml version='1.0' encoding='%s'?>\n" % (
declared_encoding,))
https://github.com/python/cpython/blob/550c44b89513ea96d209e2ff761302238715f082/Lib/xml/etree/ElementTree.py#L731-L736

Python how to strip white-spaces from xml text nodes

I have a xml file as follows
<Person>
<name>
My Name
</name>
<Address>My Address</Address>
</Person>
The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.
I found this but it trims only which are between tags not the value
https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/
Update 1 - Handle following xml which has tail spaces in <name> tag
<Person>
<name>
My Name<shortname>My</short>
</name>
<Address>My Address</Address>
</Person>
Accepted answer handle above both kind of xml's
Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings
https://stackoverflow.com/a/19396130/973699
With lxml you can iterate over all elements and check if it has text to strip():
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
print(etree.tostring(root))
It yields:
<Person><name>My Name</name>
<Address>My Address</Address>
</Person>
UPDATE to strip tail text too:
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
print(etree.tostring(root, encoding="utf-8", xml_declaration=True))
Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.
Following code did what I wanted
from lxml import etree
#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)
root = etree.parse('xmlfile',myparser)
#from Birei's answer
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)
You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.
You can use beautifulsoup. Do traverse all elements and for each one that contains some text, replace it with its stripped version:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')
for elem in soup.find_all():
if elem.string is not None:
elem.string = elem.string.strip()
print(soup)
Assuming xmlfile with the content provided in the question, it yields:
<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>
I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions.
import codecs
from xml.dom import minidom
# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")
# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")
# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")
stripped_file.close()
While it's not BeautifulSoup, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)
Documentation of API calls used here:
minidom.parse
minidom.Node.writexml
codecs.open

Python ElementTree parsing unbound prefix error

I am learning ElementTree in python. Everything seems fine except when I try to parse the xml file with prefix:
test.xml:
<?xml version="1.0"?>
<abc:data>
<abc:country name="Liechtenstein" rank="1" year="2008">
</abc:country>
<abc:country name="Singapore" rank="4" year="2011">
</abc:country>
<abc:country name="Panama" rank="5" year="2011">
</abc:country>
</abc:data>
When I try to parse the xml:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
I got the following error:
xml.etree.ElementTree.ParseError: unbound prefix: line 2, column 0
Do I need to specify something in order to parse a xml file with prefix?
Add the abc namespace to your xml file.
<?xml version="1.0"?>
<abc:data xmlns:abc="your namespace">
I encountered the same issue while processing xml file. You can use below code before parse your XML file. This will resolve your issue.
parser1 = etree.XMLParser(encoding="utf-8", recover=True)
tree1 = ElementTree.parse('filename.xml', parser1)
See if this works:
from bs4 import BeautifulSoup
xml_file = "test.xml"
with open(xml_file, "r", encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "xml")
items = soup.find_all("country")
print (items)
The above will produce an array which you can then manipulate to achieve your aim (e.g. remove html tags etc.):
[<country name="Liechtenstein" rank="1" year="2008">
</country>, <country name="Singapore" rank="4" year="2011">
</country>, <country name="Panama" rank="5" year="2011">
</country>]

Categories

Resources