I am trying to parse through my .xml file using glob and then use etree to add more code to my .xml. However, I keep getting an error when using doc insert that says object has no attribute insert. Does anyone know how I can effectively add code to my .xml?
from lxml import etree
path = "D:/Test/"
for xml_file in glob.glob(path + '/*/*.xml'):
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
doc.insert(1,new_elem)
new_elem.tail = "\n"
My original xml looks like this :
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
</data>
And I'd like to modify it to look like this:
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
<new_code abortExpression="" elseExpression="" errorIfNoMatch="false"/>
</data>
The problem is that you need to extract the root from your document before you can start modifying it: modify doc.getroot() instead of doc.
This works for me:
from lxml import etree
xml_file = "./doc.xml"
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
root = doc.getroot()
root.insert(1, new_elem)
new_elem.tail="\n"
To print the results to a file, you can use doc.write():
doc.write("doc-out.xml", encoding="utf8", xml_declaration=True)
Note the xml_declaration=True argument: it tells doc.write() to produce the <?xml version='1.0' encoding='UTF8'?> header.
I have a xml file as follows
<Person>
<name>
My Name
</name>
<Address>My Address</Address>
</Person>
The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.
I found this but it trims only which are between tags not the value
https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/
Update 1 - Handle following xml which has tail spaces in <name> tag
<Person>
<name>
My Name<shortname>My</short>
</name>
<Address>My Address</Address>
</Person>
Accepted answer handle above both kind of xml's
Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings
https://stackoverflow.com/a/19396130/973699
With lxml you can iterate over all elements and check if it has text to strip():
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
print(etree.tostring(root))
It yields:
<Person><name>My Name</name>
<Address>My Address</Address>
</Person>
UPDATE to strip tail text too:
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
print(etree.tostring(root, encoding="utf-8", xml_declaration=True))
Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.
Following code did what I wanted
from lxml import etree
#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)
root = etree.parse('xmlfile',myparser)
#from Birei's answer
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)
You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.
You can use beautifulsoup. Do traverse all elements and for each one that contains some text, replace it with its stripped version:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')
for elem in soup.find_all():
if elem.string is not None:
elem.string = elem.string.strip()
print(soup)
Assuming xmlfile with the content provided in the question, it yields:
<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>
I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions.
import codecs
from xml.dom import minidom
# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")
# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")
# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")
stripped_file.close()
While it's not BeautifulSoup, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)
Documentation of API calls used here:
minidom.parse
minidom.Node.writexml
codecs.open
I am generating an XML document in Python using an ElementTree, but the tostring function doesn't include an XML declaration when converting to plaintext.
from xml.etree.ElementTree import Element, tostring
document = Element('outer')
node = SubElement(document, 'inner')
node.NewValue = 1
print tostring(document) # Outputs "<outer><inner /></outer>"
I need my string to include the following XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
However, there does not seem to be any documented way of doing this.
Is there a proper method for rendering the XML declaration in an ElementTree?
I am surprised to find that there doesn't seem to be a way with ElementTree.tostring(). You can however use ElementTree.ElementTree.write() to write your XML document to a fake file:
from io import BytesIO
from xml.etree import ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
f = BytesIO()
et.write(f, encoding='utf-8', xml_declaration=True)
print(f.getvalue()) # your XML file, encoded as UTF-8
See this question. Even then, I don't think you can get your 'standalone' attribute without writing prepending it yourself.
I would use lxml (see http://lxml.de/api.html).
Then you can:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))
If you include the encoding='utf8', you will get an XML header:
xml.etree.ElementTree.tostring writes a XML encoding declaration with encoding='utf8'
Sample Python code (works with Python 2 and 3):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(
ElementTree.fromstring('<xml><test>123</test></xml>')
)
root = tree.getroot()
print('without:')
print(ElementTree.tostring(root, method='xml'))
print('')
print('with:')
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
Python 2 output:
$ python2 example.py
without:
<xml><test>123</test></xml>
with:
<?xml version='1.0' encoding='utf8'?>
<xml><test>123</test></xml>
With Python 3 you will note the b prefix indicating byte literals are returned (just like with Python 2):
$ python3 example.py
without:
b'<xml><test>123</test></xml>'
with:
b"<?xml version='1.0' encoding='utf8'?>\n<xml><test>123</test></xml>"
xml_declaration Argument
Is there a proper method for rendering the XML declaration in an ElementTree?
YES, and there is no need of using .tostring function. According to ElementTree Documentation, you should create an ElementTree object, create Element and SubElements, set the tree's root, and finally use xml_declaration argument in .write function, so the declaration line is included in output file.
You can do it this way:
import xml.etree.ElementTree as ET
tree = ET.ElementTree("tree")
document = ET.Element("outer")
node1 = ET.SubElement(document, "inner")
node1.text = "text"
tree._setroot(document)
tree.write("./output.xml", encoding = "UTF-8", xml_declaration = True)
And the output file is:
<?xml version='1.0' encoding='UTF-8'?>
<outer><inner>text</inner></outer>
I encounter this issue recently, after some digging of the code, I found the following code snippet is definition of function ElementTree.write
def write(self, file, encoding="us-ascii"):
assert self._root is not None
if not hasattr(file, "write"):
file = open(file, "wb")
if not encoding:
encoding = "us-ascii"
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" %
encoding)
self._write(file, self._root, encoding, {})
So the answer is, if you need write the XML header to your file, set the encoding argument other than utf-8 or us-ascii, e.g. UTF-8
Easy
Sample for both Python 2 and 3 (encoding parameter must be utf8):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
From Python 3.8 there is xml_declaration parameter for that stuff:
New in version 3.8: The xml_declaration and default_namespace
parameters.
xml.etree.ElementTree.tostring(element, encoding="us-ascii",
method="xml", *, xml_declaration=None, default_namespace=None,
short_empty_elements=True) Generates a string representation of an XML
element, including all subelements. element is an Element instance.
encoding 1 is the output encoding (default is US-ASCII). Use
encoding="unicode" to generate a Unicode string (otherwise, a
bytestring is generated). method is either "xml", "html" or "text"
(default is "xml"). xml_declaration, default_namespace and
short_empty_elements has the same meaning as in ElementTree.write().
Returns an (optionally) encoded string containing the XML data.
Sample for Python 3.8 and higher:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='unicode', method='xml', xml_declaration=True))
The minimal working example with ElementTree package usage:
import xml.etree.ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
node.text = '1'
res = ET.tostring(document, encoding='utf8', method='xml').decode()
print(res)
the output is:
<?xml version='1.0' encoding='utf8'?>
<outer><inner>1</inner></outer>
Another pretty simple option is to concatenate the desired header to the string of xml like this:
xml = (bytes('<?xml version="1.0" encoding="UTF-8"?>\n', encoding='utf-8') + ET.tostring(root))
xml = xml.decode('utf-8')
with open('invoice.xml', 'w+') as f:
f.write(xml)
I would use ET:
try:
from lxml import etree
print("running with lxml.etree")
except ImportError:
try:
# Python 2.5
import xml.etree.cElementTree as etree
print("running with cElementTree on Python 2.5+")
except ImportError:
try:
# Python 2.5
import xml.etree.ElementTree as etree
print("running with ElementTree on Python 2.5+")
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
print("running with cElementTree")
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
print("running with ElementTree")
except ImportError:
print("Failed to import ElementTree from any known place")
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, encoding='UTF-8', xml_declaration=True))
This works if you just want to print. Getting an error when I try to send it to a file...
import xml.dom.minidom as minidom
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
def prettify(elem):
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
Including 'standalone' in the declaration
I didn't found any alternative for adding the standalone argument in the documentation so I adapted the ET.tosting function to take it as an argument.
from xml.etree import ElementTree as ET
# Sample
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
# Function that you need
def tostring(element, declaration, encoding=None, method=None,):
class dummy:
pass
data = []
data.append(declaration+"\n")
file = dummy()
file.write = data.append
ET.ElementTree(element).write(file, encoding, method=method)
return "".join(data)
# Working example
xdec = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>"""
xml = tostring(document, encoding='utf-8', declaration=xdec)
I have a XML file that starts like this:
<?xml version="1.0" encoding="utf-8"?>
<Recipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
I need to read it in, modify it, then write it back out. Here is a code snippet:
from xml.etree import ElementTree
with open('base.xml', 'rt') as f:
tree = ElementTree.parse(f)
recipe = tree.find('')
t = recipe.find('Targets_Params/Target_Table/Target_Name')
t.text = "new Value"
output_file = open('new.xml', 'w' )
output_file.write(ElementTree.tostring(recipe))
output_file.close()
My problem is that when I write the file out I do not get the first line at all, and the second line comes out with just:
<Recipe>
How I can read in the file, modify it, and write it out while preserving the original structure?
I am trying to extract the name elements under the sequence in xml files. I have pasted in the top of a sample xml to illustrate. With this I want to get the text from 01 Interview_been successful through mentorship and write it to a file. There are multiple sequence tags in the xml and I am trying to figure out how to go through it and extract it. I have tried to figure out how to use xml.etree and xml.dom.minidom but I can't seem to wrap my brain around it. I was able to get all of the id values from the sequence tags but not the name elements. I'm pasting in my code before the xml.
from xml.etree import ElementTree
file = open("xmldump.txt", "r")
filedata = file.read()
file.close()
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('name'):
sequenceid = node.attrib.get('name')
print ' %s' % (sequenceid)
newLine = sequenceid + "\n"
file = open("xmldump.txt", "w")
file.write(newLine)
file.close()
Here is the XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<bin>
<uuid>0F5D72FA-54E4-4DE8-81D7-CC33F5C43836</uuid>
<updatebehavior>add</updatebehavior>
<name>Logged</name>
<children>
<sequence id="01 Interview_been successful through mentorship">
<uuid>12FB944D-83EA-4527-9A54-2130A42E3A06</uuid>
<updatebehavior>add</updatebehavior>
<name>01 Interview_been successful through mentorship</name>
<duration>1195</duration>
<rate>
<ntsc>TRUE</ntsc>
<timebase>24</timebase>
</rate>
<timecode>
Well, I'm not sure if you want the "id" attribute or the name tag(your code is confusing, it tries to extract a "name" attribute out of the "sequence" tag, but that tag only has an "id" attribute). Below is code that extract both, should help you get started on figuring out how ElementTree works
from xml.etree import ElementTree
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('sequence'):
sequenceid = node.attrib.get('id')
name = node.findtext('name')