Python how to strip white-spaces from xml text nodes

Python how to strip white-spaces from xml text nodes - python

I have a xml file as follows
<Person>
<name>
My Name
</name>
<Address>My Address</Address>
</Person>
The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.
I found this but it trims only which are between tags not the value
https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/
Update 1 - Handle following xml which has tail spaces in <name> tag
<Person>
<name>
My Name<shortname>My</short>
</name>
<Address>My Address</Address>
</Person>
Accepted answer handle above both kind of xml's
Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings
https://stackoverflow.com/a/19396130/973699

With lxml you can iterate over all elements and check if it has text to strip():
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
print(etree.tostring(root))
It yields:
<Person><name>My Name</name>
<Address>My Address</Address>
</Person>
UPDATE to strip tail text too:
from lxml import etree
tree = etree.parse('xmlfile')
root = tree.getroot()
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
print(etree.tostring(root, encoding="utf-8", xml_declaration=True))

Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.
Following code did what I wanted
from lxml import etree
#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)
root = etree.parse('xmlfile',myparser)
#from Birei's answer
for elem in root.iter('*'):
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)

You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.

You can use beautifulsoup. Do traverse all elements and for each one that contains some text, replace it with its stripped version:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')
for elem in soup.find_all():
if elem.string is not None:
elem.string = elem.string.strip()
print(soup)
Assuming xmlfile with the content provided in the question, it yields:
<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>

I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions.
import codecs
from xml.dom import minidom
# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")
# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")
# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")
stripped_file.close()
While it's not BeautifulSoup, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)
Documentation of API calls used here:
minidom.parse
minidom.Node.writexml
codecs.open

Related

How to determine what the root tag name is for a XML document

I was wonder how I would go about determining what the root tag for an XML document is using xml.dom.minidom.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child1></child1>
<child2></child2>
<child3></child3>
</root>
In the example XML above, my root tag could be 3 or 4 different things. All I want to do is pull the tag, and then use that value to get the elements by tag name.
def import_from_XML(self, file_name)
file = open(file_name)
document = file.read()
if re.compile('^<\?xml').match(document):
xml = parseString(document)
root = '' # <-- THIS IS WHERE IM STUCK
elements = xml.getElementsByTagName(root)
I tried searching through the documentation for xml.dom.minidom, but it is a little hard for me to wrap my head around, and I couldn't find anything that answered this question outright.
I'm using Python 3.6.x, and I would prefer to keep with the standard library if possible.

For the line you commented as Where I am stuck, the following should assign the value of the root tag of the XML document to the variable theNameOfTheRootElement:
theNameOfTheRootElement = xml.documentElement.tagName

this is what I did when I last processed xml. I didn't use the approach you used but I hope it will help you.
import urllib2
from xml.etree import ElementTree as ET
req = urllib2.Request(site)
file=None
try:
file = urllib2.urlopen(req)
except urllib2.URLError as e:
print e.reason
data = file.read()
file.close()
root = ET.fromstring(data)
print("root", root)
for child in root.findall('parent element'):
print(child.text, child.attrib)

Pretty formatting xml file in Python using lxml

I am trying to add a vhost entry to tomcat server.xml using python lxml
import io
from lxml import etree
newdoc = etree.fromstring('<Host name="getrailo.com" appBase="webapps"><Context path="" docBase="/var/sites/getrailo.org" /><Alias>www.getrailo.org</Alias><Alias>my.getrailo.org</Alias></Host>')
doc = etree.parse('/root/server.xml')
root = doc.getroot()
for node1 in root.iter('Service'):
for node2 in node1.iter('Engine'):
node2.append(newdoc)
doc.write('/root/server.xml')
The problem is that it is removing the
<?xml version='1.0' encoding='utf-8'?>
line on top of the file from the output and the vhost entry is all in one line .How can I add the xml element in a pretty way like
<Host name="getrailo.org" appBase="webapps">
<Context path="" docBase="/var/sites/getrailo.org" />
<Alias>www.getrailo.org</Alias>
<Alias>my.getrailo.org</Alias>
</Host>

First you need to parse existing file with remove_blank_text so that it's clean and with no extra spaces that I think is a problem in this case
parser = etree.XMLParser(remove_blank_text=True)
newdoc = etree.fromstring('/root/server.xml' parser=parser)
Then you're safe to write it back to disk with pretty_print and xml_declaration set in doc.write()
doc.write('/root/server.xml',
xml_declaration=True,
encoding='utf-8',
pretty_print=True)

Find and replacing text in elementtree

i am very new to programming and python. I am trying to find and replace a text in an xml file. Here is my xml file
<?xml version="1.0" encoding="UTF-8"?>
<!--Arbortext, Inc., 1988-2008, v.4002-->
<!DOCTYPE doc PUBLIC "-//MYCOMPANY//DTD XSEIF 1/FAD 110 05 R5//EN"
"XSEIF_R5.dtd">
<doc version="XSEIF R5"
xmlns="urn:x-mycompany:r2:reg-doc:1551-fad.110.05:en:*">
<meta-data></meta-data>
<front></front>
<body>
<chl1><title xml:id="id_881i">Installation</title>
<p>To install SDK, perform the tasks mentioned in the following
table.</p>
<p><input>ln -s /sim/<var>user_id</var>/.VirtualBox $home/.VirtualBox</input
></p>
</chl1>
</body>
</doc>
<?Pub *0000021917 0?>
I need to replace all entries of "virtual box" with "Xen". For this i tried Elementtree. But i dont know how to replace and write back to the file. Here is my try.
import xml.etree.ElementTree as ET
tree=ET.parse('C:/My_location/1_1531-CRA 119 1364_2.xml')
doc=tree.getroot()
iterator=doc.getiterator()
for body in iterator:
old_text=body.replace("Virtualbox", "Xen")
The texts are available in many sub tags under body.I got the method to remove the subelement and append a new element, but didnt get to replace only the texts.

Replace text, tail attributes.
import lxml.etree as ET
with open('1.xml', 'rb+') as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.getiterator():
if elem.text:
elem.text = elem.text.replace('VirtualBox', 'Xen')
if elem.tail:
elem.tail = elem.tail.replace('VirtualBox', 'Xen')
f.seek(0)
f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
f.truncate()

Probably the simplest way is to do:
ifile = open('input_file','r')
ofile = open('output_file','w')
for line in ifile.readlines():
ofile.write(line.replace('VirtualBox','Xen'))
ifile.close()
ofile.close()

How to add an element to xml file by using elementtree

I've a xml file, and I'm trying to add additional element to it.
the xml has the next structure :
<root>
<OldNode/>
</root>
What I'm looking for is :
<root>
<OldNode/>
<NewNode/>
</root>
but actually I'm getting next xml :
<root>
<OldNode/>
</root>
<root>
<OldNode/>
<NewNode/>
</root>
My code looks like that :
file = open("/tmp/" + executionID +".xml", 'a')
xmlRoot = xml.parse("/tmp/" + executionID +".xml").getroot()
child = xml.Element("NewNode")
xmlRoot.append(child)
xml.ElementTree(root).write(file)
file.close()
Thanks.

You opened the file for appending, which adds data to the end. Open the file for writing instead, using the w mode. Better still, just use the .write() method on the ElementTree object:
tree = xml.parse("/tmp/" + executionID +".xml")
xmlRoot = tree.getroot()
child = xml.Element("NewNode")
xmlRoot.append(child)
tree.write("/tmp/" + executionID +".xml")
Using the .write() method has the added advantage that you can set the encoding, force the XML prolog to be written if you need it, etc.
If you must use an open file to prettify the XML, use the 'w' mode, 'a' opens a file for appending, leading to the behaviour you observed:
with open("/tmp/" + executionID +".xml", 'w') as output:
output.write(prettify(tree))
where prettify is something along the lines of:
from xml.etree import ElementTree
from xml.dom import minidom
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ElementTree.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
e.g. the minidom prettifying trick.

Python ElementTree parsing unbound prefix error

I am learning ElementTree in python. Everything seems fine except when I try to parse the xml file with prefix:
test.xml:
<?xml version="1.0"?>
<abc:data>
<abc:country name="Liechtenstein" rank="1" year="2008">
</abc:country>
<abc:country name="Singapore" rank="4" year="2011">
</abc:country>
<abc:country name="Panama" rank="5" year="2011">
</abc:country>
</abc:data>
When I try to parse the xml:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
I got the following error:
xml.etree.ElementTree.ParseError: unbound prefix: line 2, column 0
Do I need to specify something in order to parse a xml file with prefix?

Add the abc namespace to your xml file.
<?xml version="1.0"?>
<abc:data xmlns:abc="your namespace">

I encountered the same issue while processing xml file. You can use below code before parse your XML file. This will resolve your issue.
parser1 = etree.XMLParser(encoding="utf-8", recover=True)
tree1 = ElementTree.parse('filename.xml', parser1)

See if this works:
from bs4 import BeautifulSoup
xml_file = "test.xml"
with open(xml_file, "r", encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "xml")
items = soup.find_all("country")
print (items)
The above will produce an array which you can then manipulate to achieve your aim (e.g. remove html tags etc.):
[<country name="Liechtenstein" rank="1" year="2008">
</country>, <country name="Singapore" rank="4" year="2011">
</country>, <country name="Panama" rank="5" year="2011">
</country>]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python how to strip white-spaces from xml text nodes - python

You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.

Related

How to determine what the root tag name is for a XML document

Pretty formatting xml file in Python using lxml

Find and replacing text in elementtree

How to add an element to xml file by using elementtree

Python ElementTree parsing unbound prefix error

Categories

Resources