Adding attribute to child elements - python

I am trying to add an attribute to all child elements in all XML files in the current directory. This attribute should be equal to the length of each string. For example, the XML looks like this:
<?xml version="1.0" encoding="utf-8?>
<RootElement>
<String Id="PythonLove">I love Python.</String>
</RootElement>
So, if this worked the way it should, it would leave the child opening tag looking like this:
<String Id="PythonLove" length="14">
I have read many forums and all point to either .set or .attrib to add attributes into an existing XML. Neither of these have any effect on the files though. My script currently looks like this:
for child in root:
length_limit = len(child.text)
child.set('length', length_limit)
I've also tried child.attrib['length'] = length_limit. This also doesn't work. What am I doing wrong?
Thanks

You need to convert the value to string before set.
>>> xml = '''<?xml version="1.0" encoding="utf-8"?>
... <RootElement>
... <String Id="PythonLove">I love Python.</String>
... </RootElement>
... '''
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(xml)
>>> for child in root:
... child.set('length', str(len(child.text))) # <---
...
>>> print(ET.tostring(root).decode())
<RootElement>
<String Id="PythonLove" length="14">I love Python.</String>
</RootElement>

Got it! Pretty elated because that was a couple weeks of struggles. I ended up just writing to 'infile' (used for iterating through the files in the cwd) and it worked to overwrite the existing XML (had to register the namespace first which was another little hump I ran into). Full code:
import fileinput
import os, glob
import xml.etree.ElementTree as ET
path = os.getcwd()
for infile in glob.glob(os.path.join(path, '*.xml')):
try:
tree = ET.parse(infile)
root = tree.getroot() # sets variable 'root' to the root element
for child in root:
string_length = str(len(child.text))
child.set('length', length_limit)
ET.register_namespace('',"http://schemas.microsoft.com/wix/2006/XML")
tree.write(infile)

Related

Parsing XML Attributes with Python

I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.
Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)

python ElementTree the text of element who has a child

When I try to read a text of a element who has a child, it gives None:
See the xml (say test.xml):
<?xml version="1.0"?>
<data>
<test><ref>MemoryRegion</ref> abcd</test>
</data>
and the python code that wants to read 'abcd':
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print root.find("test").text
When I run this python, it gives None, rather than abcd.
How can I read abcd under this condition?
Use Element.tail attribute:
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.xml')
>>> root = tree.getroot()
>>> print root.find(".//ref").tail
abcd
ElementTree has a rather different view of XML that is more suited for nested data. .text is the data right after a start tag. .tail is the data right after an end tag. so you want:
print root.find('test/ref').tail

Adding xml prefix declaration with lxml in python

Short version :
How to add the xmlns:xi="http://www.w3.org/2001/XInclude" prefix decleration to my root element in python with lxml ?
Context :
I have some XML files that include IDs to other files.
These IDs represent the referenced file names.
Using lxml I managed to replace these with the appropriate XInclude statement, but if I do not have the prefix decleration my my XML parser won't add the includes, which is normal.
Edit :
I won't include my code because it won't help at understanding the problem. I can process the document just fine, my problem is serialization.
So from this
<root>
<somechild/>
</root>
I want to get this <root xmlns:xi="http://www.w3.org/2001/XInclude">
<somechild/>
</root> in my output file.
For this I tried using
`
tree = ET.parse(fileName)
root = tree.getroot()
root.nsmap['xi'] = "http://www.w3.org/2001/XInclude"
tree.write('output.xml', xml_declaration=True, encoding="UTF-8", pretty_print=True)
`
Attribute nsmap is not writable gives me as error when I try your code.
You can try to register your namespace, remove current attributes (after saving them) of your root element, use set() method to add the namespace and recover the attributes.
An example:
>>> root = etree.XML('<root a1="one" a2="two"> <somechild/> </root>')
>>> etree.register_namespace('xi', 'http://www.w3.org/2001/XInclude')
>>> etree.tostring(root)
b'<root a1="one" a2="two"> <somechild/> </root>'
>>> orig_attrib = dict(root.attrib)
>>> root.set('{http://www.w3.org/2001/XInclude}xi', '')
>>> for a in root.attrib: del root.attrib[a]
>>> for a in orig_attrib: root.attrib[a] = orig_attrib[a]
>>> etree.tostring(root)
b'<root xmlns:xi="http://www.w3.org/2001/XInclude" a1="one" a2="two"> <somechild/> </root>'
>>> root.nsmap
{'xi': 'http://www.w3.org/2001/XInclude'}

Programmatically clean/ignore namespaces in XML - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.
I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

How can i do replace a child element(s) in ElementTree

I want to replace child elements from one tree to another , based on some criteria. I can do this using Comprehension ? But how do we replace element in ElementTree?
You can't replace an element from the ElementTree you can only work with Element.
Even when you call ElementTree.find() it's just a shortcut for getroot().find().
So you really need to:
extract the parent element
use comprehension (or whatever you like) on that parent element
The extraction of the parent element can be easy if your target is a root sub-element (just call getroot()) otherwise you'll have to find it.
Unlike the DOM, etree has no explicit multi-document functions. However, you should be able to just move elements freely from one document to another. You may want to call _setroot after doing so.
By calling insert and then remove, you can replace a node in a document.
I'm new to python, but I've found a dodgy way to do this:
Input file input1.xml:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<import ref="input2.xml" />
<name awesome="true">Chuck</name>
</root>
Input file input2.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>blah blah</bar>
</foo>
Python code: (note, messy and hacky)
import os
import xml.etree.ElementTree as ElementTree
def getElementTree(xmlFile):
print "-- Processing file: '%s' in: '%s'" %(xmlFile, os.getcwd())
xmlFH = open(xmlFile, 'r')
xmlStr = xmlFH.read()
et = ElementTree.fromstring(xmlStr)
parent_map = dict((c, p) for p in et.getiterator() for c in p)
# ref: https://stackoverflow.com/questions/2170610/access-elementtree-node-parent-node/2170994
importList = et.findall('.//import[#ref]')
for importPlaceholder in importList:
old_dir = os.getcwd()
new_dir = os.path.dirname(importPlaceholder.attrib['ref'])
shallPushd = os.path.exists(new_dir)
if shallPushd:
print " pushd: %s" %(new_dir)
os.chdir(new_dir) # pushd (for relative linking)
# Recursing to import element from file reference
importedElement = getElementTree(os.path.basename(importPlaceholder.attrib['ref']))
# element replacement
parent = parent_map[importPlaceholder]
index = parent._children.index(importPlaceholder)
parent._children[index] = importedElement
if shallPushd:
print " popd: %s" %(old_dir)
os.chdir(old_dir) # popd
return et
xmlET = getElementTree("input1.xml")
print ElementTree.tostring(xmlET)
gives the output:
-- Processing file: 'input1.xml' in: 'C:\temp\testing'
-- Processing file: 'input2.xml' in: 'C:\temp\testing'
<root>
<foo>
<bar>blah blah</bar>
</foo><name awesome="true">Chuck</name>
</root>
this was concluded with information from:
stackoverflow answer: access ElementTree node parent node
accessing parents from effbot.org

Categories

Resources