Short version :
How to add the xmlns:xi="http://www.w3.org/2001/XInclude" prefix decleration to my root element in python with lxml ?
Context :
I have some XML files that include IDs to other files.
These IDs represent the referenced file names.
Using lxml I managed to replace these with the appropriate XInclude statement, but if I do not have the prefix decleration my my XML parser won't add the includes, which is normal.
Edit :
I won't include my code because it won't help at understanding the problem. I can process the document just fine, my problem is serialization.
So from this
<root>
<somechild/>
</root>
I want to get this <root xmlns:xi="http://www.w3.org/2001/XInclude">
<somechild/>
</root> in my output file.
For this I tried using
`
tree = ET.parse(fileName)
root = tree.getroot()
root.nsmap['xi'] = "http://www.w3.org/2001/XInclude"
tree.write('output.xml', xml_declaration=True, encoding="UTF-8", pretty_print=True)
`
Attribute nsmap is not writable gives me as error when I try your code.
You can try to register your namespace, remove current attributes (after saving them) of your root element, use set() method to add the namespace and recover the attributes.
An example:
>>> root = etree.XML('<root a1="one" a2="two"> <somechild/> </root>')
>>> etree.register_namespace('xi', 'http://www.w3.org/2001/XInclude')
>>> etree.tostring(root)
b'<root a1="one" a2="two"> <somechild/> </root>'
>>> orig_attrib = dict(root.attrib)
>>> root.set('{http://www.w3.org/2001/XInclude}xi', '')
>>> for a in root.attrib: del root.attrib[a]
>>> for a in orig_attrib: root.attrib[a] = orig_attrib[a]
>>> etree.tostring(root)
b'<root xmlns:xi="http://www.w3.org/2001/XInclude" a1="one" a2="two"> <somechild/> </root>'
>>> root.nsmap
{'xi': 'http://www.w3.org/2001/XInclude'}
Related
My xml file looks like below :-
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Messages xmlns="URL/sampleMessages-v1">
<Header>
<TransactionId>0</TransactionId>
<RequestNo>41194812</RequestNo>
<VNo>6789</VNo>
<Source></Source>
</Header>
...
...
</Messages>
I want to read it and change the RequestNo value
<RequestNo>41194812</RequestNo> to
<RequestNo>41194000</RequestNo>
I am using ElementTree module currently. I am using windows machine currently.
I want to update the value in the same file.
Ihave tried below code :-
for elem in root:
for subelem in elem:
#print (subelem.tag)
if 'RequestNo' in subelem.tag :
#print (subelem.text)
subelem.text="41194813"
But i am not able to see the change or i dont know currently how to write this new value subelem.text="41194813" in existing xml file.
Your for loop does the job: it did replace the text correctly. The change is in your root variable. You can verify that by adding the following line right after the for loop:
ElementTree.dump(root)
Now that you have the XML updated, you will need to write that into a file:
tree.write('newfile.xml')
Where tree is the result of ElementTree.parse(). So, to put everything together:
tree = ElementTree.parse('messages.xml')
root = tree.getroot()
for elem in root:
for subelem in elem:
if 'RequestNo' in subelem.tag:
subelem.text = '41194813'
break
tree.write('messages-new.xml')
Dealing with Namespaces
Your XML document contains namespaces, so if you plan to search for a tag, you need to include the namespaces in the tag names. Here is an alternative solution which deals with namespaces:
tree = ElementTree.parse('messages.xml')
root = tree.getroot()
namespaces = {'xxx': 'URL/sampleMessages-v1'}
node = root.find('xxx:Header/xxx:RequestNo', namespaces)
if node is not None:
node.text = '41194813'
tree.write('messages-new.xml')
In the above example, I just gave your namespace the name 'xxx', it can be anything 'foo', 'bar', ... but should be used as prefix in the call to root.find().
Removing "ns0" from Output File
In order to remove "ns0" from output file, you need to register the namespace before writing:
ElementTree.register_namespace('', 'URL/sampleMessages-v1')
tree.write('messages-new.xml')
I am trying to add an attribute to all child elements in all XML files in the current directory. This attribute should be equal to the length of each string. For example, the XML looks like this:
<?xml version="1.0" encoding="utf-8?>
<RootElement>
<String Id="PythonLove">I love Python.</String>
</RootElement>
So, if this worked the way it should, it would leave the child opening tag looking like this:
<String Id="PythonLove" length="14">
I have read many forums and all point to either .set or .attrib to add attributes into an existing XML. Neither of these have any effect on the files though. My script currently looks like this:
for child in root:
length_limit = len(child.text)
child.set('length', length_limit)
I've also tried child.attrib['length'] = length_limit. This also doesn't work. What am I doing wrong?
Thanks
You need to convert the value to string before set.
>>> xml = '''<?xml version="1.0" encoding="utf-8"?>
... <RootElement>
... <String Id="PythonLove">I love Python.</String>
... </RootElement>
... '''
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(xml)
>>> for child in root:
... child.set('length', str(len(child.text))) # <---
...
>>> print(ET.tostring(root).decode())
<RootElement>
<String Id="PythonLove" length="14">I love Python.</String>
</RootElement>
Got it! Pretty elated because that was a couple weeks of struggles. I ended up just writing to 'infile' (used for iterating through the files in the cwd) and it worked to overwrite the existing XML (had to register the namespace first which was another little hump I ran into). Full code:
import fileinput
import os, glob
import xml.etree.ElementTree as ET
path = os.getcwd()
for infile in glob.glob(os.path.join(path, '*.xml')):
try:
tree = ET.parse(infile)
root = tree.getroot() # sets variable 'root' to the root element
for child in root:
string_length = str(len(child.text))
child.set('length', length_limit)
ET.register_namespace('',"http://schemas.microsoft.com/wix/2006/XML")
tree.write(infile)
I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value
When I try to read a text of a element who has a child, it gives None:
See the xml (say test.xml):
<?xml version="1.0"?>
<data>
<test><ref>MemoryRegion</ref> abcd</test>
</data>
and the python code that wants to read 'abcd':
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print root.find("test").text
When I run this python, it gives None, rather than abcd.
How can I read abcd under this condition?
Use Element.tail attribute:
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.xml')
>>> root = tree.getroot()
>>> print root.find(".//ref").tail
abcd
ElementTree has a rather different view of XML that is more suited for nested data. .text is the data right after a start tag. .tail is the data right after an end tag. so you want:
print root.find('test/ref').tail
I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.
I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo