Python ElementTree parsing unbound prefix error

Python ElementTree parsing unbound prefix error - python

I am learning ElementTree in python. Everything seems fine except when I try to parse the xml file with prefix:
test.xml:
<?xml version="1.0"?>
<abc:data>
<abc:country name="Liechtenstein" rank="1" year="2008">
</abc:country>
<abc:country name="Singapore" rank="4" year="2011">
</abc:country>
<abc:country name="Panama" rank="5" year="2011">
</abc:country>
</abc:data>
When I try to parse the xml:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
I got the following error:
xml.etree.ElementTree.ParseError: unbound prefix: line 2, column 0
Do I need to specify something in order to parse a xml file with prefix?

Add the abc namespace to your xml file.
<?xml version="1.0"?>
<abc:data xmlns:abc="your namespace">

I encountered the same issue while processing xml file. You can use below code before parse your XML file. This will resolve your issue.
parser1 = etree.XMLParser(encoding="utf-8", recover=True)
tree1 = ElementTree.parse('filename.xml', parser1)

See if this works:
from bs4 import BeautifulSoup
xml_file = "test.xml"
with open(xml_file, "r", encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "xml")
items = soup.find_all("country")
print (items)
The above will produce an array which you can then manipulate to achieve your aim (e.g. remove html tags etc.):
[<country name="Liechtenstein" rank="1" year="2008">
</country>, <country name="Singapore" rank="4" year="2011">
</country>, <country name="Panama" rank="5" year="2011">
</country>]

Related

'lxml.etree._ElementTree' object has no attribute 'insert'

I am trying to parse through my .xml file using glob and then use etree to add more code to my .xml. However, I keep getting an error when using doc insert that says object has no attribute insert. Does anyone know how I can effectively add code to my .xml?
from lxml import etree
path = "D:/Test/"
for xml_file in glob.glob(path + '/*/*.xml'):
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
doc.insert(1,new_elem)
new_elem.tail = "\n"
My original xml looks like this :
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
</data>
And I'd like to modify it to look like this:
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
<new_code abortExpression="" elseExpression="" errorIfNoMatch="false"/>
</data>

The problem is that you need to extract the root from your document before you can start modifying it: modify doc.getroot() instead of doc.
This works for me:
from lxml import etree
xml_file = "./doc.xml"
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
root = doc.getroot()
root.insert(1, new_elem)
new_elem.tail="\n"
To print the results to a file, you can use doc.write():
doc.write("doc-out.xml", encoding="utf8", xml_declaration=True)
Note the xml_declaration=True argument: it tells doc.write() to produce the <?xml version='1.0' encoding='UTF8'?> header.

How to output XML declaration <?xml version="1.0"?> in Python/ElementTree

I'm trying to create a XML file for the word reference source file which is in XML. When I write to the file, with only "xml_decaration=True" it shows <?xml version='1.0' encoding='us-ascii'?> but I want it in the form <?xml version="1.0"?>.
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as ET
import uuid
from lxml import etree
root=Element('b:sources')
root.set('SelectedStyle','')
root.set('xmlns:b','http://schemas.openxmlformats.org/officeDocument/2006/bibliography')
root.set('xmlns','http://schemas.openxmlformats.org/officeDocument/2006/bibliography')
#root.attrib=('SelectedStyle'='', 'xmlns:b'='"http://schemas.openxmlformats.org/officeDocument/2006/bibliography"', 'xmlns:b'='"http://schemas.openxmlformats.org/officeDocument/2006/bibliography"','xmlns'='"http://schemas.openxmlformats.org/officeDocument/2006/bibliography"')
source=ET.SubElement(root, 'b:source')
ET.SubElement(source,'b:Tag')
ET.SubElement(source,'b:SourceType').text='Misc'
ET.SubElement(source,'b:guid').text=str(uuid.uuid1())
Author=ET.SubElement(source,'b:Author')
Author2=ET.SubElement(Author,'b:Author')
ET.SubElement(Author2,'b:Corporate').text='Norsk olje og gass'
ET.SubElement(source, 'b:Title').text='R-002'
ET.SubElement(source, 'b:Year').text='2019'
ET.SubElement(source, 'b:Month').text='10'
ET.SubElement(source, 'b:Day').text='27'
tree=ElementTree(root)
tree.write('Sources.xml', xml_declaration=True, method='xml')

Answer:
When using xml.etree.ElementTree there is no way to avoid the inclusion of an encoding attribute in the declaration. If you don't want an encoding attribute in the XML declaration at all, you need to use xml.dom.minidom not xml.etree.ElementTree.
Here is a snippet to setup an example:
import xml.etree.ElementTree
a = xml.etree.ElementTree.Element('a')
tree = xml.etree.ElementTree.ElementTree(element=a)
root = tree.getroot()
Omit Encoding:
out = xml.etree.ElementTree.tostring(root, xml_declaration=True)
b"<?xml version='1.0' encoding='us-ascii'?>\n<a />"
Encoding us-ascii:
out = xml.etree.ElementTree.tostring(root, encoding='us-ascii', xml_declaration=True)
b"<?xml version='1.0' encoding='us-ascii'?>\n<a />"
Encoding unicode:
out = xml.etree.ElementTree.tostring(root, encoding='unicode', xml_declaration=True)
"<?xml version='1.0' encoding='UTF-8'?>\n<a />"
Using minidom:
Let's take the first example from above with the encoding omitted and use the variable out as the input to xml.dom.minidom and you will see the output that you're seeking.
import xml.dom.minidom
dom = xml.dom.minidom.parseString(out)
dom.toxml()
'<?xml version="1.0" ?><a/>'
There is also a pretty print option:
dom.toprettyxml()
'<?xml version="1.0" ?>\n<a/>\n'
Note
Take a look at the source code, and you can see that the encoding is hard coded in the output.
with _get_writer(file_or_filename, encoding) as (write, declared_encoding):
if method == "xml" and (xml_declaration or
(xml_declaration is None and
declared_encoding.lower() not in ("utf-8", "us-ascii"))):
write("<?xml version='1.0' encoding='%s'?>\n" % (
declared_encoding,))
https://github.com/python/cpython/blob/550c44b89513ea96d209e2ff761302238715f082/Lib/xml/etree/ElementTree.py#L731-L736

Parse an elementTree to return a string in XML form - Python

is there a way to parse an entire ElementTree from a file and return it as a string in python? I would like to read the entire file into a single string value, for example grabbing the entire output of dump(tree)? Any help or advice would be greatly appreciated!
xml
import xml.etree.ElementTree as ET
print "Enter a filename"
filename = input()
tree = ET.parse(filename)
string = tree.tostring() ##is there a way to do something like this?
test.xml
<data>
<serial>
<serial name = "serial">SN001</serial>
</serial>
<items>
<item>Test1 = Failed</item>
<item>Test2 = Passed</item>
<item>Test3 = Passed</item>
</items>
</data>

tostring is a module function, not a method.
string = ET.tostring(tree.getroot())

How to write XML declaration using xml.etree.ElementTree

I am generating an XML document in Python using an ElementTree, but the tostring function doesn't include an XML declaration when converting to plaintext.
from xml.etree.ElementTree import Element, tostring
document = Element('outer')
node = SubElement(document, 'inner')
node.NewValue = 1
print tostring(document) # Outputs "<outer><inner /></outer>"
I need my string to include the following XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
However, there does not seem to be any documented way of doing this.
Is there a proper method for rendering the XML declaration in an ElementTree?

I am surprised to find that there doesn't seem to be a way with ElementTree.tostring(). You can however use ElementTree.ElementTree.write() to write your XML document to a fake file:
from io import BytesIO
from xml.etree import ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
f = BytesIO()
et.write(f, encoding='utf-8', xml_declaration=True)
print(f.getvalue()) # your XML file, encoded as UTF-8
See this question. Even then, I don't think you can get your 'standalone' attribute without writing prepending it yourself.

I would use lxml (see http://lxml.de/api.html).
Then you can:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))

If you include the encoding='utf8', you will get an XML header:
xml.etree.ElementTree.tostring writes a XML encoding declaration with encoding='utf8'
Sample Python code (works with Python 2 and 3):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(
ElementTree.fromstring('<xml><test>123</test></xml>')
)
root = tree.getroot()
print('without:')
print(ElementTree.tostring(root, method='xml'))
print('')
print('with:')
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
Python 2 output:
$ python2 example.py
without:
<xml><test>123</test></xml>
with:
<?xml version='1.0' encoding='utf8'?>
<xml><test>123</test></xml>
With Python 3 you will note the b prefix indicating byte literals are returned (just like with Python 2):
$ python3 example.py
without:
b'<xml><test>123</test></xml>'
with:
b"<?xml version='1.0' encoding='utf8'?>\n<xml><test>123</test></xml>"

xml_declaration Argument
Is there a proper method for rendering the XML declaration in an ElementTree?
YES, and there is no need of using .tostring function. According to ElementTree Documentation, you should create an ElementTree object, create Element and SubElements, set the tree's root, and finally use xml_declaration argument in .write function, so the declaration line is included in output file.
You can do it this way:
import xml.etree.ElementTree as ET
tree = ET.ElementTree("tree")
document = ET.Element("outer")
node1 = ET.SubElement(document, "inner")
node1.text = "text"
tree._setroot(document)
tree.write("./output.xml", encoding = "UTF-8", xml_declaration = True)
And the output file is:
<?xml version='1.0' encoding='UTF-8'?>
<outer><inner>text</inner></outer>

I encounter this issue recently, after some digging of the code, I found the following code snippet is definition of function ElementTree.write
def write(self, file, encoding="us-ascii"):
assert self._root is not None
if not hasattr(file, "write"):
file = open(file, "wb")
if not encoding:
encoding = "us-ascii"
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" %
encoding)
self._write(file, self._root, encoding, {})
So the answer is, if you need write the XML header to your file, set the encoding argument other than utf-8 or us-ascii, e.g. UTF-8

Easy
Sample for both Python 2 and 3 (encoding parameter must be utf8):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
From Python 3.8 there is xml_declaration parameter for that stuff:
New in version 3.8: The xml_declaration and default_namespace
parameters.
xml.etree.ElementTree.tostring(element, encoding="us-ascii",
method="xml", *, xml_declaration=None, default_namespace=None,
short_empty_elements=True) Generates a string representation of an XML
element, including all subelements. element is an Element instance.
encoding 1 is the output encoding (default is US-ASCII). Use
encoding="unicode" to generate a Unicode string (otherwise, a
bytestring is generated). method is either "xml", "html" or "text"
(default is "xml"). xml_declaration, default_namespace and
short_empty_elements has the same meaning as in ElementTree.write().
Returns an (optionally) encoded string containing the XML data.
Sample for Python 3.8 and higher:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='unicode', method='xml', xml_declaration=True))

The minimal working example with ElementTree package usage:
import xml.etree.ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
node.text = '1'
res = ET.tostring(document, encoding='utf8', method='xml').decode()
print(res)
the output is:
<?xml version='1.0' encoding='utf8'?>
<outer><inner>1</inner></outer>

Another pretty simple option is to concatenate the desired header to the string of xml like this:
xml = (bytes('<?xml version="1.0" encoding="UTF-8"?>\n', encoding='utf-8') + ET.tostring(root))
xml = xml.decode('utf-8')
with open('invoice.xml', 'w+') as f:
f.write(xml)

I would use ET:
try:
from lxml import etree
print("running with lxml.etree")
except ImportError:
try:
# Python 2.5
import xml.etree.cElementTree as etree
print("running with cElementTree on Python 2.5+")
except ImportError:
try:
# Python 2.5
import xml.etree.ElementTree as etree
print("running with ElementTree on Python 2.5+")
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
print("running with cElementTree")
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
print("running with ElementTree")
except ImportError:
print("Failed to import ElementTree from any known place")
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, encoding='UTF-8', xml_declaration=True))

This works if you just want to print. Getting an error when I try to send it to a file...
import xml.dom.minidom as minidom
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
def prettify(elem):
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")

Including 'standalone' in the declaration
I didn't found any alternative for adding the standalone argument in the documentation so I adapted the ET.tosting function to take it as an argument.
from xml.etree import ElementTree as ET
# Sample
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
# Function that you need
def tostring(element, declaration, encoding=None, method=None,):
class dummy:
pass
data = []
data.append(declaration+"\n")
file = dummy()
file.write = data.append
ET.ElementTree(element).write(file, encoding, method=method)
return "".join(data)
# Working example
xdec = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>"""
xml = tostring(document, encoding='utf-8', declaration=xdec)

parsing XML file in python

I have a XML file such as:
<?xml version="1.0" encoding="utf-8"?>
<result>
<data>
<_0>stream1</_0>
<_1>file</_1>
<_2>livestream1</_2>
</data>
</result>
I used
xmlTag = dom.getElementsByTagName('data')[0].toxml()
xmlData=xmlTag.replace('<data>','').replace('</data>','')
and i got xmlData
<_0>stream</_0>
<_1>file</_1>
<_2>livestream1</_2>
but i need values stream,file,livestream1 etc.
How to do this?

I would suggest to use ElementTree. It's faster than the usual DOM implementations and I think its more elegant as well.
from xml.etree import ElementTree
#assuming xml_string is your XML above
xml_etree = ElementTree.fromstring(xml_string)
data = xml_etree.find('data')
for elem in data:
print elem.text
Output would be:
stream1
file
livestream1

For your information, this is how to do it with lxml and xpath:
from lxml import etree
doc = etree.fromstring(xml_string)
for elem in doc.xpath('//data/*'):
print elem.text
The output should be the same:
stream1
file
livestream1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python ElementTree parsing unbound prefix error - python

Add the abc namespace to your xml file. <?xml version="1.0"?> <abc:data xmlns:abc="your namespace">

I encountered the same issue while processing xml file. You can use below code before parse your XML file. This will resolve your issue. parser1 = etree.XMLParser(encoding="utf-8", recover=True) tree1 = ElementTree.parse('filename.xml', parser1)

Related

'lxml.etree._ElementTree' object has no attribute 'insert'

How to output XML declaration <?xml version="1.0"?> in Python/ElementTree

Parse an elementTree to return a string in XML form - Python

How to write XML declaration using xml.etree.ElementTree

parsing XML file in python

Categories

Resources