Checking if XML declaration is present - python

I am trying to check whether an xml file contains the necessary xml declaration ("header"), let's say:
<?xml version="1.0" encoding="UTF-8"?>
...rest of xml file...
I am using xml ElementTree for reading and getting info out of the file, but it seems to load a file just fine even if it does not have the header.
What I tried so far is this:
import xml.etree.ElementTree as ET
tree = ET.parse(someXmlFile)
try:
xmlFile = ET.tostring(tree.getroot(), encoding='utf8').decode('utf8')
except:
sys.stderr.write("Wrong xml2 header\n")
exit(31)
if re.match(r"^\s*<\?xml version=\'1\.0\' encoding=\'utf8\'\?>\s+", xmlFile) is None:
sys.stderr.write("Wrong xml1 header\n")
exit(31)
But the ET.tostring() function just "makes up" a header if it is not present in the file.
Is there any way to check for a xml header with ET? Or somehow throw an error while loading the file with ET.parse, if a file does not contain the xml header?

tl;dr
from xml.dom.minidom import parseString
def has_xml_declaration(xml):
return parseString(xml).version
From Wikipedia's XML declaration
If an XML document lacks encoding specification, an XML parser assumes
that the encoding is UTF-8 or UTF-16, unless the encoding has already
been determined by a higher protocol.
...
The declaration may be optionally omitted because it declares as its
encoding the default encoding. However, if the document instead makes
use of XML 1.1 or another character encoding, a declaration is
necessary. Internet Explorer prior to version 7 enters quirks mode, if
it encounters an XML declaration in a document served as text/html
So even if the XML declaration is omitted in an XML document, the code-snippet:
if re.match(r"^<\?xml\s*version=\'1\.0\' encoding=\'utf8\'\s*\?>", xmlFile.decode('utf-8')) is None:
will find "the" default XML declaration in this XML document. Please note, that I have used xmlFile.decode('utf-8') instead of xmlFile.
If you don't worry to use minidom, you can use the following code-snippet:
from xml.dom.minidom import parse
dom = parse('bookstore-003.xml')
print('<?xml version="{}" encoding="{}"?>'.format(dom.version, dom.encoding))
Here is a working fiddle
Int bookstore-001.xml an XML declaration ist present, in bookstore-002.xml no XML declaration ist present and in bookstore-003.xml a different XML declaration than in the first example ist present. The print instruction prints accordingly the version and the encoding:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="None" encoding="None"?>
<?xml version="1.0" encoding="ISO-8859-1"?>

Related

Python XML parsing removing empty CDATA nodes

I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.
I simplified my code to test the process:
from xml.dom import minidom
filename = 'Test.xml'
dom = minidom.parse(filename)
with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
fh.write(dom.toxml())
Using this for the Test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[]]>
</body>
This is what the Text_Generated.xml file is:
<?xml version="1.0" ?><body>
</body>
A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.
I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.
Anyone know what is going on with this?
Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and &#x25 and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.
If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.
Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.

Beautifulsoup4, parsing Tableau XML file, and writing to file

I'm having an issue where I'm using beautifulsoup to parse the xml generated from a Tableau workbook and when I write the results to file it doesn't behave as expected. Chose bs4 and it's standard XML parser, because I find it easiest for my brain to comprehend and I don't need the speed of the lxml parser/package.
Background: I have a calculated field in my Tableau workbook that will programmatically change during publish depending on the server and site location that template workbook will go to. I've already gone through and built some functions and scripted out everything I need to get the data to do this, but when my script writes the xml to file it adds some encodings for ampersand. This results in the file being valid and able to be opened in Tableau, but the field is considered invalid, despite looking like it is valid. I'm thinking the XML is some how getting malformed somewhere in my process.
Code so far for where I think the issue is occuring:
import bs4 as bs
twb = open(Script_config['local_file_location'], 'r')
bs_content = bs(twb, 'xml')
# formula_final below comes from another script that handles getting the data I need to programmatically generate the formula I need.
# Here is what I use to generate the bulk of the formula for Tableau
# 'When &apos;[{}]&apos; then {} '.format(rows['Column_Name'], rows['Formatted_ColumnName']))
# Does some other stuff and slaps together the formula I need as a string that can be written into my XML
# Verified that my result is coming over correctly and only changes once I do the replacement here and/or the writing of the file.
for calculation in bs_content.find_all('column', {'caption': 'Group By', 'datatype':'string', 'name':'[Calculation_12345678910]'}):
calculation.find('calculation')['formula'] = formula_final
with open('test.twb', 'w') as file:
file.write(str(bs_content))
Sample XML:
<?xml version="1.0" encoding="utf-8"?>
<workbook source-build="2021.1.4 (20211.21.0712.0907)" source-platform="win" version="18.1" xml:base="https://localhost" xmlns:user="http://www.tableausoftware.com/xml/user">
...
<column caption="Group By" datatype="string" name="[Calculation_12345678910]" role="dimension" type="nominal">
<calculation class="tableau" formula="Case [Parameters].[Location External ID Parameter] When &apos;[Territory]&apos; then [Territory] End"/>
</column>
Problem:
In the sample XML, Tableau is expecting the XML to be formatted without the & in front of the apos;. It should just be reading as &apos;.
What I've tried:
Thinking that I could just escape the & character I put the necessary slashes in place to escape it before the apos; portion, but to no avail I can't figure out how to get my XML to be formed so that it doesn't always put the ampersand code as part of the other special characters in my XML.
Any help would be much appreciated!
Good problem description.
Your problem is known as 'double escaping'. Your program is reading data which has already been serialized by an XML processor. That's why it contains &apos;[{}]&apos; and not '[{}]'
I think your program reads that XML value from a file as a simple string and assigns it to the value of a tag. But when BeautifulSoup's XML processor encounters the & in the tag value it must replace it with &. So you end up with &apos;' instead of &apos; in the XML output.
The quick and dirty solution is to write some code to replace all XML entities with the equivalent text. A better solution would be to read the XML data using an XML parser - that way, your program will receive the intended string value automatically.

how to parse xml with multiple root element

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.
As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)
Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

copying input xml file and write exactly with Python

Input xml file:
<?xml version="1.0"?>
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</res:testcases>
Python Code:
import xml.etree.ElementTree as ET
tree = ET.parse('/home/AlAhAb65/Desktop/input.xml')
root = tree.getroot()
root.attrib['type'] = 'AVA'
tree.write('/home/AlAhAb65/Desktop/output1.xml')
Output xml file:
<ns0:testcases id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="AVA" xmlns:ns0="urn:testcases">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</ns0:testcases>
The problem is when I am copying and writing the output xml file 3 unexpected things happen. They are given below:
1. The first line from the input xml file is removed automatically
2. In second line (in input), the text 'res' is replaced with 'ns0'. Same happens while closing the tag
3. The order of the attribute (of the second line of input) is changed.
But I want to write (as output) the exact copy of xml file that I got as an input. Please help me in this regard.
W3 has defined a Canonical XML standard. Documents written in this format can be faithfully round-tripped by any C14N-compliant toolchain.
In the case of lxml.etree (a more capable implementation of the ElementTree API with C14N support), this means that you need to do two things:
Convert your original input document into C14N form.
Use the ElementTree.write_c14n() call to generate your output document.
A C14N-form version of your input file will look like so (generated by the xmlstarlet c14n command):
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</res:testcases>
...and an appropriately modified version of your code:
#!/usr/bin/env python
import lxml.etree
tree = lxml.etree.parse('input.xml')
root = tree.getroot()
root.attrib['type'] = 'AVA'
tree.write_c14n('output1.xml')
If you add an XML declaration (the <?xml version="1.0"?> line), you will be noncomplaint with the C14N standard. As such, this is something you absolutely should not do. If you really, really want to do this wrongheaded thing...
Don't.
But if you must, you'd do it like so:
outfile = open('output1.xml', 'w')
outfile.write('<?xml version="1.0"?>\n')
tree.write_c14n(outfile)
outfile.close()
From the documentation page, the XML declaration can be added like this:
tree.write('/home/AlAhAb65/Desktop/output1.xml', xml_declaration=True)
You should also add the encoding because the default one is us-ascii:
tree.write('/home/AlAhAb65/Desktop/output1.xml', encoding='utf-8', xml_declaration=True)
Or you can retrieve the encoding from the original file, but in any case you will get a different XML declaration, probably something like this:
<?xml version="1.0" encoding="UTF-8"?>
Or you can manually add the XML declaration. Anyway a slight declaration mismatch should not be a problem for any robust XML parser as long as the declared encoding is coherent with the real encoding.
Attribute order is not significant in XML, so the information is probably lost when the file is parsed within the API. There is probably no simple way to make this work when processing the file through the standard ElementTree API. You would probably better have to go with lxml C14N support if you want to do minor changes to the file.
The namespace prefixes are changed by default in ElementTree. To prevent this behavior, you can switch to lxml which seems to preserve namespace prefixes by default:
Because etree is built on top of libxml2, which is namespace prefix aware, etree preserves namespaces declarations and prefixes while ElementTree tends to come up with its own prefixes (ns0, ns1, etc). When no namespace prefix is given, however, etree creates ElementTree style prefixes as well.
Switching to lxml is a good idea in any case, but the changes you observe should not be a problem if the program reading the file at the other end is XML compliant enough. Unfortunately a lot of XPath processors have issues with namespace prefixes changes...

Get some unexpected changes in xml file when use python/elementtree

Here is the original xml file:
<?xml version="1.0" encoding="UTF-8"?>
<TVAMain xml:lang="en-GB" xmlns="urn:tva:metadata:2010" xmlns:tva2="urn:tva:metadata:extended:2010" xmlns:yv="http://refdata.youview.com/schemas/Metadata/2012-10-16" xmlns:mpeg7="urn:tva:mpeg7:2008" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.youview.com/schemas/Metadata/2012-09-26 ../schemas/youview_metadata_2012-09-26.xsd">
<!-- -->
<ProgramDescription> .............................
I changes some of the content of the xml(but not the one I post here, those codes should be unchanged), then write to a new xml file, but the new xml file content become like this:
<?xml version='1.0' encoding='UTF-8'?>
<TVAMain xmlns="urn:tva:metadata:2010" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.youview.com/schemas/Metadata/2012-09-26 ../schemas/youview_metadata_2012-09-26.xsd" xml:lang="en-GB">
<ProgramDescription>....................
you can see that the some contents are lost, and the order is also changed, what should I do in order to avoid any changes to xml?
Attributes on XML tags do not have a fixed order, changing their ordering doesn't change their meaning.
ElementTree will only write out namespace qualifiers for namespaces actually in use. Your example is very brief, but I suspect it doesn't make use of the yv and mpeg7 namespaces at all.

Categories

Resources