copying input xml file and write exactly with Python - python

Input xml file:
<?xml version="1.0"?>
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</res:testcases>
Python Code:
import xml.etree.ElementTree as ET
tree = ET.parse('/home/AlAhAb65/Desktop/input.xml')
root = tree.getroot()
root.attrib['type'] = 'AVA'
tree.write('/home/AlAhAb65/Desktop/output1.xml')
Output xml file:
<ns0:testcases id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="AVA" xmlns:ns0="urn:testcases">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</ns0:testcases>
The problem is when I am copying and writing the output xml file 3 unexpected things happen. They are given below:
1. The first line from the input xml file is removed automatically
2. In second line (in input), the text 'res' is replaced with 'ns0'. Same happens while closing the tag
3. The order of the attribute (of the second line of input) is changed.
But I want to write (as output) the exact copy of xml file that I got as an input. Please help me in this regard.

W3 has defined a Canonical XML standard. Documents written in this format can be faithfully round-tripped by any C14N-compliant toolchain.
In the case of lxml.etree (a more capable implementation of the ElementTree API with C14N support), this means that you need to do two things:
Convert your original input document into C14N form.
Use the ElementTree.write_c14n() call to generate your output document.
A C14N-form version of your input file will look like so (generated by the xmlstarlet c14n command):
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</res:testcases>
...and an appropriately modified version of your code:
#!/usr/bin/env python
import lxml.etree
tree = lxml.etree.parse('input.xml')
root = tree.getroot()
root.attrib['type'] = 'AVA'
tree.write_c14n('output1.xml')
If you add an XML declaration (the <?xml version="1.0"?> line), you will be noncomplaint with the C14N standard. As such, this is something you absolutely should not do. If you really, really want to do this wrongheaded thing...
Don't.
But if you must, you'd do it like so:
outfile = open('output1.xml', 'w')
outfile.write('<?xml version="1.0"?>\n')
tree.write_c14n(outfile)
outfile.close()

From the documentation page, the XML declaration can be added like this:
tree.write('/home/AlAhAb65/Desktop/output1.xml', xml_declaration=True)
You should also add the encoding because the default one is us-ascii:
tree.write('/home/AlAhAb65/Desktop/output1.xml', encoding='utf-8', xml_declaration=True)
Or you can retrieve the encoding from the original file, but in any case you will get a different XML declaration, probably something like this:
<?xml version="1.0" encoding="UTF-8"?>
Or you can manually add the XML declaration. Anyway a slight declaration mismatch should not be a problem for any robust XML parser as long as the declared encoding is coherent with the real encoding.
Attribute order is not significant in XML, so the information is probably lost when the file is parsed within the API. There is probably no simple way to make this work when processing the file through the standard ElementTree API. You would probably better have to go with lxml C14N support if you want to do minor changes to the file.
The namespace prefixes are changed by default in ElementTree. To prevent this behavior, you can switch to lxml which seems to preserve namespace prefixes by default:
Because etree is built on top of libxml2, which is namespace prefix aware, etree preserves namespaces declarations and prefixes while ElementTree tends to come up with its own prefixes (ns0, ns1, etc). When no namespace prefix is given, however, etree creates ElementTree style prefixes as well.
Switching to lxml is a good idea in any case, but the changes you observe should not be a problem if the program reading the file at the other end is XML compliant enough. Unfortunately a lot of XPath processors have issues with namespace prefixes changes...

Related

Is there a way to pass parameters to an xml? Or modifying it?

I would like to pass a certain parameter to an xml, so instead of being a raw xml with all the values by the creation of it, I'd like to change one with a parameter (a user input, for example).
Ideally, I was looking for something like <title> &param1 </title>and be able later on to pass whatever param I would like, but I guess it cannot be done.
So like passing a parameter cannot be done (or at least from what I have searched), I thought about editing the xml after it's created.
I have searched mostly with beautifulsoup, because it is what I want to use (and what I am using). This is only a little bit of my project.
for example this and this are some of my research).
So this is the function I am trying to do:
We have an xml, we find the part we want to edit, and we edit it (I know that to access it, it needs to be an integer pruebaEdit[anyString]is not correct.
def editXMLTest():
editTest="""<?xml version="1.0" ?>
<books>
<book>
<title>moon</title>
<author>louis</author>
<price>8.50</price>
</book>
</books>
"""
soup =BeautifulSoup(editTest)
for tag in soup.find_all('title'):
print (tag.string, '\n')
#tag.string='invented title'
editTest[tag]='invented title' #I know it has to be an integer, not a string
print()
print(editTest)
My expected output should be in the xml: <title>invented title</title> instead of <title>moon</title>.
Edit: added this to my research
you have to print the results or soup not original string editTest
for tag in soup.find_all('title'):
print (tag.string, '\n')
tag.string='invented title'
print(soup)
Using entity references like &param; is the closest thing available in XML itself, but it's not very flexible because the entity expansions are defined in a DTD file rather than being supplied programmatically to the XML parser. Some parsers (I don't know the Python situation) allow you to supply an EntityResolver which can resolve entity references programmatically, but it wouldn't be my first choice of approach.
There are of course templating languages that allow XML to be constructed programmatically. XSLT is the most obvious choice; it probably does a lot more than you need, but that's not necessarily a drawback. Some other options are listed at https://en.wikipedia.org/wiki/Comparison_of_web_template_engines -- including a handful for the Python environment. Unfortunately many of these tools, in my experience, are not particularly well documented or supported, so do your research carefully.
With Python's lxml that can run XSLT 1.0 scripts and also the parsing engine to BeautifulSoup, you can pass parameters to modify XML files as needed. Simply set the <xsl:param> in XSLT script and in Python pass value via strparam:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" omit_xml_declaration="no"/>
<xsl:strip-space elements="*"/>
<!-- INITIALIZE PARAMETER -->
<xsl:param name="new_title" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- REWRITE TITLE TEXT -->
<xsl:template match="title">
<xsl:copy>
<xsl:value-of select="$new_title"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (see output below as comment)
import lxml.etree as et
txt = '''<books>
<book>
<title>moon</title>
<author>louis</author>
<price>8.50</price>
</book>
</books>'''
# LOAD XSL SCRIPT
xml = et.fromstring(txt)
xsl = et.parse('/path/to/XSLTScript.xsl')
transform = et.XSLT(xsl)
# PASS PARAMETER TO XSLT
n = et.XSLT.strparam('invented title')
result = transform(doc, new_title=n)
print(result)
# <?xml version="1.0"?>
# <books>
# <book>
# <title>invented title</title>
# <author>louis</author>
# <price>8.50</price>
# </book>
# </books>
# SAVE XML TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
Pyfiddle Demo (be sure to click run and check output)
Use <xsl> tag to pass a parameter in xml

how to parse xml with multiple root element

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.
As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)
Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

xml formatting with python Elementtree

XML File:
<testcases>
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1">
<parameter id="PEEP" value="1.000000">false</parameter>
<parameter id="CMV_FREQ" value="4.0">false</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.100000">false</parameter>
</testcase>
</testcases>
Python code:
import xml.etree.ElementTree as ET
tree = ET.parse('Results.xml')
root = tree.getroot()
mode = root.find('Mode').text
category = root.find('Category').text
self.tag_invalid = ET.SubElement(root, 'invalid') # For adding new tag with attributes and values
for v in self.final_result:
self.tag_testcase = ET.SubElement(self.tag_invalid, 'testcase')
self.tag_testcase.attrib['id'] = 5
self.tag_testcase.attrib['parameter'] = 'IE'
self.tag_testcase.text = 100
tree.write('/home/AlAhAb65/Desktop/test.xml')
Output:
<testcases>
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1">
<parameter id="PEEP" value="1.000000">false</parameter>
<parameter id="CMV_FREQ" value="4.0">false</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.100000">false</parameter>
</testcase>
<invalid><testcase id="5" parameter="I_E_RATIO">100.0</testcase></invalid></testcases> # Extra line after python code running
The extra line is added in the XML file. But the problem is I cannot format it. That means I cannot add '\n', '\t' to maintain the hiererchy and format. Is there any rule for that? I tried tree.write(), ET.Element() functions. But those do not provide the desired result.
If you'd like the indentation of the XML text file to visually represent the hierarchy of the XML document, you need to pretty-print it. One way to do that is with xmllint --format:
$ xmllint --format test.xml
<?xml version="1.0"?>
<testcases>
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1">
<parameter id="PEEP" value="1.000000">false</parameter>
<parameter id="CMV_FREQ" value="4.0">false</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.100000">false</parameter>
</testcase>
<invalid>
<testcase id="5" parameter="I_E_RATIO">100.0</testcase>
</invalid>
</testcases>
If you'd like to generate the text file already pretty-printed, try reparsing it with a different XML library, for example minidom:
>>> print minidom.parseString(
ET.tostring(
tree.getroot(),
'utf-8')).toprettyxml(indent=" ")
But note that each of these solutions changes the XML document. Strictly speaking, the
resulting text files are not equivalent to the original -- the text elements have extra spaces and newlines added.
You can control the text content of ElementTree elements using the attributes tail and text. E.g., try adding:
self.tag_invalid.text = "\n "
self.tag_invalid.tail = "\n "
Use that as a starting point, and try adding text/tail to the various other elements you create, print the results, and play around with it until it gives you what you want.
Here's an example showing what text and tail mean:
<A>TEXT_OF_A<B>TEXT_OF_B</B>TAIL_OF_B<C>TEXT_OF_C</C>TAIL_OF_C<D/>TAIL_OF_D</A>TAIL_OF_A
Alternatively, you can write a recursive function that walks through your xml tree, setting both text & tail attributes to properly indent it (relative to depth).
For more documentation on the text and tail attributes, see: http://docs.python.org/2/library/xml.etree.elementtree.html
EDIT: Take a look at http://effbot.org/zone/element-lib.htm#prettyprint to see an example of how you can recursively walk through the xml tree, setting text & tail so that all elements will be indented to their nesting depth.
According to ET manual:
Writes an element tree or element structure to sys.stdout. This function should be used for debugging only.
The exact output format is implementation dependent. In this version, it’s written as an ordinary XML file.
But there are some fixes for that on google.

Get some unexpected changes in xml file when use python/elementtree

Here is the original xml file:
<?xml version="1.0" encoding="UTF-8"?>
<TVAMain xml:lang="en-GB" xmlns="urn:tva:metadata:2010" xmlns:tva2="urn:tva:metadata:extended:2010" xmlns:yv="http://refdata.youview.com/schemas/Metadata/2012-10-16" xmlns:mpeg7="urn:tva:mpeg7:2008" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.youview.com/schemas/Metadata/2012-09-26 ../schemas/youview_metadata_2012-09-26.xsd">
<!-- -->
<ProgramDescription> .............................
I changes some of the content of the xml(but not the one I post here, those codes should be unchanged), then write to a new xml file, but the new xml file content become like this:
<?xml version='1.0' encoding='UTF-8'?>
<TVAMain xmlns="urn:tva:metadata:2010" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.youview.com/schemas/Metadata/2012-09-26 ../schemas/youview_metadata_2012-09-26.xsd" xml:lang="en-GB">
<ProgramDescription>....................
you can see that the some contents are lost, and the order is also changed, what should I do in order to avoid any changes to xml?
Attributes on XML tags do not have a fixed order, changing their ordering doesn't change their meaning.
ElementTree will only write out namespace qualifiers for namespaces actually in use. Your example is very brief, but I suspect it doesn't make use of the yv and mpeg7 namespaces at all.

Writing XML files with Python

I have created an XML document with the following contents.
<books>
<book id="1">
<title>Title01</title>
<authors/>
<pages>
<page>Page01</page>
<page>Page02</page>
<page>Page03</page>
<page>Page04</page>
<page>Page05</page>
</pages>
</book>
<book id="2">
<title>Title02</title>
<authors/>
<pages>
<page>Page01</page>
<page>Page02</page>
<page>Page03</page>
<page>Page04</page>
<page>Page05</page>
</pages>
</book>
</books>
I then use a Python script to split up and write the individual books into separate files;however, the resulting files are not XML files because they do not have the XML declaration. Is there a way of creating XML files in Python?
The idea is to ensure that each file has the XML declaration as show below.
<?xml version="1.0" encoding="ISO-8859-1" ?>
<book id="1">
<title>Title01</title>
<authors/>
<pages>
<page>Page01</page>
<page>Page02</page>
<page>Page03</page>
<page>Page04</page>
<page>Page05</page>
</pages>
</book>
Why don't you write the xml declaration to each book file before you write the book entry?
Encode your files in UTF-8 instead of some legacy encoding like ISO-8859-1. Then you don't need an XML declaration.
You should look into the xml.etree.ElementTree module. The link is for Python 3, but it was included way before that. I use it in Python 2.5, so you should be ok.
Also, I have had good results with xml.dom.minidom. Once you have built a Document (by adding elements with createElement('ELEM_NAME'), you just write it to a stream with mydoc.toprettyxml().

Categories

Resources