Issues when writing an xml file using xml.dom.minidom python - python

I have an xml file and a python script is used for adding a new node to that xml file.I used xml.dom.minidom module for processing the xml file.My xml file after processing with the python module is given below
<?xml version="1.0" ?><Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PostBuildEvent>
<Command>xcopy "SourceLoc" "DestLoc"</Command>
</PostBuildEvent>
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
<Import Project="project.targets"/></Project>
What i actually needed is as given below .The changes are a newline character after the first line and before the last line and also '&quot' is converted to "
<?xml version="1.0" ?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PostBuildEvent>
<Command>xcopy "SourceLoc" "DestLoc"</Command>
</PostBuildEvent>
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
<Import Project="project.targets"/>
</Project>
The python code i used is given below
xmltree=xml.dom.minidom.parse(xmlFile)
for Import in Project.getElementsByTagName("Import"):
newImport = xml.dom.minidom.Element("Import")
newImport.setAttribute("Project", "project.targets")
vcxprojxmltree.writexml(open(VcxProjFile, 'w'))
What should i update in my code to get the xml in correct format
Thanks,

From docs of minidom:
Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])
Return a pretty-printed version of the document. indent specifies the indentation string and defaults to a tabulator; newl specifies the string emitted at the end of each line and defaults to \n.
That's all customisation you get from minidom.
Tried inserting a Text node as a root sibling for newline. Hope dies last.
I recommend using regular expressions from re module and inserting it manually.
As for removing SGML entities, there's apparently an undocumented function for that in python standard library:
import HTMLParser
h = HTMLParser.HTMLParser()
unicode_string = h.unescape(string_with_entities)
Alternatively, you can do this manually, again using re, as all named entity names and corresponding codepoints are inside the htmlentitydefs module.

Related

Checking if XML declaration is present

I am trying to check whether an xml file contains the necessary xml declaration ("header"), let's say:
<?xml version="1.0" encoding="UTF-8"?>
...rest of xml file...
I am using xml ElementTree for reading and getting info out of the file, but it seems to load a file just fine even if it does not have the header.
What I tried so far is this:
import xml.etree.ElementTree as ET
tree = ET.parse(someXmlFile)
try:
xmlFile = ET.tostring(tree.getroot(), encoding='utf8').decode('utf8')
except:
sys.stderr.write("Wrong xml2 header\n")
exit(31)
if re.match(r"^\s*<\?xml version=\'1\.0\' encoding=\'utf8\'\?>\s+", xmlFile) is None:
sys.stderr.write("Wrong xml1 header\n")
exit(31)
But the ET.tostring() function just "makes up" a header if it is not present in the file.
Is there any way to check for a xml header with ET? Or somehow throw an error while loading the file with ET.parse, if a file does not contain the xml header?
tl;dr
from xml.dom.minidom import parseString
def has_xml_declaration(xml):
return parseString(xml).version
From Wikipedia's XML declaration
If an XML document lacks encoding specification, an XML parser assumes
that the encoding is UTF-8 or UTF-16, unless the encoding has already
been determined by a higher protocol.
...
The declaration may be optionally omitted because it declares as its
encoding the default encoding. However, if the document instead makes
use of XML 1.1 or another character encoding, a declaration is
necessary. Internet Explorer prior to version 7 enters quirks mode, if
it encounters an XML declaration in a document served as text/html
So even if the XML declaration is omitted in an XML document, the code-snippet:
if re.match(r"^<\?xml\s*version=\'1\.0\' encoding=\'utf8\'\s*\?>", xmlFile.decode('utf-8')) is None:
will find "the" default XML declaration in this XML document. Please note, that I have used xmlFile.decode('utf-8') instead of xmlFile.
If you don't worry to use minidom, you can use the following code-snippet:
from xml.dom.minidom import parse
dom = parse('bookstore-003.xml')
print('<?xml version="{}" encoding="{}"?>'.format(dom.version, dom.encoding))
Here is a working fiddle
Int bookstore-001.xml an XML declaration ist present, in bookstore-002.xml no XML declaration ist present and in bookstore-003.xml a different XML declaration than in the first example ist present. The print instruction prints accordingly the version and the encoding:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="None" encoding="None"?>
<?xml version="1.0" encoding="ISO-8859-1"?>

Pretty printing XML with a core install of Python 3.2

I've looked at: Pretty printing XML in Python
This doesn't suit my needs, because for various reasons - I've got a base install of Python 3.2 on Windows, but no ability to add modules. (user privs).
The suggested solution is:
import xml.dom.minidom
xml = xml.dom.minidom.parse(xml_fname)
pretty_xml_as_string = xml.toprettyxml()
This turns:
<root>
<node>content</node>
<node attrib="value">more content</node>
</root>
Into:
<?xml version="1.0" ?>
<root>
<node>content</node>
<node attrib="value">more content</node>
</root>
It does work if you've got:
<root><node>content</node><node attrib="value">more content</node></root>
So I would assume it's preserving linefeeds, but adding extra indent/whitespace. I appreciate the answer is probably 'use a module' - but what I'm looking for is something that can turn arbitrary formatted xml into consistently pretty-printed.
(And using a core install module if at all possible).

copying input xml file and write exactly with Python

Input xml file:
<?xml version="1.0"?>
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</res:testcases>
Python Code:
import xml.etree.ElementTree as ET
tree = ET.parse('/home/AlAhAb65/Desktop/input.xml')
root = tree.getroot()
root.attrib['type'] = 'AVA'
tree.write('/home/AlAhAb65/Desktop/output1.xml')
Output xml file:
<ns0:testcases id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="AVA" xmlns:ns0="urn:testcases">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</ns0:testcases>
The problem is when I am copying and writing the output xml file 3 unexpected things happen. They are given below:
1. The first line from the input xml file is removed automatically
2. In second line (in input), the text 'res' is replaced with 'ns0'. Same happens while closing the tag
3. The order of the attribute (of the second line of input) is changed.
But I want to write (as output) the exact copy of xml file that I got as an input. Please help me in this regard.
W3 has defined a Canonical XML standard. Documents written in this format can be faithfully round-tripped by any C14N-compliant toolchain.
In the case of lxml.etree (a more capable implementation of the ElementTree API with C14N support), this means that you need to do two things:
Convert your original input document into C14N form.
Use the ElementTree.write_c14n() call to generate your output document.
A C14N-form version of your input file will look like so (generated by the xmlstarlet c14n command):
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
<mode>PRESSURE_CONTROL</mode>
<category>ADULT</category>
<testcase id="1" type="UNIQUE">
<parameter id="PEEP" value="1.0">true</parameter>
<parameter id="CMV_FREQ" value="4.0">true</parameter>
<parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
<parameter id="I_E_RATIO" value="0.1">false</parameter>
</testcase>
</res:testcases>
...and an appropriately modified version of your code:
#!/usr/bin/env python
import lxml.etree
tree = lxml.etree.parse('input.xml')
root = tree.getroot()
root.attrib['type'] = 'AVA'
tree.write_c14n('output1.xml')
If you add an XML declaration (the <?xml version="1.0"?> line), you will be noncomplaint with the C14N standard. As such, this is something you absolutely should not do. If you really, really want to do this wrongheaded thing...
Don't.
But if you must, you'd do it like so:
outfile = open('output1.xml', 'w')
outfile.write('<?xml version="1.0"?>\n')
tree.write_c14n(outfile)
outfile.close()
From the documentation page, the XML declaration can be added like this:
tree.write('/home/AlAhAb65/Desktop/output1.xml', xml_declaration=True)
You should also add the encoding because the default one is us-ascii:
tree.write('/home/AlAhAb65/Desktop/output1.xml', encoding='utf-8', xml_declaration=True)
Or you can retrieve the encoding from the original file, but in any case you will get a different XML declaration, probably something like this:
<?xml version="1.0" encoding="UTF-8"?>
Or you can manually add the XML declaration. Anyway a slight declaration mismatch should not be a problem for any robust XML parser as long as the declared encoding is coherent with the real encoding.
Attribute order is not significant in XML, so the information is probably lost when the file is parsed within the API. There is probably no simple way to make this work when processing the file through the standard ElementTree API. You would probably better have to go with lxml C14N support if you want to do minor changes to the file.
The namespace prefixes are changed by default in ElementTree. To prevent this behavior, you can switch to lxml which seems to preserve namespace prefixes by default:
Because etree is built on top of libxml2, which is namespace prefix aware, etree preserves namespaces declarations and prefixes while ElementTree tends to come up with its own prefixes (ns0, ns1, etc). When no namespace prefix is given, however, etree creates ElementTree style prefixes as well.
Switching to lxml is a good idea in any case, but the changes you observe should not be a problem if the program reading the file at the other end is XML compliant enough. Unfortunately a lot of XPath processors have issues with namespace prefixes changes...

Parse Solr output in python

I am trying to parse a solr output of the form:
<doc>
<str name="source">source:A</str>
<str name="url">URL:A</str>
<date name="p_date">2012-09-08T10:02:01Z</date>
</doc>
<doc>
<str name="source">source:B</str>
<str name="url">URL:B</str>
<date name="p_date">2012-08-08T11:02:01Z</date>
</doc>
I am keen on using beautiful soup (versions that have BeautifulStoneSoup; I think prior to BS4) for parsing the docs.
I have used beautiful soup for HTML parsing but some how I am not able to find a effecient way to extract the contents of the tag.
I have written:
for tags in soup('doc'):
print tags.renderContents()
I do sense that I can work my way through it forcibly to get the outputs (like say 'soup'ing it again), but would appreciate an effecient solution to extract data.
My output required is:
source:A
URL:A
2012-09-08T10:02:01Z
source:B
URL:B
2012-08-08T11:02:01Z
Thanks
Use a XML parser for task instead; xml.etree.ElementTree is included with Python:
from xml.etree import ElementTree as ET
# `ET.fromstring()` expects a string containing XML to parse.
# tree = ET.fromstring(solrdata)
# Use `ET.parse()` for a filename or open file object, such as returned by urllib2:
ET.parse(urllib2.urlopen(url))
for doc in tree.findall('.//doc'):
for elem in doc:
print elem.attrib['name'], elem.text
Do you have to use this particular output format? Solr supports Python output format out of the box (in version 4 at least), just use wt=python in the query.

Writing XML files with Python

I have created an XML document with the following contents.
<books>
<book id="1">
<title>Title01</title>
<authors/>
<pages>
<page>Page01</page>
<page>Page02</page>
<page>Page03</page>
<page>Page04</page>
<page>Page05</page>
</pages>
</book>
<book id="2">
<title>Title02</title>
<authors/>
<pages>
<page>Page01</page>
<page>Page02</page>
<page>Page03</page>
<page>Page04</page>
<page>Page05</page>
</pages>
</book>
</books>
I then use a Python script to split up and write the individual books into separate files;however, the resulting files are not XML files because they do not have the XML declaration. Is there a way of creating XML files in Python?
The idea is to ensure that each file has the XML declaration as show below.
<?xml version="1.0" encoding="ISO-8859-1" ?>
<book id="1">
<title>Title01</title>
<authors/>
<pages>
<page>Page01</page>
<page>Page02</page>
<page>Page03</page>
<page>Page04</page>
<page>Page05</page>
</pages>
</book>
Why don't you write the xml declaration to each book file before you write the book entry?
Encode your files in UTF-8 instead of some legacy encoding like ISO-8859-1. Then you don't need an XML declaration.
You should look into the xml.etree.ElementTree module. The link is for Python 3, but it was included way before that. I use it in Python 2.5, so you should be ok.
Also, I have had good results with xml.dom.minidom. Once you have built a Document (by adding elements with createElement('ELEM_NAME'), you just write it to a stream with mydoc.toprettyxml().

Categories

Resources