Modify XML declaration with python - python

I have an XML document for which I need to add a couple of things to the XML declaration using minidom. The declaration looks like this:
<?xml version="1.0"?>
And I need it to look like this:
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
I know how to change or add attributes using minidom, which will not work here.
What is the easiest way of doing this? For reference, I am running python 3.3.3.

I'm not sure if this can be done with minidom. But you could try lxml.
from lxml import etree
tree = etree.parse("test.xml")
string = etree.tostring(tree.getroot(), pretty_print = True, xml_declaration = True, standalone = False, encoding = "UTF-16")
with open("test2.xml", "wb") as f:
f.write(string)
More or less taken from here.

Related

How to add comment after XML declaration using python

import xml.etree.ElementTree as ET
def addCommentInXml():
fileXml ='C:\\Users\\Documents\\config.xml'
tree = ET.parse(fileXml)
root = tree.getroot()
comment = ET.Comment('TEST')
root.insert(1, comment) # 1 is the index where comment is inserted
tree.write(fileXml, encoding='UTF-8', xml_declaration=True)
print("Done")
It is updating xml as below,Please suggest how to add right after xml declaration line:
<?xml version='1.0' encoding='UTF-8'?>
<ScopeConfig Checksum="5846AFCF4E5D02786">
<ExecutableName>STU</ExecutableName>
<!--TEST--><ZoomT2Encoder>-2230</ZoomT2Encoder>
The ElementTree XML API does not allow this. The documentation for the Comment factory function explicitly states:
An ElementTree will only contain comment nodes if they have been
inserted into to the tree using one of the Element methods.
but you would like to insert a comment outside the tree. The documentation for the TreeBuilder class is even more explicit:
When insert_comments and/or insert_pis is true, comments/pis will be
inserted into the tree if they appear within the root element (but not
outside of it)
So I would suggest writing out the XML file without the comment, using this API, and then reading the file as plain text (not parsed XML) to add your comment after the first line.

Writing XML formatted text to output file in Python

I am having issues while writing the below XML to output file.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>
Pusheen
</word>
<CharacterOffsetBegin>
0
</CharacterOffsetBegin>
<CharacterOffsetEnd>
7
</CharacterOffsetEnd>
<POS>
NNP
</POS>
</token>
</tokens>
</sentence>
</sentences>
</document>
</root>
How to write this to output file in xml format? I tried using below write statement
tree.write(open('person.xml', 'w'), encoding='unicode').
But, I am getting the below error
AttributeError: 'str' object has no attribute 'write'
I don't have to build XML here as I already have the data in XML format. I just need it to write it to a XML file.
Assuming that tree is your XML, it is a string. You probably want something like:
with open("person.xml", "w", encoding="unicode") as outfile:
outfile.write(tree)
(It is good practice to use with for files; it automatically closes them after)
The error is caused by the fact that, since tree is a string, you can't write to it.
I recommend using the lxml module to check the format first and then write it to a file. I notice that you've got two elements with the same id, which caught my eye. It doesn't flag an error in XML, but it could cause trouble on an HTML page, where each id is supposed to be unique.
Here's the simple code to do what I described above:
from lxml import etree
try:
root = etree.fromstring(your_xml_data) # checks XML formatting, returns Element if good
if root is not None:
tree = etree.ElementTree(root) # convert the Element to ElementTree
tree.write('person.xml') # we needed the ElementTree for writing the file
except:
'Oops!'

How to remove xml header in beautifulsoup?

I have imported and modified some xml, but when I write out my xml using test.prettify(). It changes the top line of the xml from
<?xml version="1.0"?>
to
<?xml version="1.0" encoding="utf-8"?>
I don't want this change. How can I just keep the first line unchanged? What is the easiest way to do this?
If it matters, I'm using the xml parser.
soup = BeautifulSoup(r.text,'xml')
I'm sure there's a more elegant way to do this using BeautifulSoup's built-ins, but based on your comment, I'll give you the "strip it out" version:
xml_string = '<?xml version="1.0" encoding="utf-8"?>'
print xml_string[:xml_string.find("encoding")-1] + "?>"
This is general enough to strip out any encoding from the header (not just utf-8).
You could find the xml and use replaceWith() to replace it with the value you want.

lxml parsing with python: how to with objectify

I am trying to read xml behind an spss file, I would like to move from etree to objectify.
How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.
def get_etree(path_file):
from lxml import etree
with open(path_file, 'r+') as f:
xml_text = f.read()
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse(StringIO(xml_text), parser=recovering_parser)
return xml
my failed attempt:
def get_etree(path_file):
from lxml import etree, objectify
with open(path_file, 'r+') as f:
xml_text = objectify.fromstring(xml)
return xml
but I get this error:
lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI
The first, biggest mistake is to read a file into a string and feed that string to an XML parser.
Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.
XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.
<?xml version="1.0" encoding="Windows-1252"?>
An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.
Luckily lxml makes it very easy:
from lxml import etree, objectify
def get_etree(path_file):
return etree.parse(path_file, parser=etree.XMLParser(recover=True))
def get_objectify(path_file):
return objectify.parse(path_file)
and
path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)
print xml1 # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2 # -> <lxml.etree._ElementTree object at 0x02A7B878>
P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?
I would do the latter. Using a recovering parser may cause nasty run-time errors later.

creating xml documents with whitespace with xml.etree.cElementTree

I'm working on a project to store various bits of text in xml files, but because people besides me are going to look at it and use it, it has to be properly indented and such. I looked at a question on how to generate xml files using cElement Tree here, and the guy says something about putting in info about making things pretty if people ask, but there isn't anything there (I guess because no one asked). So basically, is there a way to properly indent and whitespace using cElementTree, or should i just throw up my hands and go learn how to use lxml.
You can use minidom to prettify our xml string:
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Return a pretty-printed XML string for the Element.
def prettify(xmlStr):
INDENT = " "
rough_string = ET.tostring(xmlStr, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=INDENT)
# name of root tag
root = ET.Element("root")
child = ET.SubElement(root, 'child')
child.text = 'This is text of child'
prettified_xmlStr = prettify(root)
output_file = open("Output.xml", "w")
output_file.write(prettified_xmlStr)
output_file.close()
print("Done!")
Answering myself here:
Not with ElementTree. The best option would be to download and install the module for lxml, then simply enable the option
prettyprint = True
when generating new XML files.

Categories

Resources