I need to compress multiple xml files and I achieved this with lxml, zipfile and a for loop.
My problem is that every time I re run my function the content of the compressed files are repeating (being appended in the end) and getting longer. I believe that it has to do with the writing mode a+b. I thought that by using with open at the end of the code block the files would be deleted and no more content would be added to them. I was wrong and with the other modes I do not get the intended result.
Here is my code:
def compress_package_file(self):
bytes_buffer = BytesIO()
with zipfile.ZipFile(bytes_buffer, 'w') as invoices_package:
i = 1
for invoice in record.invoice_ids.sorted('sin_number'):
invoice_file_name = 'Invoice_' + invoice.number + '.xml'
with open(invoice_file_name, 'a+b') as invoice_file:
invoice_file.write(invoice._get_invoice_xml().getvalue())
invoices_package.write(invoice_file_name, compress_type=zipfile.ZIP_DEFLATED)
i += 1
compressed_package = bytes_buffer.getvalue()
encoded_compressed_file = base64.b64encode(compressed_package)
My xml generator is in another function and works fine. But the content repeats each time I run this function. For example if I run it two times, the content of the files in the compressed file look something like this (simplified content):
<?xml version='1.0' encoding='UTF-8'?>
<invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoice.xsd">
<header>
<invoiceNumber>9</invoiceNumber>
</header>
</facturaComputarizadaCompraVenta><?xml version='1.0' encoding='UTF-8'?>
<invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoice.xsd">
<header>
<invoiceNumber>9</invoiceNumber>
</header>
</facturaComputarizadaCompraVenta>
If I use w+b mode, the content of the files are blank.
How should my code look like to avoid this behavior?
I suggest you do use w+b mode, but move writing to zipfile after closing the invoice XML file.
From what you wrote it looks as you are trying to compress a file that is not yet flushed to disk, therefore with w+b it is still empty at time of compression.
So, try remove 1 level of indent for invoices_package.write line (I can't format code properly on mobile, so can't post whole section).
For generating resource XML file for ASP.NET, the third-party tool requires BOM (when migrating to a new version of the tool). At the same time, it requires the XML prolog like <?xml version='1.0' encoding='utf-8'?>.
The problem is that when using the ElementTree command...
tree.write(lang_resx_fpath, encoding='utf-8')
the resulting file does not contain BOM. When using the command...
tree.write(lang_resx_fpath, encoding='utf-8-sig')
the result does contain BOM; however, the XML prolog contains encoding='utf-8-sig'.
How should I generate the file to contain both BOM and encoding='utf-8'?
UPDATE:
I have worked around it by reading, replacing, and writing the file again, like this...
with open(lang_resx_fpath, 'r', encoding='utf-8-sig') as f:
content = f.read()
content = content.replace("encoding='utf-8-sig'", "encoding='utf-8'")
with open(lang_resx_fpath, 'w', encoding='utf-8-sig') as f:
f.write(content)
Anyway, is there any cleaner solution?
UPDATE: I have created the https://bugs.python.org/issue46598, and I have also written the fix (https://github.com/python/cpython/pull/31043).
Peek into sources of ElementTree.write shows that prolog is hardcoded there (https://github.com/python/cpython/blob/main/Lib/xml/etree/ElementTree.py or permalink https://github.com/python/cpython/blob/ee0ac328d38a86f7907598c94cb88a97635b32f8/Lib/xml/etree/ElementTree.py). Therefore probably using internals of ET is the only option (other than monkey-pathing module), to write required preamble and keep BOM in the file:
import xml.etree.ElementTree as ET
qnames, namespaces = ET._namespaces(tree._root, None)
with open(lang_resx_fpath,'w',encoding='utf-8-sig') as f:
f.write("<?xml version='1.0' encoding='utf-8'?>\n" )
ET._serialize_xml(f.write,
tree._root, qnames, namespaces,
short_empty_elements=False)
Probably it is not more elegant than your solution (and maybe it is even less elegant). The only advantage is that it does not require writing file twice, which would be minor benefit besides some huge XML files.
I parse a large xml file in python using
tree = ET.parse('test.xml')
#do my manipulation
How do I write back the xml file to disk exactly as I have read it, albeit with my modifications.
<?xml version="1.0" encoding="utf-16"?>
This was the first line of the input xml file
I added tree.write("output.sbp", encoding="utf-16") and now they are of the same size.
<book>
<title>sponge bob</title>
<author>Joe Doe</author>
<file>Tbase</file>
</book>
I have 2 files, one is a xml and the other is a base64 file. I would like to know how to insert and replace the string"Tbase" with the content of the base64 file using python.
Are you wanting to put the verbatim contents of the base64 file (still base64 encoded) into the XML file, in place of "Tbase"? If that's the case, you could just do something like:
xml = open("xmlfile.xml").read()
b64file = open("b64file.base64").read()
open("xmlfile.xml", "w").write(xml.replace("Tbase", b64file))
(If you're on Python 2.6 or later, you can do this a little bit cleaner using with statements, but that's another discussion.)
If you want to decode the base64 file first, and place the decoded contents into the XML file, then you'd replace b64file on the last line of the example above with b64file.decode("base64").
Of course, doing simple text replacement, as above, opens you up to the problems you'll have if, say, the title or author contain "Tbase" as well. A better way would be to use an actual XML parsing library, like so:
from xml.etree.ElementTree import fromstring, tostring
xml = fromstring(open("xmlfile.xml").read())
xml.find("file").text = open("b64file.base64").read()
open("xmlfile.xml", "w").write(tostring(xml))
This sets the contents of the <file> tag to be the contents of the file b64file.base64, regardless of what its former contents were and regardless of whether "Tbase" appears elsewhere in the XML document.
I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.