I need to compress multiple xml files and I achieved this with lxml, zipfile and a for loop.
My problem is that every time I re run my function the content of the compressed files are repeating (being appended in the end) and getting longer. I believe that it has to do with the writing mode a+b. I thought that by using with open at the end of the code block the files would be deleted and no more content would be added to them. I was wrong and with the other modes I do not get the intended result.
Here is my code:
def compress_package_file(self):
bytes_buffer = BytesIO()
with zipfile.ZipFile(bytes_buffer, 'w') as invoices_package:
i = 1
for invoice in record.invoice_ids.sorted('sin_number'):
invoice_file_name = 'Invoice_' + invoice.number + '.xml'
with open(invoice_file_name, 'a+b') as invoice_file:
invoice_file.write(invoice._get_invoice_xml().getvalue())
invoices_package.write(invoice_file_name, compress_type=zipfile.ZIP_DEFLATED)
i += 1
compressed_package = bytes_buffer.getvalue()
encoded_compressed_file = base64.b64encode(compressed_package)
My xml generator is in another function and works fine. But the content repeats each time I run this function. For example if I run it two times, the content of the files in the compressed file look something like this (simplified content):
<?xml version='1.0' encoding='UTF-8'?>
<invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoice.xsd">
<header>
<invoiceNumber>9</invoiceNumber>
</header>
</facturaComputarizadaCompraVenta><?xml version='1.0' encoding='UTF-8'?>
<invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoice.xsd">
<header>
<invoiceNumber>9</invoiceNumber>
</header>
</facturaComputarizadaCompraVenta>
If I use w+b mode, the content of the files are blank.
How should my code look like to avoid this behavior?
I suggest you do use w+b mode, but move writing to zipfile after closing the invoice XML file.
From what you wrote it looks as you are trying to compress a file that is not yet flushed to disk, therefore with w+b it is still empty at time of compression.
So, try remove 1 level of indent for invoices_package.write line (I can't format code properly on mobile, so can't post whole section).
Related
I have a section of Python (Sigil) code:
for (id, href) in bk.text_iter():
html = bk.readfile(id)
html = re.sub(r'<title></title>', '<title>Mara’s Tale</title>', html)
html = re.sub(r'<p>Mara’s Tale</p>', '<p class="title">Mara’s Tale</p>',html)
bk.writefile(id, html)
Ideally, I'd like to read the regular expressions in from an external text-file (or just read in that block of code). Any suggestions? I've done similar in Perl with a try, but I'm a Python-novice.
Also, quick supplementary question - shouldn't bk.writefile be indented? And, if so, why is my code working? It looks as though it's outside the for block, and therefore will only write to the final file, if that (it's an epub, so there are several html files), but it's updating all relevant files.
Regarding bk, my understanding is that this object is the whole epub, and what this code is doing is reading each html file that makes up an epub via text_iter, so id is each individual file.
EDIT TO ADD
Ah! That bk.writefile should indeed be indented. I got away with it because, at the point I run this code, I only have a single html file.
As for the reading something from a file - it's easy. Assume you have the file 'my_file.txt' in the same folder where the script is saved:
f = open('my_file.txt', 'r')
content = f.read() # read all content of the file in the sting 'content'
lines = f.read().splitlines() # read lines of the file in array 'lines'
f.close()
print(lines[0]) # first line
print(lines[1]) # second line
# etc
As for shouldn't bk.writefile be indented? Yep, it seems the loop makes and changes the variable html for several times but saves only the last iteration. It looks weird. Perhaps it should be indented. But it's just a guess.
I have written code based on a txt file but in production it will be an XML file. I need to be able to iterate over every single character in an XML file as if it were a standard text file.
I have the below as an XML file:
<?xml version="1.0" encoding="UTF-8"?>
-<openBranchData>
<phoneSwMngmtEnabled/>
-<phoneSwMngmtData>
<imgProvisioningEnabled/>
<startTime>02:00</startTime>
<stopTime>06:00</stopTime>
<centralPhoneSwServer/>
<maxParallelAccess>3</maxParallelAccess>
<phoneSwPullingEnabled/>
</phoneSwMngmtData>
<voiceMailEnabled/>
I've tried the below python code.
with open('osb_file.xml') as test:
testing = test.read()
for x in testing:
print(x)
Do I need to convert to a txt file or is there a simpler way? It's been a while since I've looked at this project so apologies if I'm missing anything very obvious.
As Robert Harvey said, XML files are already text files. You can read them just like you read a standard text file in python using open(filename, 'r')
I parse a large xml file in python using
tree = ET.parse('test.xml')
#do my manipulation
How do I write back the xml file to disk exactly as I have read it, albeit with my modifications.
<?xml version="1.0" encoding="utf-16"?>
This was the first line of the input xml file
I added tree.write("output.sbp", encoding="utf-16") and now they are of the same size.
I am attempting to extract the XML code from a Word document with Python. Here's the code I tried:
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= str(zip.read("word/document.xml"))
return xmlString
I created a test document and ran the function getXML on it. Here's the result:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00B52719"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:t>Test</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidSect="009C4305"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'
There are some obvious issues. One is that the XML code begins with an "b' " and ends with an apostrophe. Second, there is a "\r\n" right after the first set of angle brackets.
My ultimate goal is to modify the XML code to create a new Word document -- see this question -- but the anomalies with the extracted XML are preventing me from doing this.
Does anyone know why the extracted XML has these strange features and how I can remove them?
EDIT: I tried using the lxml module to parse this code but I only got different errors.
I created a new function getXmlTree:
from lxml import etree
def getXmlTree(xmlString):
return etree.fromstring(xmlString)
I then ran the code etree.tostring(getXmlTree(getXml("test.docx")),pretty_print=True) and received much more sensible XML code.
The problems arise when I tried to create a new Word document. I created the following function to convert XML code into a Word document (shamelessly stolen from here):
import zipfile
from lxml import etree
import os
import tempfile
import shutil
def createNewDocx(originalDocx,xmlContent,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
xmlString = etree.tostring(xmlContent,pretty_print=True)
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
Before trying to create a new Word document, I wanted to see if I could create a copy of my original test document by substituting xmlContent = getXmlTree(getXml("test.docx")) as an argument in the above function. When I ran the code, however, I received an error message:
f.write(xmlString)
TypeError: must be str, not bytes
Using f.write(str(xmlString)) instead didn't help; it created a new word document, but Word would crash if I tried to open it.
EDIT2: tried running the above code with f.write(xmlString.decode("utf-8")) instead, but it didn't help; Word still crashed.
My guess is that the XML is not being encoded properly. First, write the document file as binary using "wb" as the mode. Second, tell etree.tostring() what the encoding is and to include the XML declaration.
with open(os.path.join(tmpDir, "word/document.xml"), "wb") as f:
xmlBytes = etree.tostring(xmlContent, encoding="UTF-8", xml_declaration=True, pretty_print=True)
f.write(xmlBytes)
<book>
<title>sponge bob</title>
<author>Joe Doe</author>
<file>Tbase</file>
</book>
I have 2 files, one is a xml and the other is a base64 file. I would like to know how to insert and replace the string"Tbase" with the content of the base64 file using python.
Are you wanting to put the verbatim contents of the base64 file (still base64 encoded) into the XML file, in place of "Tbase"? If that's the case, you could just do something like:
xml = open("xmlfile.xml").read()
b64file = open("b64file.base64").read()
open("xmlfile.xml", "w").write(xml.replace("Tbase", b64file))
(If you're on Python 2.6 or later, you can do this a little bit cleaner using with statements, but that's another discussion.)
If you want to decode the base64 file first, and place the decoded contents into the XML file, then you'd replace b64file on the last line of the example above with b64file.decode("base64").
Of course, doing simple text replacement, as above, opens you up to the problems you'll have if, say, the title or author contain "Tbase" as well. A better way would be to use an actual XML parsing library, like so:
from xml.etree.ElementTree import fromstring, tostring
xml = fromstring(open("xmlfile.xml").read())
xml.find("file").text = open("b64file.base64").read()
open("xmlfile.xml", "w").write(tostring(xml))
This sets the contents of the <file> tag to be the contents of the file b64file.base64, regardless of what its former contents were and regardless of whether "Tbase" appears elsewhere in the XML document.