I parse a large xml file in python using
tree = ET.parse('test.xml')
#do my manipulation
How do I write back the xml file to disk exactly as I have read it, albeit with my modifications.
<?xml version="1.0" encoding="utf-16"?>
This was the first line of the input xml file
I added tree.write("output.sbp", encoding="utf-16") and now they are of the same size.
Related
I need to compress multiple xml files and I achieved this with lxml, zipfile and a for loop.
My problem is that every time I re run my function the content of the compressed files are repeating (being appended in the end) and getting longer. I believe that it has to do with the writing mode a+b. I thought that by using with open at the end of the code block the files would be deleted and no more content would be added to them. I was wrong and with the other modes I do not get the intended result.
Here is my code:
def compress_package_file(self):
bytes_buffer = BytesIO()
with zipfile.ZipFile(bytes_buffer, 'w') as invoices_package:
i = 1
for invoice in record.invoice_ids.sorted('sin_number'):
invoice_file_name = 'Invoice_' + invoice.number + '.xml'
with open(invoice_file_name, 'a+b') as invoice_file:
invoice_file.write(invoice._get_invoice_xml().getvalue())
invoices_package.write(invoice_file_name, compress_type=zipfile.ZIP_DEFLATED)
i += 1
compressed_package = bytes_buffer.getvalue()
encoded_compressed_file = base64.b64encode(compressed_package)
My xml generator is in another function and works fine. But the content repeats each time I run this function. For example if I run it two times, the content of the files in the compressed file look something like this (simplified content):
<?xml version='1.0' encoding='UTF-8'?>
<invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoice.xsd">
<header>
<invoiceNumber>9</invoiceNumber>
</header>
</facturaComputarizadaCompraVenta><?xml version='1.0' encoding='UTF-8'?>
<invoice xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoice.xsd">
<header>
<invoiceNumber>9</invoiceNumber>
</header>
</facturaComputarizadaCompraVenta>
If I use w+b mode, the content of the files are blank.
How should my code look like to avoid this behavior?
I suggest you do use w+b mode, but move writing to zipfile after closing the invoice XML file.
From what you wrote it looks as you are trying to compress a file that is not yet flushed to disk, therefore with w+b it is still empty at time of compression.
So, try remove 1 level of indent for invoices_package.write line (I can't format code properly on mobile, so can't post whole section).
I have a dummy xml file,
<?xml version="1.0" encoding="UTF-8"?>
<hello xmlns="abc">
<inside>
<ok>xyz</ok>
</inside>
</hello>
<?xml version="1.0" encoding="UTF-8"?>
<xyz xmlns="acxd">
</xyz>
<?xml version="1.0" encoding="UTF-8"?>
<zz xmlns="zmrt">
</zz>
]]>]]>
And Iam trying to parse this xml file, using following code.
import xml.etree.ElementTree as ET
mytree = ET.parse(temp_xml)
The error I am getting is "ParseError: junk after document element: line 7, column 0".
I did try to remove ']]>]]>' i.e. in line 7 but still I am getting same error i.e. "ParseError: junk after document element: line 8, column 0". Is there a way to deal with such error or we can skip reading such lines where there is junk data ?
XML document may only have a single root element. Yours has three and therefore is not well-formed. If you wish to parse it using XML tools, you'll have to first, manually or programmatically, separate the root elements into their own documents.
Note that an XML document also can have at most a single XML declaration (<?xml version="1.0" encoding="UTF-8"?>), and if it exists, it must be at the top of the file.
See also
Why must XML documents have a single root element?
How to parse invalid (bad / not well-formed) XML?
Are multiple XML declarations in a document well-formed XML?
Parse a xml file with multiple root element in python
I have written code based on a txt file but in production it will be an XML file. I need to be able to iterate over every single character in an XML file as if it were a standard text file.
I have the below as an XML file:
<?xml version="1.0" encoding="UTF-8"?>
-<openBranchData>
<phoneSwMngmtEnabled/>
-<phoneSwMngmtData>
<imgProvisioningEnabled/>
<startTime>02:00</startTime>
<stopTime>06:00</stopTime>
<centralPhoneSwServer/>
<maxParallelAccess>3</maxParallelAccess>
<phoneSwPullingEnabled/>
</phoneSwMngmtData>
<voiceMailEnabled/>
I've tried the below python code.
with open('osb_file.xml') as test:
testing = test.read()
for x in testing:
print(x)
Do I need to convert to a txt file or is there a simpler way? It's been a while since I've looked at this project so apologies if I'm missing anything very obvious.
As Robert Harvey said, XML files are already text files. You can read them just like you read a standard text file in python using open(filename, 'r')
I am having issues while writing the below XML to output file.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>
Pusheen
</word>
<CharacterOffsetBegin>
0
</CharacterOffsetBegin>
<CharacterOffsetEnd>
7
</CharacterOffsetEnd>
<POS>
NNP
</POS>
</token>
</tokens>
</sentence>
</sentences>
</document>
</root>
How to write this to output file in xml format? I tried using below write statement
tree.write(open('person.xml', 'w'), encoding='unicode').
But, I am getting the below error
AttributeError: 'str' object has no attribute 'write'
I don't have to build XML here as I already have the data in XML format. I just need it to write it to a XML file.
Assuming that tree is your XML, it is a string. You probably want something like:
with open("person.xml", "w", encoding="unicode") as outfile:
outfile.write(tree)
(It is good practice to use with for files; it automatically closes them after)
The error is caused by the fact that, since tree is a string, you can't write to it.
I recommend using the lxml module to check the format first and then write it to a file. I notice that you've got two elements with the same id, which caught my eye. It doesn't flag an error in XML, but it could cause trouble on an HTML page, where each id is supposed to be unique.
Here's the simple code to do what I described above:
from lxml import etree
try:
root = etree.fromstring(your_xml_data) # checks XML formatting, returns Element if good
if root is not None:
tree = etree.ElementTree(root) # convert the Element to ElementTree
tree.write('person.xml') # we needed the ElementTree for writing the file
except:
'Oops!'
I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:
<?xml version="1.0"?>
<data>
<more>
<p></p>
</more>
</data>
<?xml version="1.0"?>
<different data>
<etc>
<p></p>
</etc>
</different data>
<?xml version="1.0"?>
<continues.....>
Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.
You'll need to read in the documents separately; here is a generator function that'll yield complete XML documents from a given file object:
def xml_documents(fileobj):
document = []
for line in fileobj:
if line.strip().startswith('<?xml') and document:
yield ''.join(document)
document = []
document.append(line)
if document:
yield ''.join(document)
Then use ElementTree.fromstring() to load and parse these:
with open('file_with_multiple_xmldocuments') as fileobj:
for xml in xml_documents(fileobj):
tree = ElementTree.fromstring(xml)