XML Python: XML code is duplicated after saving to file - python

I have a code that in principle is to open the file content and wrap it with an additional import tag:
with open('oferta-empik.xml', 'r+', encoding='utf-8') as f:
xml = '<import>' + f.read() + '</import>'
print(xml)
f.write(xml)
f.close()
Unfortunately, after saving half the code is unchanged, and then the xml code already wrapped in the import is inserted into the file.
In total, the file duplicates the xml code where the first original is unchanged and then the same is appended to the end of the file wrapped with the import tag
ORIGINAL CODE:
<offers>
<offer>
<leadtime-to-ship>1</leadtime-to-ship>
<product-id-type>EAN</product-id-type>
<state>11</state>
<quantity>0</quantity>
<price>146</price>
<sku>B01.001.1.10</sku>
</offer>
</offer>
AFTER CODE:
<offers>
<offer>
<leadtime-to-ship>1</leadtime-to-ship>
<product-id-type>EAN</product-id-type>
<state>11</state>
<quantity>0</quantity>
<price>146</price>
<sku>B01.001.1.10</sku>
</offer>
</offer>
<import><offers>
<offer>
<leadtime-to-ship>1</leadtime-to-ship>
<product-id-type>EAN</product-id-type>
<state>11</state>
<quantity>0</quantity>
<price>146</price>
<sku>B01.001.1.10</sku>
</offer>
</offer></import>

the issue is that you're appending the new text (the new XML) to the end of the file. You're reading the entire file, and then write the modified XML at the end of that file.
There are two solutions:
Recommended: open the file for reading. Read the XML. Close it, and then open it for writing and write the entire thing (override the initial content).
Not Recommended: After you read, seek to the beginning of the file (with f.seek(0)) and write the new content. This solution is not recommended because if, at some point, the new content is shorter than the original content, the result will be inconsistent / messed-up.

I have a code that in principle is to open the file content and wrap it with an additional import tag
Your current approach is wrong. Don't open XML files as text files, don't treat XML as text. Always use a parser.
This is a lot better:
import xml.etree.ElementTree as ET
# 1: load current document and top level element
old_tree = ET.parse('oferta-empik.xml')
old_root = old_tree.getroot()
# 2: create <import> element to serve as new top level
new_root = ET.Element('import')
# 3: insert current document root ("wrap it in <import>")
new_root.insert(0, old_root)
# 4 make new ElementTree and write it to file
new_tree = ET.ElementTree(new_root)
with open('output.xml', 'wb') as f:
new_tree.write(f, encoding='utf8')
Compressed:
new_root = ET.Element('import')
new_root.insert(0, ET.parse('oferta-empik.xml').getroot())
with open('output.xml', 'wb') as f:
ET.ElementTree(new_root).write(f, encoding='utf8')

Related

How to increase version number of a xml file after each change in the file using ETree

I'm trying to manipulate a xml file. I use a loop and for each iteration I want the version number of the xml file to be increased. For manipulating the xml file I using ETree. Here is what I have tried so far:
def main():
import xml.etree.ElementTree as ET
import os
version = "0"
while os.path.exists(f"/Users/tt/sumoTracefcdfile_{version}.xml"):
#use parse() function to load and parse an xml file
fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml"
version=int(version)
version+=1
doc = ET.parse(fileDirect)
.....
#at the end after adding some data to xml file, I do the following to write the changes into the xml file:
save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml"
b_xml = ET.tostring(valeurs)
with open(save_path_file, "wb") as f:
f.write(b_xml)
However I get the following error for the line 'doc = ET.parse(fileDirect)':
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/tt/sumoTracefcdfile_{version}.xml'
It looks like you wanted to use f-strings and forgot the "f" in 2 lines.
Changing fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml" to fileDirect = f"/Users/tt/sumoTracefcdfile_{version}.xml" and save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml" to save_path_file = f"/Users/tt/sumoTracefcdfile_{version}.xml" might solve your issues.

Write ElementTree directly to zip with utf-8 encoding

I want to modify a large number of XMLs. They are stored in ZIP-files. The source-XMLs are utf-8 encoded (at least to the guesses of the file tool on Linux) and have a correct XML declaration:
<?xml version='1.0' encoding='UTF-8'?>.
The target ZIPs and the XMLs contained therein should also have the correct XML declaration. However, the (at least to me) most obvious method (using ElementTree.tostring) fails.
Here is a self-contained example, that should work out of the box.
Short walkthrough:
imports
preparations (creating src.zip, these ZIPs are a given in my actual application)
actual work of program (modifying XMLs), starting at # read XMLs from zip
Please focus on the lower part, especially # APPROACH 1, APPROACH 2, APPROACH 3:
import os
import tempfile
import zipfile
from xml.etree.ElementTree import Element, parse
src_1 = os.path.join(tempfile.gettempdir(), "one.xml")
src_2 = os.path.join(tempfile.gettempdir(), "two.xml")
src_zip = os.path.join(tempfile.gettempdir(), "src.zip")
trgt_appr1_zip = os.path.join(tempfile.gettempdir(), "trgt_appr1.zip")
trgt_appr2_zip = os.path.join(tempfile.gettempdir(), "trgt_appr2.zip")
trgt_appr3_zip = os.path.join(tempfile.gettempdir(), "trgt_appr3.zip")
# file on hard disk that must be used due to ElementTree insufficiencies
tmp_xml_name = os.path.join(tempfile.gettempdir(), "curr_xml.tmp")
# prepare src.zip
tree1 = ElementTree(Element('hello', {'beer': 'good'}))
tree1.write(os.path.join(tempfile.gettempdir(), "one.xml"), encoding="UTF-8", xml_declaration=True)
tree2 = ElementTree(Element('scnd', {'äkey': 'a value'}))
tree2.write(os.path.join(tempfile.gettempdir(), "two.xml"), encoding="UTF-8", xml_declaration=True)
with zipfile.ZipFile(src_zip, 'a') as src:
with open(src_1, 'r', encoding="utf-8") as one:
string_representation = one.read()
# write to zip
src.writestr(zinfo_or_arcname="one.xml", data=string_representation.encode("utf-8"))
with open(src_2, 'r', encoding="utf-8") as two:
string_representation = two.read()
# write to zip
src.writestr(zinfo_or_arcname="two.xml", data=string_representation.encode("utf-8"))
os.remove(src_1)
os.remove(src_2)
# read XMLs from zip
with zipfile.ZipFile(src_zip, 'r') as zfile:
updated_trees = []
for xml_name in zfile.namelist():
curr_file = zfile.open(xml_name, 'r')
tree = parse(curr_file)
# modify tree
updated_tree = tree
updated_tree.getroot().append(Element('new', {'newkey': 'new value'}))
updated_trees.append((xml_name, updated_tree))
for xml_name, updated_tree in updated_trees:
# write to target file
with zipfile.ZipFile(trgt_appr1_zip, 'a') as trgt1_zip, zipfile.ZipFile(trgt_appr2_zip, 'a') as trgt2_zip, zipfile.ZipFile(trgt_appr3_zip, 'a') as trgt3_zip:
#
# APPROACH 1 [DESIRED, BUT DOES NOT WORK]: write tree to zip-file
# encoding in XML declaration missing
#
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml')
# write XML directly to zip
trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)
#
# APPROACH 2 [WORKS IN THEORY, BUT DOES NOT WORK]: write tree to zip-file
# encoding in XML declaration is faulty (is 'utf8', should be 'utf-8' or 'UTF-8')
#
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='utf8', method='xml')
# write XML directly to zip
trgt2_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)
#
# APPROACH 3 [WORKS, BUT LACKS PERFORMANCE]: write to file, then read from file, then write to zip
#
# write to file
updated_tree.write(tmp_xml_name, encoding="UTF-8", method="xml", xml_declaration=True)
# read from file
with open(tmp_xml_name, 'r', encoding="utf-8") as tmp:
string_representation = tmp.read()
# write to zip
trgt3_zip.writestr(zinfo_or_arcname=xml_name, data=string_representation.encode("utf-8"))
os.remove(tmp_xml_name)
APPROACH 3 works, but it is much more resource-intensive than the other two.
APPROACH 2 is the only way I could get an ElementTree object to be written with an actual XML declaration -- which then turns out to be invalid (utf8 instead of UTF-8/utf-8).
APPROACH 1 would be most desired -- but fails during reading later in the pipeline, as the XML declaration is missing.
Question: How can I get rid of writing the whole XML to disk first, only to read it afterwards, write it to the zip and delete it after being done with the zip? What am I missing?
You can use an io.BytesIO object.
This allows using ElementTree.write, while avoiding exporting the tree to disk:
import zipfile
from io import BytesIO
from xml.etree.ElementTree import ElementTree, Element
tree = ElementTree(Element('hello', {'beer': 'good'}))
bio = BytesIO()
tree.write(bio, encoding='UTF-8', xml_declaration=True)
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
z.writestr('test.xml', bio.getvalue())
If you are using Python 3.6 or higher, there's an even shorter solution:
you can get a writable file object from the ZipFile object, which you can pass to ElementTree.write:
import zipfile
from xml.etree.ElementTree import ElementTree, Element
tree = ElementTree(Element('hello', {'beer': 'good'}))
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
with z.open('test.xml', 'w') as f:
tree.write(f, encoding='UTF-8', xml_declaration=True)
This also has the advantage that you don't store multiple copies of the tree in memory, which could be a relevant issue for large trees.
The only thing that is really missing in approach one is the XML declaration header. For ElementTree.write(...) you can use the xml_declaration, unfortunately for your version this isn't available in ElementTree.tostring yet.
Starting with Python 3.8, the ElementTree.tostring method does have a xml_declaration argument, see:
https://docs.python.org/3.8/library/xml.etree.elementtree.html
Even though that implementation is unavailable to you when using Python 3.6, you can easily copy the 3.8 implementation in your own Python file:
import io
def tostring(element, encoding=None, method=None, *,
xml_declaration=None, default_namespace=None,
short_empty_elements=True):
"""Generate string representation of XML element.
All subelements are included. If encoding is "unicode", a string
is returned. Otherwise a bytestring is returned.
*element* is an Element instance, *encoding* is an optional output
encoding defaulting to US-ASCII, *method* is an optional output which can
be one of "xml" (default), "html", "text" or "c14n", *default_namespace*
sets the default XML namespace (for "xmlns").
Returns an (optionally) encoded string containing the XML data.
"""
stream = io.StringIO() if encoding == 'unicode' else io.BytesIO()
ElementTree(element).write(stream, encoding,
xml_declaration=xml_declaration,
default_namespace=default_namespace,
method=method,
short_empty_elements=short_empty_elements)
return stream.getvalue()
(See https://github.com/python/cpython/blob/v3.8.0/Lib/xml/etree/ElementTree.py#L1116)
In that case you can simply use approach one:
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml', xml_declaration=True)
# write XML directly to zip
trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)

lxml.etree: Start tag expected, '<' not found, line 1, column 1

I want to take some simple xml files and convert them all to CSV in one go (though this code is just for one at a time). It looks to me like there are no official name spaces, but I'm not sure.
I have this code (I used one header, SubmittingSystemVendor, but I really want to write all of them to CSV:
import csv
import lxml.etree
x = r'C:\Users\...\jh944.xml'
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow('SubmittingSystemVendor')
root = lxml.etree.fromstring(x)
writer.writerow(row)
Here is a sample of the XML file:
<?xml version="1.0" encoding="utf-8"?>
<EOYGeneralCollectionGroup SchemaVersionMajor="2014-2015" SchemaVersionMinor="1" CollectionId="157" SubmittingSystemName="MISTAR" SubmittingSystemVendor="WayneRESA" SubmittingSystemVersion="2014" xsi:noNamespaceSchemaLocation="http://cepi.state.mi.us/msdsxml/EOYGeneralCollection2014-20151.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<EOYGeneralCollection>
<SubmittingEntity>
<SubmittingEntityTypeCode>D</SubmittingEntityTypeCode>
<SubmittingEntityCode>82730</SubmittingEntityCode>
</SubmittingEntity>
The error is:
lxml.etree: Start tag expected, '<' not found, line 1, column 1
You are using lxml.etree.fromstring, but giving it a file path as the argument. This means it's trying to interpret "C:\Users...\jh944.xml" as the XML data to be parsed.
Instead, you want to open the file containing this XML. You can simply replace the call to fromstring with lxml.etree.parse, which will accept a filename or open file object as the argument.

Modifying and rewriting XML file with Python ElementTree

I have a XML file that starts like this:
<?xml version="1.0" encoding="utf-8"?>
<Recipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
I need to read it in, modify it, then write it back out. Here is a code snippet:
from xml.etree import ElementTree
with open('base.xml', 'rt') as f:
tree = ElementTree.parse(f)
recipe = tree.find('')
t = recipe.find('Targets_Params/Target_Table/Target_Name')
t.text = "new Value"
output_file = open('new.xml', 'w' )
output_file.write(ElementTree.tostring(recipe))
output_file.close()
My problem is that when I write the file out I do not get the first line at all, and the second line comes out with just:
<Recipe>
How I can read in the file, modify it, and write it out while preserving the original structure?

Parse each file in a directory with BeautifulSoup/Python, save out as new file

New to Python & BeautifulSoup. I have a Python program that opens a file called "example.html", runs a BeautifulSoup action on it, then runs a Bleach action on it, then saves the result as file "example-cleaned.html". So far it is working for all contents of "example.html".
I need to modify it so that it opens each file in folder "/posts/", runs the program on it, then saves it out as "/posts-cleaned/X-cleaned.html" where X is the original filename.
Here's my code, minimised:
from bs4 import BeautifulSoup
import bleach
import re
text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"
Assistance & pointers to existing solutions gladly received!
You can use os.listdir() to get a list of all files in a directory. If you want to recurse all the way down the directory tree, you'll need os.walk().
I would move all this code to handle a single file to function, and then write a second function to handle parsing the whole directory. Something like this:
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
# this appends -cleaned to the file;
# relies on the file having a '.'
dot_pos = filename.rfind('.')
cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])
with open(cleaned_filename, 'w') as fout:
fout.write(cleaned.encode("utf-8"))
print "Done"
Then you just call clean_dir('/posts') or what not.
I'm appending "-cleaned" to the files, but I think I like your idea of using a whole new directory better. That way you won't have to handle conflicts if -cleaned already exists for some file, etc.
I'm also using the with statement to open files here as it closes them and handles exceptions automatically.
Answer to my own question, for others who might find the Python docs for os.listdir a bit unhelpful:
from bs4 import BeautifulSoup
import bleach
import re
import os, os.path
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
postlist = os.listdir("posts/")
for post in postlist:
# HERE: you need to specify the directory again, the value of "post" is just the filename:
text = BeautifulSoup(open("posts/"+post))
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts-cleaned/"+post, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
I cheated and made a separate folder called "posts-cleaned/" because savings files to there was easier than splitting the filename, adding "cleaned", and re-joining it, although if anyone wants to show me a good way to do that, that would be even better.

Categories

Resources