How to generate XML, UTF-8 with BOM using Python Element Tree?

How to generate XML, UTF-8 with BOM using Python Element Tree? - python

For generating resource XML file for ASP.NET, the third-party tool requires BOM (when migrating to a new version of the tool). At the same time, it requires the XML prolog like <?xml version='1.0' encoding='utf-8'?>.
The problem is that when using the ElementTree command...
tree.write(lang_resx_fpath, encoding='utf-8')
the resulting file does not contain BOM. When using the command...
tree.write(lang_resx_fpath, encoding='utf-8-sig')
the result does contain BOM; however, the XML prolog contains encoding='utf-8-sig'.
How should I generate the file to contain both BOM and encoding='utf-8'?
UPDATE:
I have worked around it by reading, replacing, and writing the file again, like this...
with open(lang_resx_fpath, 'r', encoding='utf-8-sig') as f:
content = f.read()
content = content.replace("encoding='utf-8-sig'", "encoding='utf-8'")
with open(lang_resx_fpath, 'w', encoding='utf-8-sig') as f:
f.write(content)
Anyway, is there any cleaner solution?
UPDATE: I have created the https://bugs.python.org/issue46598, and I have also written the fix (https://github.com/python/cpython/pull/31043).

Peek into sources of ElementTree.write shows that prolog is hardcoded there (https://github.com/python/cpython/blob/main/Lib/xml/etree/ElementTree.py or permalink https://github.com/python/cpython/blob/ee0ac328d38a86f7907598c94cb88a97635b32f8/Lib/xml/etree/ElementTree.py). Therefore probably using internals of ET is the only option (other than monkey-pathing module), to write required preamble and keep BOM in the file:
import xml.etree.ElementTree as ET
qnames, namespaces = ET._namespaces(tree._root, None)
with open(lang_resx_fpath,'w',encoding='utf-8-sig') as f:
f.write("<?xml version='1.0' encoding='utf-8'?>\n" )
ET._serialize_xml(f.write,
tree._root, qnames, namespaces,
short_empty_elements=False)
Probably it is not more elegant than your solution (and maybe it is even less elegant). The only advantage is that it does not require writing file twice, which would be minor benefit besides some huge XML files.

Related

ÙˆØµÙ„Ù‰ characters showing when writing the text obtained through web scraping into a csv file [duplicate]

I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.
from newspaper import Article, Source
import csv
first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")
first_article.download()
if first_article.is_downloaded:
first_article.parse()
first_article.nlp
article_array = []
collate = {}
collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)
keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
csv_writer = csv.DictWriter(output_file, keys)
csv_writer.writeheader()
csv_writer.writerows(article_array)
output_file.close()
When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:
â€œAt the end of the day, Europeâ€™s economy isnâ€™t in great shape, inflation doesnâ€™t look exciting and there are a bunch of political risks to reckon with.
So far I have tried:
with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:
to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.
What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?

Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.

Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.

That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.
You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).

lxml parsing with python: how to with objectify

I am trying to read xml behind an spss file, I would like to move from etree to objectify.
How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.
def get_etree(path_file):
from lxml import etree
with open(path_file, 'r+') as f:
xml_text = f.read()
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse(StringIO(xml_text), parser=recovering_parser)
return xml
my failed attempt:
def get_etree(path_file):
from lxml import etree, objectify
with open(path_file, 'r+') as f:
xml_text = objectify.fromstring(xml)
return xml
but I get this error:
lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI

The first, biggest mistake is to read a file into a string and feed that string to an XML parser.
Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.
XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.
<?xml version="1.0" encoding="Windows-1252"?>
An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.
Luckily lxml makes it very easy:
from lxml import etree, objectify
def get_etree(path_file):
return etree.parse(path_file, parser=etree.XMLParser(recover=True))
def get_objectify(path_file):
return objectify.parse(path_file)
and
path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)
print xml1 # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2 # -> <lxml.etree._ElementTree object at 0x02A7B878>
P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?
I would do the latter. Using a recovering parser may cause nasty run-time errors later.

Arabic, Unicode and files in python

I am trying to grab some text written in Arabic from Youtube, writting it into a file and reading it again.
The source file to grab the text has:
#!/usr/bin/python
#encoding: utf-8
in the beginning of the file.
Writing the text are done like this:
f.write(comment + '\n' )
The file contents is readable Arabic, so I assume the previous steps were correct.
But the problem appears when trying to read the contents from the file (and writing them for example into another file) like this:
in = open('data_Pass1/EG', 'rb')
out.write(in.read())
Which results in output file like this:
\xd8\xa7\xd9\x8a\xd9\x87
What is causing this?

In python 3.x
in = open('data_Pass1/EG', 'r', encoding='utf-8')
out = open('_file_name_', 'w', encoding='utf-8')
In python 2.x.
import codecs
in = codecs.open('data_Pass1/EG', 'r', encoding='utf-8')
out = codecs.open('_file_name_', 'w', encoding='utf-8')

You're opening the input file in binary ('rb') mode. Open the file to read as text ('r'). I tend to use Python 3 so the source files are UTF-8 by default, so I don't know what effect setting the encoding for .py files inside the files has on text I/O, but if necessary you may also want to use encoding='utf8' inside the calls to open() for all your file I/O, unless that doesn't work in 2.7 in which case I'm not sure what the best way to handle that in Python 2.7 would be...
As Lee Daniel Crocker suggests, you'd probably be better off just opening both input and output files in binary mode ('rb' for the input file, 'wb' for the output) if you're passing the input directly to the output without doing any textual manipulation of it. (Though going by Andy's comment, in Python 2 it's better to open text files in binary mode and do explicit encoding/decoding anyway.)

lxml adds urlencoding in xml?

I'll preface this by indicating I'm using Python 2.7.3 (x64) on Windows 7, with lxml 2.3.6.
I have a little, odd, problem I'm hoping somebody can help with. I haven't find a solution online, perhaps I'm not searching for the right thing.
Anyway, I have a problem where I'm programmatically building some XML with lxml, then outputting this to a text file, the problem is lxml is converting carriage returns to the text 
, almost like urlencoding - but I'm not using HTML I'm using XML.
For example, I have a simple text file created in Notepad, like this:
This
is
my
text
I then build some xml and add this text into the xml:
from lxml import etree
textstr = ""
fh = open("mytext.txt", "rb")
for line in fh:
textstr += line
root = etree.Element("root")
a = etree.SubElement(root, "some_element")
a.text = textstr
print etree.tostring(root)
The problem here is the output of the print looks like this:
<root><some_element>This
is
my
text</some_element></root>
For my purposes the line breaks are fine, but the 
 elements are not.
What I have been able to figure out is that this is happening because I'm opening the text file in binary mode "rb" (which I actually need to do as my app is indexing a large text file). If I don't open the file in binary mode "r", then the output does not contain 
 (but of course, then my indexing doesn't work).
I've also tried changing the etree.tostring to:
print etree.tostring(root, method="xml")
However there is no difference in the output.
Now, I CAN dump the xml text to a string then do a replace of the $#13; artifacts, however, I was hoping for a more elegant solution - because the text files I parse are not under my control and I'm worried that other elements of the text file might be converted to url style encoding without my knowledge.
Does anyone know a way of preventing this encoding from happening?

Windows uses \r\n to represent a line ending, Unix uses \n.
This will remove the \r at the end of the line, if there is one there (so the code will work with unix text files too.) It will remove at most one \r, so if there is an \r somewhere else in the line it will be preserved.
import re
textstr = ""
with open("mytext.txt", "rb") as fh:
for line in fh:
textstr += re.sub(r'\r$', '', line)
print(repr(textstr))

How to update/modify an XML file in python?

I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..

Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.

Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.

The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.

While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()

For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.

What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html

To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.

You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.

As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generate XML, UTF-8 with BOM using Python Element Tree? - python

Related

ÙˆØµÙ„Ù‰ characters showing when writing the text obtained through web scraping into a csv file [duplicate]

lxml parsing with python: how to with objectify

Arabic, Unicode and files in python

lxml adds urlencoding in xml?

How to update/modify an XML file in python?

Categories

Resources