I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file.
My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them.
When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
ü
When I create the output in Python, I use
with open('file', 'w') as fout:
fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?
I think this should work:
import codecs
with codecs.open("file.xml", 'w', "utf-8") as fout:
# do stuff with filepointer
To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:
tree.write('file', encoding='utf-8')
When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout:
fout.write(xyz)
Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout:
fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.
Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:
def parse(self, source, parser=None):
if not hasattr(source, "read"):
source = open(source, "rb")
if not parser:
parser = XMLParser(target=TreeBuilder())
while 1:
data = source.read(65536)
if not data:
break
parser.feed(data)
self._root = parser.close()
return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.
Related
I have a process which requires a csv to be created if it does not currently exist. I then open this csv and write some data to it.
with open("foo.csv", "w") as my_empty_csv:
# now you have an empty file already
# This is where I write my data to the file
This is the code i'm currently using, but I don't know what default encoding the file is created in if it doesn't already exist.
What would be the better way to create a file with UTF-8 encoding if the file doesn't exist.
The open function has an optional 'encoding' parameter that you can use to explicitly specify the encoding of the file:
with open("foo.csv", "w", encoding="utf-8") as my_empty_csv:
...
More specificially, the documentation specifies about this parameter:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
You should be able to do it this way.
with open("foo.csv", "w" newline='', encoding='utf-8') as my_empty_csv:
// other logic
I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
TypeError: write() argument must be str, not bytes
Even if I make code as:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
The reason why I dont open file as 'wb' is I am writing different strings and integers to file.
If text_list contains unicode strings you should encode (not decode) them to string before saving to file.
Try this instead:
t_w = text_list[y].encode('utf-8')
Also it could be helpful to look at codecs standard module https://docs.python.org/2/library/codecs.html. You could try this:
import codecs
with codecs.open('path/to/file', 'w', 'utf-8') as f:
f.write(text_list[y])
f.write(u'\n')
But note that codecs always opens files in binary mode.
When using write in text mode, the UTF-8 mode is the default (in Python 3, I assume you use only Python 3, not Python 2) so do not encode the strings. Or open your file in binary mode and encode EVERYTHING you write to your file. I suggest NOT using binary mode in your case. So, your code will look like this:
with open('myfile.txt', 'w') as new_file:
t_w = text_list[y]
new_file.write(t_w)
new_file.write('\n')
or for Python 2:
new_file = open('myfile.txt', 'wb')
t_w = text_list[y].encode('utf-8') # I assume you use Unicode strings
new_file.write(t_w)
new_file.write(ub'\n')
new_file.close()
I need to get data from json file to further send it in the post-request. Unfortunately, when i read the file, some unexplained unicode symbols at the beginning
path = '.\jsons_updated'
newpath = os.path.join(path, 'Totem Plus eT 00078-20140224_060406.ord.txt')
file = open(newpath, 'r')
#data = json.dumps(file.read())
data = file.read()
print('data= ', data)
file.close()
Data in the file starts with this:
{"PriceTableHash": [{"Hash": ...
I get the result:
data= п»ї{"PriceTableHash": [{"Hash": ...
or in case of data = json.dumps(file.read())
data= "\u043f\u00bb\u0457{\"PriceTableHash\": [{
So my request can't process this data.
Odd symbols are the same for all the files i have.
UPD:
If i copy data manyally in the new json or txt file, problem dissappears. But i have about 2,5k files, so that's not an option =)
The command open(newpath, 'r') opens the file with your system's default encoding (whichever that might be). So when you read encoded Unicode data, that will mangle the encoding (so instead of reading the UTF-8 encoded data with a UTF-8 decoder, Python will try Cp-1250 or something).
Use codecs.open() instead and specify the correct encoding of the data (i.e. the one which was used when the files were written).
The odd bytes you get look like a BOM header. You may want to change the code which writes those files to omit it and send you pure UTF-8. See also Reading Unicode file data with BOM chars in Python
Running Windows 8 64-bit. I have a file where I store some data, saved with the UTF-8 encoding using Windows notepad. Supposing this is the content of the file:
1,some,data,here,0,-1
I'm reading it like this:
f = open("file.txt", "rb")
f.read()
f.close()
And f.read() returns this:
u"\xef\xbb\xbf1,some,data,here,0,-1"
I can just use f.read()[3:] but that's not a clean solution.
What are those characters at the beginning of the file?
Those first 3 bytes are the UTF-8 BOM, or Byte Order Mark. UTF-8 doesn't need the BOM (it has a fixed byte order unlike UTF-16 and UTF-32), but many tools (mostly Microsoft's) add it anyway to aid in file-encoding detection.
You can test for it and skip it safely, use codecs.BOM_UTF8 to handle it:
import codecs
data = f.read()
if data.startswith(codecs.BOM_UTF8):
data = data[3:]
You could also use the io.open() function to open the file and have Python decode the file for you to Unicode, and tell it to use the utf_8_sig codec:
import io
with io.open('file.txt', encoding='utf_8_sig'):
data = f.read()
That´s the BOM (byte order mark).
In reality, UTF-8 has only one valid byte order,
but despite of that there can be this 3-byte-sequence
at the beginning of the file (data in general).
-> If there are exactly these values as first 3 bytes, ignore them.
Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.
Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).
In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:
s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)
import codecs
import shutil
import sys
s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin, sys.stdout)
I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.
For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:
configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")
This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)
This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.
Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''
# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))
# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()
# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)
# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0
You can use codecs.
import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")
In python3 you should add encoding='utf-8-sig':
with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)
that's it.