I can't create an utf-8 csv file in Python.
I'm trying to read it's docs, and in the examples section, it says:
For all other encodings the following
UnicodeReader and UnicodeWriter
classes can be used. They take an
additional encoding parameter in their
constructor and make sure that the
data passes the real reader or writer
encoded as UTF-8:
Ok. So I have this code:
values = (unicode("Ñ", "utf-8"), unicode("é", "utf-8"))
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
writer = UnicodeWriter(f)
writer.writerow(values)
And I keep getting this error:
line 159, in writerow
self.stream.write(data)
File "/usr/lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/usr/lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 22: ordinal not in range(128)
Can someone please give me a light so I can understand what the hell am I doing wrong since I set all the encoding everywhere before calling UnicodeWriter class?
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
You don't have to use codecs.open; UnicodeWriter takes Unicode input and takes care of encoding everything into UTF-8. When UnicodeWriter writes into the file handle you passed to it, everything is already in UTF-8 encoding (therefore it works with a normal file you opened with open).
By using codecs.open, you essentially convert your Unicode objects to UTF-8 strings in UnicodeWriter, then try to re-encode these strings into UTF-8 again as if these strings contained Unicode strings, which obviously fails.
As you have figured out it works if you use plain open.
The reason for this is that you tried to encode UTF-8 twice. Once in
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
and then later in UnicodeWriter.writeRow
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
To check that this works use your original code and outcomment that line.
Greetz
I ran into the csv / unicode challenge a while back and tossed this up on bitbucket: http://bitbucket.org/famousactress/dude_csv .. might work for you, if your needs are simple :)
You don't need to "double-encode" everything.
Your application should work entirely in Unicode.
Do your encoding only in the codecs.open to write UTF-8 bytes to an external file. Do no other encoding within your application.
Related
I am currently reading the documentation for the io module: https://docs.python.org/3.5/library/io.html?highlight=stringio#io.TextIOBase
Maybe it is because I don't know Python well enough, but in most cases I just don't understand their documentation.
I need to save the data in addresses_list to a csv file and serve it to the user via https. So all of this must happen in-memory. This is the code for it and currently it is working fine.
addresses = Abonnent.objects.filter(exemplare__gt=0)
addresses_list = list(addresses.values_list(*fieldnames))
csvfile = io.StringIO()
csvwriter_unicode = csv.writer(csvfile)
csvwriter_unicode.writerow(fieldnames)
for a in addresses_list:
csvwriter_unicode.writerow(a)
csvfile.seek(0)
export_data = io.BytesIO()
myzip = zipfile.ZipFile(export_data, "w", zipfile.ZIP_DEFLATED)
myzip.writestr("output.csv", csvfile.read())
myzip.close()
csvfile.close()
export_data.close()
# serve the file via https
Now the problem is that I need the content of the csv file to be encoded in cp1252 and not in utf-8. Traditionally I would just write f = open("output.csv", "w", encoding="cp1252") and then dump all the data into it. But with in-memory streams it doesn't work that way. Both, io.StringIO() and io.BytesIO() don't take a parameter encoding=.
This is where I have truoble understanding the documentation:
The text stream API is described in detail in the documentation of TextIOBase.
And the documentation of TextIOBase says this:
encoding=
The name of the encoding used to decode the stream’s bytes into strings, and to encode strings into bytes.
But io.StringIO(encoding="cp1252") just throws: TypeError: 'encoding' is an invalid keyword argument for this function.
So how can I use TextIOBase's enconding parameter with StringIO? Or how does this work in general? I am so confused.
StringIO deals only with strings/text. It doesn't know anything about encodings or bytes. The easiest way to do what you want is probably something like:
f = StringIO()
f.write("Some text")
# Old-ish way:
f.seek(0)
my_bytes = f.read().encode("cp1252")
# Alternatively
my_bytes = f.getvalue().encode("cp1252")
reading text from io.BytesIO (in memory streams) using io.TextIOWrapper including encoding and error handling (python3)
this does what io.StringIO cant
sample code
>>> import io
>>> import chardet
>>> # my bytes, single german umlaut
... bts = b'\xf6'
>>>
>>> # try reading as utf-8 text and on error replace
... my_encoding = 'utf-8'
>>> fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='replace')
>>> fh.read()
'�'
>>>
>>> # try reading as utf-8 text with strict error handling
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> # catch exception
... try:
... fh.read()
... except UnicodeDecodeError as err:
... print('"%s"' % err)
... # try to get encoding
... my_encoding = chardet.detect(err.object)['encoding']
... print("correct encoding is %s" % my_encoding)
...
"'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte"
correct encoding is windows-1252
>>> # retry with detected encoding
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> fh.read()
'ö'
I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)
In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass
Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.
In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')
I'm using the following code to unzip and save a CSV file:
with gzip.open(filename_gz) as f:
file = open(filename, "w");
output = csv.writer(file, delimiter = ',')
output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))
Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.
I've read that encoding the content in utf-8-sig should fix the issue. However, adding:
.read().encoding('utf-8-sig')
to f in csv.reader fails with:
File "ckan_gz_datastore.py", line 16, in <module>
output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
How can I remove the BOM and just save the content in correct utf-8?
First, you need to decode the file contents, not encode them.
Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.
Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.
So:
csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())
However, you might consider it simpler / more efficent just to remove the BOM manually:
def remove_bom(line):
return line[3:] if line.startswith(codecs.BOM_UTF8) else line
csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')
That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:
def remove_bom_from_first(iterable):
f = iter(iterable)
firstline = next(f, None)
if firstline is not None:
yield remove_bom(firstline)
for line in f:
yield f
This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"fl", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!
There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"fl", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.
AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I'd like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.
Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.
I came up with an answer myself:
def UnicodeDictReader(utf8_data, **kwargs):
csv_reader = csv.DictReader(utf8_data, **kwargs)
for row in csv_reader:
yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}
Note: This has been updated so keys are decoded per the suggestion in the comments
For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:
with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
csv_reader = csv.DictReader(csv_file)
No special class required. Now I can open files either with or without BOM without crashing.
First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
A classed based approach to #LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8
import csv
class UnicodeDictReader(csv.DictReader, object):
def next(self):
row = super(UnicodeDictReader, self).next()
return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}
That's easy with the unicodecsv package.
# pip install unicodecsv
import unicodecsv as csv
with open('your_file.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:
class UnicodeReader(Iterator):
"""Read Unicode data from a csv file."""
[…]
def _next_row(self):
self.lineno += 1
return [
s if isinstance(s, text_type) else s.decode(self._reader_encoding)
for s in next(self.reader)]
It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.
The answer doesn't have the DictWriter methods, so here is the updated class:
class DictUnicodeWriter(object):
def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
self.fieldnames = fieldnames # list of keys for the dict
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
def writeheader(self):
header = dict(zip(self.fieldnames, self.fieldnames))
self.writerow(header)