AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I'd like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.
Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.
I came up with an answer myself:
def UnicodeDictReader(utf8_data, **kwargs):
csv_reader = csv.DictReader(utf8_data, **kwargs)
for row in csv_reader:
yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}
Note: This has been updated so keys are decoded per the suggestion in the comments
For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:
with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
csv_reader = csv.DictReader(csv_file)
No special class required. Now I can open files either with or without BOM without crashing.
First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
A classed based approach to #LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8
import csv
class UnicodeDictReader(csv.DictReader, object):
def next(self):
row = super(UnicodeDictReader, self).next()
return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}
That's easy with the unicodecsv package.
# pip install unicodecsv
import unicodecsv as csv
with open('your_file.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:
class UnicodeReader(Iterator):
"""Read Unicode data from a csv file."""
[…]
def _next_row(self):
self.lineno += 1
return [
s if isinstance(s, text_type) else s.decode(self._reader_encoding)
for s in next(self.reader)]
It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.
The answer doesn't have the DictWriter methods, so here is the updated class:
class DictUnicodeWriter(object):
def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
self.fieldnames = fieldnames # list of keys for the dict
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
def writeheader(self):
header = dict(zip(self.fieldnames, self.fieldnames))
self.writerow(header)
Related
The goal is to create python2.7 and >=python3.6 compatible code.
This code currently works on python2.7. It creates a GzipFile object and later writes lists to the gzip file. It lastly uploads the gzip file to an s3 bucket.
Example Data: [[1, 2, 3], [4, 5, 6], ["a", 3, "iamastring"]]
def get_gzip_writer(path):
with s3_reader.open(path) as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
yield csv.writer(gzip_file)
However, this code does not work on python3 due to csv giving str whereas gzip expects bytes. It's important to keep gzip in bytes due to how it's used/read later on. That means using io.TextIOWrapper does not work in this specific use case.
I have tried to create an adapter class.
class BytesToBytes(object):
def __init__(self, stream, dialect, encoding, **kwargs):
self.temp = six.StringIO()
self.writer = csv.writer(self.temp, dialect, **kwargs)
self.stream = stream
self.encoding = encoding
def writerow(self, row):
self.writer.writerow([s.decode('utf-8') if hasattr(s, 'decode') else s for s in row])
self.stream.write(six.ensure_binary(self.temp.getvalue(), encoding))
self.temp.seek(0)
self.temp.truncate(0)
With the updated code looking like:
def get_gzip_writer(path):
with s3_reader.open(path) as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
yield BytesToBytes(gzip_file)
This works, but it seems excessive to have a full class for the purpose of this singular use case.
This is the code that calls the above:
def write_data(data, url):
with get_gzip_writer(url) as writer:
for row in data:
writer.writerow(row)
return url
What options are available for working with GzipFile (while maintaining bytes for read/write) without creating an entire adapter class?
I've read and considered your concern w/keeping the GZip file in binary mode, and I think you can still use TextIOWrapper. My understanding is that its job is to provide an interface for writing bytes from text (my emphasis):
A buffered text stream providing higher-level access to a BufferedIOBase buffered binary stream.
I interpret that as "text in, bytes out"... which is what your GZip application needs, right? If so, then for Python3 we need to give the CSV writer something that accepts strings but ultimately writes bytes.
Enter TextIOWrapper with a UTF-8 encoding, accepting strings from csv.writer's writerow/s() methods and writing UTF-8-encoded bytes to gzip_file.
I've run this in Python2 and 3, and unzipped the file and it looks good:
import csv, gzip, io, six
def get_gzip_writer(path):
with open(path, 'wb') as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode='wb') as gzip_file:
if six.PY3:
with io.TextIOWrapper(gzip_file, encoding='utf-8') as wrapper:
yield csv.writer(wrapper)
elif six.PY2:
yield csv.writer(gzip_file)
else:
raise ValueError('Neither Python2 or 3?!')
data = [[1,2,3],['a','b','c']]
url = 'output.gz'
for writer in get_gzip_writer(url):
for row in data:
writer.writerow(row)
I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.
I have read every thread related to unicode reading, but I can't seem to get it to work.
Im trying to read a csv which happens to have a utf-8 BOM signature on it and is also utf-8.
So, after opening the file, reading it with unicodecsv library, I've tried different things.
def _extract_gz(self): # fd
logging.info("Gz detected")
self.fp = gzip.open(self.path)
return unicodecsv.reader(self.path.read().decode('utf-8-sig').splitlines(), encoding='utf-8')
Still fails at row 226. UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 226: ordinal not in range(128)
Also tried this approach but failed as well.
def _extract_gz(self): # fd
logging.info("Gz detected")
self.fp = gzip.open(self.path)
self.f = self.unicode_csv_reader()
return self.f
def unicode_csv_reader(self):
csv_reader = csv.reader(self.fp.read().decode('utf-8-sig').splitlines())
for row in csv_reader:
yield [cell.encode('utf-8', 'replace') for cell in row]
What am I doing wrong?
Thanks everyone.
Version is Python 2.7.12
The built-in csv module does not support Unicode (assuming Python 2.x), but there is a drop-in replacement unicodecsv module which does (and which you've apparently tried, unsuccessfully) and it should be fairly straightforward:
import gzip
import unicodecsv as csv
def read_csv(filename, has_bom=True, **kwargs):
with gzip.open(filename, "r") as f:
if has_bom:
f.seek(3) # skip the BOM
reader = csv.reader(f, **kwargs)
for row in reader:
yield row
for row in read_csv("path/to/your.csv.gz", delimiter=";"): # encoding needed for BOM
print(row) # or do whatever you want with it
Should do the trick.
UPDATE - The above code works with your uploaded file and doesn't throw any errors (since your files are delimited by a semi-column I've added that as well), however there is a bug in the unicodecsv module - it doesn't remove quotes around the first column name when parsing a file with BOM so I've updated the code to reflect that.
When running it on your uploaded file you get the following output (YMMV, depends how your console prints unicode):
[u'Name', u'Ref', u'POS', u'POS', u'Status', u'City', u'']
[u'Hotel Flamero', u'3365', u'ES', u'0.27', u'No Change', u'Matalascañas', u'']
(the last empty entry is due to your CSV having the last entry as empty)
UPDATE#2 - Don't have a MySQL instance at hand, but you can check that it parses just fine using an in-memory SQLite DB:
import sqlite3
db = sqlite3.connect(":memory:") # create an in-memory DB
c = db.cursor()
c.execute("CREATE TABLE test (Name TEXT, Ref TEXT, POS TEXT, Status TEXT, City TEXT)")
header = None
for row in read_csv("path/to/your.csv.gz", delimiter=";"):
del row[-1] # remove the last element as it's always empty
if header is None: # get the header first
header = row
continue
query = u"INSERT INTO test ({}) VALUES ({})".format(
u", ".join(header),
u", ".join(u"'{}'".format(column) for column in row) # quote each column entry
)
c.execute(query)
# now lets read our data from the DB
c.execute("SELECT * FROM test")
for row in c.fetchall():
print(row)
which happily prints:
(u'Hotel Flamero', u'3365', u'ES', u'No Change', u'Matalascañas')
the following code worked until today when I imported from a Windows machine and got this error:
new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
import csv
class CSV:
def __init__(self, file=None):
self.file = file
def read_file(self):
data = []
file_read = csv.reader(self.file)
for row in file_read:
data.append(row)
return data
def get_row_count(self):
return len(self.read_file())
def get_column_count(self):
new_data = self.read_file()
return len(new_data[0])
def get_data(self, rows=1):
data = self.read_file()
return data[:rows]
How can I fix this issue?
def upload_configurator(request, id=None):
"""
A view that allows the user to configurator the uploaded CSV.
"""
upload = Upload.objects.get(id=id)
csvobject = CSV(upload.filepath)
upload.num_records = csvobject.get_row_count()
upload.num_columns = csvobject.get_column_count()
upload.save()
form = ConfiguratorForm()
row_count = csvobject.get_row_count()
colum_count = csvobject.get_column_count()
first_row = csvobject.get_data(rows=1)
first_two_rows = csvobject.get_data(rows=5)
It'll be good to see the csv file itself, but this might work for you, give it a try, replace:
file_read = csv.reader(self.file)
with:
file_read = csv.reader(self.file, dialect=csv.excel_tab)
Or, open a file with universal newline mode and pass it to csv.reader, like:
reader = csv.reader(open(self.file, 'rU'), dialect=csv.excel_tab)
Or, use splitlines(), like this:
def read_file(self):
with open(self.file, 'r') as f:
data = [row for row in csv.reader(f.read().splitlines())]
return data
I realize this is an old post, but I ran into the same problem and don't see the correct answer so I will give it a try
Python Error:
_csv.Error: new-line character seen in unquoted field
Caused by trying to read Macintosh (pre OS X formatted) CSV files. These are text files that use CR for end of line. If using MS Office make sure you select either plain CSV format or CSV (MS-DOS). Do not use CSV (Macintosh) as save-as type.
My preferred EOL version would be LF (Unix/Linux/Apple), but I don't think MS Office provides the option to save in this format.
For Mac OS X, save your CSV file in "Windows Comma Separated (.csv)" format.
If this happens to you on mac (as it did to me):
Save the file as CSV (MS-DOS Comma-Separated)
Run the following script
with open(csv_filename, 'rU') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print ', '.join(row)
Try to run dos2unix on your windows imported files first
This is an error that I faced. I had saved .csv file in MAC OSX.
While saving, save it as "Windows Comma Separated Values (.csv)" which resolved the issue.
This worked for me on OSX.
# allow variable to opened as files
from io import StringIO
# library to map other strange (accented) characters back into UTF-8
from unidecode import unidecode
# cleanse input file with Windows formating to plain UTF-8 string
with open(filename, 'rb') as fID:
uncleansedBytes = fID.read()
# decode the file using the correct encoding scheme
# (probably this old windows one)
uncleansedText = uncleansedBytes.decode('Windows-1252')
# replace carriage-returns with new-lines
cleansedText = uncleansedText.replace('\r', '\n')
# map any other non UTF-8 characters into UTF-8
asciiText = unidecode(cleansedText)
# read each line of the csv file and store as an array of dicts,
# use first line as field names for each dict.
reader = csv.DictReader(StringIO(cleansedText))
for line_entry in reader:
# do something with your read data
I know this has been answered for quite some time but not solve my problem. I am using DictReader and StringIO for my csv reading due to some other complications. I was able to solve problem more simply by replacing delimiters explicitly:
with urllib.request.urlopen(q) as response:
raw_data = response.read()
encoding = response.info().get_content_charset('utf8')
data = raw_data.decode(encoding)
if '\r\n' not in data:
# proably a windows delimited thing...try to update it
data = data.replace('\r', '\r\n')
Might not be reasonable for enormous CSV files, but worked well for my use case.
Alternative and fast solution : I faced the same error. I reopened the "wierd" csv file in GNUMERIC on my lubuntu machine and exported the file as csv file. This corrected the issue.
I can't create an utf-8 csv file in Python.
I'm trying to read it's docs, and in the examples section, it says:
For all other encodings the following
UnicodeReader and UnicodeWriter
classes can be used. They take an
additional encoding parameter in their
constructor and make sure that the
data passes the real reader or writer
encoded as UTF-8:
Ok. So I have this code:
values = (unicode("Ñ", "utf-8"), unicode("é", "utf-8"))
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
writer = UnicodeWriter(f)
writer.writerow(values)
And I keep getting this error:
line 159, in writerow
self.stream.write(data)
File "/usr/lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/usr/lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 22: ordinal not in range(128)
Can someone please give me a light so I can understand what the hell am I doing wrong since I set all the encoding everywhere before calling UnicodeWriter class?
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
You don't have to use codecs.open; UnicodeWriter takes Unicode input and takes care of encoding everything into UTF-8. When UnicodeWriter writes into the file handle you passed to it, everything is already in UTF-8 encoding (therefore it works with a normal file you opened with open).
By using codecs.open, you essentially convert your Unicode objects to UTF-8 strings in UnicodeWriter, then try to re-encode these strings into UTF-8 again as if these strings contained Unicode strings, which obviously fails.
As you have figured out it works if you use plain open.
The reason for this is that you tried to encode UTF-8 twice. Once in
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
and then later in UnicodeWriter.writeRow
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
To check that this works use your original code and outcomment that line.
Greetz
I ran into the csv / unicode challenge a while back and tossed this up on bitbucket: http://bitbucket.org/famousactress/dude_csv .. might work for you, if your needs are simple :)
You don't need to "double-encode" everything.
Your application should work entirely in Unicode.
Do your encoding only in the codecs.open to write UTF-8 bytes to an external file. Do no other encoding within your application.