The goal is to create python2.7 and >=python3.6 compatible code.
This code currently works on python2.7. It creates a GzipFile object and later writes lists to the gzip file. It lastly uploads the gzip file to an s3 bucket.
Example Data: [[1, 2, 3], [4, 5, 6], ["a", 3, "iamastring"]]
def get_gzip_writer(path):
with s3_reader.open(path) as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
yield csv.writer(gzip_file)
However, this code does not work on python3 due to csv giving str whereas gzip expects bytes. It's important to keep gzip in bytes due to how it's used/read later on. That means using io.TextIOWrapper does not work in this specific use case.
I have tried to create an adapter class.
class BytesToBytes(object):
def __init__(self, stream, dialect, encoding, **kwargs):
self.temp = six.StringIO()
self.writer = csv.writer(self.temp, dialect, **kwargs)
self.stream = stream
self.encoding = encoding
def writerow(self, row):
self.writer.writerow([s.decode('utf-8') if hasattr(s, 'decode') else s for s in row])
self.stream.write(six.ensure_binary(self.temp.getvalue(), encoding))
self.temp.seek(0)
self.temp.truncate(0)
With the updated code looking like:
def get_gzip_writer(path):
with s3_reader.open(path) as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
yield BytesToBytes(gzip_file)
This works, but it seems excessive to have a full class for the purpose of this singular use case.
This is the code that calls the above:
def write_data(data, url):
with get_gzip_writer(url) as writer:
for row in data:
writer.writerow(row)
return url
What options are available for working with GzipFile (while maintaining bytes for read/write) without creating an entire adapter class?
I've read and considered your concern w/keeping the GZip file in binary mode, and I think you can still use TextIOWrapper. My understanding is that its job is to provide an interface for writing bytes from text (my emphasis):
A buffered text stream providing higher-level access to a BufferedIOBase buffered binary stream.
I interpret that as "text in, bytes out"... which is what your GZip application needs, right? If so, then for Python3 we need to give the CSV writer something that accepts strings but ultimately writes bytes.
Enter TextIOWrapper with a UTF-8 encoding, accepting strings from csv.writer's writerow/s() methods and writing UTF-8-encoded bytes to gzip_file.
I've run this in Python2 and 3, and unzipped the file and it looks good:
import csv, gzip, io, six
def get_gzip_writer(path):
with open(path, 'wb') as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode='wb') as gzip_file:
if six.PY3:
with io.TextIOWrapper(gzip_file, encoding='utf-8') as wrapper:
yield csv.writer(wrapper)
elif six.PY2:
yield csv.writer(gzip_file)
else:
raise ValueError('Neither Python2 or 3?!')
data = [[1,2,3],['a','b','c']]
url = 'output.gz'
for writer in get_gzip_writer(url):
for row in data:
writer.writerow(row)
Related
I'm working on a new library which will allow the user to parse any file (xlsx, csv, json, tar, zip, txt) into generators.
Now I'm stuck at zip archive and when I try to parse a csv from it, I get
io.UnsupportedOperation: seek immediately after elem.seek(0). The csv file is a simple one 4x4 rows and columns. If I parse the csv using the csv_parser I get what I want, but trying to parse it from a zip archive... boom. Error!
with open("/Users/ro/Downloads/archive_file/csv.zip", 'r') as my_file_file:
asd = parse_zip(my_file_file)
print asd
Where parse_zip is:
def parse_zip(element):
"""Function for manipulating zip files"""
try:
my_zip = zipfile.ZipFile(element, 'r')
except zipfile.BadZipfile:
raise err.NestedArchives(element)
else:
my_file = my_zip.open('corect_csv.csv')
# print my_file
my_mime = csv_tsv_parser.parse_csv_tsv(my_file)
print list(my_mime)
And parse_cvs_tsv is:
def _csv_tsv_parser(element):
"""Helper function for csv and tsv files that return an generator"""
for row in element:
if any(s for s in row):
yield row
def parse_csv_tsv(elem):
"""Function for manipulating all the csv files"""
dialect = csv.Sniffer().sniff(elem.readline())
elem.seek(0)
data_file = csv.reader(elem, dialect)
read_data = _csv_tsv_parser(data_file)
yield '', read_data
Where am I wrong? Is the way I'm opening the file OK or...?
Zipfile.open returns a file-like ZipExtFile object that inherits from io.BufferedIOBase. io.BufferedIOBase does not support seek (only text streams in the io module support seek), hence the exception.
However, ZipExtFile does provide a peek method, which will return a number of bytes without moving the file pointer. So changing
dialect = csv.Sniffer().sniff(elem.readline())
elem.seek(0)
to
num_bytes = 128 # number of bytes to read
dialect = csv.Sniffer().sniff(elem.peek(n=num_bytes))
solves the problem.
How can I open file on FTP server in write mode? I know I can write/create file directly (when I have data), but I want to first open it for writing and only then write it as you would do locally using contextmanager.
The reasoning is, I want to create interface that would have unified methods to work with transfer protocol servers. Specifically SFTP and FTP.
So with SFTP its easy (using paramiko):
def open(sftp, fname, mode='r'):
return sftp.open(fname, mode=mode)
Now I can do this:
with open(sftp, 'some_file.txt', 'w') as f:
f.write(data)
And then I can read what was written
with open(sftp, 'some_file.txt', 'r') as f:
print(f.read().decode('utf-8'))
How can I do the same implementation for FTP (using ftplib)?
Reading part for FTP, I was able to implement and I can open file in read mode just like with SFTP. But how can I open it in write mode? ftplib method storbinary asks for data to be provided "immediately". I mean I should already pass data I want to write via open method (but then it would defeat unified method purpose)?
import io
def open(ftp, filename, mode='r'):
"""Open a file on FTP server."""
def handle_buffer(buffer_data):
bio.write(buffer_data)
# Reading implementation
if mode == 'r':
bio = io.BytesIO()
ftp.retrbinary(
'RETR %s' % filename, callback=handle_buffer)
bio.seek(0)
return bio
# Writing implementation.
if mode == 'w':
# how to open in write mode?
update
Let say we have immediate writing implementation in FTP:
bio = io.BytesIO
# Write some data
data = csv.writer(bio)
data.writerows(data_to_export)
bio.seek(0)
# Store. So it looks like storbinary does not open file in w mode, it does everything in one go?
ftp.storbinary("STOR " + file_name, sio)
So the question is how can I separate writing data from just opening file in write mode. Is it even possible with ftplib?
So after some struggle, I was able to make this work. Solution was to implement custom contextmanagers for open method when in read (had to reimplement read mode, because it was only working with plain file reading, but was failing if let say I would try to use csv reader) mode and when in write mode.
For read mode, I chose to use tempfile, because using other approaches, I was not able to properly read data using different readers (plain file reader, csv reader etc.). Though when using opened tempfile in read mode, everything works as expected.
For write mode, I was able to utilize memory buffer -> io.BytesIO. So for writing it was not necessary to use tempfile.
import tempfile
class OpenRead(object):
def _open_tempfile(self):
self.tfile = tempfile.NamedTemporaryFile()
# Write data on tempfile.
self.ftp.retrbinary(
'RETR %s' % self.filename, self.tfile.write)
# Get back to start of file, so it would be possible to
# read it.
self.tfile.seek(0)
return open(self.tfile.name, 'r')
def __init__(self, ftp, filename):
self.ftp = ftp
self.filename = filename
self.tfile = None
def __enter__(self):
return self._open_tempfile()
def __exit__(self, exception_type, exception_value, traceback):
# Remove temporary file.
self.tfile.close()
class OpenWrite(object):
def __init__(self, ftp, filename):
self.ftp = ftp
self.filename = filename
self.data = ''
def __enter__(self):
return self
def __exit__(self, exception_type, exception_value, traceback):
bio = io.BytesIO()
if isinstance(self.data, six.string_types):
self.data = self.data.encode()
bio.write(self.data)
bio.seek(0)
res = self.ftp.storbinary('STOR %s' % self.filename, bio)
bio.close()
return res
def write(self, data):
self.data += data
def open(ftp, filename, mode='r'):
"""Open a file on FTP server."""
if mode == 'r':
return OpenRead(ftp, filename)
if mode == 'w':
return OpenWrite(ftp, filename)
P.S. this might not work properly without context manager, but for now it is OK solution to me. If anyone has better implementation, they are more than welcome to share it.
Update
Decided to use ftputil package instead of standard ftplib. So all this hacking is not needed, because ftputil takes care of it and it actually uses many same named methods as paramiko, that do same thing, so it is much easier to unify protocols usage.
In a WSGI application we can read the row input data from the wsgi.input field:
def application(env, start_response):
.....
data = env['wsgi.input'].read(num_bytes)
.....
However, I want to wrap the file-like object using the new io module:
import io
def application(env, start_response):
.....
f = io.open(env['wsgi.input'], 'rb')
buffer = bytearray(buff_size)
read = f.readinto(buffer)
.....
The problem is that io.open doesn't accept these kind of file objects. Any idea on how to do that? I need to read from env['wsgi.input'] to a buffer.
The io.open() function does not accept file-object as the first parameter.
However, it accepts an integer representing the handle to an open file. so you may have some success using:
f = io.open(env['wsgi.input'].fileno, 'rb')
Addendum:
The io module is written for python 3, where string handling is quite different from python 2. calling read() on a file opened in binary mode returns a bytes object in python 3, but a str in python 2, but when wrapping a file using the io module and using binary mode, the io module expect read() to return bytes.
You can try fixing your original file object by making it return bytes:
def fix(file):
# wrap 'func' to convert its return value to bytes using the specified encoding
def wrap(func, encoding):
def read(*args, **kwargs):
return bytes(func(*args, **kwargs), encoding)
return read
file.read = wrap(file.read, 'ascii')
fix(env['wsgi.input'])
f = io.open(env['wsgi.input'].fileno, 'rb')
The above function wraps the read() method, but can be completed to wrap readline(). also, a small additional work is required to wrap readlines()...
AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I'd like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.
Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.
I came up with an answer myself:
def UnicodeDictReader(utf8_data, **kwargs):
csv_reader = csv.DictReader(utf8_data, **kwargs)
for row in csv_reader:
yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}
Note: This has been updated so keys are decoded per the suggestion in the comments
For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:
with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
csv_reader = csv.DictReader(csv_file)
No special class required. Now I can open files either with or without BOM without crashing.
First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
A classed based approach to #LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8
import csv
class UnicodeDictReader(csv.DictReader, object):
def next(self):
row = super(UnicodeDictReader, self).next()
return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}
That's easy with the unicodecsv package.
# pip install unicodecsv
import unicodecsv as csv
with open('your_file.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:
class UnicodeReader(Iterator):
"""Read Unicode data from a csv file."""
[…]
def _next_row(self):
self.lineno += 1
return [
s if isinstance(s, text_type) else s.decode(self._reader_encoding)
for s in next(self.reader)]
It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.
The answer doesn't have the DictWriter methods, so here is the updated class:
class DictUnicodeWriter(object):
def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
self.fieldnames = fieldnames # list of keys for the dict
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
def writeheader(self):
header = dict(zip(self.fieldnames, self.fieldnames))
self.writerow(header)
I can't create an utf-8 csv file in Python.
I'm trying to read it's docs, and in the examples section, it says:
For all other encodings the following
UnicodeReader and UnicodeWriter
classes can be used. They take an
additional encoding parameter in their
constructor and make sure that the
data passes the real reader or writer
encoded as UTF-8:
Ok. So I have this code:
values = (unicode("Ñ", "utf-8"), unicode("é", "utf-8"))
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
writer = UnicodeWriter(f)
writer.writerow(values)
And I keep getting this error:
line 159, in writerow
self.stream.write(data)
File "/usr/lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/usr/lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 22: ordinal not in range(128)
Can someone please give me a light so I can understand what the hell am I doing wrong since I set all the encoding everywhere before calling UnicodeWriter class?
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
You don't have to use codecs.open; UnicodeWriter takes Unicode input and takes care of encoding everything into UTF-8. When UnicodeWriter writes into the file handle you passed to it, everything is already in UTF-8 encoding (therefore it works with a normal file you opened with open).
By using codecs.open, you essentially convert your Unicode objects to UTF-8 strings in UnicodeWriter, then try to re-encode these strings into UTF-8 again as if these strings contained Unicode strings, which obviously fails.
As you have figured out it works if you use plain open.
The reason for this is that you tried to encode UTF-8 twice. Once in
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
and then later in UnicodeWriter.writeRow
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
To check that this works use your original code and outcomment that line.
Greetz
I ran into the csv / unicode challenge a while back and tossed this up on bitbucket: http://bitbucket.org/famousactress/dude_csv .. might work for you, if your needs are simple :)
You don't need to "double-encode" everything.
Your application should work entirely in Unicode.
Do your encoding only in the codecs.open to write UTF-8 bytes to an external file. Do no other encoding within your application.