Wrapping Python's io.open Function - python

as you might know, file-streams created by Python's io.open function in non-binary mode (e.g. io.open("myFile.txt", "w", encoding = "utf-8") do not accept non-unicode strings. For a reason that is kind of hard to explain, I want to sort of monkey-patch the write-method of those file-streams in order to convert the string to be written into a unicode before calling the actual write-method. My code currently looks like this:
import io
import types
def write(self, text):
if not isinstance(text, unicode):
text = unicode(text)
return super(self.__class__, self).write(text)
def wopen(*args, **kwargs):
f = io.open(*args, **kwargs)
f.write = types.MethodType(write, f)
return f
However, it does not work:
>>> with wopen(p, "w", encoding = "utf-8") as f:
>>> f.write("test")
UnsupportedOperation: write
Since I can not really look into those built-in-methods, I have no idea how to analyse the problem.
Thanks in advance!

Related

Using Gzip in Python3 with Csv

The goal is to create python2.7 and >=python3.6 compatible code.
This code currently works on python2.7. It creates a GzipFile object and later writes lists to the gzip file. It lastly uploads the gzip file to an s3 bucket.
Example Data: [[1, 2, 3], [4, 5, 6], ["a", 3, "iamastring"]]
def get_gzip_writer(path):
with s3_reader.open(path) as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
yield csv.writer(gzip_file)
However, this code does not work on python3 due to csv giving str whereas gzip expects bytes. It's important to keep gzip in bytes due to how it's used/read later on. That means using io.TextIOWrapper does not work in this specific use case.
I have tried to create an adapter class.
class BytesToBytes(object):
def __init__(self, stream, dialect, encoding, **kwargs):
self.temp = six.StringIO()
self.writer = csv.writer(self.temp, dialect, **kwargs)
self.stream = stream
self.encoding = encoding
def writerow(self, row):
self.writer.writerow([s.decode('utf-8') if hasattr(s, 'decode') else s for s in row])
self.stream.write(six.ensure_binary(self.temp.getvalue(), encoding))
self.temp.seek(0)
self.temp.truncate(0)
With the updated code looking like:
def get_gzip_writer(path):
with s3_reader.open(path) as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
yield BytesToBytes(gzip_file)
This works, but it seems excessive to have a full class for the purpose of this singular use case.
This is the code that calls the above:
def write_data(data, url):
with get_gzip_writer(url) as writer:
for row in data:
writer.writerow(row)
return url
What options are available for working with GzipFile (while maintaining bytes for read/write) without creating an entire adapter class?
I've read and considered your concern w/keeping the GZip file in binary mode, and I think you can still use TextIOWrapper. My understanding is that its job is to provide an interface for writing bytes from text (my emphasis):
A buffered text stream providing higher-level access to a BufferedIOBase buffered binary stream.
I interpret that as "text in, bytes out"... which is what your GZip application needs, right? If so, then for Python3 we need to give the CSV writer something that accepts strings but ultimately writes bytes.
Enter TextIOWrapper with a UTF-8 encoding, accepting strings from csv.writer's writerow/s() methods and writing UTF-8-encoded bytes to gzip_file.
I've run this in Python2 and 3, and unzipped the file and it looks good:
import csv, gzip, io, six
def get_gzip_writer(path):
with open(path, 'wb') as s3_file:
with gzip.GzipFile(fileobj=s3_file, mode='wb') as gzip_file:
if six.PY3:
with io.TextIOWrapper(gzip_file, encoding='utf-8') as wrapper:
yield csv.writer(wrapper)
elif six.PY2:
yield csv.writer(gzip_file)
else:
raise ValueError('Neither Python2 or 3?!')
data = [[1,2,3],['a','b','c']]
url = 'output.gz'
for writer in get_gzip_writer(url):
for row in data:
writer.writerow(row)

Serializing a list of class instances in python

In python, I am trying to store a list to a file. I've tried pickle, json, etc, but none of them support classes being inside those lists. I can't sacrifice the lists or the classes, I must maintain both. How can I do it?
My current attempt:
try:
with open('file.json', 'r') as file:
allcards = json.load(file)
except:
allcards = []
def saveData(list):
with open('file.json', 'w') as file:
print(list)
json.dump(list, file, indent=2)
saveData is called elsewhere, and I've done all the testing I can and have determined the error comes from trying to save the list due to it's inclusion of classes. It throws me the error
Object of type Card is not JSON serializable
whenever I do the JSON method, and any other method doesn't even give errors but doesn't load the list when I reload the program.
Edit: As for the pickle method, here is what it looks like:
try:
with open('allcards.dat', 'rb') as file:
allcards = pickle.load(file)
print(allcards)
except:
allcards = []
class Card():
def __init__(self, owner, name, rarity, img, pack):
self.owner = str(owner)
self.name = str(name)
self.rarity = str(rarity)
self.img = img
self.pack = str(pack)
def saveData(list):
with open('allcards.dat', 'wb') as file:
pickle.dump(list, file)
When I do this, all that happens is the code runs as normal, but the list is not saved. And the print(allcards) does not trigger either which makes me believe it's somehow not detecting the file or causing some other error leading to it just going straight to the exception. Also, img is supposed to always a link, in case that changes anything.
I have no other way I believe I can help solve this issue, but I can post more code if need be.
Please help, and thanks in advance.
Python's built-in pickle module does not support serializing a python class, but there are libraries that extend the pickle module and provide this functionality. Drill and Cloudpickle both support serializing a python class and has the exact same interface as the pickle module.
Dill: https://github.com/uqfoundation/dill
Cloudpickle: https://github.com/cloudpipe/cloudpickle
//EDIT
The article linked below is good, but I've written a bad example.
This time I've created a new snippet from scratch -- sorry for making it earlier more complicated than it should.
import json
class Card(object):
#classmethod
def from_json(cls, data):
return cls(**data)
def __init__(self, figure, color):
self.figure = figure
self.color = color
def __repr__(self):
return f"<Card: [{self.figure} of {self.color}]>"
def save(cards):
with open('file.json', 'w') as f:
json.dump(cards, f, indent=4, default=lambda c: c.__dict__)
def load():
with open('file.json', 'r') as f:
obj_list = json.load(f)
return [Card.from_json(obj) for obj in obj_list]
cards = []
cards.append(Card("1", "clubs"))
cards.append(Card("K", "spades"))
save(cards)
cards_from_file = load()
print(cards_from_file)
Source

Log everything printed into a file

I would like to create a function that keeps a record of every print command, storing each command's string into a new line in a file.
def log(line):
with open('file.txt', "a") as f:
f.write('\n' + line)
This is what I have, but is there any way to do what I said using Python?
Try replacing stdout with custom class:
import sys
class LoggedStdout():
def __init__(self, filename = None):
self.filename = filename
def write(self, text):
sys.__stdout__.write(text)
if not self.filename is None:
self.log(text)
def log(self, line):
with open(self.filename, "a") as f:
f.write('\n' + line)
sys.stdout = LoggedStdout('file.txt')
print 'Hello world!'
This would affect not only print, but also any other function that prints something to stdout, but it is often even better.
For production-mode logging it's much better to use something like logging module, rather than home-made hooks over standard IO streams.

Python CSV DictReader with UTF-8 data

AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I'd like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.
Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.
I came up with an answer myself:
def UnicodeDictReader(utf8_data, **kwargs):
csv_reader = csv.DictReader(utf8_data, **kwargs)
for row in csv_reader:
yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}
Note: This has been updated so keys are decoded per the suggestion in the comments
For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:
with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
csv_reader = csv.DictReader(csv_file)
No special class required. Now I can open files either with or without BOM without crashing.
First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
A classed based approach to #LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8
import csv
class UnicodeDictReader(csv.DictReader, object):
def next(self):
row = super(UnicodeDictReader, self).next()
return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}
That's easy with the unicodecsv package.
# pip install unicodecsv
import unicodecsv as csv
with open('your_file.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:
class UnicodeReader(Iterator):
"""Read Unicode data from a csv file."""
[…]
def _next_row(self):
self.lineno += 1
return [
s if isinstance(s, text_type) else s.decode(self._reader_encoding)
for s in next(self.reader)]
It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.
The answer doesn't have the DictWriter methods, so here is the updated class:
class DictUnicodeWriter(object):
def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
self.fieldnames = fieldnames # list of keys for the dict
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
def writeheader(self):
header = dict(zip(self.fieldnames, self.fieldnames))
self.writerow(header)

This is my current way of writing to a file. However, I can't do UTF-8?

f = open("go.txt", "w")
f.write(title)
f.close()
What if "title" is in japanese/utf-8? How do I modify this code to be able to write "title" without having the ascii error?
Edit: Then, how do I read this file in UTF-8?
How to use UTF-8:
import codecs
# ...
# title is a unicode string
# ...
f = codecs.open("go.txt", "w", "utf-8")
f.write(title)
# ...
fileObj = codecs.open("go.txt", "r", "utf-8")
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
It depends on whether you want to insert a Unicode UTF-8 byte order mark, of which the only way I know of is to open a normal file and write:
import codecs
f = open('go.txt', 'wb')
f.write(codecs.BOM_UTF8)
f.write(title.encode('utf-8')
f.close()
Generally though, I don't want to add a UTF-8 BOM and the following will suffice though:
import codecs
f = codecs.open('go.txt', 'w', 'utf-8')
f.write(title)
f.close()

Categories

Resources