In this first example we save two Unicode strings in a file while delegating to codecs the task of encoding them.
# -*- coding: utf-8 -*-
import codecs
cities = [u'Düsseldorf', u'天津市']
with codecs.open("cities", "w", "utf-8") as f:
for c in cities:
f.write(c)
We now do the same thing, first saving the two names to redis, then reading them back and saving what we've read to a file. Because what we've read is already in utf-8 we skip decoding/encoding for that part.
# -*- coding: utf-8 -*-
import redis
r_server = redis.Redis('localhost') #, decode_responses = True)
cities_tag = u'Städte'
cities = [u'Düsseldorf', u'天津市']
for city in cities:
r_server.sadd(cities_tag.encode('utf8'),
city.encode('utf8'))
with open(u'someCities.txt', 'w') as f:
while r_server.scard(cities_tag.encode('utf8')) != 0:
city_utf8 = r_server.srandmember(cities_tag.encode('utf8'))
f.write(city_utf8)
r_server.srem(cities_tag.encode('utf8'), city_utf8)
How can I replace the line
r_server = redis.Redis('localhost')
with
r_server = redis.Redis('localhost', decode_responses = True)
to avoid the wholesale introduction of .encode/.decode when using redis?
I'm not sure that there is a problem.
If you remove all of the .encode('utf8') calls in your code it produces a correct file, i.e. the file is the same as the one produced by your current code.
>>> r_server = redis.Redis('localhost')
>>> r_server.keys()
[]
>>> r_server.sadd(u'Hauptstädte', u'東京', u'Godthåb',u'Москва')
3
>>> r_server.keys()
['Hauptst\xc3\xa4dte']
>>> r_server.smembers(u'Hauptstädte')
set(['Godth\xc3\xa5b', '\xd0\x9c\xd0\xbe\xd1\x81\xd0\xba\xd0\xb2\xd0\xb0', '\xe6\x9d\xb1\xe4\xba\xac'])
This shows that keys and values are UTF8 encoded, therefore .encode('utf8') is not required. The default encoding for the redis module is UTF8. This can be changed by passing an encoding when creating the client, e.g. redis.Redis('localhost', encoding='iso-8859-1'), but there's no reason to.
If you enable response decoding with decode_responses=True then the responses will be converted to unicode using the client connection's encoding. This just means that you don't need to explicitly decode the returned data, redis will do it for you and give you back a unicode string:
>>> r_server = redis.Redis('localhost', decode_responses=True)
>>> r_server.keys()
[u'Hauptst\xe4dte']
>>> r_server.smembers(u'Hauptstädte')
set([u'Godth\xe5b', u'\u041c\u043e\u0441\u043a\u0432\u0430', u'\u6771\u4eac'])
So, in your second example where you write data retrieved from redis to a file, if you enable response decoding then you need to open the output file with the desired encoding. If this is the default encoding then you can just use open(). Otherwise you can use codecs.open() or manually encode the data before writing to the file.
import codecs
cities_tag = u'Hauptstädte'
with codecs.open('capitals.txt', 'w', encoding='utf8') as f:
while r_server.scard(cities_tag) != 0:
city = r_server.srandmember(cities_tag)
f.write(city + '\n')
r_server.srem(cities_tag, city)
Related
I am currently reading the documentation for the io module: https://docs.python.org/3.5/library/io.html?highlight=stringio#io.TextIOBase
Maybe it is because I don't know Python well enough, but in most cases I just don't understand their documentation.
I need to save the data in addresses_list to a csv file and serve it to the user via https. So all of this must happen in-memory. This is the code for it and currently it is working fine.
addresses = Abonnent.objects.filter(exemplare__gt=0)
addresses_list = list(addresses.values_list(*fieldnames))
csvfile = io.StringIO()
csvwriter_unicode = csv.writer(csvfile)
csvwriter_unicode.writerow(fieldnames)
for a in addresses_list:
csvwriter_unicode.writerow(a)
csvfile.seek(0)
export_data = io.BytesIO()
myzip = zipfile.ZipFile(export_data, "w", zipfile.ZIP_DEFLATED)
myzip.writestr("output.csv", csvfile.read())
myzip.close()
csvfile.close()
export_data.close()
# serve the file via https
Now the problem is that I need the content of the csv file to be encoded in cp1252 and not in utf-8. Traditionally I would just write f = open("output.csv", "w", encoding="cp1252") and then dump all the data into it. But with in-memory streams it doesn't work that way. Both, io.StringIO() and io.BytesIO() don't take a parameter encoding=.
This is where I have truoble understanding the documentation:
The text stream API is described in detail in the documentation of TextIOBase.
And the documentation of TextIOBase says this:
encoding=
The name of the encoding used to decode the stream’s bytes into strings, and to encode strings into bytes.
But io.StringIO(encoding="cp1252") just throws: TypeError: 'encoding' is an invalid keyword argument for this function.
So how can I use TextIOBase's enconding parameter with StringIO? Or how does this work in general? I am so confused.
StringIO deals only with strings/text. It doesn't know anything about encodings or bytes. The easiest way to do what you want is probably something like:
f = StringIO()
f.write("Some text")
# Old-ish way:
f.seek(0)
my_bytes = f.read().encode("cp1252")
# Alternatively
my_bytes = f.getvalue().encode("cp1252")
reading text from io.BytesIO (in memory streams) using io.TextIOWrapper including encoding and error handling (python3)
this does what io.StringIO cant
sample code
>>> import io
>>> import chardet
>>> # my bytes, single german umlaut
... bts = b'\xf6'
>>>
>>> # try reading as utf-8 text and on error replace
... my_encoding = 'utf-8'
>>> fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='replace')
>>> fh.read()
'�'
>>>
>>> # try reading as utf-8 text with strict error handling
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> # catch exception
... try:
... fh.read()
... except UnicodeDecodeError as err:
... print('"%s"' % err)
... # try to get encoding
... my_encoding = chardet.detect(err.object)['encoding']
... print("correct encoding is %s" % my_encoding)
...
"'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte"
correct encoding is windows-1252
>>> # retry with detected encoding
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> fh.read()
'ö'
My requirement is to read some data from mysql database, and then write it in a JSON format to a file. However, while writing into file, the unicode data is garbled.
Actual Unicode Name: ぎぎぎは.exe
Name written to file: ã<81>Žã<81>Žã<81>Žã<81>¯.exe
My database has charset set as utf8. I am opening connection like below:
MySQLdb.connect (host = "XXXXX", user = "XXXXX", passwd = "XXXX", cursorclass=cursors.SSCursor,charset='utf8',use_unicode=True)
And the outfile is opened as below:
for r in data:
with open("XX.json",'w') as out:
d={}
d['name']=r[0]
d['type']='Work'
out.write('%s\n' % json.dumps(d, indent=0, ensure_ascii=False).replace('\n', ''))
This is working, but as mentioned above unicode data is getting garbled.
If I do type(r[0]), it's coming as 'str'.
If your solution includes to use codes.open function, with encoding as 'utf-8', then please help me add decode/encode whereever required. This method needs all data to be unicode.
I am lost in plethora of solution available online, but none of them are working perfectly fine for me :(
OS Details: CentOS 6
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
>>>
Try io.open:
# -*- coding: utf8 -*-
import json
import io
with io.open('b.txt', 'at', encoding='utf8') as json_file:
json_file.write(json.dumps({u'ぎぎぎは': 0}, ensure_ascii=False, encoding='utf8'))
I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:
f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)
... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:
line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.
I can't create an utf-8 csv file in Python.
I'm trying to read it's docs, and in the examples section, it says:
For all other encodings the following
UnicodeReader and UnicodeWriter
classes can be used. They take an
additional encoding parameter in their
constructor and make sure that the
data passes the real reader or writer
encoded as UTF-8:
Ok. So I have this code:
values = (unicode("Ñ", "utf-8"), unicode("é", "utf-8"))
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
writer = UnicodeWriter(f)
writer.writerow(values)
And I keep getting this error:
line 159, in writerow
self.stream.write(data)
File "/usr/lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/usr/lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 22: ordinal not in range(128)
Can someone please give me a light so I can understand what the hell am I doing wrong since I set all the encoding everywhere before calling UnicodeWriter class?
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
You don't have to use codecs.open; UnicodeWriter takes Unicode input and takes care of encoding everything into UTF-8. When UnicodeWriter writes into the file handle you passed to it, everything is already in UTF-8 encoding (therefore it works with a normal file you opened with open).
By using codecs.open, you essentially convert your Unicode objects to UTF-8 strings in UnicodeWriter, then try to re-encode these strings into UTF-8 again as if these strings contained Unicode strings, which obviously fails.
As you have figured out it works if you use plain open.
The reason for this is that you tried to encode UTF-8 twice. Once in
f = codecs.open('eggs.csv', 'w', encoding="utf-8")
and then later in UnicodeWriter.writeRow
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
To check that this works use your original code and outcomment that line.
Greetz
I ran into the csv / unicode challenge a while back and tossed this up on bitbucket: http://bitbucket.org/famousactress/dude_csv .. might work for you, if your needs are simple :)
You don't need to "double-encode" everything.
Your application should work entirely in Unicode.
Do your encoding only in the codecs.open to write UTF-8 bytes to an external file. Do no other encoding within your application.
f = open("go.txt", "w")
f.write(title)
f.close()
What if "title" is in japanese/utf-8? How do I modify this code to be able to write "title" without having the ascii error?
Edit: Then, how do I read this file in UTF-8?
How to use UTF-8:
import codecs
# ...
# title is a unicode string
# ...
f = codecs.open("go.txt", "w", "utf-8")
f.write(title)
# ...
fileObj = codecs.open("go.txt", "r", "utf-8")
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
It depends on whether you want to insert a Unicode UTF-8 byte order mark, of which the only way I know of is to open a normal file and write:
import codecs
f = open('go.txt', 'wb')
f.write(codecs.BOM_UTF8)
f.write(title.encode('utf-8')
f.close()
Generally though, I don't want to add a UTF-8 BOM and the following will suffice though:
import codecs
f = codecs.open('go.txt', 'w', 'utf-8')
f.write(title)
f.close()