python: csv to json conversion when csv contains unicode - python

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:
import csv
import json
originalfilename, file_stream = db.tablename.file.retrieve(info.file)
file_contents = file_stream.read()
csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])
This produces the following error:
'utf8' codec can't decode byte
0xa0 in position 1: invalid start byte
Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:
Traceback (most recent call last):
File ".../web2py/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File ".../controllers/default.py", line 2345, in <module>
File ".../web2py/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File ".../web2py/gluon/tools.py", line 3021, in f
return action(*a, **b)
File ".../controllers/default.py", line 697, in generate_vis
request.vars.json = json.dumps(list(csv_reader))
File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).
A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.
Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding, *args, **kwargs):
csv.DictReader.__init__(self, f, *args, **kwargs)
self.encoding = encoding
def next(self):
return {
k.decode(self.encoding): v.decode(self.encoding)
for (k, v) in csv.DictReader.next(self).items()
}
csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))
it's not known in advance what sort of encoding will come up
Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

Try replacing your final line with
json = json.dumps([x.encode('utf-8') for x in csv_reader])

Running unidecode over the file contents seems to do the trick:
from isounidecode import unidecode
...
file_contents = unidecode(file_stream.read())
...
Thanks, everyone!

Related

UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.
Here is the code -
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
print(line, file = out_file, end = '')
print("done")
if __name__ == '__main__':
main()
Some words are not utf-8.
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position
0: character maps to <undefined>
Update
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
out_file.write(line)
print("done")
if __name__ == '__main__':
main()```
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:\Python\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte
Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).
By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can't represent all Unicode characters and '\u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can't represent them all.
Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as #robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.
Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.
If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.
If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python
Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.
Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

Python 3 itertools.islice continue despite UnicodeDecodeError

I have a python 3 program that monitors a log file. The log includes, among other things, chat messages written by users. The log is created by a third party application which I cannot change.
Today a user wrote "텋��텋��" and it caused the program to crash with the following error:
future: <Task finished coro=<updateConsoleLog() done, defined at /usr/local/src/bserver/logmonitor.py:48> exception=UnicodeDecodeError('utf-8',...
say "\xed\xa0\xbd\xed\xb1\x8c"\r\n', 7623, 7624, 'invalid continuation byte')>
Traceback (most recent call last):
File "/usr/lib/python3.4/asyncio/tasks.py", line 238, in _step
result = next(coro)
File "/usr/local/src/bserver/logmonitor.py", line 50, in updateConsoleLog
server_events = self.console.getUpdate()
File "/usr/local/src/bserver/console.py", line 79, in getUpdate
return self.read()
File "/usr/local/src/bserver/console.py", line 90, in read
for line in itertools.islice(log_file, log_no, None):
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7623: invalid continuation byte
ERROR:asyncio:Task exception was never retrieved
Using 'file -i log.file' I determined that the log file is us-ascii. This shouldn't be and issue as ascii is a subset of utf-8 (as far as I know).
Since this happens rarely and I don't mind losing whatever it is that this user typed, is it possible for me to ignore this line or the particular characters that can't be decoded and just keep on reading the rest of the file?
I considered using try: ... except UnicodeDecodeError as ..., but this would mean I can't read anything in the log file after the error.
Code
def read(self):
log_no = self.last_log_no
log_file = open(self.path, 'r')
server_events = []
starting_log_no = log_no
for line in itertools.islice(log_file, log_no, None): //ERROR
server_events.append(line)
print(line.replace('\n', '').replace('\r', ''))
log_no += 1
self.last_log_no = log_no
if (starting_log_no < log_no):
return server_events
return False
Any help or advise would be appreciated!
The byte string \xed\xa0\xbd\xed\xb1\x8c is not utf-8 valid. Neither is it us-ascii, since us-ascii can only be 7-bits long; i.e. \x8c is greater than 127.
Instead of ignoring the UnicodeDecodeError, try opening the file with an encoding that supports all 8-bits of a byte (e.g. latin-1):
log_file = open(self.path, 'r' encoding='latin-1')

Going from URL encoded accented e to accented e in a .json text file using Python

I am copying strings containing the word cafe (but with an accented e) from a javascript source file into a python script where I need to do some processing over the data and then output some JSON. I am having some trouble getting my head around the encoding/decoding details though. This is perhaps best illustrated with an example:
$ python
>>> import urllib2, json
>>> the_name = "Tasty Caf%C3%E9"
>>> the_name
'Tasty Caf%C3%E9'
>>> the_name_unquoted = urllib2.unquote(the_name)
>>> the_name_unquoted
'Tasty Caf\xc3\xe9'
>>> json.dumps({'bla': the_name_unquoted})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I've spent some time trying to understand how encodings work, though clearly I'm not getting it. Exactly what encoding/format (any other appropriate terminology here?) is the_name_unquoted in above and what is it about it that utf8 cannot decode correctly?
That because that character is supported by unicode encoding. You can fix this by converting the string to unicode.
the_name = u'Tasty Caf%C3%E9'
Alternatively, if a string exists already, you can convert it.
the_name = 'Tasty Caf%C3%E9'
the_name = unicode(the_name)
# or..
the_name = the_name.decode('utf8', the_name)

Unicode characters with BlobStore in App Engine

Is there a way to store unicode data with App Engine's BlobStore (in Python)?
I'm saving the data like this
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
But on the production (not development) server I'm getting this error. It seems to be converting my Unicode into ASCII and I don't know why.
Why is it trying to convert back to ASCII? Can I avoid this?
Traceback (most recent call last):
File "/base/data/home/apps/myapp/1.349473606437967000/myfile.py", line 137, in get
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 364, in write
self._make_rpc_call_with_retry('Append', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 472, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 226, in _make_call
rpc.make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 509, in make_call
self.__rpc.MakeCall(self.__service, method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 115, in MakeCall
self._MakeCallImpl()
File "/base/python_runtime/python_lib/versions/1/google/appengine/runtime/apiproxy.py", line 161, in _MakeCallImpl
self.request.Output(e)
File "/base/python_runtime/python_lib/versions/1/google/net/proto/ProtocolBuffer.py", line 204, in Output
self.OutputUnchecked(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file_service_pb.py", line 2390, in OutputUnchecked
out.putPrefixedString(self.data_)
File "/base/python_runtime/python_lib/versions/1/google/net/proto/ProtocolBuffer.py", line 432, in putPrefixedString
v = str(v)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 313: ordinal not in range(128)
A BLOB store contains binary data: bytes, not characters. So you're going to have to do an encode step of some sort. utf-8 seems as good an encoding as any.
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
This will go wrong if an item in stringInUnicode contains <, & or ]]> sequences. You'll want to do some escaping (either using a proper XML library to serialise the data, or manually):
with files.open(file_name, 'a') as f:
f.write('<as>')
for line in stringInUnicode:
line= line.replace(u'&', u'&').replace(u'<', u'<').replace(u'>', u'>');
f.write('<a>%s</a>' % line.encode('utf-8'))
f.write('</as>')
(This will still be ill-formed XML if the strings ever include control characters, but there's not so much you can do about that. If you need to store arbitrary binary in XML you'd need some ad-hoc encoding such as base-64 on top.)

CSV, DictWriter, unicode and utf-8

I am having problems with the DictWriter and non-ascii characters. A short version of my problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import csv
f = codecs.open("test.csv", 'w', 'utf-8')
writer = csv.DictWriter(f, ['field1'], delimiter='\t')
writer.writerow({'field1':u'å'.encode('utf-8')})
f.close()
Gives this Traceback:
Traceback (most recent call last):
File "test.py", line 10, in <module>writer.writerow({'field1':u'å'.encode('utf-8')})
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/csv.py", line 124, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/codecs.py", line 303, in write data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
I am bit lost as the DictWriter ought to be able to work with UTF-8 from what I have read in the documentation.
The object you obtain with codecs.open wants a unicode string in its write method -- that's the whole point. csv.DictWriter of course is calling that method with a utf8-encoded byte string instead, whence the exception.
Change f's creation to f = open("test.csv", 'wb') (taking codecs out of the picture) and things should work just fine.

Categories

Resources