GAE Python - CSV Writer gives ascii codec error - python

I have a Google App Engine app that stores data in a number of languages, like e.g. German and Russian. To do so, I need to store the strings as UTF-8, which luckily is done automatically since I use the webapp2.request handler. As a beginning programmer, this allowed me to avoid the complications of encoding and decoding the data.
However now I am trying to write the contents to a CSV file, and it seems that for the csv.writer command, the encoding is necessary anyway. Now I am not sure if I should be decoding or encoding, but currently the error I get is following:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
The code I am using is:
import csv, webapp2, codecs
class AdminShopExport(webapp2.RequestHandler):
def get(self):
shops = Shop.all()
shops.order('name')
self.response.headers['Content-Type'] = 'application/csv'
writer = csv.writer(self.response.out)
writer.writerow(["id", "name", "domain", "category"])
for shop in shops:
writer.writerow([shop.keyname, shop.name, shop.url, shop.category])
Regarding the contents, the error is currently coming from a category which is given in Russian. However as said, any of these fields could contain a UTF-8 character. What would be the best way to handle this? Thanks for your help!

python 2x csv does not support unicode
try this
https://github.com/jdunck/python-unicodecsv

Related

Special Charecter in Header line during text import

I'm trying to write a python script to import a data file generated by data aquistion software (EC-lab). I would like to keep the column headers as they are in the file and not manually define them since they are not uniform across all files (different techniques will generate data in different orders and will have a different number of headers). The problem is that the header text in the file contains forward slashes (eg "ox/red", "time/s").
I am getting an ascii error when I try to load the data with the header column
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 19: ordinal not in range(128)
I've tried adding encoding as a keyword argument based off other solutions but that didn't yield a solution
data = np.genfromtxt("20180611_bB_GCE-G.mpt", dtype=None, delimiter='\t', names=True, skip_header=61, encoding='utf-8')
I'm currently using genfromtxt as the data import technique
data = np.genfromtxt("filename.mpt", dtype=None, delimiter='\t', names=True, skip_header=61)
First, forward slashes in headers are not a problem for ASCII, for CSV files, or for NumPy.
My guess is that the real problem is that your CSV is in Latin-1, or a Latin-1-compatible encoding like Windows-1252, and that one of the headers includes the micro sign µ, which is 0xB5 in those encodings. Or that the headers aren't actually a problem at all, and you have µ characters in some of the data.
Either way, with the default encoding of ASCII, you get an error about 0xb5 not being in range(128), exactly like the one in your question.
If you try to fix this by explicitly specifying encoding='utf-8', that's the wrong encoding, and you just get a different error, about 0xb5 being an invalid start byte.
If you fix it by specifying encoding='latin-1', it should work.
More generally, you have to know what encoding your files are actually in, not just guess wildly. Especially if you're on Windows, where a lot of files are going to be in whatever encoding you have set as your OEM code page, while others will be in UTF-16-LE, while others will be in UTF-8 but with an illegal BOM, etc.
The program that generated them should document what encoding it uses, or have options to let you pick. If it doesn't, you need to try, e.g., viewing the file in a text editor that lets you select the encoding to try to figure out which one looks correct. Or you can use a tool like chardet to help you guess.

decoding issue while parsing JSON [python]

I am reading a JSON file in Python which has lots of fields and values (~8000 records).
Env: windows 10, python 3.6.4;
code:
import json
json_data = json.load(open('json_list.json'))
print (json_data)
With this I get an error. Below is the stack trace:
json_data = json.load(open('json_list.json'))
File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>
Along with this I have tried
import json
with open('json_list.json', encoding='utf-8') as fd:
json_data = json.load(fd)
print (json_data)
with this my program runs for a long time then hangs with no output.
I have searched almost all topics related to this and could not find a solution.
Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.
Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.
Here is what the file looks like around the reported error:
>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
b'":"2017-04-10","storage_size_gb":"84.747')
The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.
Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.
The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).
Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.
Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

Python 3.4 hex to Japanese Characters

I am currently writing a script to pull information off my site which contains Japanese characters. So far I have my script pulling out the data off the site.
It has return as a string:
"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
Using an online hex to text tool, I am giving:
年に一度の晴れ姿
I know this phrase is correct, but my question is how do I convert it in python? When I run something like:
name = "\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
print(name)
I am giving this:
å¹´ã«ä¸åº¦ã®æ´ã姿
I've tried to
name.decode("hex")
But it seems like Python 3.4 doesn't have str.decode(), so I tried to convert it to a bytes object and decode it that way, which still failed.
Edit 1:
Follow up question if you don't mind: Like the solution, Martijn Pieters gave this works:
name = "\xe2\x80\x9c\xe5\xa4\x8f\xe7\xa5\xad\xe3\x82\x8a\xe3\x83\x87\xe3\x83\xbc\xe3\x8‌​3\x88\xe2\x80\x9d\xe7\xb5\xa2\xe7\x80\xac \xe7\xb5\xb5\xe9\x87\x8c"
name = name.encode('latin1')
print(name.decode('Utf-8'))
However if I have what's in the quotes for name in a file and I do this:
with open('0N.txt',mode='r',encoding='utf-8') as f:
name = f.read()
name = name.encode('latin1')
print(name.decode('Utf-8'))
It doesn't work...any ideas?
You are confusing the Python representation with the contents. You are shown \xhh hex escapes used in Python string literals to keep the displayed value ASCII safe and reproducable.
You have UTF-8 data here:
>>> name = b"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
>>> name.decode('utf8')
'\u5e74\u306b\u4e00\u5ea6\u306e\u6674\u308c\u59ff'
>>> print(name.decode('utf8'))
年に一度の晴れ姿
Note that I used a bytes() string literal there, using b'...'. If your data is not a bytes object you have a Mojibake and need to encode to bytes first:
name.encode('latin1').decode('utf8')
Latin 1 maps codepoints one-on-one to bytes, so that's usually a safe bet to use in case of such data. It could be that you have a Mojibake in a different codec, it depends on how you retrieved the data.
If used open() to read the data from a file, you either specified the wrong encoding or relied on your platform default. use open(filename, encoding='utf8') to remedy that.
If you used the requests library to load this from a website, take into account that the response.text attribute uses latin-1 as the default codec if a) the site didn't specify a codec and b) the response has a text/* mime-type. If this is sourced from HTML, usually the codec is part of the HTML headers instead. Use a library like BeautifulSoup to handle HTML (using the response.content raw bytes) and it'll detect such information for you.
If all else fails, the ftfy library may still be able to fix a Mojibake; it uses specially constructed codecs to reverse common errors.

International characters in Python

I'm currently working on a Python script that takes a list of log files (from a search engine) and produces a file with all the queries within these, for later analysis.
Another feature of the script is that it removes the most common words, which I've also implemented, but I've faced a problem I can't seem to overcome. The removing of words does work as intended, as long as the queries does not contain special characters. As the search logs are in Danish, the characters æ, ø and å will appear regularly.
Searching on the topic I'm now aware that I need to encode these into UTF-8, which I'm doing when obtaining the query:
tmp = t_query.encode("UTF-8").lower().split()
t_query is the query and I split it up to later compare each word with my list of forbidden words. If I do not use the encoding I'll get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 1: ordinal not in range(128)
Edit: I also tried using the decode instead, but get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa7' in position 3: ordinal not in range(128)
I loop through the words like this:
for i in tmp:
if i in words_to_filter:
tmp.remove(i)
As said this works perfectly for words not including special characters. I've tried to print the i along with the current forbidden word and will get e.g:
færdelsloven - færdelsloven
Where the first word is the ith element in tmp. The last word in the one from the forbidden words. Obviously something has gone wrong, but I just can't manage to find a solution. I've tried many suggestions found on Google and in here, but nothing have worked so far.
Edit 2: if it makes a difference, I've tried loading the log files both with and without the use of codec:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
I'm running Python 2.7.2 from a Windows environment, if it matters. The script should be able to run on other platforms (namely Linux and Mac OS).
I would really appreciate if one of you are able to help me out.
Best regards
Casper
If you are reading files, you want to decode them.
tmp = t_query.decode("UTF-8").lower().split()
Given a utf-8 file with json object per line, you could read all objects:
with open(filename) as file:
jlogs = [json.loads(line) for line in file]
Except for an embeded newline treatment the above code should produce the same result as yours:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
At this point all strings in jlogs are Unicode you don't need to do anything to handle "special" characters. Just make sure you are not mixing bytes and Unicode text in your code.
to get Unicode text from bytes: some_bytes.decode(character_encoding)
to get bytes from Unicode text: some_text.encode(character_encoding)
Don't encode bytes/decode Unicode.
If encoding is right and you just want to ignore unexpected characters you could use errors='ignore' or errors='replace' parameter passed to codecs.open function.
with codecs.open(file_name, encoding='utf-8', mode='r', errors='ignore') as f:
jlogs = map(json.loads, f.readlines())
Details in docs:
http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data
I've finally solved it. As Lattyware Python 3.x seems to do much better. After changing the version and encoding the Python file to Unicode it works as intended.

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?
Try to decode it:
> print u'abcdé'.encode('utf-8')
> abcdé
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé
In case your string is 'str':
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.
Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.
I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
with tempfile.NamedTemporaryFile(delete=False) as fh:
dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
fh = codecs.lookup("utf-8")[3](fh)
dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).
I encounter this error a few times, and my hacky way of dealing with it is just to do this:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

Categories

Resources