Python: Convert Unicode to ASCII without errors for CSV file

Python: Convert Unicode to ASCII without errors for CSV file - python

I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"
buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
row = cr.fetchone()
writer.writerow([s.encode('ascii','ignore') for s in row])
The value of row is
(56, u"LIMPIADOR BA\xd1O 1'5 L")
where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).
I'd like to save the value to the CSV, preferably with the ñ ("LIMPIADOR BAÑO 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").

Correct, ñ is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here:
What is the best way to remove accents in a Python unicode string?
But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.
So just switch "ascii" for whatever encoding you want.
Your actual code, according to your comment is:
writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
where
row = (56, u"LIMPIADOR BA\xd1O 1'5 L")
Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:
col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row)
Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.

Related

Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3?

I'm using Python 3.7 and Django 2.0. I want to strip out non-UTF-8 characters from a string, that I'm obtaining by reading this CSV file. I tried this ...
web_site = row['website'].strip().encode("utf-8", 'ignore').decode("utf-8")
but this doesn't seem to be doing the job, since I have a resulting string that looks like ...
web_site: "wbez.org<200e>"
Whatever this "<200e>" thing is, is evidently non-UTF-8 string, because when I try and insert this into a MySQL database (deployed as a docker image), I get the following error ...
web_1 | django.db.utils.OperationalError: Problem installing fixture '/app/maps/fixtures/seed_data.yaml': Could not load maps.Coop(pk=191): (1366, "Incorrect string value: '\\xE2\\x80\\x8E' for column 'web_site' at row 1")

Your row['website'] is already a Unicode string. UTF-8 can support all valid Unicode code points, so .encode('utf8','ignore') doesn't typically ignore anything and encodes the entire string in UTF-8, and .decode('utf8') changes it back to a Unicode string again.
If you simply want to strip non-ASCII characters, use the following to filter only ASCII characters and ignore the rest.
row['website'].encode('ascii','ignore').decode('ascii')

I think you are confusing the encodings.
Python has a standard character set: Unicode
UTF-8 is just and encoding of Unicode. All characters in Unicode can be encoded in UTF-8, and all valid UTF-8 codes can be interpreted as unicode characters.
So you are just encoding and decoding Unicode strings, so the code should do nothing. (There is really some exceptional cases: Python strings really are a superset of Unicode, so your code would just remove non Unicode characters, see surrogateescape, for such extremely seldom case, usually you will enconter only by reading sys.argv or os.environ).
In any case, I think you are doing thing wrong. Search in this site for the general question (e.g. "remove non-ascii characters"). It is often better to decompose (with K, compatibility), and then remove accent, and then remove non-ascii characters, so that you will get more characters translated. There are various function to create slug, which do a better job, or there is also a library which translate more characters in "nearly equivalent" ascii characters (Unicode has various representation of LETTER A, and you may want to translate also Alpha and Aleph and ... into A (better then discarding, especially if you have a foreign language, which possibly you will discard everything).

How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)
I tried the following methods:
print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')
None of them work. I have the same error message.
What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?
How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?
ADDED
I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.

I recently posted another answer that addresses this very issue. Key quote:
For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)
You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.
Without knowing what is in your txt object I can't be more specific.

I think you need to use codecs library:
import codecs
file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()
Works fine.
The Story of Encoding/Decoding:
In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.
With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

string decode method error in python

I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.

Firstly:
The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.
Secondly:
Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.

Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?

Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.

You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.

Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"

To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Unicode to UTF8 for CSV Files - Python via xlrd

I'm trying to translate an Excel spreadsheet to CSV using the Python xlrd and csv modules, but am getting hung up on encoding issues. Xlrd produces output from Excel in Unicode, and the CSV module requires UTF-8.
I imaging that this has nothing to do with the xlrd module: everything works fine outputing to stdout or other outputs that don't require a specific encoding.
The worksheet is encoded as UTF-16-LE, according to book.encoding
The simplified version of what I'm doing is:
from xlrd import *
import csv
b = open_workbook('file.xls')
s = b.sheet_by_name('Export')
bc = open('file.csv','w')
bcw = csv.writer(bc,csv.excel,b.encoding)
for row in range(s.nrows):
this_row = []
for col in range(s.ncols):
this_row.append(s.cell_value(row,col))
bcw.writerow(this_row)
This produces the following error, about 740 lines in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)
The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)
I'll be the first to admit that I have only a vague sense of how character encoding works, so most of what I've tried so far are various fumbling permutations of .encode and .decode on the s.cell_value(row,col)
If someone could suggest a solution I would appreciate it -- even better if you could provide an explanation of what's not working and why, so that I can more easily debug these problems myself in the future.
Thanks in advance!
EDIT:
Thanks for the comments so far.
When I user this_row.append(s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors.
The output isn't particularly desirable (text:u'516-7773167'), but it avoids the error even though the offending characters are still in the output.
This makes me think that the challenge might be in xlrd after all.
Thoughts?

I expect the cell_value return value is the unicode string that's giving you problems (please print its type() to confirm that), in which case you should be able to solve it by changing this one line:
this_row.append(s.cell_value(row,col))
to:
this_row.append(s.cell_value(row,col).encode('utf8'))
If cell_value is returning multiple different types, then you need to encode if and only if it's returning a unicode string; so you'd split this line into a few lines:
val = s.cell_value(row, col)
if isinstance(val, unicode):
val = val.encode('utf8')
this_row.append(val)

You asked for explanations, but some of the phenomena are inexplicable without your help.
(A) Strings in XLS files created by Excel 97 onwards are encoded in Latin1 if possible otherwise in UTF16LE. Each string carries a flag telling which was used. Earlier Excels encoded strings according to the user's "codepage". In any case, xlrd produces unicode objects. The file encoding is of interest only when the XLS file has been created by 3rd party software which either omits the codepage or lies about it. See the Unicode section up the front of the xlrd docs.
(B) Unexplained phenomenon:
This code:
bcw = csv.writer(bc,csv.excel,b.encoding)
causes the following error with Python 2.5, 2.6 and 3.1: TypeError: expected at most 2 arguments, got 3 -- this is about what I'd expect given the docs on csv.writer; it's expecting a filelike object followed by either (1) nothing (2) a dialect or (3) one or more formatting parameters. You gave it a dialect, and csv.writer has no encoding argument, so splat. What version of Python are you using? Or did you not copy/paste the script that you actually ran?
(C) Unexplained phenomena around traceback and what the actual offending data was:
"the_script.py", line 40, in <module>
this_row.append(str(s.cell_value(row,col)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)
FIRSTLY, there's a str() in the offending code line that wasn't in the simplified script -- did you not copy/paste the script that you actually ran? In any case, you shouldn't use str in general -- you won't get the full precision on your floats; just let the csv module convert them.
SECONDLY, you say """The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)""" --- it's difficult to imagine how the 7 gets lost off the end. I'd use something like this to find out exactly what the problematic data was:
try:
str_value = str(s.cell_value(row, col))
except:
print "row=%d col=%d cell_value=%r" % (row, col, s.cell_value(row, col))
raise
That %r saves you from typing cell_value=%s ... repr(s.cell_value(row, col)) ... the repr() produces an unambiguous representation of your data. Learn it. Use it.
How did you arrive at "516-777316"?
THIRDLY, the error message is actually complaining about a unicode character u'\xed' at offset 5 (i.e. the sixth character). U+00ED is LATIN SMALL LETTER I WITH ACUTE, and there's nothing like that at all in "516-7773167"
FOURTHLY, the error location seems to be a moving target -- you said in a comment on one of the solutions: "The error is on bcw.writerow." Huh?
(D) Why you got that error message (with str()): str(a_unicode_object) attempts to convert the unicode object to a str object and in the absence of any encoding information uses ascii, but you have non-ascii data, so splat. Note that your object is to produce a csv file encoded in utf8, but your simplified script doesn't mention utf8 anywhere.
(E) """... s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors. The output isn't particularly desirable (text:u'516-7773167')"""
That's happening because the csv writer is calling the __str__ method of your Cell object, and this produces <type>:<repr(value)> which may be useful for debugging but as you say not so great in your csv file.
(F) Alex Martelli's solution is great in that it got you going. However you should read the section on the Cell class in the xlrd docs: types of cell are text, number, boolean, date, error, blank and empty. If you have dates, you are going to want to format them as dates not numbers, so you can't use isinstance() (and you may not want the function call overhead anyway) ... this is what the Cell.ctype attribute and Sheet.cell_type() and Sheet.row_types() methods are for.
(G) UTF8 is not Unicode. UTF16LE is not Unicode. UTF16 is not Unicode ... and the idea that individual strings would waste 2 bytes each on a UTF16 BOM is too preposterous for even MS to contemplate :-)
(H) Further reading (apart from the xlrd docs):
http://www.joelonsoftware.com/articles/Unicode.html
http://www.amk.ca/python/howto/unicode

Looks like you've got 2 problems.
There's something screwed up in that cell - '7' should be encoded as u'x37' I think, since it's within the ASCII-range.
More importantly though, the fact that you're getting an error message specifying that the ascii codec can't be used suggests something's wrong with your encoding into unicode - it thinks you're trying to encode a value 0xed that can't be represented in ASCII, but you said you're trying to represent it in unicode.
I'm not smart enough to work out what particular line is causing the problem - if you edit your question to tell me what line's causing that error message I might be able to help a bit more (I guess it's either this_row.append(s.cell_value(row,col)) or bcw.writerow(this_row), but would appreciate you confirming).

There appear to be two possibilities. One is that you have not perhaps opened the output file correctly:
"If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference." ( http://docs.python.org/library/csv.html#module-csv )
If that is not the problem, then another option for you is to use codecs.EncodedFile(file, input[, output[, errors]]) as a wrapper to output your .csv:
http://docs.python.org/library/codecs.html#module-codecs
This will allow you to have the file object filter from incoming UTF16 to UTF8. While both of them are technically "unicode", the way they encode is very different.
Something like this:
rbc = open('file.csv','w')
bc = codecs.EncodedFile(rbc, "UTF16", "UTF8")
bcw = csv.writer(bc,csv.excel)
may resolve the problem for you, assuming I understood the problem right, and assuming that the error is thrown when writing to the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.