I'm trying to translate an Excel spreadsheet to CSV using the Python xlrd and csv modules, but am getting hung up on encoding issues. Xlrd produces output from Excel in Unicode, and the CSV module requires UTF-8.
I imaging that this has nothing to do with the xlrd module: everything works fine outputing to stdout or other outputs that don't require a specific encoding.
The worksheet is encoded as UTF-16-LE, according to book.encoding
The simplified version of what I'm doing is:
from xlrd import *
import csv
b = open_workbook('file.xls')
s = b.sheet_by_name('Export')
bc = open('file.csv','w')
bcw = csv.writer(bc,csv.excel,b.encoding)
for row in range(s.nrows):
this_row = []
for col in range(s.ncols):
this_row.append(s.cell_value(row,col))
bcw.writerow(this_row)
This produces the following error, about 740 lines in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)
The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)
I'll be the first to admit that I have only a vague sense of how character encoding works, so most of what I've tried so far are various fumbling permutations of .encode and .decode on the s.cell_value(row,col)
If someone could suggest a solution I would appreciate it -- even better if you could provide an explanation of what's not working and why, so that I can more easily debug these problems myself in the future.
Thanks in advance!
EDIT:
Thanks for the comments so far.
When I user this_row.append(s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors.
The output isn't particularly desirable (text:u'516-7773167'), but it avoids the error even though the offending characters are still in the output.
This makes me think that the challenge might be in xlrd after all.
Thoughts?
I expect the cell_value return value is the unicode string that's giving you problems (please print its type() to confirm that), in which case you should be able to solve it by changing this one line:
this_row.append(s.cell_value(row,col))
to:
this_row.append(s.cell_value(row,col).encode('utf8'))
If cell_value is returning multiple different types, then you need to encode if and only if it's returning a unicode string; so you'd split this line into a few lines:
val = s.cell_value(row, col)
if isinstance(val, unicode):
val = val.encode('utf8')
this_row.append(val)
You asked for explanations, but some of the phenomena are inexplicable without your help.
(A) Strings in XLS files created by Excel 97 onwards are encoded in Latin1 if possible otherwise in UTF16LE. Each string carries a flag telling which was used. Earlier Excels encoded strings according to the user's "codepage". In any case, xlrd produces unicode objects. The file encoding is of interest only when the XLS file has been created by 3rd party software which either omits the codepage or lies about it. See the Unicode section up the front of the xlrd docs.
(B) Unexplained phenomenon:
This code:
bcw = csv.writer(bc,csv.excel,b.encoding)
causes the following error with Python 2.5, 2.6 and 3.1: TypeError: expected at most 2 arguments, got 3 -- this is about what I'd expect given the docs on csv.writer; it's expecting a filelike object followed by either (1) nothing (2) a dialect or (3) one or more formatting parameters. You gave it a dialect, and csv.writer has no encoding argument, so splat. What version of Python are you using? Or did you not copy/paste the script that you actually ran?
(C) Unexplained phenomena around traceback and what the actual offending data was:
"the_script.py", line 40, in <module>
this_row.append(str(s.cell_value(row,col)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)
FIRSTLY, there's a str() in the offending code line that wasn't in the simplified script -- did you not copy/paste the script that you actually ran? In any case, you shouldn't use str in general -- you won't get the full precision on your floats; just let the csv module convert them.
SECONDLY, you say """The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)""" --- it's difficult to imagine how the 7 gets lost off the end. I'd use something like this to find out exactly what the problematic data was:
try:
str_value = str(s.cell_value(row, col))
except:
print "row=%d col=%d cell_value=%r" % (row, col, s.cell_value(row, col))
raise
That %r saves you from typing cell_value=%s ... repr(s.cell_value(row, col)) ... the repr() produces an unambiguous representation of your data. Learn it. Use it.
How did you arrive at "516-777316"?
THIRDLY, the error message is actually complaining about a unicode character u'\xed' at offset 5 (i.e. the sixth character). U+00ED is LATIN SMALL LETTER I WITH ACUTE, and there's nothing like that at all in "516-7773167"
FOURTHLY, the error location seems to be a moving target -- you said in a comment on one of the solutions: "The error is on bcw.writerow." Huh?
(D) Why you got that error message (with str()): str(a_unicode_object) attempts to convert the unicode object to a str object and in the absence of any encoding information uses ascii, but you have non-ascii data, so splat. Note that your object is to produce a csv file encoded in utf8, but your simplified script doesn't mention utf8 anywhere.
(E) """... s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors. The output isn't particularly desirable (text:u'516-7773167')"""
That's happening because the csv writer is calling the __str__ method of your Cell object, and this produces <type>:<repr(value)> which may be useful for debugging but as you say not so great in your csv file.
(F) Alex Martelli's solution is great in that it got you going. However you should read the section on the Cell class in the xlrd docs: types of cell are text, number, boolean, date, error, blank and empty. If you have dates, you are going to want to format them as dates not numbers, so you can't use isinstance() (and you may not want the function call overhead anyway) ... this is what the Cell.ctype attribute and Sheet.cell_type() and Sheet.row_types() methods are for.
(G) UTF8 is not Unicode. UTF16LE is not Unicode. UTF16 is not Unicode ... and the idea that individual strings would waste 2 bytes each on a UTF16 BOM is too preposterous for even MS to contemplate :-)
(H) Further reading (apart from the xlrd docs):
http://www.joelonsoftware.com/articles/Unicode.html
http://www.amk.ca/python/howto/unicode
Looks like you've got 2 problems.
There's something screwed up in that cell - '7' should be encoded as u'x37' I think, since it's within the ASCII-range.
More importantly though, the fact that you're getting an error message specifying that the ascii codec can't be used suggests something's wrong with your encoding into unicode - it thinks you're trying to encode a value 0xed that can't be represented in ASCII, but you said you're trying to represent it in unicode.
I'm not smart enough to work out what particular line is causing the problem - if you edit your question to tell me what line's causing that error message I might be able to help a bit more (I guess it's either this_row.append(s.cell_value(row,col)) or bcw.writerow(this_row), but would appreciate you confirming).
There appear to be two possibilities. One is that you have not perhaps opened the output file correctly:
"If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference." ( http://docs.python.org/library/csv.html#module-csv )
If that is not the problem, then another option for you is to use codecs.EncodedFile(file, input[, output[, errors]]) as a wrapper to output your .csv:
http://docs.python.org/library/codecs.html#module-codecs
This will allow you to have the file object filter from incoming UTF16 to UTF8. While both of them are technically "unicode", the way they encode is very different.
Something like this:
rbc = open('file.csv','w')
bc = codecs.EncodedFile(rbc, "UTF16", "UTF8")
bcw = csv.writer(bc,csv.excel)
may resolve the problem for you, assuming I understood the problem right, and assuming that the error is thrown when writing to the file.
Related
I have converted a kdb query into a dataframe and then uploaded that dataframe to a csv file. This caused an encoding error which I easily fixed by decoding to utf-8. However, there is one column which this did not work for.
"nameFid" is the column which isn't working correctly, it outputs on the CSV file as " b'STRING' "
I am running Python 3.7, any other information needed I will be happy to provide.
Here is my code which decodes the data in the dataframe I get from kdb
for ba in df.dtypes.keys():
if df.dtypes[ba] == 'O':
try:
df[ba] = df[ba].apply(lambda x: x.decode('UTF-8'))
except Exception as e:
print(e)
return df
This worked for every column except "nameFid"
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 6: invalid continuation byte -
This is one error I get but I thought this suggests that the data isn't encoded using UTF-8, which would surely mean all the columns wouldn't work?
When using the try except, it instead prints "'Series' object has no attribute 'decode'".
My goal is to remove the "b''" from the column values, which currently show
" b'STRING' "
I'm not sure what else i need to add. Let me know if you need anything.
Also sorry I am quite new to all of this.
Many encodings are partially compatible from one other. This is mostly due to the prevalence of ASCII so a ton of them will be backward compatible with ASCII but extend it differently. Hence if your other columns only contain stuff like numbers etc they are likely ASCII-only and will work with a lot of different encodings.
The column that raises an error however contains some character outside the normal ASCII range and thus the encoding starts to matter. If you don't know the encoding of the file you can use chardet to try to guess it. Keep in mind that this is just guessing. Decoding using a different encoding may not raise any error however it could result in the wrong characters appearing in the final text so you should always know which encoding to use.
This said, if you are on Linux the standard file utility is often able to give you a rough guess of the encoding used, however for more advanced use cases something like chardet is necessary.
Once you have found the correct encoding, say you found it is latin-1 simply replace the decode('utf-8') with decode('latin-1').
When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).
This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm trying to get rid of diacritics in my textfile. I converted a pdf to text with a tool, not made by myself. I wasn't able to understand which encoding they use. The text is written in Nahuatl, orthographically familiar with Spanish.
I transformed the text into a list of strings. No I'm trying to do the following:
# check whether there is a not-ascii character in the item
def is_ascii(word):
check = string.ascii_letters + "."
if word not in check:
return False
return True
# if there is a not ascii-character encode the string
def to_ascii(word):
if is_ascii(word) == False:
newWord = word.encode("utf8")
return newWord
return word
What I want to get is a unicode-version of my string. It doesn't work so far and I tried several encodings like latin1, cp1252, iso-8859-1. What I get is Can anybody tell me what I did wrong?
How can I find out the right encoding?
Thank you!
EDIT:
I wrote to the people that developed the converter (pdf-txt) and they said they were using unicode already. So John Machin was right with (1) in his answer.
As I wrote in some comment that wasn't clear to me, because in the Eclipse debugger the list itself showed some signs in unicodes, others not. And if I looked at the items seperately they were all decoded in some way, so that I actually saw unicode.
Thank you for your help!
Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.
You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8'). Note decode, NOT encode.
If decoding with UTF-8 does not raise UnicodeDecodeError and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.
If the above does not work, show us the result of print repr(the_text).
Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str objects and other as unicode is messy in Python 2.x and won't work in Python 3.X
In any case, your first function doesn't do what you think it does; it returns False for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.
Note that latin1 and iso-8859-1 are the same encoding. As latin1 encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError raised by text.decode('latin1'). "No error" is this case has exactly zero diagnostic value.
Update in response to this comment from OP:
I use Python 2.7. If I use text.decode("utf8") it raises the following
error: UnicodeEncodeError: 'latin-1' codec can't encode character
u'\u2014' in position 0: ordinal not in range(256).
That can happen two ways:
(1) In a single statement like foo = text.decode('utf8'), text is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).
(2) Possibly in two different statements, first foo = text.decode('utf8') where text is an str object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo and your sys.stdout.encoding is latin-1 (???).
I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!
Please edit your question to show your code (insert print repr(text) just before the text.decode("utf8") line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).
I ask again: can you make your file available for analysis?
By the way, u'\u2014' is an "EM DASH" and is a valid character in cp1252 (but not in latin-1, as you have seen from the error message). What version of what operating system are you using?
And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014' is enough evidence of that. Just show us your code and its results.
If you have read some bytes and want to interpret them as an unicode string, then you have to use .decode() rather than encode().
Like #delnan said in the comment, I hope you know the encoding. If not, the guesswork should go easy once you fix the function used.
BTW even if there are only ASCII characters in that word, why not .decode() it too? You'd have the same data type (unicode) everywhere, which will make your program simpler.
I am working on a program (Python 2.7) that reads xls files (in MHTML format). One of the problems I have is that files contain symbols/characters that are not ascii. My initial solution was to read the files in using unicode
Here is how I am reading in a file:
theString=unicode(open(excelFile).read(),'UTF-8','replace')
I am then using lxml to do some processing. These files have many tables, the first step of my processing requires that I find the right table. I can find the table based on words that are in the the first cell of the first row. This is where is gets tricky. I had hoped to use a regular expression to test the text_content() of the cell but discovered that there were too many variants of the words (in a test run of 3,200 files I found 91 different ways that the concept that defines just one of the tables was expressed. Therefore I decided to dump all of the text_contents of the particular cell out and use some algorithims in excel to strictly identify all of the variants.
The code I used to write the text_content() was
headerDict['header_'+str(column+1)]=encode(string,'Latin-1','replace')
I did this baseed on previous answers to questions similar to mine here where it seems the consensus was to read in the file using unicode and then encode it just before the file is written out.
So I processed the labels/words in excel - converted them all to lower case and got rid of the spaces and saved the output as a text file.
The text file has a column of all of the unique ways the table I am looking for is labeled
I then am reading in the file - and the first time I did I read it in using
labels=set([label for label in unicode(open('C:\\balsheetstrings-1.txt').read(),'UTF-8','replace').split('\n')])
I ran my program and discovered that some matches did not occur, investigating it I discovered that unicode replaced certain charactors with \ufffd like in the example below
u'unauditedcondensedstatementsoffinancialcondition(usd\ufffd$)inthousands'
More research turns up that the replacement happens when unicode does not have a mapping for the character (probably not the exact explanation but that was my interpretation)
So then I tried (after thinking what do I have to lose) reading in my list of labels without using unicode. So I read it in using this code:
labels=set(open('C:\\balsheetstrings-1.txt').readlines())
now looking at the same label in the interpreter I see
'unauditedcondensedstatementsoffinancialcondition(usd\xa0$)inthousands'
I then try to use this set of labels to match and I get this error
Warning (from warnings module):
File "C:\FunctionsForExcel.py", line 128
if tableHeader in testSet:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Now the frustrating thing is that the value for tableHeader is NOT in the test set When I ask for the value of tableHeader after it broke I received this
'fairvaluemeasurements:'
And to add insult to injury when I type the test into Idle
tableHeader in testSet
it correctly returns false
I understand that the code '\xa0' is code for a non-breaking space. So does Python when I read it in without using unicode. I thought I had gotten rid of all the spaces in excel but to handle these I split them and then joined them
labels=[''.joiin([word for word in label.split()] for label in labels])
I still have not gotten to a question yet. Sorry I am still trying to get my head around this. It seems to me that I am dealing with inconsistent behavior here. When I read the string in originally and used unicode and UTF-8 all the characters were perserved/transportable if you will. I encoded them to write them out and they displayed fine in Excel, I then saved them as a txt file and they looked okay But something is going on and I can't seem to figure out where.
If I could avoid writing the strings out to identify the correct labels I have a feeling my problem would go away but there are 20,000 or more labels. I can use a regular expression to cut my potential list down significantly but some of it just requires inspection.
As an aside I will note that the source files all specify the charset='UTF-8'
Recap- when I read sourcedocument and list of labels in using unicode I fail to make some matches because the labels have some characters replaced by the ufffd, and when I read the sourcedocument in using unicode and the list of labels in without any special handling I get the warning.
I would like to understand what is going on so I can fix it but I have exhausted all the places I can think to look
You read (and write) encoded files like this:
import codecs
# read a utf8 encoded file and return the data as unicode
data = codecs.open(excelFile, 'rb', 'UTF-8').read()
The encoding you use does not matter as long as you do all the comparisons in unicode.
I understand that the code '\xa0' is code for a non-breaking space.
In a byte string, \xA0 is a byte representing non-breaking space in a few encodings; the most likely of those would be Windows code page 1252 (Western European). But it's certainly not UTF-8, where byte \xA0 on its own is invalid.
Use .decode('cp1252') to turn that byte string into Unicode instead of 'utf-8'. In general if you want to know what encoding an HTML file is in, look for the charset parameter in the <meta http-equiv="Content-Type"> tag; it is likely to differ depending on what exported it.
Not exactly a solution, but something like xlrd would probably make a lot more sense than jumping through all those hoops.
I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"
buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
row = cr.fetchone()
writer.writerow([s.encode('ascii','ignore') for s in row])
The value of row is
(56, u"LIMPIADOR BA\xd1O 1'5 L")
where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).
I'd like to save the value to the CSV, preferably with the ñ ("LIMPIADOR BAÑO 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").
Correct, ñ is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here:
What is the best way to remove accents in a Python unicode string?
But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.
So just switch "ascii" for whatever encoding you want.
Your actual code, according to your comment is:
writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
where
row = (56, u"LIMPIADOR BA\xd1O 1'5 L")
Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:
col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row)
Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.