How to print german diacritic characters in Python 2.7? - python

Working on a german words (sometimes containing Umlaut characters) in an Excel2007 spreadsheet (I use xlrd xlwt and openpyxl), I get the following value:
var = str(ws.cell(row=i+k,column=0).value).encode('latin-1')
I get with print(var):
'[a word')
until coming on a word containing Umlaut characters, when I get:
Traceback (most recent call last):
File "C:\Users\cristina\Documents\horia\Linguistics3\px t3.py", line 68, in <module>
var = str(ws4.cell(row=i+k,column=0).value).encode('latin-1')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 3:ordinal not in range(128)
And the program stops.
If I define var as:
var = u'str(ws4.cell(row=i+k,column=0).value)'.encode('latin-1')
I get, when hen trying to print(var), I get:
var=str(ws.cell(row=i+k,column=0).value)
The program runs normally until the end
I can get the value of var in Python Shell, but not by "print(var)" in the program.
Can anybody give me a solution?

First of all, read this: http://www.joelonsoftware.com/articles/Unicode.html (seriously)
Then, understand that Python2 has two distinct data-types:
unicode, for "agnostic" handing all possible characters, but which can nt be used in
input/output, such as "print" or writing to files, without being encoded into the
other data type: strings.
Strings are encoding-dependent.
What I am almost sure is going on there, given your error message, is that the ws4.cell(row=i+k,column=0).value call is returning you a unicode value. (I can't test it on my non-windows environment here) - to be sure instead of guess work, you may want to run things there once with
print (type(ws4.cell(row=i+k,column=0).value) just to assert you are getting unicode values.
Thus, when you do str(ws4.(...).value) you are telling Python to just convert unicode to str without any encoding - that is the call that raises your error, not the subsequent "decode" call.
If that is what is going on, simply replace that str call for unicode:
var = u'str(ws4.cell(row=i+k,column=0).value)'.encode('latin-1')
That should fix your problem. I hope you've read the article I linked above - it is helpful.
Also, mark your Python source code with the corresponding encoding you are using - otherwise
you will get an error on any non-ASCII char in your source code.
For example, write this on the very first line of your code:
# coding: latin1
(Although for any serious project you should be using utf-8 instead.)

Related

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I am currently working on python 3.8.6. I am getting the following error on reading (thousands of) json files in python:
ValueError: Unpaired high surrogate when decoding 'string' on reading json file
I tried using the following solutions while checking other stackoverflow posts but nothing worked:
1) import json
json.loads('{"":"\\ud800"}')
2) import simplejson
simplejson.loads('{"":"\\ud800"}')
The problem is that after getting this error the remaining json files are not read. Is there a way to get rid of this error so I can read all the json files?
I am not sure what all information is necessary to provide regarding the problem so please feel free to ask.
Unicode code point U+D800 may only occur as part of a surrogate pair (and then only in UTF-16 encoding). So that string inside the JSON is (after decoding it) not valid UTF-8.
The JSON itself might or might not be valid. The spec doesn't mention the case of unmatched surrogate pairs, but does explicitly allow nonexistent code points:
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.
Now, you can choose your friends, but you can't choose your family and you can't always choose your JSON either. So the next question is: how to parse this mess?
It looks like both the built-in json module in Python (version 3.9) and simplejson (version 3.17.2) have no problems parsing the JSON. The problem only occurs once you try to use the string. So this really doesn't have anything to do with JSON at all:
>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
Fortunately, we can encode the string manually and tell Python how to handle the error. For example, replace the erroneous code point with a question mark:
>>> bork.encode('utf-8', errors='replace')
b'?'
The documentation lists other possible options for the errors argument.
To fix up this broken string, we can encode (into bytes) and then decode (back into str):
>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'
A Unicode surrogate in isolation does not correspond to anything. Every valid high surrogate code point needs to be immediately followed by a low surrogate code point before it can be meaningfully decoded.
The error message simply means that this code point in isolation does not have a well-defined meaning. It's like saying "take" without saying what we should take, or "look at" without the object of the sentence filled in.
You should not be using surrogates in files which do not contain UTF-16 anyway; they are reserved strictly for this encoding. It is used for encoding characters outside the 16-bit space which this 16-bit encoding can naturally represent by way of splitting them across two code points.
The simple and obvious fix is to supply the missing information, but we can't know what it is. Perhaps you have more context, and can fill in with a correct low surrogate pair. But for example, this works:
>>> json.loads('{"":"\\ud800\\udc00"}')
{'': '𐀀'}
It populates the JSON with the single code point U+010000 but of course we can have no idea whether that's actually the code point your data should contain.

Python UnicodeDecodeError on smart quotes

I have a python script and recently noticed that I was hitting some encoding errors on certain input. I noticed that "smart quotes" were causing problems. I'd like to know advice on how to overcome this. I am using Python 2, so need to tell my script that I want to encode everything in UTF-8.
I thought doing this was enough:
mystring.encode("utf-8")
and largely it worked, until I came across smart quotes (and there are possibly many other things that will cause problems, hence why I'm posting here.) For example:
mystring = "hi"
mystring.encode("utf-8")
output is
'hi'
But for this:
mystring2 = "’"
mystring.encode("utf-8")
output is
UnicodeDecodeError
Traceback (most recent call last)
<ipython-input-21-f563327dcd27> in <module>()
----> 1 mystring.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 0: ordinal not in range(128)
I created a function to handle the JSON input I get (sometimes I get null/None values, and sometimes numeric values, although mostly unicode, hence why i have the couple of if statements):
def xstr(s):
if s is None:
return ''
if isinstance(s, basestring):
return str(s.encode("utf-8"))
else:
return str(s)
This has worked quite well (until this smart quotes issue)
The two questions I have are:
Why can't "smart quotes" be encoded in UTF-8, and are there other limitations of UTF-8 or am I completely misinterpreting what I am seeing?
Is the approach I have used (ie using my custom function) the best way to handle this? I tried using a try/except to catch the cases of smart quotes, but that didn't work.
Python cannot encode the string because it doesn't know its current encoding. You'll need to use u"’" in Python 2 to tell Python that this is a Unicode string. ("\xe2" happens to be the first byte of the UTF-8 encoding of this character, but Python doesn't know that it's in UTF-8 because you haven't told it. You could put a -*- coding: utf-8 -*- comment near the top of your file; or unambiguously represent the character as u"\u2219".)
Similarly, to convert a string you read from disk, you have to coerce into Unicode so that you can then encode as UTF-8.
print(s.decode('iso-8859-1').encode('utf-8'))
Here, of course, 'iso-8859-1' is just a random guess. You have to know the encoding, or risk getting incorrect output.

Why is Python's .decode('cp037') not working on specific binary array?

When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).

Python, .format(), and UTF-8

My background is in Perl, but I'm giving Python plus BeautifulSoup a try for a new project.
In this example, I'm trying to extract and present the link targets and link text contained in a single page. Here's the source:
table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))
All those explicit calls to .encode('utf-8') are my attempt to make this work, but they don't seem to help -- it's likely that I'm completely misunderstanding something about how Python 2.7 handles Unicode strings.
Anyway. This works fine up until it encounters U+2013 in a URL (yes, really). At that point it bombs out with:
Traceback (most recent call last):
File "./test2.py", line 30, in <module>
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)
Presumably .format(), even applied to a Unicode string, is playing silly-buggers and trying to do a .decode() operation. And as ASCII is the default, it's using that, and of course it can't map U+2013 to an ASCII character, and thus...
The options seem to be to remove it or convert it to something else, but really what I want is to simply preserve it. Ultimately (this is just a little test case) I need to be able to present working clickable links.
The BS3 documentation suggests changing the default encoding from ASCII to UTF-8 but reading comments on similar questions that looks to be a really bad idea as it'll muck up dictionaries.
Short of using Python 3.2 instead (which means no Django, which we're considering for part of this project) is there some way to make this work cleanly?
First, note that your two code samples disagree on the text of the problematic line:
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
vs
line_out = unicode(table_row.format(link_text, link_target))
The first is the one from the traceback, so it's the one to look at. Assuming the rest of your first code sample is accurate, table_row is a byte-string, because you took a unicode string and encoded it. Byte strings can't be encoded, so Python 2 implicitly converts table_row from byte-string to unicode by decoding it as ascii. Hence the error message, "UnicodeDecodeError from ascii".
You need to decide what strings will be byte strings and which will be unicode strings, and be disciplined about it. I recommend keeping all text as Unicode strings as much as possible.
Here's a presentation I gave at PyCon that explains it all: Pragmatic Unicode, or, How Do I Stop The Pain?

Unicode to UTF8 for CSV Files - Python via xlrd

I'm trying to translate an Excel spreadsheet to CSV using the Python xlrd and csv modules, but am getting hung up on encoding issues. Xlrd produces output from Excel in Unicode, and the CSV module requires UTF-8.
I imaging that this has nothing to do with the xlrd module: everything works fine outputing to stdout or other outputs that don't require a specific encoding.
The worksheet is encoded as UTF-16-LE, according to book.encoding
The simplified version of what I'm doing is:
from xlrd import *
import csv
b = open_workbook('file.xls')
s = b.sheet_by_name('Export')
bc = open('file.csv','w')
bcw = csv.writer(bc,csv.excel,b.encoding)
for row in range(s.nrows):
this_row = []
for col in range(s.ncols):
this_row.append(s.cell_value(row,col))
bcw.writerow(this_row)
This produces the following error, about 740 lines in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)
The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)
I'll be the first to admit that I have only a vague sense of how character encoding works, so most of what I've tried so far are various fumbling permutations of .encode and .decode on the s.cell_value(row,col)
If someone could suggest a solution I would appreciate it -- even better if you could provide an explanation of what's not working and why, so that I can more easily debug these problems myself in the future.
Thanks in advance!
EDIT:
Thanks for the comments so far.
When I user this_row.append(s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors.
The output isn't particularly desirable (text:u'516-7773167'), but it avoids the error even though the offending characters are still in the output.
This makes me think that the challenge might be in xlrd after all.
Thoughts?
I expect the cell_value return value is the unicode string that's giving you problems (please print its type() to confirm that), in which case you should be able to solve it by changing this one line:
this_row.append(s.cell_value(row,col))
to:
this_row.append(s.cell_value(row,col).encode('utf8'))
If cell_value is returning multiple different types, then you need to encode if and only if it's returning a unicode string; so you'd split this line into a few lines:
val = s.cell_value(row, col)
if isinstance(val, unicode):
val = val.encode('utf8')
this_row.append(val)
You asked for explanations, but some of the phenomena are inexplicable without your help.
(A) Strings in XLS files created by Excel 97 onwards are encoded in Latin1 if possible otherwise in UTF16LE. Each string carries a flag telling which was used. Earlier Excels encoded strings according to the user's "codepage". In any case, xlrd produces unicode objects. The file encoding is of interest only when the XLS file has been created by 3rd party software which either omits the codepage or lies about it. See the Unicode section up the front of the xlrd docs.
(B) Unexplained phenomenon:
This code:
bcw = csv.writer(bc,csv.excel,b.encoding)
causes the following error with Python 2.5, 2.6 and 3.1: TypeError: expected at most 2 arguments, got 3 -- this is about what I'd expect given the docs on csv.writer; it's expecting a filelike object followed by either (1) nothing (2) a dialect or (3) one or more formatting parameters. You gave it a dialect, and csv.writer has no encoding argument, so splat. What version of Python are you using? Or did you not copy/paste the script that you actually ran?
(C) Unexplained phenomena around traceback and what the actual offending data was:
"the_script.py", line 40, in <module>
this_row.append(str(s.cell_value(row,col)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)
FIRSTLY, there's a str() in the offending code line that wasn't in the simplified script -- did you not copy/paste the script that you actually ran? In any case, you shouldn't use str in general -- you won't get the full precision on your floats; just let the csv module convert them.
SECONDLY, you say """The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)""" --- it's difficult to imagine how the 7 gets lost off the end. I'd use something like this to find out exactly what the problematic data was:
try:
str_value = str(s.cell_value(row, col))
except:
print "row=%d col=%d cell_value=%r" % (row, col, s.cell_value(row, col))
raise
That %r saves you from typing cell_value=%s ... repr(s.cell_value(row, col)) ... the repr() produces an unambiguous representation of your data. Learn it. Use it.
How did you arrive at "516-777316"?
THIRDLY, the error message is actually complaining about a unicode character u'\xed' at offset 5 (i.e. the sixth character). U+00ED is LATIN SMALL LETTER I WITH ACUTE, and there's nothing like that at all in "516-7773167"
FOURTHLY, the error location seems to be a moving target -- you said in a comment on one of the solutions: "The error is on bcw.writerow." Huh?
(D) Why you got that error message (with str()): str(a_unicode_object) attempts to convert the unicode object to a str object and in the absence of any encoding information uses ascii, but you have non-ascii data, so splat. Note that your object is to produce a csv file encoded in utf8, but your simplified script doesn't mention utf8 anywhere.
(E) """... s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors. The output isn't particularly desirable (text:u'516-7773167')"""
That's happening because the csv writer is calling the __str__ method of your Cell object, and this produces <type>:<repr(value)> which may be useful for debugging but as you say not so great in your csv file.
(F) Alex Martelli's solution is great in that it got you going. However you should read the section on the Cell class in the xlrd docs: types of cell are text, number, boolean, date, error, blank and empty. If you have dates, you are going to want to format them as dates not numbers, so you can't use isinstance() (and you may not want the function call overhead anyway) ... this is what the Cell.ctype attribute and Sheet.cell_type() and Sheet.row_types() methods are for.
(G) UTF8 is not Unicode. UTF16LE is not Unicode. UTF16 is not Unicode ... and the idea that individual strings would waste 2 bytes each on a UTF16 BOM is too preposterous for even MS to contemplate :-)
(H) Further reading (apart from the xlrd docs):
http://www.joelonsoftware.com/articles/Unicode.html
http://www.amk.ca/python/howto/unicode
Looks like you've got 2 problems.
There's something screwed up in that cell - '7' should be encoded as u'x37' I think, since it's within the ASCII-range.
More importantly though, the fact that you're getting an error message specifying that the ascii codec can't be used suggests something's wrong with your encoding into unicode - it thinks you're trying to encode a value 0xed that can't be represented in ASCII, but you said you're trying to represent it in unicode.
I'm not smart enough to work out what particular line is causing the problem - if you edit your question to tell me what line's causing that error message I might be able to help a bit more (I guess it's either this_row.append(s.cell_value(row,col)) or bcw.writerow(this_row), but would appreciate you confirming).
There appear to be two possibilities. One is that you have not perhaps opened the output file correctly:
"If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference." ( http://docs.python.org/library/csv.html#module-csv )
If that is not the problem, then another option for you is to use codecs.EncodedFile(file, input[, output[, errors]]) as a wrapper to output your .csv:
http://docs.python.org/library/codecs.html#module-codecs
This will allow you to have the file object filter from incoming UTF16 to UTF8. While both of them are technically "unicode", the way they encode is very different.
Something like this:
rbc = open('file.csv','w')
bc = codecs.EncodedFile(rbc, "UTF16", "UTF8")
bcw = csv.writer(bc,csv.excel)
may resolve the problem for you, assuming I understood the problem right, and assuming that the error is thrown when writing to the file.

Categories

Resources