python wx unicode encoding

python wx unicode encoding - python

I have a problem with wx and python which seems to be a unicode one.
I'm using Portable python 2.7.2.1 and wx-2.8-msw-unicode.
My python code at the point of failure is this statement:
listbox.AppendText("\n " + dparser.parse(t['created_at']).strftime('%H-%M-%S') + " " +t['text'] + "\n")
t['text']
has a value:
"RT #WebbieBmx: “#AlexColebornBmx: http://t.co/cN6zSO69”watch this an #retweet"
which when printed in the DOS window from which I'm running python displays as:
'RT #WebbieBmx: \xe2\x80\x9c#AlexColebornBmx: http://t.co/cN6zSO69\xe2
\x80\x9dwatch this an #retweet'
The traceback is:
Traceback (most recent call last): File "myprogs\Search_db_dev.py",
line 713, in onSubmit
self.toField.GetLineText(0)) File "F:\Portable\Portable Python 2.7.2.1\App\myprogs\process_form2_dev.py", l ine 575, in display_Tweets
listbox.AppendText("\n " + dparser.parse(t['created_at']).strftime('%H-%M-%
S') + " " +t['text'] + "\n")
File "F:\Portable\Portable Python
2.7.2.1\App\lib\site-packages\wx-2.8-msw-uni code\wx_controls.py", line 1850, in AppendText
return _controls_.TextCtrl_AppendText(*args, **kwargs)
File "F:\Portable\Portable Python
2.7.2.1\App\lib\encodings\cp1252.py", line 1 5, in
decode return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
73: chara cter maps to undefined
The UnicodeDecodeError seems to occur at the end of the right double quotation mark (\xe2\x80\x9d) but I can't see why. I would be grateful for any help.
It may be a simple encoding problem, I'm afraid

The reference to cp1252 kind of threw me when I looked at the traceback, because the text is utf8 (as one might expect when handling the text of tweets.) The utf8 sequence on the left (\xe2\x80\x9c) doesn't seem to cause a problem, but it appears there's a space after the \xe2 in the second hex sequence, which would keep it from being decoded from utf8 properly. When I remove that space, the decode problem goes away. So you've got some bad utf8, which I'm not sure how you would guard against other than an explicit decode inside a try statement when you receive it from the original source. Does this make sense?

Yes, it's a simple encoding problem.
The reason you don't see why is that your font isn't distinguishing between u'”' and u'"'. The former is a curly closed-quote mark, which is '\xe2\x80\x9d' in UTF-8. This most often happens when you edit a text file in an editor (like MS Word) that does "smart quotes".
But it's good that you found the problem now; otherwise, everything would seem to work until you gave your script to some Chinese user…
Anyway, the problem here is that you've got some code that's storing UTF-8 strings, and some other code that's trying to access them as if they were in the default encoding (your Windows OEM charset). Without seeing more of the code, it's hard to be sure what exactly you're doing wrong, but hopefully this is enough info for you to track it down.

Related

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.

You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

Encoding issues with CLD

After some issues getting Chrome Compact Language Detection library installed on Windows, I installed CLD from this easy_install.
I can now use CLD, but getting some encoding issues.
Background
Pulling Tweets into a python script, and after stripping out the hashtags and links, passing them to CLD to detect the language.
Following is a simplified version of my code:
s = "I am a tweet from Twitter"
clean_s = s.encode('utf-8')
lan = cld.detect(clean_s, pickSummaryLanguage=True, removeWeakMatches=True)
Problem
4 out of 5 times, this works as expected (get returned a response about what language it is).
However, I keep getting this error popping up:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position 15: character maps to undefined
I did read that:
"You must provide CLD clean (interchange-valid) UTF-8, so any encoding
issues must be sorted out before-hand."
However, I thought I had this covered with my statement to encode to UTF8?
I assume that I need to ensure that I pass a string to CLD that preserves fonts in languages such as arabic, asian, etc.
This is my first python project, so likely this is a rookie mistake. Can anyone point out my mistake and how to rectify?
Let me know in comments if I need to gather more info, and I will edit my Q to provide more info.
EDIT
If it helps, here is my rookie code (cut down to replicate issue).
I am running Python 2.7 32bit.
Running this code, after awhile, I get this error.
Let me know if I have not correctly implemented the error reporting.
Raw: Traceback (most recent call last):
File "LanguageTesting.py", line 71, in <module>
parse_tweet(tweet)
File "LanguageTesting.py", line 43, in parse_tweet
print "Raw:", raw
File "C:\Python27\ArcGIS10.1\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-32: character maps to <undefined>

It looks like you are failing on the print statement right? This means Python cannot encode the unicode string into what it thinks the console's stdout encoding is ("print sys.getdefaultencoding()").
If python is wrong about what your terminal expects, you can set the env var ("export PYTHONIOENCODING=UTF-8") and it will encode your printed strings to utf-8. Alternatively, before printing, you can encode to whatever charset your terminal expects (and will likely have to ignore/replace errors to avoid exceptions like the one you hit)...

How to keep encoded characters as is

I am receiving web response in different encoding using python and my expected output should have to same as given on the web page
Ex : Marc Barbé
The last character should remain same after the parsing of html response.
Currently I am using following code for this
unicode.join(u'\n',map(unicode,item))
In some cases when there is no special encoding is given it is throwing following error :
Ex: Markus Rygaard, Alberte Blichfeldt, Flemming Quist, Møller
Traceback (most recent call last):
File "BFICrawl.py", line 20, in <module>
print attrName + " : " + attrValue
File "C:\Python27\LIB\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xf8' in position 6
0: character maps to <undefined>
I really not able to find the reason for the same. Is there any alternate way available for getting the same encoding content from web.

You have successfully obtained unicode objects from the web. You should not need to do things like unicode.join(u'\n',map(unicode,item)). The problem is happening when you try to output the unicode.
You are running your script in a Windows "Command Prompt" window. The script is printing to the console. The console encoding is cp437. That is a very limited (8-bit) encoding. It can't handle the second character in Møller, and an enormous bunch of other characters
Remedy: Run your script in IDLE (supplied with your Python) or some other IDE.
Alternatively, if you are printing to the console for debug purposes only, instead of print foo use print repr(foo)

Codepage 437 (which is being encoded into) doesn't know the ø character, therefore your string can't be encoded for output. The error message does say all this.
So the question is: Why are you trying to encode the string into a codepage used by DOS console windows?

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

UnicodeDecodeError only with cx_freeze

I get the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 7338: ordinal not in range(128)" once I try to run the program after I freeze my script with cx_freeze. If I run the Python 3 script normally it runs fine, but only after I freeze it and try to run the executable does it give me this error. I would post my code, but I don't know exactly what parts to post so if there are any certain parts that will help just let me know and I will post them, otherwise it seems like I have had this problem once before and solved it, but it has been a while and I can't remember what exactly the problem was or how I fixed it so any help or pointers to get me going in the right direction will help greatly. Thanks in advance.

Tell us exactly which version of Python on what platform.
Show the full traceback that you get when the error happens. Look at it yourself. What is the last line of your code that appears? What do you think is the bytes string that is being decoded? Why is the ascii codec being used??
Note that automatic conversion of bytes to str with a default codec (e.g. ascii) is NOT done by Python 3.x. So either you are doing it explicitly or cx_freeze is.
Update after further info in comments.
Excel does not save csv files in ASCII. It saves them in what MS calls "the ANSI codepage", which varies by locale. If you don't know what yours is, it is probably cp1252. To check, do this:
>>> import locale; print(locale.getpreferredencoding())
cp1252
If Excel did save files in ASCII, your offending '\xa0' byte would have been replaced by '?' and you would not be getting a UnicodeDecodeError.
Saving your files in UTF-8 would need you to open your files with encoding='utf8' and would have the same problem (except that you'd get a grumble about 0xc2 instead of 0xa0).
You don't need to post all four of your csv files on the web. Just run this little script (untested):
import sys
for filename in sys.argv[1:]:
for lino, line in enumerate(open(filename), 1):
if '\xa0' in line:
print(ascii(filename), lino, ascii(line))
The '\xa0' is a NO-BREAK SPACE aka ... you may want to edit your files to change these to ordinary spaces.
Probably you will need to ask on the cx_freeze mailing list to get an answer to why this error is happening. They will want to know the full traceback. Get some practice -- show it here.
By the way, "offset 7338" is rather large -- do you expect lines that long in your csv file? Perhaps something is reading all of your file ...

That error itself indicates that you have a character in a python string that isn't a normal ASCII character:
>>> b'abc\xa0'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 3: ordinal not in range(128)
I certainly don't know why this would only happen when a script is frozen. You could wrap the whole script in a try/except and manually print out all or part of the string in question.
EDIT: here's how that might look
try:
# ... your script here
except UnicodeDecodeError as e:
print("Exception happened in string '...%s...'"%(e.object[e.start-50:e.start+51],))
raise

fix by set default coding:
reload(sys)
sys.setdefaultencoding("utf-8")

Use str.decode() function for that lines. And also you can specify encoding like myString.decode('cp1252').
Look also: http://docs.python.org/release/3.0.1/howto/unicode.html#unicode-howto

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.