How to keep encoded characters as is - python

I am receiving web response in different encoding using python and my expected output should have to same as given on the web page
Ex : Marc Barbé
The last character should remain same after the parsing of html response.
Currently I am using following code for this
unicode.join(u'\n',map(unicode,item))
In some cases when there is no special encoding is given it is throwing following error :
Ex: Markus Rygaard, Alberte Blichfeldt, Flemming Quist, Møller
Traceback (most recent call last):
File "BFICrawl.py", line 20, in <module>
print attrName + " : " + attrValue
File "C:\Python27\LIB\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xf8' in position 6
0: character maps to <undefined>
I really not able to find the reason for the same. Is there any alternate way available for getting the same encoding content from web.

You have successfully obtained unicode objects from the web. You should not need to do things like unicode.join(u'\n',map(unicode,item)). The problem is happening when you try to output the unicode.
You are running your script in a Windows "Command Prompt" window. The script is printing to the console. The console encoding is cp437. That is a very limited (8-bit) encoding. It can't handle the second character in Møller, and an enormous bunch of other characters
Remedy: Run your script in IDLE (supplied with your Python) or some other IDE.
Alternatively, if you are printing to the console for debug purposes only, instead of print foo use print repr(foo)

Codepage 437 (which is being encoded into) doesn't know the ø character, therefore your string can't be encoded for output. The error message does say all this.
So the question is: Why are you trying to encode the string into a codepage used by DOS console windows?

Related

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.
You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

Encoding issues with CLD

After some issues getting Chrome Compact Language Detection library installed on Windows, I installed CLD from this easy_install.
I can now use CLD, but getting some encoding issues.
Background
Pulling Tweets into a python script, and after stripping out the hashtags and links, passing them to CLD to detect the language.
Following is a simplified version of my code:
s = "I am a tweet from Twitter"
clean_s = s.encode('utf-8')
lan = cld.detect(clean_s, pickSummaryLanguage=True, removeWeakMatches=True)
Problem
4 out of 5 times, this works as expected (get returned a response about what language it is).
However, I keep getting this error popping up:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position 15: character maps to undefined
I did read that:
"You must provide CLD clean (interchange-valid) UTF-8, so any encoding
issues must be sorted out before-hand."
However, I thought I had this covered with my statement to encode to UTF8?
I assume that I need to ensure that I pass a string to CLD that preserves fonts in languages such as arabic, asian, etc.
This is my first python project, so likely this is a rookie mistake. Can anyone point out my mistake and how to rectify?
Let me know in comments if I need to gather more info, and I will edit my Q to provide more info.
EDIT
If it helps, here is my rookie code (cut down to replicate issue).
I am running Python 2.7 32bit.
Running this code, after awhile, I get this error.
Let me know if I have not correctly implemented the error reporting.
Raw: Traceback (most recent call last):
File "LanguageTesting.py", line 71, in <module>
parse_tweet(tweet)
File "LanguageTesting.py", line 43, in parse_tweet
print "Raw:", raw
File "C:\Python27\ArcGIS10.1\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-32: character maps to <undefined>
It looks like you are failing on the print statement right? This means Python cannot encode the unicode string into what it thinks the console's stdout encoding is ("print sys.getdefaultencoding()").
If python is wrong about what your terminal expects, you can set the env var ("export PYTHONIOENCODING=UTF-8") and it will encode your printed strings to utf-8. Alternatively, before printing, you can encode to whatever charset your terminal expects (and will likely have to ignore/replace errors to avoid exceptions like the one you hit)...

python wx unicode encoding

I have a problem with wx and python which seems to be a unicode one.
I'm using Portable python 2.7.2.1 and wx-2.8-msw-unicode.
My python code at the point of failure is this statement:
listbox.AppendText("\n " + dparser.parse(t['created_at']).strftime('%H-%M-%S') + " " +t['text'] + "\n")
t['text']
has a value:
"RT #WebbieBmx: “#AlexColebornBmx: http://t.co/cN6zSO69”watch this an #retweet"
which when printed in the DOS window from which I'm running python displays as:
'RT #WebbieBmx: \xe2\x80\x9c#AlexColebornBmx: http://t.co/cN6zSO69\xe2
\x80\x9dwatch this an #retweet'
The traceback is:
Traceback (most recent call last): File "myprogs\Search_db_dev.py",
line 713, in onSubmit
self.toField.GetLineText(0)) File "F:\Portable\Portable Python 2.7.2.1\App\myprogs\process_form2_dev.py", l ine 575, in display_Tweets
listbox.AppendText("\n " + dparser.parse(t['created_at']).strftime('%H-%M-%
S') + " " +t['text'] + "\n")
File "F:\Portable\Portable Python
2.7.2.1\App\lib\site-packages\wx-2.8-msw-uni code\wx_controls.py", line 1850, in AppendText
return _controls_.TextCtrl_AppendText(*args, **kwargs)
File "F:\Portable\Portable Python
2.7.2.1\App\lib\encodings\cp1252.py", line 1 5, in
decode return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
73: chara cter maps to undefined
The UnicodeDecodeError seems to occur at the end of the right double quotation mark (\xe2\x80\x9d) but I can't see why. I would be grateful for any help.
It may be a simple encoding problem, I'm afraid
The reference to cp1252 kind of threw me when I looked at the traceback, because the text is utf8 (as one might expect when handling the text of tweets.) The utf8 sequence on the left (\xe2\x80\x9c) doesn't seem to cause a problem, but it appears there's a space after the \xe2 in the second hex sequence, which would keep it from being decoded from utf8 properly. When I remove that space, the decode problem goes away. So you've got some bad utf8, which I'm not sure how you would guard against other than an explicit decode inside a try statement when you receive it from the original source. Does this make sense?
Yes, it's a simple encoding problem.
The reason you don't see why is that your font isn't distinguishing between u'”' and u'"'. The former is a curly closed-quote mark, which is '\xe2\x80\x9d' in UTF-8. This most often happens when you edit a text file in an editor (like MS Word) that does "smart quotes".
But it's good that you found the problem now; otherwise, everything would seem to work until you gave your script to some Chinese user…
Anyway, the problem here is that you've got some code that's storing UTF-8 strings, and some other code that's trying to access them as if they were in the default encoding (your Windows OEM charset). Without seeing more of the code, it's hard to be sure what exactly you're doing wrong, but hopefully this is enough info for you to track it down.

Converting from ascii to utf-8 with Python

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
output.decode('ascii')
in python console I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4
PS: Sorry for my English :(
sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
You need to work out what encoding the actual text is, either manually, or using the locale module.
Typically something like:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶
Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
See http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.
You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.
Update You say that you still need cp866 in IDLE. Note this:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.
In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.

Unicode problems with web pages in Python's urllib

I seem to have the all-familiar problem of correctly reading and viewing a web page. It looks like Python reads the page in UTF-8 but when I try to convert it to something more viewable (iso-8859-1) I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)
The code looks like this:
#!/usr/bin/python
from urllib import urlopen
import re
url_address = 'http://www.eurohockey.net/players/show_player.cgi?serial=4722'
finished = 0
begin_record = 0
col = 0
str = ''
for line in urlopen(url_address):
if '</tr' in line:
begin_record = 0
print str
str = ''
continue
if begin_record == 1:
col = col + 1
tmp_match = re.search('<td>(.+)</td>', line.strip())
str = str + ';' + unicode(tmp_match.group(1), 'iso-8859-1')
if '<tr class=\"even\"' in line or '<tr class=\"odd\"' in line:
begin_record = 1
col = 0
continue
How should I handle the contents? Firefox at least thinks it's iso-8859-1 and it would make sense looking at the contents of that page. The error comes from the 'ä' character clearly.
And if I was to save that data to a database, should I not bother with changing the codec and then converting when showing it?
As noted by Lennart, your problem is not the decoding. It is trying to encode into "ascii", which is often a problem with print statements. I suspect the line
print str
is your problem. You need to encode the str into whatever your console is using to have that line work.
It doesn't look like Python is "reading it in UTF-8" at all. As already pointed out, you have an encoding problem, NOT a decoding problem. It is impossible for that error to have arisen from that line that you say. When asking a question like this, always give the full traceback and error message.
Kathy's suspicion is correct; in fact the print str line is the only possible source of that error, and that can only happen when sys.stdout.encoding is not set so Python punts on 'ascii'.
Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do.
Example: I'm using Python 2.6.2 on Windows XP and I'm running your script with some diagnostic additions:
(1) import sys; print sys.stdout.encoding up near the front
(2) print repr(str) before print str so that I can see what you've got before it crashes.
In a Command Prompt window, if I do \python26\python hockey.py it prints cp850 as the encoding and just works.
However if I do
\python26\python hockey.py | more
or
\python26\python hockey.py >hockey.txt
it prints None as the encoding and crashes with your error message on the first line with the a-with-diaeresis:
C:\junk>\python26\python hockey.py >hockey.txt
Traceback (most recent call last):
File "hockey.py", line 18, in <module>
print str
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)
If that fits your case, the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use.
That text is indeed iso-88591-1, and I can decode it without a problem, and indeed your code runs without a hitch.
Your error, however, is an ENCODE error, not a decode error. And you don't do any encoding in your code, so. Possibly you have gotten encoding and decoding confused, it's a common problem.
You DECODE from Latin1 to Unicode. You ENCODE the other way. Remember that Latin1, UTF8 etc are called "encodings".

Categories

Resources