convert string in Python from ascii to utf-8 and back

convert string in Python from ascii to utf-8 and back - python

I have a string variable blah that I am trying to send to Python's print and I get this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 26892-26894: ordinal not in range(128)
The text is coming from a windows server via WMI. So I tried this:
try:
print(blah, end='')
except:
print(blah.encode('utf-8'), end='')
This catches the error, but now my blah variable looks like this:
b'line#1\nline#2\nline#3\nline#4'
However, I need to send the output of this Python program to a daemon that expects the original format,
line#1
line#2
line#3
line#4
Which is what I really want. I tried chaining it with string.encode('utf-8').decode('utf-8','errors=ignore') but apparently my version of Python doesn't accept this. Any suggestions?

Related

Python woes with ssids, utf-8, and unicode

I'm trying to read the list of available wifi connections from python. I use the following code to do so:
ps = subprocess.Popen(('sudo', 'iwlist', 'wlan0', 'scan'), stdout=subprocess.PIPE)
output = subprocess.check_output(('grep', 'ESSID'), stdin=ps.stdout)
ssids = filter(None, sorted(dict.fromkeys(re.findall('"([^"]*)"', output))))
This produces a list of available ssids. Great. But if I have an ssid with an apostrope in it, I get something like:
Ryan\xe2\x80\x99s iPhone
Cool..the only way I can get this to display correctly on my web app is if I run str.decode('string_escape') on each of the ssids. It then looks like:
Ryan’s iPhone
But then...when I try to write that value to wpa_supplicant.conf, I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position...
Ok...so I try to utf-8 encode it before it is submitted, and I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x32 in
position...
I've tried utf-8 encoding the ssids initially (instead of using decode('string_escape'), but the character still just shows up and is passed as \xe2\x80\x99. I don't know what I can do to simply display and pass this like any other normal character.

“UnicodeEncodeError: 'ascii' codec can't encode character” in Python3

I'm fetching JSON with Requests from an API (using Python 3.5) and when I'm trying to print (or use) the JSON, either by response.text, json.loads(...) or response.json(), I get an UnicodeEncodeError.
print(response.text)
UnicodeEncodeError: 'ascii' codec can't encode character '\xc5' in position 676: ordinal not in range(128)
The JSON contains an array of dictionaries with country names and some of them contain special characters, e.g.: (just one dictionary in the binary array for example)
b'[{\n "name" : "\xc3\x85land Islands"\n}]
I have no idea why there is an encoding problem and also why "ascii" is used when Requests detects an UTF-8 encoding (and even by setting it manually to UTF-8 doesn't change anything).
Edit2: The problem was Microsoft Visual Studio Code 1.4. It wasn't able to print the characters.

If your code is running within VS, then it sounds that Python can't work out the encoding of the inbuilt console, so defaults to ASCII. If you try to print any non-ASCII then Python throws an error rather printing text that won't display.
You can force Python's encoding by using the PYTHONIOENCODING environment variable. Set it within the run configuration for the script.
Depending on Visual Studio's console, you may get away with:
PYTHONIOENCODING=utf-8
or you may have to use a typical 8bit charset like:
PYTHONIOENCODING=windows-1252

Django getting encoding error only on command line

I have a script that reads an XML file and writes it into the Database.
When I run it through the browser (call it via a view) it works fine, but
when I created a Command for it (./manage.py importxmlfile) I get the following message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
I'm not sure why it would only happen when calling the import via command line.. any ideas?
Update
I'm trying to convert an lxml.etree._ElementUnicodeResult object to string and save it in the DB (utf8 collation) using str(result).
This produces the error mentioned above only on Command Line.

Ah, don't use str(result).
instead, do:
result.encode('utf-8')
When you call str(result), python will use the default system encoding (usually ascii) to try and encode the bytes in result. This will break if the ordinal not in range(128). Rather than using the ascii codec, just .encode() and tell python which codec to use.
Check out the Python Unicode HowTo for more information. You might also want to check out this related question or this excellent presentation on the subject.

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

Unicode problems with web pages in Python's urllib

I seem to have the all-familiar problem of correctly reading and viewing a web page. It looks like Python reads the page in UTF-8 but when I try to convert it to something more viewable (iso-8859-1) I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)
The code looks like this:
#!/usr/bin/python
from urllib import urlopen
import re
url_address = 'http://www.eurohockey.net/players/show_player.cgi?serial=4722'
finished = 0
begin_record = 0
col = 0
str = ''
for line in urlopen(url_address):
if '</tr' in line:
begin_record = 0
print str
str = ''
continue
if begin_record == 1:
col = col + 1
tmp_match = re.search('<td>(.+)</td>', line.strip())
str = str + ';' + unicode(tmp_match.group(1), 'iso-8859-1')
if '<tr class=\"even\"' in line or '<tr class=\"odd\"' in line:
begin_record = 1
col = 0
continue
How should I handle the contents? Firefox at least thinks it's iso-8859-1 and it would make sense looking at the contents of that page. The error comes from the 'ä' character clearly.
And if I was to save that data to a database, should I not bother with changing the codec and then converting when showing it?

As noted by Lennart, your problem is not the decoding. It is trying to encode into "ascii", which is often a problem with print statements. I suspect the line
print str
is your problem. You need to encode the str into whatever your console is using to have that line work.

It doesn't look like Python is "reading it in UTF-8" at all. As already pointed out, you have an encoding problem, NOT a decoding problem. It is impossible for that error to have arisen from that line that you say. When asking a question like this, always give the full traceback and error message.
Kathy's suspicion is correct; in fact the print str line is the only possible source of that error, and that can only happen when sys.stdout.encoding is not set so Python punts on 'ascii'.
Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do.
Example: I'm using Python 2.6.2 on Windows XP and I'm running your script with some diagnostic additions:
(1) import sys; print sys.stdout.encoding up near the front
(2) print repr(str) before print str so that I can see what you've got before it crashes.
In a Command Prompt window, if I do \python26\python hockey.py it prints cp850 as the encoding and just works.
However if I do
\python26\python hockey.py | more
or
\python26\python hockey.py >hockey.txt
it prints None as the encoding and crashes with your error message on the first line with the a-with-diaeresis:
C:\junk>\python26\python hockey.py >hockey.txt
Traceback (most recent call last):
File "hockey.py", line 18, in <module>
print str
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)
If that fits your case, the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use.

That text is indeed iso-88591-1, and I can decode it without a problem, and indeed your code runs without a hitch.
Your error, however, is an ENCODE error, not a decode error. And you don't do any encoding in your code, so. Possibly you have gotten encoding and decoding confused, it's a common problem.
You DECODE from Latin1 to Unicode. You ENCODE the other way. Remember that Latin1, UTF8 etc are called "encodings".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.