Python woes with ssids, utf-8, and unicode - python

I'm trying to read the list of available wifi connections from python. I use the following code to do so:
ps = subprocess.Popen(('sudo', 'iwlist', 'wlan0', 'scan'), stdout=subprocess.PIPE)
output = subprocess.check_output(('grep', 'ESSID'), stdin=ps.stdout)
ssids = filter(None, sorted(dict.fromkeys(re.findall('"([^"]*)"', output))))
This produces a list of available ssids. Great. But if I have an ssid with an apostrope in it, I get something like:
Ryan\xe2\x80\x99s iPhone
Cool..the only way I can get this to display correctly on my web app is if I run str.decode('string_escape') on each of the ssids. It then looks like:
Ryan’s iPhone
But then...when I try to write that value to wpa_supplicant.conf, I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position...
Ok...so I try to utf-8 encode it before it is submitted, and I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x32 in
position...
I've tried utf-8 encoding the ssids initially (instead of using decode('string_escape'), but the character still just shows up and is passed as \xe2\x80\x99. I don't know what I can do to simply display and pass this like any other normal character.

Related

convert string in Python from ascii to utf-8 and back

I have a string variable blah that I am trying to send to Python's print and I get this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 26892-26894: ordinal not in range(128)
The text is coming from a windows server via WMI. So I tried this:
try:
print(blah, end='')
except:
print(blah.encode('utf-8'), end='')
This catches the error, but now my blah variable looks like this:
b'line#1\nline#2\nline#3\nline#4'
However, I need to send the output of this Python program to a daemon that expects the original format,
line#1
line#2
line#3
line#4
Which is what I really want. I tried chaining it with string.encode('utf-8').decode('utf-8','errors=ignore') but apparently my version of Python doesn't accept this. Any suggestions?

Python encoding error Polish characters

I've got a .txt file that I want to read with Python and it contains Polish citynames. I use this code (my script has :# - coding: utf-8 -*- in the first line):
string='PL.txt'
country=io.open(string,mode=r, encoding='utf-8')
lezer=csv.reader(country,dialect='excel-tab')
my_dict=defaultdict(list)
for record in lezer:
pc, gemeente= record[0], record[1]
my_dict[pc].append(gemeente)
return my_dict
When I use the code it starts running and then the error appears:
returm codecs.charmap_encode(input,errors,encodeing_table)
UnicodeEncodeError: charmap codec can't encode character u\'u0144' in position 35:charcter maps to
I've searched on the internet and I've found different answers bus not exact the one I need.
It's about the character ń when I understand well. The basic codes charmap doesn't contain this character, so it can't be encoded.
I used another codec utf16 but then it maps to something strange. I also tried other codes like latin-1, cp437, cp1252.
I also tried:
string='PL.txt'
country=io.open(string,mode=r, encoding='utf-8')
lezer=csv.reader(country,dialect='excel-tab')
my_dict=defaultdict(list)
for record in lezer:
pc, gemeente= record[0], record[1].encode('utf16')
my_dict[pc].append(gemeente)
return my_dict
when I look with type(record[1]) is gives str and not unicode. It's the same with other Polish carachters.

Django getting encoding error only on command line

I have a script that reads an XML file and writes it into the Database.
When I run it through the browser (call it via a view) it works fine, but
when I created a Command for it (./manage.py importxmlfile) I get the following message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
I'm not sure why it would only happen when calling the import via command line.. any ideas?
Update
I'm trying to convert an lxml.etree._ElementUnicodeResult object to string and save it in the DB (utf8 collation) using str(result).
This produces the error mentioned above only on Command Line.
Ah, don't use str(result).
instead, do:
result.encode('utf-8')
When you call str(result), python will use the default system encoding (usually ascii) to try and encode the bytes in result. This will break if the ordinal not in range(128). Rather than using the ascii codec, just .encode() and tell python which codec to use.
Check out the Python Unicode HowTo for more information. You might also want to check out this related question or this excellent presentation on the subject.

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 17710: ordinal not in range(128)

I'm trying to print a string from an archived web crawl, but when I do I get this error:
print page['html']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 17710: ordinal not in range(128)
When I try print unicode(page['html']) I get:
print unicode(page['html'],errors='ignore')
TypeError: decoding Unicode is not supported
Any idea how I can properly code this string, or at least get it to print? Thanks.
You need to encode the unicode you saved to display it, not decode it -- unicode is the unencoded form. You should always specify an encoding, so that your code will be portable. The "usual" pick is utf-8:
print page['html'].encode('utf-8')
If you don't specify an encoding, whether or not it works will depend on what you're printing to -- your editor, OS, terminal program, etc.

Categories

Resources