Python: replace nonbreaking space in Unicode

Python: replace nonbreaking space in Unicode - python

In Python, I have a text that is Unicode-encoded. This text contains non-breaking spaces, which I want to convert to 'x'. Non-breaking spaces are equal to chr(160). I have the following code, which works great when I run it as Django via Eclipse using Localhost. No errors and any non-breaking spaces are converted.
my_text = u"hello"
my_new_text = my_text.replace(chr(160), "x")
However when I run it any other way (Python command line, Django via runserver instead of Eclipse) I get an error:
'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
I guess this error makes sense because it's trying to compare Unicode (my_text) to something that isn't Unicode. My questions are:
If chr(160) isn't Unicode, what is it?
How come this works when I run it from Eclipse? Understanding this would help me determine if I need to change other parts of my code. I have been testing my code from Eclipse.
(most important) How do I solve my original problem of removing the non-breaking spaces? my_text is definitely going to be Unicode.

In Python 2, chr(160) is a byte string of length one whose only byte has value 160, or hex a0. There's no meaning attached to it except in the context of a specific encoding.
I'm not familiar with Eclipse, but it may be playing encoding tricks of its own.
If you want the Unicode character NO-BREAK SPACE, i.e. code point 160, that's unichr(160).
E.g.,
>>> u"hello\u00a0world".replace(unichr(160), "X")
u'helloXworld

Related

Python - Encode error - cp850.py

I am Python beginner so I hope this problem will be an easy fix.
I would like to print the value of an attribute as follows:
print (follower.city)
I receive the following error message:
File “C:\Python34\lib\encodings\cp850.py“, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: ‘charmap‘ codec can’t encode character ‘\u0130‘
0: character maps to (undefined)
I think the problem is that cp850.py does not contain the relevant character in the encoding table.
What would be the solution to this problem? No ultimate need to display the character correctly, but the error message must be avoided. Do I need to modify cp850.py?
Sorry if this question has been addressed before, but I was not able to figure it out using previous answers to this topic.

To print a string it must first be converted from pure Unicode to the byte sequences supported by your output device. This requires an encode to the proper character set, which Python has identified as cp850 - the Windows Console default.
Starting with Python 3.3 you can set the Windows console to use UTF-8 with the following command issued at the command prompt:
chcp 65001
This should fix your issue, as long as you've configured the window to use a font that contains the character.

Why web-server complains about Cyrillic letters and command line not?

I have a web-server on which I try to submit a form containing Cyrillic letters. As a result I get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
This message comes from the following line of the code:
ups = 'rrr {0}'.format(body.replace("'","''"))
(body contains Cyrillic letters). Strangely I cannot reproduce this error message in the python command line. The following works fine:
>>> body = 'ппп'
>>> ups = 'rrr {0}'.format(body.replace("'","''"))

It's working in the interactive prompt because your terminal is using your locale to determine what encoding to use. Directly from the Python docs:
Whereas the other file-like objects in python always convert to ASCII
unless you set them up differently, using print() to output to the
terminal will use the user’s locale to convert before sending the
output to the terminal.
On the other hand, while your server is running the scripts, there is no such assumption. Everything read as a byte str from a file-like object is encoded as ASCII in memory unless otherwise specified. Your Cyrillic characters, presumably encoded as UTF-8, can't be converted; they're far beyond the U+007F code point that maps directly between UTF-8 and ASCII. (Unicode uses hex to map its code points; U+007F, then, is U+00127 in decimal. In fact, ASCII only has 127 zero-indexed code points because it uses only 1 byte, and of that one byte, only the least-significant 7 bits. The most significant bit is always 0.)
Back to your problem. If you want to operate on the body of the file, you'll have to specify that it should be opened with a UTF-8 encoding. (Again, I'm assuming it's UTF-8 because it's information submitted from the web. If it's not -- well, it really should be.) The solution has already been given in other StackOverflow answers, so I'll just link to one of them rather than reiterate what's already been answered. The best answer may vary a little bit depending on your version of Python -- if you let me know in a comment I could give you a clearer recommendation.

Python, .format(), and UTF-8

My background is in Perl, but I'm giving Python plus BeautifulSoup a try for a new project.
In this example, I'm trying to extract and present the link targets and link text contained in a single page. Here's the source:
table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))
All those explicit calls to .encode('utf-8') are my attempt to make this work, but they don't seem to help -- it's likely that I'm completely misunderstanding something about how Python 2.7 handles Unicode strings.
Anyway. This works fine up until it encounters U+2013 in a URL (yes, really). At that point it bombs out with:
Traceback (most recent call last):
File "./test2.py", line 30, in <module>
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)
Presumably .format(), even applied to a Unicode string, is playing silly-buggers and trying to do a .decode() operation. And as ASCII is the default, it's using that, and of course it can't map U+2013 to an ASCII character, and thus...
The options seem to be to remove it or convert it to something else, but really what I want is to simply preserve it. Ultimately (this is just a little test case) I need to be able to present working clickable links.
The BS3 documentation suggests changing the default encoding from ASCII to UTF-8 but reading comments on similar questions that looks to be a really bad idea as it'll muck up dictionaries.
Short of using Python 3.2 instead (which means no Django, which we're considering for part of this project) is there some way to make this work cleanly?

First, note that your two code samples disagree on the text of the problematic line:
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
vs
line_out = unicode(table_row.format(link_text, link_target))
The first is the one from the traceback, so it's the one to look at. Assuming the rest of your first code sample is accurate, table_row is a byte-string, because you took a unicode string and encoded it. Byte strings can't be encoded, so Python 2 implicitly converts table_row from byte-string to unicode by decoding it as ascii. Hence the error message, "UnicodeDecodeError from ascii".
You need to decide what strings will be byte strings and which will be unicode strings, and be disciplined about it. I recommend keeping all text as Unicode strings as much as possible.
Here's a presentation I gave at PyCon that explains it all: Pragmatic Unicode, or, How Do I Stop The Pain?

python ascii codes to utf

So when i post a name or text in mod_python in my native language i get:
македонија
And i also get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
When i use:
hparser = HTMLParser.HTMLParser()
req.write(hparser.unescape(text))
How can i decode it?

It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of
Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)
In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character1. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.
The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.
To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.
Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.
1Sort of.
EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.
Anyway, if you step through what you're doing in the shell you'll see
>>> from HTMLParser import HTMLParser
>>> text = "македонија"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'
I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like
>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'
But we could also pick a different encoding!
>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'
You'll need to decide what encoding you want to use.
What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try
>>> text.encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
you just get an error, because you can't encode those code points in ASCII.
When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.
So you need to do req.write(hparser.unescape(text).encode("some-encoding")).

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?

Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.

You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.

Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"

To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.