Replacing non-ascii characters in an ascii encoded string - python

I have this code fragment (Python 2.7):
from bs4 import BeautifulSoup
content = ' foo bar';
soup = BeautifulSoup(content, 'html.parser')
w = soup.get_text()
At this point w has a byte with value 160 in it, but it's encoding is ASCII.
How do replace all of the \xa0 bytes by another character?
I've tried:
w = w.replace(chr(160), ' ')
w = w.replace('\xa0', ' ')
but I am getting the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
And why does BS return an ASCII encoded string with an invalid character in it?
Is there a way to convert w to a 'latin1` encoded string?

At this point w has a byte with value 160 in it, but it's encoding is 'ascii'.
You have an unicode string:
>>> w
u'\xa0 foo bar'
>>> type(w)
<type 'unicode'>
How do replace all of the \xa0 bytes with another character?
>>> x = w.replace(u'\xa0', ' ')
>>> x
u' foo bar'
And why does BS return an 'ascii' encoded string with an invalid character in it?
As mentioned above, it is not an ascii encoded string, but an Unicode string instance.
Is there a way to convert w to a 'latin1` encoded string?
Sure:
>>> w.encode('latin1')
'\xa0 foo bar'
(Note this last string is an encoded string, not an unicode object, and its representation is not prefixed by 'u' like the previous unicode objects).
Notes (edited):
If you are typing strings into your source files, note that encoding of source files matters. Python will assume your source files are ASCII. The command line interpreter, on the other hand, will assume you are entering strings in your default system encoding. Of course you can override all this.
Avoid latin1, use UTF-8 if possible: ie. w.encode('utf8')
When encoding and decoding can tell Python to ignore errors, or replace characters that cannot be encoded with some marker character . I don't recommend to ignore encoding errors (at least without logging them), except for the hopefully rare cases when you know there are encoding errors or you need to encode text into a more reduced character set, requiring replacement of the code points that cannot be represented (ie if you need to encode 'España' into ASCII, you definitely should replace the 'ñ'). But for these cases there are imho better alternatives and you should look into the magical unicodedata module (see https://stackoverflow.com/a/1207479/401656).
There is a Python Unicode HOWTO: https://docs.python.org/2/howto/unicode.html

Related

Python: unwanted unicode type

I've got unicode strings (from an API query) that should have been encoded as regular ascii strings (as they already contain unicode representations). How can I change the encoding without actually changing the the characters being encoded?
To wit:
string = '165\xc2\xba F' # What I want
print(string)
my_string = u'165\xc2\xba F' # What I have
print(my_string)
PS I realize \xc2\xba is actually for ordinal number and not the degree sign (\xc2\xb0) but that's what I got.
What you have is not "unicode" is the byte sequence for the UTF-8 encoding of the string you want.
You can retrieve the text by using the "latin-1" codec to transparently transport your byte sequence to a byte-string (from your unicode-string) and
them, decode it normally from UTF-8:
In[]: u'165\xc2\xba F'.encode("latin1").decode("utf-8")
Out[]: u'165º F'
Why the latin-1 codec is special and works in this case is described on the second paragraph from here: https://docs.python.org/3/library/codecs.html#encodings-and-unicode
When you've got some minutes to spare it would be useful to read this nice article on Unicode to know what are codecs, and what does text in unicode mean.

Python - ASCII encoding string in the unicode string; how to remove that 'u'?

When I use python module 'pygoogle' in chinese, I got url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
It's unicode but include ascii. I try to encode it back to utf-8 but the code be changed too.
a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'
Also I try to use :
str(a)
but I got error :
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)
How can I encoding it for remove the 'u' ?
By the way, if there is not 'u' I will get correct result like:
s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语
You have a Mojibake; in this case those are UTF-8 bytes decoded as if they were Latin-1 bytes.
To reverse the process, encode to Latin-1 again:
>>> a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语
The print worked because my terminal is configured to handle UTF-8. You can get a unicode object again by decoding as UTF-8:
>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
The ISO-8859-1 (Latin-1) codec maps one-on-one to the first 255 Unicode codepoints, which is why the string contents look otherwise unchanged.
You may want to use the ftfy library for jobs like these; it handles a wide variety of text issues, including Windows codepage Mojibake where some resulting 'codepoints' are not legally encodable to the codepage. The ftfy.fix_text() function takes Unicode input and repairs it:
>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'

How can I change unicode to ascii and drop unrecognized characters

My file is in unicode. However, for some reason, I want to change it to plain ascii while dropping any characters that are not recognized in ascii. For example, I want to change u'This is a string�' to just 'This is a string'. Following is the code I use to do so.
ascii_str = unicode_str.encode('ascii', 'ignore')
However, I still get the following annoying error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0:
ordinal not in range(128)
How can I solve this problem? I am fine with plain ascii strings.
I assume that your unicode_str is a real unicode string.
>>> u"\xf3".encode("ascii", "ignore")
''
If not use this
>>> "\xf3".decode("ascii", "ignore").encode("ascii")
Always the best way would be, find out which encoding you deal with and than decode it. So you have an unicode string in the right format. This means start at unicode_str either to be a real unicode string or read it with the right codec. I assume that there is a file. So the very best would be:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
Another desperate approach would be:
>>> import string
>>> a = "abc\xf3abc"
>>> "".join(b for b in a if b in string.printable)
'abcabc'
You need to decode it. if you have a file
with open('example.csv', 'rb') as f:
csv = f.read().decode("utf-8")
if you wanna decode a string, you can do it this way
data.decode('UTF-8')
UPDATE
You can use ord() to get code ascii of every character
d=u'This is a string'
l=[ord(s) for s in d.encode('ascii', 'ignore')]
print l
If you need to concatenate them, you can use join
print "".join(l)
As you have a Replacement character ( a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table) in your string , you need to specify that for your interpreter before decoding , with add u at the leading of your string :
>>> unicode_str=u'This is a string�'
>>> unicode_str.encode('ascii', 'ignore')
'This is a string'

Python Unencode unicode html hexadecimal

Suppose I have strings with lots of stuff like
“words words words
Is there a way to convert these through python directly into the characters they represent?
I tried
h = HTMLParser.HTMLParser()
print h.unescape(x)
but got this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I also tried
print h.unescape(x).encode(utf-8)
but it encodes
“ as â
when it should be a quote
“ form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.
You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = '“'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“
If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.
the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8
x="“words words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")
there may be a better way to do this in the HTMLParser library ...

Why does chardet say my UTF-8-encoded string (originally decoded from ISO-8859-1) is ASCII?

I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:
chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])
It returns:
{'confidence': 1.0, 'encoding': 'ascii'}
How come?
Quoting from Python's documentation:
UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
A string of ASCII text is also a valid UTF-8 text.
All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)
To make it clear, check out this console session:
>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>
However, not all string with UTF-8 encoding is valid ASCII string:
>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>
So, chardet is still right. Only if there is a character that is not ascii, chardet would be able to tell, it's not ascii encoded.
Hope this simple explanation helps!
UTF-8 is a superset of ASCII. This means that that every valid Ascii file (that only uses the first 128 characters, not the extended characters) will also be a valid UTF-8 file. Since encoding is not stored explicitly, but guessed each time, it will default to the simpler character set. However, if you were to encode anything beyond the basic 128 characters (like foreign text and such) in UTF-8, it would be very likely to guess the encoding as UTF-8.
this is the reason why You got ascii
https://github.com/erikrose/chardet/blob/master/chardet/universaldetector.py#L135
If all characters in sequence is ascii symbols chardet consider string encoding as ascii
N.B.
The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

Categories

Resources