How can I change unicode to ascii and drop unrecognized characters

How can I change unicode to ascii and drop unrecognized characters - python

My file is in unicode. However, for some reason, I want to change it to plain ascii while dropping any characters that are not recognized in ascii. For example, I want to change u'This is a string�' to just 'This is a string'. Following is the code I use to do so.
ascii_str = unicode_str.encode('ascii', 'ignore')
However, I still get the following annoying error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0:
ordinal not in range(128)
How can I solve this problem? I am fine with plain ascii strings.

I assume that your unicode_str is a real unicode string.
>>> u"\xf3".encode("ascii", "ignore")
''
If not use this
>>> "\xf3".decode("ascii", "ignore").encode("ascii")
Always the best way would be, find out which encoding you deal with and than decode it. So you have an unicode string in the right format. This means start at unicode_str either to be a real unicode string or read it with the right codec. I assume that there is a file. So the very best would be:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
Another desperate approach would be:
>>> import string
>>> a = "abc\xf3abc"
>>> "".join(b for b in a if b in string.printable)
'abcabc'

You need to decode it. if you have a file
with open('example.csv', 'rb') as f:
csv = f.read().decode("utf-8")
if you wanna decode a string, you can do it this way
data.decode('UTF-8')
UPDATE
You can use ord() to get code ascii of every character
d=u'This is a string'
l=[ord(s) for s in d.encode('ascii', 'ignore')]
print l
If you need to concatenate them, you can use join
print "".join(l)

As you have a Replacement character ( a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table) in your string , you need to specify that for your interpreter before decoding , with add u at the leading of your string :
>>> unicode_str=u'This is a string�'
>>> unicode_str.encode('ascii', 'ignore')
'This is a string'

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?

This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.

The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

Replacing non-ascii characters in an ascii encoded string

I have this code fragment (Python 2.7):
from bs4 import BeautifulSoup
content = ' foo bar';
soup = BeautifulSoup(content, 'html.parser')
w = soup.get_text()
At this point w has a byte with value 160 in it, but it's encoding is ASCII.
How do replace all of the \xa0 bytes by another character?
I've tried:
w = w.replace(chr(160), ' ')
w = w.replace('\xa0', ' ')
but I am getting the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
And why does BS return an ASCII encoded string with an invalid character in it?
Is there a way to convert w to a 'latin1` encoded string?

At this point w has a byte with value 160 in it, but it's encoding is 'ascii'.
You have an unicode string:
>>> w
u'\xa0 foo bar'
>>> type(w)
<type 'unicode'>
How do replace all of the \xa0 bytes with another character?
>>> x = w.replace(u'\xa0', ' ')
>>> x
u' foo bar'
And why does BS return an 'ascii' encoded string with an invalid character in it?
As mentioned above, it is not an ascii encoded string, but an Unicode string instance.
Is there a way to convert w to a 'latin1` encoded string?
Sure:
>>> w.encode('latin1')
'\xa0 foo bar'
(Note this last string is an encoded string, not an unicode object, and its representation is not prefixed by 'u' like the previous unicode objects).
Notes (edited):
If you are typing strings into your source files, note that encoding of source files matters. Python will assume your source files are ASCII. The command line interpreter, on the other hand, will assume you are entering strings in your default system encoding. Of course you can override all this.
Avoid latin1, use UTF-8 if possible: ie. w.encode('utf8')
When encoding and decoding can tell Python to ignore errors, or replace characters that cannot be encoded with some marker character . I don't recommend to ignore encoding errors (at least without logging them), except for the hopefully rare cases when you know there are encoding errors or you need to encode text into a more reduced character set, requiring replacement of the code points that cannot be represented (ie if you need to encode 'España' into ASCII, you definitely should replace the 'ñ'). But for these cases there are imho better alternatives and you should look into the magical unicodedata module (see https://stackoverflow.com/a/1207479/401656).
There is a Python Unicode HOWTO: https://docs.python.org/2/howto/unicode.html

Using re.sub to remove double quotes

I am trying to remove double quotes from a text file in Python. the statement print re.sub(r'"', '', line) works in the interpreter, but not when I use it in a file. Why would this be?
from the interpreter directly:
>>>
>>> import re
>>> str = "bill"
>>> print re.sub(r'"', '', str)
bill
>>>
from my .py file:
def remove_quotes (filename):
with open(filename, 'rU') as file:
print re.sub(r'"', '', file.read())
output:
“Bill”
“pretty good” bastante bien
“friendship” amistad
“teenager” adolescent
OK, as col6y pointed out, I am dealing with fancy L/R quotes. Trying to get rid of them:
>>> line
'\xe2\x80\x9cBill\xe2\x80\x9d\n'
text = line.replace(u'\xe2\x80\x9c', '')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Tried another character encoding:
text = line.replace(u"\u201c", '')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

In your interpreter example, you say:
>>>
>>> import re
>>> str = "bill"
>>> print re.sub(r'"', '', str)
bill
>>>
However, the string "bill" does not contain any quotes, so this doesn't test anything. If you try print str, you'll see it never had quotes in the first place - this is because the quotes mark that str is a string, and are therefore not included. (You wouldn't always want quotes in your strings.) If you wanted to include quotes, you could say "\"bill\"" or '"bill"'.
However, this doesn't explain the real issue in your other program. To understand that, note the difference between “, ”, and ". They look similar, but they're slightly different, and are definitely different to the computer. In your file, you have “ and ”, but you are replacing the ". You'll want to replace the other two as well.
Also, as #MikeT pointed out, it would be easier to use file.read().replace(...) instead of re.replace(..., file.read()). re.replace is for regular expressions, but you don't need their power here.
You should also note that file.read() will only read a part of the file, not the whole file. For that, consider using file.readlines(), and iterating over the lines.

Python Unencode unicode html hexadecimal

Suppose I have strings with lots of stuff like
âwords words words
Is there a way to convert these through python directly into the characters they represent?
I tried
h = HTMLParser.HTMLParser()
print h.unescape(x)
but got this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I also tried
print h.unescape(x).encode(utf-8)
but it encodes
â as â
when it should be a quote

â form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.
You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = 'â'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“
If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.

the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8
x="âwords words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")
there may be a better way to do this in the HTMLParser library ...

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I change unicode to ascii and drop unrecognized characters - python

Related

How to tell python that a string is actually bytes-object? Not converting

Replacing non-ascii characters in an ascii encoded string

Using re.sub to remove double quotes

Python Unencode unicode html hexadecimal

Python, Encoding output to UTF-8

Categories

Resources