Unicode, ASCII, and Regex Won't Work

Unicode, ASCII, and Regex Won't Work - python

So I am using Python 3.6.something and I've been trying to figure out this completely intuitive Unicode/ASCII nightmare. I am trying to save the text from a webpage into a file, and parse it using Regex later.
When I try to read the file and parse it, I need to find the pattern:
Note 1 –
Which is apparently different from:
Note 1 -
I keep getting the error:
SyntaxError: Non-UTF-8 code starting with '\x96' in file C:\Users\Steve\eclipse-workspace\scraper\BeautifulSoupTest.py on line 28, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
on the RegEx that I am trying to do. This is really strange since '\x96' is an Unicode character from what I've seen online. Something is going on with Unicode or ASCII and I have no clue what it is. I also can't remove the '\x96' character with a replace() either, it gives the same error. Can anyone help?
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
def downloadCleanText(url, year):
urlObject = urlopen(url)
rawHTML = urlObject.read()
cleanedText = BeautifulSoup(rawHTML, 'html.parser').body.getText()
outputFile = open(str(year) + '.txt', 'w')
outputFile.write(cleanedText)
outputFile.close()
def pullNote1(year):
inputFile = open(str(year) + '.txt', 'r')
inData = inputFile.read()
outData = re.findall('Note 1 –(.*?)Note 2 ', inData)
print(outData)
inputFile.close()
downloadCleanText('https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/a10-k20179302017.htm#s2A826F0B8B5755F787D29B5B8C8C7D16', 2000)
pullNote1(2000)

No, 0x96 is not an ASCII codepoint. The ASCII standard defines only 7 bit codepoints, so from 0x00 through to 0x7F. Nor is 0x96 a valid UTF-8 byte sequence.
You most likely have saved your source code as Windows Codepage 1252, where 0x96 is the U+2013 EN DASH codepoint (all codepages between 1250 and 1258 do, but 1252 is the most widely used). So, following the exception message you can make the error go away by adding:
# encoding: cp1252
at the top of the file. Or you could configure your editor to save the file as UTF-8 instead (at which point the byte sequence 0xE2 0x80 0x93 will be written to represent that codepoint).
Alternatively, use only ASCII characters in your source code by using a \uhhhh escape sequence in your string literal:
outData = re.findall('Note 1 \u2013(.*?)Note 2 ', inData)
You may want to read up on Unicode and Python, I strongly recommend Ned Batchelder's Pragmatic Unicode.

Related

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

Error when non-English characters present [duplicate]

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
But then if I open the actual text file, I see lots of symbols like:
Qur‚Äôan
Maybe I need to write to something other than a text file?

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:
import io
with io.open(filename, 'w', encoding=character_encoding) as file:
file.write(unicode_text)
It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

Unicode string handling is already standardized in Python 3.
char's are already stored in Unicode (32-bit) in memory
You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)
out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open from the io module (this is the same as the builtin open in Python 3):
import io
Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm ￫ HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)
You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

How to print unicode characters into a file:
Save this to file: foo.py:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')
Run it and pipe output to file:
python foo.py > tmp.txt
Open tmp.txt and look inside, you see this:
el#apollo:~$ cat tmp.txt
e with obfuscation: é
Thus you have saved unicode e with a obfuscation mark on it to a file.

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it's in plain ASCII. There are two possibilities:
You're encoding it to a bytestring, but because you've used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

In case of writing in python3
>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'
In case of writing in python2:
>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:
>>> f.write(a.encode("utf-8"))
>>> f.close()
and decode the data while reading using the codecs "utf-8":
>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'
And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this
>>> print a
batsà

python ascii vs unicode (utf-8) [duplicate]

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
But then if I open the actual text file, I see lots of symbols like:
Qur‚Äôan
Maybe I need to write to something other than a text file?

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:
import io
with io.open(filename, 'w', encoding=character_encoding) as file:
file.write(unicode_text)
It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

Unicode string handling is already standardized in Python 3.
char's are already stored in Unicode (32-bit) in memory
You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)
out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open from the io module (this is the same as the builtin open in Python 3):
import io
Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm ￫ HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)
You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

How to print unicode characters into a file:
Save this to file: foo.py:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')
Run it and pipe output to file:
python foo.py > tmp.txt
Open tmp.txt and look inside, you see this:
el#apollo:~$ cat tmp.txt
e with obfuscation: é
Thus you have saved unicode e with a obfuscation mark on it to a file.

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it's in plain ASCII. There are two possibilities:
You're encoding it to a bytestring, but because you've used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

In case of writing in python3
>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'
In case of writing in python2:
>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:
>>> f.write(a.encode("utf-8"))
>>> f.close()
and decode the data while reading using the codecs "utf-8":
>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'
And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this
>>> print a
batsà

Using unicode character in XML with python : 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

I use django and in my view i need to send a request as XML with some unicode character that received from html page with post method. I tried these (Note that i save that input in fname variable) :
xml = r"""my XML code with unicode {0} """.format(fname)
And
fname = u"%s".encode('utf8') % (fname)
xml = r"""my XML code with unicode {0} """.format(fname)
And
fname = fname.encode('ascii', 'ignore').decode('ascii')
xml = r"""my XML code with unicode {0} """.format(fname)
And every time i got this error:
'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

You could reproduce the error with this code:
>>> "{0}".format(u"\U0001F384"*4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
To fix this particular error, just use Unicode format string:
>>> u"{0}".format(u"\U0001F384"*4)
u'\U0001f384\U0001f384\U0001f384\U0001f384'
You could use xml.etree.ElementTree module to build your xml document instead of string formatting. xml is a complex format; it is easy to get it wrong. ElementTree will also serialize your Unicode string into bytes correctly making sure that the character encoding in the xml declaration is consistent with the actual encoding that is used in the document.

xml = r"""my XML code with unicode {0} """.format(fname)
The .format method always produces the same output string type as the input format string. In the case your format string is a byte string r"""...""" so if fname is a Unicode string Python tries to force it into being a byte string. If frame contains characters that do not exist in the default encoding (ASCII) then bang.
Note that this differs from the old string formatting operator %, which tries to promote to Unicode string when either the format string or any of the arguments used are Unicode, which would work in this case as long as the my XML code was ASCII-compatible. This is a common problem when you convert code that uses % to .format().
This should work fine:
xml = ur"""my XML code with unicode {0} """.format(fname)
However the output will be a Unicode string so whatever you do next needs to cope with that (for example if you are writing it to a byte stream/file, you would probably want to .encode('utf-8') the whole thing). Alternatively encode it in place to get a byte string:
xml = r"""my XML code with unicode {0} """.format(fname.encode('utf-8'))
Note that this above:
fname = u"%s".encode('utf8') % (fname)
does not work because you are encoding the format string to bytes, not the fname argument. This is identical to saying just fname = '%s' % fname, which is effectively fname = fname.
I Solved that with this code:
fname = fname.encode('ascii', 'xmlcharrefreplace')
This smells bad. For input hello ☃, you are now generating hello ☃ instead of the normal output hello ☃.
If both ☃ and ☃ look the same to you in the output then probably you are doing something like this:
xml = '<element>{0}</element>'.format(some_text)
which is broken for XML-special characters like & and <. When you are generating XML you should take care to escape special characters (&<>"', to &, < etc), otherwise at best your output will break for these characters; at worst, when some_text includes user input you have an XML-injection vulnerability which may break the logic of your system in a security-senesitive way.
As J F Sebastian said (+1), it's a good idea to use existing known-good XML serialisation libraries like etree instead of trying to roll your own.

You could do something like:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but its an ugly hack-ish that only works in python 2.7.
or
fname.encode('GB18030').decode('utf-8')
it will bypass the error but may still look messy. If you're posting to an html file then make the charset utf-8

I Solved that with this code:
fname = fname.encode('ascii', 'xmlcharrefreplace')
xml = r"""my XML code with unicode {0} """.format(fname)
Thank you for your help.
Update :
And you can remove or replace special characters like > & < with this (Thanks to #bobince for notice this) :
fname = fname.replace("<", "")
fname = fname.replace(">", "")
fname = fname.replace("&", "")

Replacing a weird single-quote (’) with blank string in Python

I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:
SyntaxError: Non-ASCII character '\xe2' in file
EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()
# replace bad characters
raw = raw.replace('’', "")
print(raw)
Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.

The problem here is with the encoding of the file you downloaded (aa_meetings.csv). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99', which is not what is in the file.
Fixing this is as simple as adding appropriate calls to encode and decode:
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')
# replace bad characters
raw = raw.replace(u'’', u"'")
print(raw.encode("ascii"))
1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".

You have to declare the encoding of your source file.
Put this as one of the first two lines of your code:
# encoding: utf-8
If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.

This file is encoded in Windows-1252. The apostrophe U+2019 encodes to \x92 in this encoding. The proper thing is to decode the file to Unicode for processing:
data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed
The problem was you were searching for a UTF-8 encoded U+2019, i.e. \xe2\x80\x99, which was not in the file. Converting to Unicode solves this.
Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’':
Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'

You can do string.replace('\xe2', "'") to replace them with the normal single-quote.

I was getting such Non-ASCII character '\xe2' errors repeatedly with my Python scripts, despite replacing the single-quotes. It turns out the non-ASCII character really was a double en dash (−−). I replaced it with a regular double dash (--) and that fixed it. [Both will look the same on most screens. Depending on your font settings, the problematic one might look a bit longer.]
For anyone encountering the same issue in their Python scripts (in their lines of code, not in data loaded by your script):
Option 1: get rid of the problematic character
Re-type the line by hand. (To make sure you did not copy-paste the problematic character by mistake.)
Note that commenting the line out will not work.
Check whether the problematic character really is the one you think.
Option 2: change the encoding
Declare an encoding at the beginning of the script, as Roberto pointed out:
# encoding: utf-8
Hope this helps someone.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode, ASCII, and Regex Won't Work - python

Related

Python search through file and un-escape unicode characters [duplicate]

Error when non-English characters present [duplicate]

python ascii vs unicode (utf-8) [duplicate]

Using unicode character in XML with python : 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Replacing a weird single-quote (’) with blank string in Python

Categories

Resources