python encoding error when searching a string - python

I get the following error while trying to search the string below
ERROR:
SyntaxError: Non-ASCII character '\xd8' in file Hadith_scraper.py on line 44, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
STRING:
دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ
CODE:
arabic_hadith = "دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ"
arabic_hadith.encode('utf8')
print arabic_hadith
if "الجمعة" in arabic_hadith:‎
day = "5"
else:
day = ""

You have a byte string, not a unicode value. Trying to encode a byte string in Python 2 means that Python will first try to decode it to unicode so that it can then encode.
Use unicode values here instead, and make sure you set the codec at the top of the file first. See PEP 263 - Defining Python Source Code Encodings (which your error message pointed you to).
Note that there is no need to encode to UTF8 here, that'll only complicate text comparisons:
# encoding: utf8
arabic_hadith = u"دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ"
print arabic_hadith
if u"الجمعة" in arabic_hadith:‎
day = "5"
else:
day = ""
Rule of thumb: decode bytes from incoming sources (files, network data) to Unicode, process only Unicode in your program, and only encode again for any outgoing data.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
before you continue.

Related

How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)
I tried the following methods:
print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')
None of them work. I have the same error message.
What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?
How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?
ADDED
I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.
I recently posted another answer that addresses this very issue. Key quote:
For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)
You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.
Without knowing what is in your txt object I can't be more specific.
I think you need to use codecs library:
import codecs
file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()
Works fine.
The Story of Encoding/Decoding:
In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.
With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

Python: difficulty converting ascii to unicode

My goal: get the page source from a url and count all instances of a keyword within that page source
How I am doing it: getting the pagesource via urllib2, looping through each char of the page source and comparing it to the keyword
My problem: my keyword is encoded in utf-8 while the page source is in ascii... I am running into errors whenever I try conversions.
getting the page source:
import urllib2
response = urllib2.urlopen(myUrl)
return response.read()
comparing page source and keyword:
pageSource[i] == keyWord[j]
I need to convert one of these strings to the other's encoding. Intuitively I felt that ascii (the page source) to utf-8 (the key word) would be the best and easiest, so:
pageSource = unicode(pageSource)
UnicodeDecodeError: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128)
When trying to work with text, don't leave your data as byte strings. Decode to Unicode early, encode back to bytes as late as possible.
Decode your downloaded network data:
import urllib2
response = urllib2.urlopen(myUrl)
# Latin-1 is the default for HTTP text/ responses, adjust as needed
codec = response.info().getparam('charset', 'latin1')
return response.read().decode(codec)
and do the same for your keyWord data. If it is encoded as UTF-8, decode it as such, or use Unicode string literals.
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
I'll assume your remote "source page" contains more than just ASCII otherwise your comparison will already work as is (ASCII is now a subset of UTF-8. I.e. A in ASCII is 0x41, which is the same as UTF-8).
You may find Python Requests library easier as it will automatically decode remote content to Unicode strings based on the server's headers (Unicode strings are encoding neutral so can be compared without worrying about encoding).
resp = requests.get("http://www.example.com/utf8page.html")
resp.text
>> u'My unicode data €'
You will then need to decode your reference data:
keyWord[j] = "€".decode("UTF-8")
keyWord[j]
>> u'€'
If you're embedding non-ASCII in your source code, you need to define the encoding you're using. For example, at the top of your source code/script:
# coding=UTF-8

Python: increment special character Í

I want to read some words from an excel file and extracte some information.
Reading the file is no problem.
The point is, that I want to increment the last character of a word. It is no problem for normal characters like 'A'. But special Characters like 'Í' are a problem.
I read the content with this:
val = val.encode('utf-8')
I put this value in a dictionary.
The next step is to iterate through the dict and get the saved information. info['streettype'] contains my val from before. Now i convert the value to upper case like this:
w2 = info['streettype'].decode('utf-8').upper().encode('utf-8')
That is needed because some characters are special, like I said (e.g. 'é', 'ž', 'í').
Now I want to increment the last character of the word, which can be a special character.
w3 = w2.decode('utf-8')[:-1].encode('utf-8')
lastLetter = w2.decode('utf-8')[-1].encode('utf-8')
Now I increment the character by using:
lastLetter2 = (chr(ord(lastLetter.decode('utf-8')) + 1))
Next I want to save it in a text file.
I want to save the original word and the edited word.
I think I need to reencode my lastLetter2, but it does not work.
When I just save my w2 and w3+lastLetter2 I get strange results because some are encoded, some are not.
For the word:
NÁBŘEŽÍ
my Result is:
"NÃBŘEŽÃ", "NÃBŘEŽÎÃ"
but I want:
"NÁBŘEŽÍ", "NÁBŘEŽÎ"
(Í is ascii 205, Î is ascii 206)
Can someone help me to save this problem?
Stop encoding your data to UTF-8 all the time; keep your text as Unicode, it makes processing much easier. Leave encoding to the last minute, preferably by having the file object encode this for you.
Having the file encode Unicode means that in Python 2 you'd use io.open() rather than the standard built-in open() function; this is the same infrastructure Python 3 uses to handle Unicode and file I/O.
You managed to create a Mojibake by encoding and decoding at will here; your text is now a mix of UTF-8 data decoded with Windows codepage 1252 then encoded to UTF8 again, plus non-mangled data:
>>> print u'NÃBŘEŽÃ'
NÃBŘEŽÃ
>>> print u'NÃBŘEŽÃ'[3:-1].encode('cp1252').decode('utf8')
ŘEŽ
Note that the last character in the first stringis invalid; it is missing a byte! That's because the result of 'decoding' the last character's UTF-8 bytes should not have been possible in a proper CP1252 codec; I had to use the ftfy project internal repair codecs to bypass that problem:
>>> print u'NÃBŘEŽÃ\x8d'[3:].encode('sloppy-cp1252').decode('utf8')
ŘEŽÍ
>>> u'Í'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1: character maps to <undefined>
>>> u'Í'.encode('utf8').decode('sloppy-cp1252')
u'\xc3\x8d'
>>> print u'Í'.encode('utf8').decode('sloppy-cp1252')
Ã
The only way to fix this is to a) ensure you read your data using the correct codecs, and b) then treat all text as Unicode throughout your code, and only encode at the last moment to the correct output codec.
Handling Unicode code points with ord() and unichr() (in Python 2) and chr() in Python 3 will then work as expected:
>>> lastletter = u'Î'
>>> ord(lastletter)
206
>>> unichr(ord(lastletter) + 1)
u'\xcf'
>>> print unichr(ord(lastletter) + 1)
Ï
You may want to read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

How do I check whether have encoded in utf-8 successfully

Given a string
u ='abc'
which syntax is the right one to encode into utf8?
u.encode('utf-8')
or
u.encode('utf8')
And how do I know that I have already encoded in utr-8?
First of all you need to make a distinction if you're talking about Python 2 or Python 3 because unicode handling is one of the biggest differences between the two versions.
Python 2
unicode type contains text characters
str contains sequences of 8-bit bytes, sometimes representing text in some unspecified encoding
s.decode(encoding) takes a sequence bytes and builds a text string out of it, once given the encoding used by the bytes. It goes from str to unicode, for example "Citt\xe0".decode("iso8859-1") will give you the text "Città" (Italian for city) and the same will happen for "Citt\xc3\xa0".decode("utf-8"). The encoding may be omitted and in that case the meaning is "use the default encoding".
u.encode(encoding) takes a text string and builds the byte sequence representing it in the given encoding, thus reversing the processing of decode. It goes from unicode to str. As above the encoding can be omitted.
Part of the confusion when handling unicode with Python is that the language tries to be a bit too smart and does things automatically.
For example you can call encode also on an str object and the meaning is "encode the text that comes from decoding these bytes when using the default encoding, eventually using the specified encoding or the default encoding if not specified".
Similarly you can also call decode on an unicode object, meaning "decode the bytes that come from this text when using the default encoding, eventually using the specified encoding".
For example if I write
u"Citt\u00e0".decode("utf-8")
Python gives as error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in
position 3: ordinal not in range(128)
NOTE: the error is about encoding that failed, while I asked for decoding. The reason is that I asked to decode text (nonsense because that is already "decoded"... it's text) and Python decided to first encode it using the "ascii" encoding and that failed. IMO much better would have to just not have decode on unicode objects and not have encode on string objects: the error message would have been clearer.
More confusion is that in Python 2 str is used for unencoded bytes, but it's also used everywhere for text and for example string literals are str objects.
Python 3
To solve some of the issues Python 3 made a few key changes
str is for text and contains unicode characters, string literals are unicode text
unicode type doesn't exist any more
bytes type is used for 8-bit bytes sequences that may represent text in some unspecified encoding
For example in Python 3
'Città'.encode('iso8859-1') → b'Citt\xe0'
'Città'.encode('utf-8') → b'Citt\xc3\xa0'
also you cannot call decode on text strings and you cannot call encode on byte sequences.
Failures
Sometimes encoding text into bytes may fail, because the specified encoding cannot handle all of unicode. For example iso8859-1 cannot handle Chinese. These errors can be processed in a few ways like raising an exception (default), or replacing characters that cannot be encoded with something else.
The encoding utf-8 however is able to encode any unicode character and thus encoding to utf-8 never fails. Thus it doesn't make sense to ask how to know if encoding text into utf-8 was done correctly, because it always happens (for utf-8).
Also decoding may fail, because the sequence of bytes may make no sense in the specified encoding. For example the sequence of bytes 0x43 0x69 0x74 0x74 0xE0 cannot be interpreted as utf-8 because the byte 0xE0 cannot appear without a proper prefix.
There are encodings like iso8859-1 where however decoding cannot fail because any byte 0..255 has a meaning as a character. Most "local encodings" are of this type... they map all 256 possible 8-bit values to some character, but only covering a tiny fraction of the unicode characters.
Decoding using iso8859-1 will never raise an error (any byte sequence is valid) but of course it can give you nonsense text if the bytes where using another encoding.
First solution:
isinstance(u, unicode)
Second solution:
try:
u.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"

string encoding and decoding?

Here are my attempts with error messages. What am I doing wrong?
string.decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 37: ordinal not in range(128)
string.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
37: ordinal not in range(128)
You can't decode a unicode, and you can't encode a str. Try doing it the other way around.
Guessing at all the things omitted from the original question, but, assuming Python 2.x the key is to read the error messages carefully: in particular where you call 'encode' but the message says 'decode' and vice versa, but also the types of the values included in the messages.
In the first example string is of type unicode and you attempted to decode it which is an operation converting a byte string to unicode. Python helpfully attempted to convert the unicode value to str using the default 'ascii' encoding but since your string contained a non-ascii character you got the error which says that Python was unable to encode a unicode value. Here's an example which shows the type of the input string:
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first and, since you didn't give it an ascii string the default ascii decoder fails:
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Aside from getting decode and encode backwards, I think part of the answer here is actually don't use the ascii encoding. It's probably not what you want.
To begin with, think of str like you would a plain text file. It's just a bunch of bytes with no encoding actually attached to it. How it's interpreted is up to whatever piece of code is reading it. If you don't know what this paragraph is talking about, go read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets right now before you go any further.
Naturally, we're all aware of the mess that created. The answer is to, at least within memory, have a standard encoding for all strings. That's where unicode comes in. I'm having trouble tracking down exactly what encoding Python uses internally for sure, but it doesn't really matter just for this. The point is that you know it's a sequence of bytes that are interpreted a certain way. So you only need to think about the characters themselves, and not the bytes.
The problem is that in practice, you run into both. Some libraries give you a str, and some expect a str. Certainly that makes sense whenever you're streaming a series of bytes (such as to or from disk or over a web request). So you need to be able to translate back and forth.
Enter codecs: it's the translation library between these two data types. You use encode to generate a sequence of bytes (str) from a text string (unicode), and you use decode to get a text string (unicode) from a sequence of bytes (str).
For example:
>>> s = "I look like a string, but I'm actually a sequence of bytes. \xe2\x9d\xa4"
>>> codecs.decode(s, 'utf-8')
u"I look like a string, but I'm actually a sequence of bytes. \u2764"
What happened here? I gave Python a sequence of bytes, and then I told it, "Give me the unicode version of this, given that this sequence of bytes is in 'utf-8'." It did as I asked, and those bytes (a heart character) are now treated as a whole, represented by their Unicode codepoint.
Let's go the other way around:
>>> u = u"I'm a string! Really! \u2764"
>>> codecs.encode(u, 'utf-8')
"I'm a string! Really! \xe2\x9d\xa4"
I gave Python a Unicode string, and I asked it to translate the string into a sequence of bytes using the 'utf-8' encoding. So it did, and now the heart is just a bunch of bytes it can't print as ASCII; so it shows me the hexadecimal instead.
We can work with other encodings, too, of course:
>>> s = "I have a section \xa7"
>>> codecs.decode(s, 'latin1')
u'I have a section \xa7'
>>> codecs.decode(s, 'latin1')[-1] == u'\u00A7'
True
>>> u = u"I have a section \u00a7"
>>> u
u'I have a section \xa7'
>>> codecs.encode(u, 'latin1')
'I have a section \xa7'
('\xa7' is the section character, in both
Unicode and Latin-1.)
So for your question, you first need to figure out what encoding your str is in.
Did it come from a file? From a web request? From your database? Then the source determines the encoding. Find out the encoding of the source and use that to translate it into a unicode.
s = [get from external source]
u = codecs.decode(s, 'utf-8') # Replace utf-8 with the actual input encoding
Or maybe you're trying to write it out somewhere. What encoding does the destination expect? Use that to translate it into a str. UTF-8 is a good choice for plain text documents; most things can read it.
u = u'My string'
s = codecs.encode(u, 'utf-8') # Replace utf-8 with the actual output encoding
[Write s out somewhere]
Are you just translating back and forth in memory for interoperability or something? Then just pick an encoding and stick with it; 'utf-8' is probably the best choice for that:
u = u'My string'
s = codecs.encode(u, 'utf-8')
newu = codecs.decode(s, 'utf-8')
In modern programming, you probably never want to use the 'ascii' encoding for any of this. It's an extremely small subset of all possible characters, and no system I know of uses it by default or anything.
Python 3 does its best to make this immensely clearer simply by changing the names. In Python 3, str was replaced with bytes, and unicode was replaced with str.
That's because your input string can’t be converted according to the encoding rules (strict by default).
I don't know, but I always encoded using directly unicode() constructor, at least that's the ways at the official documentation:
unicode(your_str, errors="ignore")

Categories

Resources