Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)
It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.
Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode
Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')
utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.
Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')
This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding
TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.
In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')
I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')
The solution was change to "UTF-8 sin BOM"
Related
Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)
It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.
Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode
Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')
utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.
Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')
This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding
TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.
In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')
I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')
The solution was change to "UTF-8 sin BOM"
I am reading a JSON file in Python which has lots of fields and values (~8000 records).
Env: windows 10, python 3.6.4;
code:
import json
json_data = json.load(open('json_list.json'))
print (json_data)
With this I get an error. Below is the stack trace:
json_data = json.load(open('json_list.json'))
File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>
Along with this I have tried
import json
with open('json_list.json', encoding='utf-8') as fd:
json_data = json.load(fd)
print (json_data)
with this my program runs for a long time then hangs with no output.
I have searched almost all topics related to this and could not find a solution.
Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.
Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.
Here is what the file looks like around the reported error:
>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
b'":"2017-04-10","storage_size_gb":"84.747')
The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.
Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.
The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).
Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.
Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.
We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.
Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.
Example:
>>> unicode('\xab')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
But of course, this works without any problems
>>> unicode(u'\xab')
u'\xab'
Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.
One really horrible hack that occurs is to write our own str2uni() method, something like:
def str2uni(val):
r"""brute force coersion of str -> unicode"""
try:
return unicode(src)
except UnicodeDecodeError:
pass
res = u''
for ch in val:
res += unichr(ord(ch))
return res
But before we do this -- wanted to see if anyone else had any insight?
UPDATED
I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.
for _,_,files in os.walk('/path/to/folder'):
for fname in files:
filename = unicode(fname)
That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'
To see the error for yourself, do the following:
>>> unicode('3\xab Floppy (A).link')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128)
UPDATED
I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!
I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.
As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".
Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.
I think you're confusing Unicode strings and Unicode encodings (like UTF-8).
os.walk(".") returns the filenames (and directory names etc.) as strings that are encoded in the current codepage. It will silently remove characters that are not present in your current codepage (see this question for a striking example).
Therefore, if your file/directory names contain characters outside of your encoding's range, then you definitely need to use a Unicode string to specify the starting directory, for example by calling os.walk(u"."). Then you don't need to (and shouldn't) call unicode() on the results any longer, because they already are Unicode strings.
If you don't do this, you first need to decode the filenames (as in mystring.decode("cp850")) which will give you a Unicode string:
>>> "\xab".decode("cp850")
u'\xbd'
Then you can encode that into UTF-8 or any other encoding.
>>> _.encode("utf-8")
'\xc2\xbd'
If you're still confused why unicode("\xab") throws a decoding error, maybe the following explanation helps:
"\xab" is an encoded string. Python has no way of knowing which encoding that is, but before you can convert it to Unicode, it needs to be decoded first. Without any specification from you, unicode() assumes that it is encoded in ASCII, and when it tries to decode it under this assumption, it fails because \xab isn't part of ASCII. So either you need to find out which encoding is being used by your filesystem and call unicode("\xab", encoding="cp850") or whatever, or start with Unicode strings in the first place.
for fname in files:
filename = unicode(fname)
The second line will complaint if fname is not ASCII. If you want to convert the string to Unicode, instead of unicode(fname) you should do fname.decode('<the encoding here>').
I would suggest the encoding but you don't tell us what does \xab is in your .link file. You can search in google for the encoding anyways so it would stay like this:
for fname in files:
filename = fname.decode('<encoding>')
UPDATE: For example, IF the encoding of your filesystem's names is ISO-8859-1 then \xab char would be "«". To read it into python you should do:
for fname in files:
filename = fname.decode('latin1') #which is synonym to #ISO-8859-1
Hope this helps!
As I understand it your issue is that os.walk(unicode_path) fails to decode some filenames to Unicode. This problem is fixed in Python 3.1+ (see PEP 383: Non-decodable Bytes in System Character Interfaces):
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Windows provides Unicode API to access filesystem so there shouldn't be this problem.
Python 2.7 (utf-8 filesystem on Linux):
>>> import os
>>> list(os.walk("."))
[('.', [], ['\xc3('])]
>>> list(os.walk(u"."))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/os.py", line 284, in walk
if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py", line 71, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: \
ordinal not in range(128)
Python 3.3:
>>> import os
>>> list(os.walk(b'.'))
[(b'.', [], [b'\xc3('])]
>>> list(os.walk(u'.'))
[('.', [], ['\udcc3('])]
Your str2uni() function tries (it introduces ambiguous names) to solve the same issue as "surrogateescape" error handler on Python 3. Use bytestrings for filenames on Python 2 if you are expecting filenames that can't be decoded using sys.getfilesystemencoding().
'\xab'
Is a byte, number 171.
u'\xab'
Is a character, U+00AB Left-pointing double angle quotation mark («).
u'\xab' is a short-hand way of saying u'\u00ab'. It's not the same (not even the same datatype) as the byte '\xab'; it would probably have been clearer to always use the \u syntax in Unicode string literals IMO, but it's too late to fix that now.
To go from bytes to characters is known as a decode operation. To go from characters to bytes is known as an encode operation. For either direction, you need to know which encoding is used to map between the two.
>>> unicode('\xab')
UnicodeDecodeError
unicode is a character string, so there is an implicit decode operation when you pass bytes to the unicode() constructor. If you don't tell it which encoding you want you get the default encoding which is often ascii. ASCII doesn't have a meaning for byte 171 so you get an error.
>>> unicode(u'\xab')
u'\xab'
Since u'\xab' (or u'\u00ab') is already a character string, there is no implicit conversion in passing it to the unicode() constructor - you get an unchanged copy.
res = u''
for ch in val:
res += unichr(ord(ch))
return res
The encoding that maps each input byte to the Unicode character with the same ordinal value is ISO-8859-1. Consequently you could replace this loop with just:
return unicode(val, 'iso-8859-1')
(However note that if Windows is in the mix, then the encoding you want is probably not that one but the somewhat-similar windows-1252.)
One really horrible hack that occurs is to write our own str2uni() method
This isn't generally a good idea. UnicodeErrors are Python telling you you've misunderstood something about string types; ignoring that error instead of fixing it at source means you're more likely to hide subtle failures that will bite you later.
filename = unicode(fname)
So this would be better replaced with: filename = unicode(fname, 'iso-8859-1') if you know your filesystem is using ISO-8859-1 filenames. If your system locales are set up correctly then it should be possible to find out the encoding your filesystem is using, and go straight to that:
filename = unicode(fname, sys.getfilesystemencoding())
Though actually if it is set up correctly, you can skip all the encode/decode fuss by asking Python to treat filesystem paths as native Unicode instead of byte strings. You do that by passing a Unicode character string into the os filename interfaces:
for _,_,files in os.walk(u'/path/to/folder'): # note u'' string
for fname in files:
filename = fname # nothing more to do!
PS. The character in 3″ Floppy should really be U+2033 Double Prime, but there is no encoding for that in ISO-8859-1. Better in the long term to use UTF-8 filesystem encoding so you can include any character.
We are creating a website using Django 1.5, We have several large text files stored on the server that are to render with the web page, depending on the country. The problem is that these text files contain the copyright symbol (c) and we keep getting a 'Non-ascii character' error, and the text does not load. Does anyone have any suggestions on how to successfully convert one to the other?
Selections of the Code:
#Open file, where filename is our variable
with open(filename) as f:
#Append (It is in a loop, and we are only passing 1 document variable
document=document + f.read()
f.close
We have tried using:
mark safe (in django)
smart_str
.encode('utf8')
But to no avail, the page continues so spit back an error saying there is an ascii character that it cannot convert. Any ideas?
Here is the error we keep getting
UnicodeDecodeError at /<website-hidden>/
'ascii' codec can't decode byte 0x92 in position 950: ordinal not in range(128)
The issue is that the copyright symbol isn't a strict ASCII character, as it's 8th (most significant) bit is 1. ASCII only uses 7 bits. You need to tell python that the file isn't ASCII data, but something like "Extended ASCII", "ISO 8859-1" or "ISO Latin-1" data.
As such, you need to read it as bytes and then convert it to a string using that decoding. You can then re-encode it to anything you want, including UTF-8.
Exact handling for this depends if you are using python 2.x or 3.x.
Ref
http://www.ascii-code.com/
https://en.wikipedia.org/wiki/Extended_ASCII
A restart of the computer and eclipse seemed to do the trick. Perhaps it was a problem with the cache? Either way, strange error...
I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!
My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)
Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.