Python 2.7: Read file with Chinese characters

Python 2.7: Read file with Chinese characters - python

I am trying to analyze data within CSV files with Chinese characters in their names (E.g. "粗1 25g").
I am using Tkinter to choose the files like so:
selectedFiles = askopenfilenames(filetypes=[("xlsx","*"),("xls","*")]) # Utilize Tkinker dialog window to choose files
selectedFiles = master.tk.splitlist(selectedFiles) # Create list from files chosen
I have attempted to convert the filename to unicode in this way:
selectedFiles = [x.decode("utf-8") for x in selectedFiles]
Only to yield the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 0: ordinal not in range(128)
I have also tried converting the filenames as the files are created with the following:
titles = [x.encode('utf-8') for x in titles]
Only to receive the error:
IOError: [Errno 22] invalid mode ('wb') or filename: 'C:\...\\data_division_files\\\xe7\xb2\x971 25g.csv'
I have also tried combinations of the above methods to no avail.
What can I do to allow these files to be read in Python?
(This question,while related, has not been able to solve my problem: Obtain File size with os.path.getsize() in Python 2.7.5)

When you call decode on a unicode object, it first encodes it with sys.getdefaultencoding() so it can decode it for you. Which is why you get an error about ASCII even though you didn't ask for ASCII anywhere.
So, where are you getting a unicode object from? From askopenfilename. From a quick test, it looks like it always returns unicode values on Windows (presumably by getting the UTF-16 and decoding it), while on POSIX it returns some unicode and some str (I'd guess by leaving alone anything that fits into 7-bit ASCII, decoding anything else with your filesystem encoding). If you'd tried printing out the repr or type or anything of selectedFiles, the problem would have been obvious.
Meanwhile, the encode('utf-8') shouldn't cause any UnicodeErrors… but it's likely that your filesystem encoding isn't UTF-8 on Windows, so it will probably cause a lot of IOErrors with errno 2 (trying to open files that don't exist, or to create files in directories that don't exist), 21 (trying to open files with illegal file or directory names on Windows), etc. And it looks like that's exactly what you're seeing. And there's really no reason to do it; just pass the pathnames as-is to open and they'll be fine.
So, basically, if you removed all of your encode and decode calls, your code would probably just work.
However, there's an even easier solution: Just use askopenfile or asksaveasfile instead of askopenfilename or asksaveasfilename. Let Tk figure out how to use its pathnames and just hand you the file objects, instead of messing with the pathnames yourself.

Related

Why am I keep getting a UnicodeDecodeError in pandas read_csv() function even though I specified the correct encoding parameter? [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.

Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

The solution was change to "UTF-8 sin BOM"

Python3 using UTF-8 data from bytes [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.

Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

The solution was change to "UTF-8 sin BOM"

UnicodeDecodeError for Reading files in Python

pythonNotes = open('E:\\Python Notes.docx','r')
read_it_now = pythonNotes.read()
print(read_it_now.encode('utf-16'))
When I try this code, I get:
UnicodeDecodeError: 'charmap' can't decode byte 0x8f in position 591 character maps to <undefined>
I am running this in visual studio with python tools - starting without debugging.
I have tried putting enc='utf-8' at the top, throwing it in as a parameter, I've looked at other questions and just couldn't find a solution to this simple issue.
Please assist.

This error can occur when text that is already in utf-8 format is read in as an 8-bit encoding, and python tries to "decode" it to Unicode: Bytes that have no meaning in the supposed encoding throw a UnicodeDecodeError. But you'll always get an error if you try to read a file as utf-8 that is not in the utf-8 encoding.
In your case, the problem is that a docx file is not a regular text file; no single text encoding can meaningfully import it. See this SO answer for directions on how to read it on a low level, or use python-docx to get access to the document in a way that resembles what you see in Word.

Python UnicodeDecodeError on Mac, but not on PC?

I've got a script that basically aggregates students' code files into one file for plagiarism detection. It walks through a tree of files, copying all file contents into one file.
I've run the script on the exact same files on my Mac and my PC. On my PC, it works fine. On my Mac, it encounters 27 UnicodeDecodeErrors (probably 0.1% of all files I'm testing).
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
If relevant, the code is:
originalFile = open(originalFilename, "r")
newFile = open(newFilename, "a")
newFile.write(originalFile.read())

Figure out what encoding was used when saving that file. A safe bet is loading the file as 'utf-8'. If that succeeds then it's likely to be the correct encoding.
# try utf-8. If this fails, all bets are off.
open(originalFilename, "r", encoding="utf-8")
Now, if students are sending you these files, it's likely they just use the default encoding on their system. It is not possible to reliably guess the encoding. If they were using an 8-bit codec, like one of the ISO-8859 character sets, it will be almost impossible to guess which one was used. What to do then depends on what kind of files you're processing.

It is incorrect to read Python source files using open(originalFilename, "r") on Python 3. open() uses locale.getpreferredencoding(False) by default. A Python source may use a different character encoding; in the best case, it may cause UnicodeDecodeError -- usually, you just get a mojibake silently.
To read Python source taking into account the encoding declaration (# -*- coding: ...), use tokenize.open(filename). If it fails; the input is not valid Python 3 source code.
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
locale.getpreferredencoding(False) is likely to be utf-8 on Mac. utf-8 doesn't accept arbitrary sequence of bytes as utf-8 encoded text. PC is likely to use a 8-bit character encoding that corrupts the input and produces a mojibake silently instead of raising an error due to a mismatched character encoding.
To read a text file, you should know its character encoding. If you don't know the character encoding then either read the file as a sequence of bytes ('rb' mode) or you could try to guess the encoding using chardet Python module (it would be only a guess but it might be good enough depending on your task).

I got the exact same problem. There seemed to be some characters in the file that gave a UnicodeDecodeError during readlines()
This only happened on my macbook, but not on a PC.
I solve the problem by simply skipping these characters:
with open(file_to_extract, errors='ignore') as f: reader = f.readlines()

Converting ascii characters in text file to Unicode

We are creating a website using Django 1.5, We have several large text files stored on the server that are to render with the web page, depending on the country. The problem is that these text files contain the copyright symbol (c) and we keep getting a 'Non-ascii character' error, and the text does not load. Does anyone have any suggestions on how to successfully convert one to the other?
Selections of the Code:
#Open file, where filename is our variable
with open(filename) as f:
#Append (It is in a loop, and we are only passing 1 document variable
document=document + f.read()
f.close
We have tried using:
mark safe (in django)
smart_str
.encode('utf8')
But to no avail, the page continues so spit back an error saying there is an ascii character that it cannot convert. Any ideas?
Here is the error we keep getting
UnicodeDecodeError at /<website-hidden>/
'ascii' codec can't decode byte 0x92 in position 950: ordinal not in range(128)

The issue is that the copyright symbol isn't a strict ASCII character, as it's 8th (most significant) bit is 1. ASCII only uses 7 bits. You need to tell python that the file isn't ASCII data, but something like "Extended ASCII", "ISO 8859-1" or "ISO Latin-1" data.
As such, you need to read it as bytes and then convert it to a string using that decoding. You can then re-encode it to anything you want, including UTF-8.
Exact handling for this depends if you are using python 2.x or 3.x.
Ref
http://www.ascii-code.com/
https://en.wikipedia.org/wiki/Extended_ASCII

A restart of the computer and eclipse seemed to do the trick. Perhaps it was a problem with the cache? Either way, strange error...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.