Python - Reading CSV UnicodeError - python

I have exported a CSV from Kaggle - https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis. However, when I attempt to iterate through the file, I receive unicode errors concerning certain characters that cannot be encoded.
File "C:\Program Files\Python35\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 264: character maps to
I have enabled utf-8 encoding while opening the file, which I assumed would have decoded the ASCII characters. Evidently not.
My Code:
with open("sentimentDataSet.csv", "r", encoding="utf-8" ,errors='ignore', newline='') as file:
reader = csv.reader(file)-
for row in reader:
if row:
print(row)
if row[sentimentCsvColumn] == sentimentScores(row[textCsvColumn]):
accuracyCount += 1
print(accuracyCount)

That's an encode error as you're printing the row, and has little to do with reading the actual CSV.
Your Windows terminal is in CP850 encoding, which can't represent everything.
There are some things you can do here.
A simple way is to set the PYTHONIOENCODING environment variable to a combination that will trash things it can't represent. set PYTHONIOENCODING=cp850:replace before running Python will have Python replace characters unrepresentable in CP850 with question marks.
Change your terminal encoding to UTF-8: chcp 65001 before running Python.
Encode the thing by hand before printing: print(str(data).encode('ascii', 'replace'))
Don't print the thing.

Related

what encoding should I use to open a file with a big letter N with tilde character?

I'm trying to open a file with a big letter N tilde (http://graphemica.com/%C3%91) but I can't seem to figure it out. when I open the file in notepad++ it shows the character as xD1, when I open the file in gedit it shows \D1. When I open the file in excel, it shows the character correctly.
Now I'm trying to open the file in python, it halts when it encounters the character. I'm aware that I can put in the encoding so the file can be opened properly but I'm not sure which encoding I should use. My error is
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
this is my code
with codecs.open('tsv.txt', 'r', 'utf8') as my_file:
for line in my_file:
print(line)
if it is not utf8, then what should I use? From the site above, it does not show which encoding 0xd1 is associated with.
You can find in tables how 'Ñ' gets encoded in different encodings.
You can also try it directly with Python:
>>> 'Ñ'.encode('utf8')
b'\xc3\x91'
>>> 'Ñ'.encode('latin1')
b'\xd1'
It seems that your file is encoded in latin-1.

Python pandas load csv ANSI Format as UTF-8

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand
The correct UTF-8 outcome for Empf„nger should be: Empfänger
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
I get and Error Message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' is the position where the character occurs that causes the error message
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
Example code:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empfänger is represented as: Empf„nger
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
didnt work for me.
encoding='cp1252'
throws the following exception:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame
If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.
Then you go to the python documentation and look for the correct codec to use.
In your case , ANSI, the codec is 'mbcs', so your code will look like these
df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
When EmpfängerStraße shows up as Empf„ngerStraáe when decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:
print 'Empf„ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')
Or Python 3, where literal strings are already unicode strings:
print("Empf„ngerStraáe".encode("cp1250").decode("cp850"))
I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.
You could use the encoding value UTF-16LE to solve the problem
pd.read_csv("./file.csv", encoding="UTF-16LE")
The file.csv should be saved using encoding UTF-16LE by NotePad++, option UCS-2 LE BOM
Best,
cp1252 works on both linux and windows to decode latin1 encoded files.
df = pd.read_csv('data.csv',sep=';',encoding='cp1252')
Although, if you are running on a windows machine, I would recommend using
df = pd.read_csv('data.csv', sep=';', encoding='mbcs')
Ironically, using 'latin1' in the encoding does not always work. Especially if you want to convert file to a different encoding.

Encoding csv files on opening with Python

So i have this csv which has rows like these:
"41975","IT","Catania","2016-01-12T10:57:50+01:00",409.58
"538352","DE","Düsseldorf","2015-12-18T20:50:21+01:00",95.03
"V22211","GB","Nottingham","2015-12-31T11:17:59+00:00",872
In the current example the first and the third word are working fine but the program crashes when it prints Düsseldorf, the ü is problematic
I want to be able to get the information from this csv file and to be able to print it. Here is my code:
def load_sales(file_name):
SALES_ID = 0
SALES_COUNTRY = 1
SALES_CITY = 2
SALES_DATE = 3
SALES_PRICE =4
with open(file_name, 'r', newline='', encoding='utf8') as r:
reader = csv.reader(r)
result=[]
for row in reader:
sale={}
sale["id"]=row[SALES_ID]
sale["country"]=row[SALES_COUNTRY]
sale["city"]=row[SALES_CITY]
sale["date"]=row[SALES_DATE]
sale["price"]=float(row[SALES_PRICE])
result.append(sale)
when I print I print the result I get:
File "C:\Anaconda3\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 384: character maps to <undefined>
So far I have tried: changing the encoding value in the open function with utf-8, UTF8 etc., making a print function:
def write_uft8(data):
print(data).encode('utf-8')
But this is not a viable way when you have to print list of dictionaries.
Someone told me that the problem is that my python is not set to encode to these messages to utf-8, is that true and how do I change it ?
The issue here is that when python writes to a stream, it attempts to write text in a fashion that is compatible with the encoding or character set of that stream.
In this case, it appears you are running the command in a Windows console that is set to display Cyrillic text (CP866). The Cyrillic codepage does not contain a corresponding character for ü and thus the string cannot be decoded to an appropriate character for output.
Changing the active codepage of your windows cmd console to utf-8 should help:
$ CHCP 65001

Processing CSV with Python

*I am having an issue processing CSV'S. I get this error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 22: character maps to <undefined>
What do I need to do to fix this? I think there is an issue where the CSV matches with the user's input.*
import csv
csvanswer=input("Type your issue here!").lower()
searchanswer=(csvanswer).split()
userinput = open("task2csv.csv")
csv_task=csv.reader(userinput)
for row in csv_task:
if row[0] in searchanswer:
print(row)
break
Your input file is probably in an encoding other than the default on your system. You can fix this by explicitly providing the correct file encoding to the open call (you should also pass newline='' to the open call to properly obey the csv module's requirements).
For example, if the file is UTF-8, you'd do:
userinput = open("task2csv.csv", encoding='utf-8', newline='')
If it's some other encoding (UTF-16 is common for files produced by Windows programs), you'd use that. If it's some terrible non-UTF encoding, you're going to have to figure out the locale settings on the machine that produced it, they could be any number of different encodings.

encoding='utf-8' raise UnicodeEncodeError when opening utf-8 file with Chinese char

I cann't open file with any Chinese charecter, with encording set to utf-8:
text = open('file.txt', mode='r', encoding='utf-8').read()
print(text)
UnicodeEncodeError: 'charmap' codec can't encode character '\u70e6' in position 0: character maps to <undefined>
The file is 100% utf-8.
http://asdfasd.net/423/file.txt
http://asdfasd.net/423/test.py
If I remove encoding='utf-8' everything is ok.
What is wrong here with encoding?
I always use encoding='utf-8' when opening files, I don't now what happened now.
The exception you see comes from printing your data. Printing requires that you encode the data to the encoding used by your terminal or Windows console.
You can see this from the exception (and from the traceback, but you didn't include that); if you have a problem with decoding data (which is what happens when you read from a file) then you would get a UnicodeDecodeError, you got a UnicodeEncodeError instead.
You need to either adjust your terminal or console encoding, or not print the data
See http://wiki.python.org/moin/PrintFails for troubleshooting help.

Categories

Resources