Encoding csv files on opening with Python

Encoding csv files on opening with Python - python

So i have this csv which has rows like these:
"41975","IT","Catania","2016-01-12T10:57:50+01:00",409.58
"538352","DE","Düsseldorf","2015-12-18T20:50:21+01:00",95.03
"V22211","GB","Nottingham","2015-12-31T11:17:59+00:00",872
In the current example the first and the third word are working fine but the program crashes when it prints Düsseldorf, the ü is problematic
I want to be able to get the information from this csv file and to be able to print it. Here is my code:
def load_sales(file_name):
SALES_ID = 0
SALES_COUNTRY = 1
SALES_CITY = 2
SALES_DATE = 3
SALES_PRICE =4
with open(file_name, 'r', newline='', encoding='utf8') as r:
reader = csv.reader(r)
result=[]
for row in reader:
sale={}
sale["id"]=row[SALES_ID]
sale["country"]=row[SALES_COUNTRY]
sale["city"]=row[SALES_CITY]
sale["date"]=row[SALES_DATE]
sale["price"]=float(row[SALES_PRICE])
result.append(sale)
when I print I print the result I get:
File "C:\Anaconda3\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 384: character maps to <undefined>
So far I have tried: changing the encoding value in the open function with utf-8, UTF8 etc., making a print function:
def write_uft8(data):
print(data).encode('utf-8')
But this is not a viable way when you have to print list of dictionaries.
Someone told me that the problem is that my python is not set to encode to these messages to utf-8, is that true and how do I change it ?

The issue here is that when python writes to a stream, it attempts to write text in a fashion that is compatible with the encoding or character set of that stream.
In this case, it appears you are running the command in a Windows console that is set to display Cyrillic text (CP866). The Cyrillic codepage does not contain a corresponding character for ü and thus the string cannot be decoded to an appropriate character for output.
Changing the active codepage of your windows cmd console to utf-8 should help:
$ CHCP 65001

Related

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

Replacing non-UTF-8 from a string

Here is the code:
s = 'Waitematā'
w = open('test.txt','w')
w.write(s)
w.close()
I get the following error.
UnicodeEncodeError: 'charmap' codec can't encode character '\u0101' in position 8: character maps to <undefined>
The string will print with the macron a, ā. However, I am not able to write this to a .txt or .csv file.
Am I able to swap our the macron a, ā for no macron? Thanks for the help in advance.

Note that if you open a file with open('text.txt', 'w') and write a string to it, you are not writing a string to a file, but writing the encoded string into the file. What encoding used depends on your LANG environment variable or other factors.
To force UTF-8, as you suggested in title, you can try this:
w = open('text.txt', 'wb') # note for binary
w.write(s.encode('utf-8')) # convert str into byte explicitly
w.close()

As documented in open:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Not all encodings support all Unicode characters. Since the encoding is platform dependent when not specified, it is better and more portable to be explicit and call out the encoding when reading or writing a text file. UTF-8 supports all Unicode code points:
s = 'Waitematā'
with open('text.txt','w',encoding='utf8') as w:
w.write(s)

Python - Reading CSV UnicodeError

I have exported a CSV from Kaggle - https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis. However, when I attempt to iterate through the file, I receive unicode errors concerning certain characters that cannot be encoded.
File "C:\Program Files\Python35\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 264: character maps to
I have enabled utf-8 encoding while opening the file, which I assumed would have decoded the ASCII characters. Evidently not.
My Code:
with open("sentimentDataSet.csv", "r", encoding="utf-8" ,errors='ignore', newline='') as file:
reader = csv.reader(file)-
for row in reader:
if row:
print(row)
if row[sentimentCsvColumn] == sentimentScores(row[textCsvColumn]):
accuracyCount += 1
print(accuracyCount)

That's an encode error as you're printing the row, and has little to do with reading the actual CSV.
Your Windows terminal is in CP850 encoding, which can't represent everything.
There are some things you can do here.
A simple way is to set the PYTHONIOENCODING environment variable to a combination that will trash things it can't represent. set PYTHONIOENCODING=cp850:replace before running Python will have Python replace characters unrepresentable in CP850 with question marks.
Change your terminal encoding to UTF-8: chcp 65001 before running Python.
Encode the thing by hand before printing: print(str(data).encode('ascii', 'replace'))
Don't print the thing.

Reading UTF8 encoded CSV and converting to UTF-16

I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8 encoded str:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str (as opposed to repr()) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs to unicode:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've read this question:
Reading a UTF8 CSV file with Python
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)
As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach.

Your default encoding is ASCII. When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII.
The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.
To print a unicode, use print some_unicode.encode('utf-8'). (Or whatever encoding your terminal is actually using).
As for the u'\xc1lvaro Salazar', nothing here is mangled. The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \x hex escapes instead of \u unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as \u00c1.)
To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html

Unicode error in python when printing a list

Edit: http://pastebin.com/W4iG3tjS - the file
I have a text file encoded in utf8 with some Cyrillic text it. To load it, I use the following code:
import codecs
fopen = codecs.open('thefile', 'r', encoding='utf8')
fread = fopen.read()
fread dumps the file on the screen all unicodish (escape sequences). print fread displays it in readable form (ASCII I guess).
I then try to split it and write it to an empty file with no encoding:
a = fread.split()
for l in a:
print>>dasFile, l
But I get the following error message: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Is there a way to dump fread.split() into a file? How can I get rid of this error?

Since you've opened and read the file via codecs.open(), it's been decoded to Unicode. So to output it you need to encode it again, presumably back to UTF-8.
for l in a:
dasFile.write(l.encode('utf-8'))

print is going to use the default encoding, which is normally "ascii". So you see that error with print. But you can open a file and write directly to it.
a = fopen.readlines() # returns a list of lines already, with line endings intact
# do something with a
dasFile.writelines(a) # doesn't add line endings, expects them to be present already.
assuming the lines in a are encoded already.
PS. You should also investigate the io module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding csv files on opening with Python - python

Related

Python search through file and un-escape unicode characters [duplicate]

Replacing non-UTF-8 from a string

Python - Reading CSV UnicodeError

Reading UTF8 encoded CSV and converting to UTF-16

Unicode error in python when printing a list

Categories

Resources