CSV, DictWriter, unicode and utf-8

CSV, DictWriter, unicode and utf-8 - python

I am having problems with the DictWriter and non-ascii characters. A short version of my problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import csv
f = codecs.open("test.csv", 'w', 'utf-8')
writer = csv.DictWriter(f, ['field1'], delimiter='\t')
writer.writerow({'field1':u'å'.encode('utf-8')})
f.close()
Gives this Traceback:
Traceback (most recent call last):
File "test.py", line 10, in <module>writer.writerow({'field1':u'å'.encode('utf-8')})
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/csv.py", line 124, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/codecs.py", line 303, in write data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
I am bit lost as the DictWriter ought to be able to work with UTF-8 from what I have read in the documentation.

The object you obtain with codecs.open wants a unicode string in its write method -- that's the whole point. csv.DictWriter of course is calling that method with a utf8-encoded byte string instead, whence the exception.
Change f's creation to f = open("test.csv", 'wb') (taking codecs out of the picture) and things should work just fine.

Related

I get python frameworks error while reading a csv file, when I try a different easier file it works fine

import csv
exampleFile = open('example.csv')
exampleReader = csv.reader(exampleFile)
for row in exampleReader:
print('Row #' + str(exampleReader.line_num) + ' ' + str(row))
Traceback (most recent call last):
File "/Users/jossan113/Documents/Python II/test.py", line 7, in <module>
for row in exampleReader:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 4627: ordinal not in range(128)
Do anyone have any idea why I get this error? I tried an very easy cvs file from the internet and it worked just fine, but when I try the bigger file it doesn't

The file contains unicode characters, which was painful to deal with in old versions of python, since you are using 3.5 try opening the file as utf-8 and see if the issue goes away:
exampleFile = open('example.csv', encoding="utf-8")
From the docs:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
csv modeule docs

UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.
Here is the code -
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
print(line, file = out_file, end = '')
print("done")
if __name__ == '__main__':
main()
Some words are not utf-8.
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position
0: character maps to <undefined>
Update
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
out_file.write(line)
print("done")
if __name__ == '__main__':
main()```
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:\Python\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte

Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).
By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can't represent all Unicode characters and '\u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can't represent them all.
Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as #robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.
Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.
If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.
If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python
Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.

Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

Python3 CSV reader Unicode Decode error

I have a huge csv file in utf8 encoding, but some columns with encoding that differs from main file encoding. It looks like:
input.txt in UTF-8 encoding:
a,b,c
d,"e?",f
g,h,"kü"
same input.txt in win-1252
a,b,c
d,"eü",f
g,h,"kÃ¼
Code:
import csv
file = open("input.txt",encoding="...")
c = csv.reader(file, delimiter=';', quotechar='"')
for itm in c:
print(itm)
and standart python3 csv reader generates encoding error on such lines.I can not just ignore reading this line but I need only always good encoded "someOther" column.
Is it posible using standart csv reader to split somehow CSV data in some "bytes mode" and then convert each array element to normal python unicode string, or should I implement my own csv reader ?
Traceback:
Traceback (most recent call last):
File "C:\Development\t.py", line 7, in <module>
for itm in c:
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 11: invalid start byte

How sure are you that your file is UTF8 encoded?
For the small sample that you've posted UTF8 decoding fails on the ü which is "LATIN SMALL LETTER U WITH DIAERESIS". When encoded as ISO-8859-1, ü is '\xfc'. Two other possibilities are that the CSV file is UTF-16 encoded (UTF-16 little endian is common on Windows), or even Windows-1252.
If your CSV file is encoded in one of the ISO-8859-X family of encodings; any of ISO 8859-1/3/4/9/10/14/15/16 encode ü as 0xfc.
To solve, use the correct encoding and open the file like this:
file = open("input.txt", encoding="iso-8859-1")
or, for Windows 1252:
file = open("input.txt", encoding="windows-1252")
or, for UTF-16:
file = open("input.txt", encoding="utf-16") # or utf-16-le or utf-16-be as required

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

I don't know exactly what's the source of this error and how to fix it. I am getting it by running this code.
Traceback (most recent call last):
File "t1.py", line 86, in <module>
write_results(results)
File "t1.py", line 34, in write_results
dw.writerows(results)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Any explanation is really appreciated!
I changed the code and now I get this error:
File "t1.py", line 88, in <module>
write_results(results)
File "t1.py", line 35, in write_results
dw.writerows(results)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's the change:
with codecs.open('results.csv', 'wb', 'utf-8') as f:
dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
dw.writer.writerow(dw.fieldnames)
dw.writerows(results)

The error is raised by this part of the code:
with open('results.csv', 'w') as f:
dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
dw.writer.writerow(dw.fieldnames)
dw.writerows(results)
You're opening an ASCII file, and then you're trying to write non-ASCII data to it. I guess that whoever wrote that script happened to never encounter a non-ASCII character during testing, so he never ran into an error.
But if you look at the docs for the csv module, you'll see that the module can't correctly handle Unicode strings (which is what Beautiful Soup returns), that CSV files always have to be opened in binary mode, and that only UTF-8 or ASCII are safe to write.
So you need to encode all the strings to UTF-8 before writing them. I first thought that it should suffice to encode the strings on writing, but the Python 2 csv module chokes on the Unicode strings anyway. So I guess there's no other way but to encode each string explicitly:
In parse_results(), change the line
results.append({'url': url, 'create_date': create_date, 'title': title})
to
results.append({'url': url, 'create_date': create_date, 'title': title.encode("utf-8")})
That might already be sufficient since I don't expect URLs or dates to contain non-ASCII characters.

This should work. works for me. Code snippet
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf8')
data = [["a", "b", u'\xe9']]
with open("output.csv", "w") as csv_file:
writer = csv.writer(csv_file, quoting=csv.QUOTE_ALL)
writer.writerows(data)

Write xml utf-8 file with utf-8 data with ElementTree

I'm trying to write an xml file with utf-8 encoded data using ElementTree like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import codecs
testtag = ET.Element('unicodetag')
testtag.text = u'Töreboda' #The o is really ö (o with two dots over). No idea why SO dont display this
expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
expfile.close()
This blows up with the error
Traceback (most recent call last):
File "unicodetest.py", line 10, in <module>
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
write(_escape_cdata(text, encoding))
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Using the "us-ascii" encoding instead works fine, but don't preserve the unicode characters in the data. What is happening?

codecs.open expects Unicode strings to be written to the file object and it will handle encoding to UTF-8. ElementTree's write encodes the Unicode strings to UTF-8 byte strings before sending them to the file object. Since the file object wants Unicode strings, it is coercing the byte string back to Unicode using the default ascii codec and causing the UnicodeDecodeError.
Just do this:
#expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write('testunicode.xml',encoding="UTF-8",xml_declaration=True)
#expfile.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

CSV, DictWriter, unicode and utf-8 - python

Related

I get python frameworks error while reading a csv file, when I try a different easier file it works fine

UnicodeEncodeError when reading a file

Python3 CSV reader Unicode Decode error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Write xml utf-8 file with utf-8 data with ElementTree

Categories

Resources