SyntaxError: Non-ASCII character '\xfe' in file error happen - python

SyntaxError: Non-ASCII character '\xfe' in file error happen.
I wanna read tsv file,and change into csv file.When I run this app,this error happen.
I wrote
# coding: shift_jis
import libraries as libraries
import DataCleaning
import csv
media = 'Google'
tsv = csv.reader(file(r"data/aaa.csv"), delimiter = '\t',encoding='UTF-16')
for row in tsv:
print ", ".join(row)
I think ASCII is wrong,but I do not know how to fix this.
My tsv file is shift_jis and finally I wanna change it into UTF-8.But I think this error happen because I did not designate encoding as UTF-16.

The csv module on Python 2 is not Unicode friendly. You can't pass encoding to it as an argument, it's not a recognized argument (only csv format parameters are accepted as keyword arguments). It can't work with the Py2 unicode type correctly, so using it involves reading in binary mode, and even then, it only works properly when newlines are one byte per character. Per the csv module docs:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
If at all possible, switch to Python 3, where the csv module works with Py3's Unicode-friendly str by default, bypassing all the issues from Python 2's csv module, and encoding can be passed to open correctly. In that case, your code simplifies to:
with open(r"data/aaa.csv", encoding='utf-16', newline='') as inf:
tsv = csv.reader(inf, delimiter='\t')
# Explicit encoding argument may be needed for TextIOWrapper;
# the rewrapping is done to ensure newline='' is used as the csv module requires
csv.writer(io.TextIOWrapper(sys.stdout.buffer, newline='')).writerows(tsv)
Or to write as CSV to a UTF-8 encoded file:
with open(r"data/aaa.csv", encoding='utf-16', newline='') as inf, open(outfilename, "w", encoding='utf-8', newline='') as outf:
tsv = csv.reader(inf, delimiter='\t')
csv.writer(outf).writerows(tsv)
Failing that, take a look at the unicodecsv module on PyPI, which should handle Unicode input properly on Python 2.

Related

Which encoding does the built in CSV parser in Python 3 support?

I need to know which encoding it support but it's not in the documentation:
https://docs.python.org/3/library/csv.html
Here are the format I want to support:
ANSI
UTF-8
UNICODE
win1251
UTF-16LE
I there like a inclusive list that I can use to build my UI on?
EDIT: My files are on a external FTP server uploaded by user so they will not use my system default encoding. They can be in any format. I need to tell the user what encoding I support.
csv is not encoding-aware. Use open() for that.
From the docs you linked:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
For which encodings are supported, see the docs for open():
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

How can I fix "UnicodeDecodeError: 'utf-8' codec can't decode bytes..." in python?

I need to read specified rows and columns of csv file and write into txt file.But I got an unicode decode error.
import csv
with open('output.csv', 'r', encoding='utf-8') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
The reason for this error is perhaps that your CSV file does not use UTF-8 encoding. Find out the original encoding used for your document.
First of all, try using the default encoding by leaving out the encoding parameter:
with open('output.csv', 'r') as f:
...
If that does not work, try alternative encoding schemes that are commonly used, for example:
with open('output.csv', 'r', encoding="ISO-8859-1") as f:
...
If you get a unicode decode error with this code, it is likely that the csv file is not utf-8 encoded... The correct fix is to find what is the correct encoding and use it.
If you only want quick and dirty workarounds, Python offers the errors=... option of open. From the documentation of open function in the standard library:
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' replaces malformed data by Python’s backslashed escape sequences.
'namereplace' (also only supported when writing) replaces unsupported characters with \N{...} escape sequences.
I often use errors='replace', when I only want to know that there were erroneous bytes or errors='backslashreplace' when I want to know what they were.

python encoding error with utf-8

I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
TypeError: write() argument must be str, not bytes
Even if I make code as:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
The reason why I dont open file as 'wb' is I am writing different strings and integers to file.
If text_list contains unicode strings you should encode (not decode) them to string before saving to file.
Try this instead:
t_w = text_list[y].encode('utf-8')
Also it could be helpful to look at codecs standard module https://docs.python.org/2/library/codecs.html. You could try this:
import codecs
with codecs.open('path/to/file', 'w', 'utf-8') as f:
f.write(text_list[y])
f.write(u'\n')
But note that codecs always opens files in binary mode.
When using write in text mode, the UTF-8 mode is the default (in Python 3, I assume you use only Python 3, not Python 2) so do not encode the strings. Or open your file in binary mode and encode EVERYTHING you write to your file. I suggest NOT using binary mode in your case. So, your code will look like this:
with open('myfile.txt', 'w') as new_file:
t_w = text_list[y]
new_file.write(t_w)
new_file.write('\n')
or for Python 2:
new_file = open('myfile.txt', 'wb')
t_w = text_list[y].encode('utf-8') # I assume you use Unicode strings
new_file.write(t_w)
new_file.write(ub'\n')
new_file.close()

The Python CSV writer is adding letters to the beginning of each element and issues with encode

So I'm trying to parse out JSON files into a tab delimited file. The parsing seems to work fine and all the data is coming through. Although the oddest thing is happening on the output file. I told it to use a tab delimiter and on the output it does use tabs, but it still seems to keep the single quotes. And for some reason it also seems to be adding the letter B to the beginning. I manually typed in the header, and that works fine, but the data itself is acting weird. Here's an example of the output I'm getting.
id created text screen name name latitude longitude place name place type
b'1234567890' b'Thu Mar 14 19:39:07 +0000 2013' "b""I'm at Bank Of America (Wayne, MI) http://t.co/asdf""" b'userid' b'username' 42.28286837 -83.38487864 b'Bank Of America, Wayne' b'poi'
b'1234567891' b'Thu Mar 14 19:39:16 +0000 2013' b'here is a sample tweet \xf0\x9f\x8f\x80 #notingoodhands' b'userid2' b'username2'
Here is the code that I'm using to write the data out.
out = open(filename, 'w')
out.write('id\tcreated\ttext\tscreen name\tname\tlatitude\tlongitude\tplace name\tplace type')
out.write('\n')
rows = zip(ids, times, texts, screen_names, names, lats, lons, place_names, place_types)
from csv import writer
csv = writer(out, dialect='excel', delimiter = '\t')
for row in rows:
values = [(value.encode('utf-8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()
So here's the thing. If i did this without the utf-8 bit and just output it straight, the formatting would be perfectly how i want it. But then when people type in special characters, the program crashes and isn't able to handle it.
Traceback (most recent call last):
File "tweets.py", line 34, in <module>
csv.writerow(values)
File "C:\Python33\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3c0' in position 153: character maps to <undefined>
Adding the utf-8 bit converts it to the type of output you see here, but then it adds all these characters to the output. Does anyone have any thoughts on this?
You are writing byte data instead of unicode to your files, because you are encoding the data yourself.
Remove the encode calls altogether and let Python handle this for you; open the file with the UTF8 encoding and the rest takes care of itself:
out = open(filename, 'w', encoding='utf8')
This is documented in the csv module documentation:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
You've got multiple things going on here, but first, let's clear up a bit of confusion.
Encoding non-ASCII characters to UTF-8 means you get multiple bytes. For example, the character 🏀 is \xf0\x9f\x8f\x80 in UTF-8. But that's still just one character, it's just a character that takes four bytes. If you write the string to a binary file, then look at that file in a UTF-8-compatible tool (Notepad or TextEdit, or just cat on a UTF-8-friendly terminal/shell), you'll see one 🏀, not four garbage characters.
Second, b'abc' is not a string with b added to the beginning, it's the repr representation of the byte-string abc. The b is no more a part of the string than the quotes are.
Finally, in Python 3, you can't open a file in text mode and then write byte strings to it. Either open it in text mode, with an encoding, and write normal unicode strings, or open it in binary mode and write encoded byte strings.

Python: How to preserve Ä,Ö,Ü when writing to file

I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file.
My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them.
When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
ü
When I create the output in Python, I use
with open('file', 'w') as fout:
fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?
I think this should work:
import codecs
with codecs.open("file.xml", 'w', "utf-8") as fout:
# do stuff with filepointer
To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:
tree.write('file', encoding='utf-8')
When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout:
fout.write(xyz)
Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout:
fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.
Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:
def parse(self, source, parser=None):
if not hasattr(source, "read"):
source = open(source, "rb")
if not parser:
parser = XMLParser(target=TreeBuilder())
while 1:
data = source.read(65536)
if not data:
break
parser.feed(data)
self._root = parser.close()
return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.

Categories

Resources