Creating a new csv with UTF-8 encoding - python

I have a process which requires a csv to be created if it does not currently exist. I then open this csv and write some data to it.
with open("foo.csv", "w") as my_empty_csv:
# now you have an empty file already
# This is where I write my data to the file
This is the code i'm currently using, but I don't know what default encoding the file is created in if it doesn't already exist.
What would be the better way to create a file with UTF-8 encoding if the file doesn't exist.

The open function has an optional 'encoding' parameter that you can use to explicitly specify the encoding of the file:
with open("foo.csv", "w", encoding="utf-8") as my_empty_csv:
...
More specificially, the documentation specifies about this parameter:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

You should be able to do it this way.
with open("foo.csv", "w" newline='', encoding='utf-8') as my_empty_csv:
// other logic

Related

How to read languages in Python [duplicate]

When I read a file in python and print it to the screen, it does not read certain characters properly, however, those same characters hard coded into a variable print just fine. Here is an example where "test.html" contains the text "Hallå":
with open('test.html','r') as file:
Str = file.read()
print(Str)
Str = "Hallå"
print(Str)
This generates the following output:
hallå
Hallå
My guess is that there is something wrong with how the data in the file is being interpreted when it is read into Python, however I am uncertain of what it is since Python 3.8.5 already uses UTF-8 encoding by default.
Function open does not use UTF-8 by default. As the documentation says:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So, it depends, and to be certain, you have to specify the encoding yourself. If the file is saved in UTF-8, you should do this:
with open('test.html', 'r', encoding='utf-8') as file:
On the other hand, it is not clear whether the file is or is not saved in UTF-8 encoding. If it is not, you'll have to choose a different one.

Which encoding does the built in CSV parser in Python 3 support?

I need to know which encoding it support but it's not in the documentation:
https://docs.python.org/3/library/csv.html
Here are the format I want to support:
ANSI
UTF-8
UNICODE
win1251
UTF-16LE
I there like a inclusive list that I can use to build my UI on?
EDIT: My files are on a external FTP server uploaded by user so they will not use my system default encoding. They can be in any format. I need to tell the user what encoding I support.
csv is not encoding-aware. Use open() for that.
From the docs you linked:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
For which encodings are supported, see the docs for open():
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

SyntaxError: Non-ASCII character '\xfe' in file error happen

SyntaxError: Non-ASCII character '\xfe' in file error happen.
I wanna read tsv file,and change into csv file.When I run this app,this error happen.
I wrote
# coding: shift_jis
import libraries as libraries
import DataCleaning
import csv
media = 'Google'
tsv = csv.reader(file(r"data/aaa.csv"), delimiter = '\t',encoding='UTF-16')
for row in tsv:
print ", ".join(row)
I think ASCII is wrong,but I do not know how to fix this.
My tsv file is shift_jis and finally I wanna change it into UTF-8.But I think this error happen because I did not designate encoding as UTF-16.
The csv module on Python 2 is not Unicode friendly. You can't pass encoding to it as an argument, it's not a recognized argument (only csv format parameters are accepted as keyword arguments). It can't work with the Py2 unicode type correctly, so using it involves reading in binary mode, and even then, it only works properly when newlines are one byte per character. Per the csv module docs:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
If at all possible, switch to Python 3, where the csv module works with Py3's Unicode-friendly str by default, bypassing all the issues from Python 2's csv module, and encoding can be passed to open correctly. In that case, your code simplifies to:
with open(r"data/aaa.csv", encoding='utf-16', newline='') as inf:
tsv = csv.reader(inf, delimiter='\t')
# Explicit encoding argument may be needed for TextIOWrapper;
# the rewrapping is done to ensure newline='' is used as the csv module requires
csv.writer(io.TextIOWrapper(sys.stdout.buffer, newline='')).writerows(tsv)
Or to write as CSV to a UTF-8 encoded file:
with open(r"data/aaa.csv", encoding='utf-16', newline='') as inf, open(outfilename, "w", encoding='utf-8', newline='') as outf:
tsv = csv.reader(inf, delimiter='\t')
csv.writer(outf).writerows(tsv)
Failing that, take a look at the unicodecsv module on PyPI, which should handle Unicode input properly on Python 2.

How to avoid encoding parameter when opening file in Python3

When I am working on a .txt file on a Windows device I must save as either: ANSI, Unicode, Unicode big endian, or UTF-8. When I run Python3 on an OSX device and try to import and read the .txt file, I have to do something along the lines of:
with open('ships.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
print(line)
Is there a particular format I should use to encode the .txt file on the Windows device to avoid adding the encoding parameter when opening the file in Python?
Call locale.getpreferredencoding(False) on your OSX device. That's the default encoding used for reading a file on that device. Save in that encoding on Windows and you won't need to specify the encoding on your OSX device.
But as the Zen of Python says, "Explicit is better than implicit." Since you know the encoding, why not specify it?

Python: How to preserve Ä,Ö,Ü when writing to file

I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file.
My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them.
When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
ü
When I create the output in Python, I use
with open('file', 'w') as fout:
fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?
I think this should work:
import codecs
with codecs.open("file.xml", 'w', "utf-8") as fout:
# do stuff with filepointer
To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:
tree.write('file', encoding='utf-8')
When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout:
fout.write(xyz)
Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout:
fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.
Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:
def parse(self, source, parser=None):
if not hasattr(source, "read"):
source = open(source, "rb")
if not parser:
parser = XMLParser(target=TreeBuilder())
while 1:
data = source.read(65536)
if not data:
break
parser.feed(data)
self._root = parser.close()
return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.

Categories

Resources