python encoding error with utf-8

python encoding error with utf-8 - python

I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
TypeError: write() argument must be str, not bytes
Even if I make code as:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
The reason why I dont open file as 'wb' is I am writing different strings and integers to file.

If text_list contains unicode strings you should encode (not decode) them to string before saving to file.
Try this instead:
t_w = text_list[y].encode('utf-8')
Also it could be helpful to look at codecs standard module https://docs.python.org/2/library/codecs.html. You could try this:
import codecs
with codecs.open('path/to/file', 'w', 'utf-8') as f:
f.write(text_list[y])
f.write(u'\n')
But note that codecs always opens files in binary mode.

When using write in text mode, the UTF-8 mode is the default (in Python 3, I assume you use only Python 3, not Python 2) so do not encode the strings. Or open your file in binary mode and encode EVERYTHING you write to your file. I suggest NOT using binary mode in your case. So, your code will look like this:
with open('myfile.txt', 'w') as new_file:
t_w = text_list[y]
new_file.write(t_w)
new_file.write('\n')
or for Python 2:
new_file = open('myfile.txt', 'wb')
t_w = text_list[y].encode('utf-8') # I assume you use Unicode strings
new_file.write(t_w)
new_file.write(ub'\n')
new_file.close()

Related

SyntaxError: Non-ASCII character '\xfe' in file error happen

SyntaxError: Non-ASCII character '\xfe' in file error happen.
I wanna read tsv file,and change into csv file.When I run this app,this error happen.
I wrote
# coding: shift_jis
import libraries as libraries
import DataCleaning
import csv
media = 'Google'
tsv = csv.reader(file(r"data/aaa.csv"), delimiter = '\t',encoding='UTF-16')
for row in tsv:
print ", ".join(row)
I think ASCII is wrong,but I do not know how to fix this.
My tsv file is shift_jis and finally I wanna change it into UTF-8.But I think this error happen because I did not designate encoding as UTF-16.

The csv module on Python 2 is not Unicode friendly. You can't pass encoding to it as an argument, it's not a recognized argument (only csv format parameters are accepted as keyword arguments). It can't work with the Py2 unicode type correctly, so using it involves reading in binary mode, and even then, it only works properly when newlines are one byte per character. Per the csv module docs:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
If at all possible, switch to Python 3, where the csv module works with Py3's Unicode-friendly str by default, bypassing all the issues from Python 2's csv module, and encoding can be passed to open correctly. In that case, your code simplifies to:
with open(r"data/aaa.csv", encoding='utf-16', newline='') as inf:
tsv = csv.reader(inf, delimiter='\t')
# Explicit encoding argument may be needed for TextIOWrapper;
# the rewrapping is done to ensure newline='' is used as the csv module requires
csv.writer(io.TextIOWrapper(sys.stdout.buffer, newline='')).writerows(tsv)
Or to write as CSV to a UTF-8 encoded file:
with open(r"data/aaa.csv", encoding='utf-16', newline='') as inf, open(outfilename, "w", encoding='utf-8', newline='') as outf:
tsv = csv.reader(inf, delimiter='\t')
csv.writer(outf).writerows(tsv)
Failing that, take a look at the unicodecsv module on PyPI, which should handle Unicode input properly on Python 2.

How do I convert unicode to unicode-escaped text [duplicate]

This question already has an answer here:
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I'm loading a file with a bunch of unicode characters (e.g. \xe9\x87\x8b). I want to convert these characters to their escaped-unicode form (\u91cb) in Python. I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3, which does almost exactly what I want, but I can't work out how to save the data.
For example:
Input file:
\xe9\x87\x8b
Python Script
file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()
Output File (That I would like):
\u91cb

You need to encode it again with unicode-escape encoding.
>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'
Code modified (used binary mode to reduce unnecessary encode/decodes)
with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))
http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq

\xe9\x87\x8b is not a Unicode character. It looks like a representation of a bytestring that represents 釋 Unicode character encoded using utf-8 character encoding. \u91cb is a representation of 釋 character in Python source code (or in JSON format). Don't confuse the text representation and the character itself:
>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'
To read text encoded as utf-8 from a file, specify the character encoding explicitly:
with open('input.txt', encoding='utf-8') as file:
unicode_text = file.read()
It is exactly the same for saving Unicode text to a file:
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(unicode_text)
If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file.
If your input file literally contains \xe9 (4 characters) then you should fix whatever software generates it. If you need to use 'unicode-escape'; something is broken.

It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference):
with open("input.txt", "r", encoding='utf8') as f:
text = f.read()
text will contain the content of the file as a str (i.e. unicode string). Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape':
with open('output.txt', 'w', encoding='unicode-escape') as f:
f.write(text)
The content of your file will now contain unicode-escaped literals:
$ cat output.txt
\u91cb

Ansi to UTF-8 using python causing error

While I was trying to write a python program that converts Ansi to UTF-8, I found this
https://stackoverflow.com/questions/14732996/how-can-i-convert-utf-8-to-ansi-in-python
which converts UTF-8 to Ansi.
I thought it will just work by reversing the order. So I coded
file_path_ansi = "input.txt"
file_path_utf8 = "output.txt"
#open and encode the original content
file_source = open(file_path_ansi, mode='r', encoding='latin-1', errors='ignore')
file_content = file_source.read()
file_source.close
#write
file_target = open(file_path_utf8, mode='w', encoding='utf-8')
file_target.write(file_content)
file_target.close
But it causes error.
TypeError: file<> takes at most 3 arguments <4 given>
So I changed
file_source = open(file_path_ansi, mode='r', encoding='latin-1', errors='ignore')
to
file_source = open(file_path_ansi, mode='r', encoding='latin-1')
Then it causes another error:
TypeError: 'encoding' is an invalid keyword arguemtn for this function
How should I fix my code to solve this problem?

You are trying to use the Python 3 version of the open() function, on Python 2. Between the major versions, I/O support was overhauled, supporting better encoding and decoding.
You can get the same new version in Python 2 as io.open() instead.
I'd use the shutil.copyfileobj() function to do the copying, so you don't have to read the whole file into memory:
import io
import shutil
with io.open(file_path_ansi, encoding='latin-1', errors='ignore') as source:
with io.open(file_path_utf8, mode='w', encoding='utf-8') as target:
shutil.copyfileobj(source, target)
Be careful though; most people talking about ANSI refer to one of the Windows codepages; you may really have a file in CP (codepage) 1252, which is almost, but not quite the same thing as ISO-8859-1 (Latin 1). If so, use cp1252 instead of latin-1 as the encoding parameter.

Python: How to preserve Ä,Ö,Ü when writing to file

I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file.
My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them.
When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
ü
When I create the output in Python, I use
with open('file', 'w') as fout:
fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?

I think this should work:
import codecs
with codecs.open("file.xml", 'w', "utf-8") as fout:
# do stuff with filepointer

To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:
tree.write('file', encoding='utf-8')

When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout:
fout.write(xyz)
Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout:
fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.

Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:
def parse(self, source, parser=None):
if not hasattr(source, "read"):
source = open(source, "rb")
if not parser:
parser = XMLParser(target=TreeBuilder())
while 1:
data = source.read(65536)
if not data:
break
parser.feed(data)
self._root = parser.close()
return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:
s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

import codecs
import shutil
import sys
s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin, sys.stdout)

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.
For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:
configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")
This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.
Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''
# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))
# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()
# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)
# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0

You can use codecs.
import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")

In python3 you should add encoding='utf-8-sig':
with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)
that's it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python encoding error with utf-8 - python

Related

SyntaxError: Non-ASCII character '\xfe' in file error happen

How do I convert unicode to unicode-escaped text [duplicate]

Ansi to UTF-8 using python causing error

Python: How to preserve Ä,Ö,Ü when writing to file

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Categories

Resources