UnicodeDecodeError: save to file in python - python

i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you

You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.

Related

Opening .dat file UnicodeDecodeError: invalid start byte

I am using Python to convert a .dat file (which you can find here) to csv in order for me to use it later in numpy or csv reader.
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./i2019.dat").readlines()]
# write it as a new CSV file
with open("./i2019.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
But this results in an error message of
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 68: invalid start byte
Any help would be appreciated!
It seems like your dat file uses Shift JIS(Japanese) encoding.
So you can pass shift_jis as the encoding argument to the open function.
datContent = [i.strip().split() for i in open("./i2019.dat", encoding='shift_jis').readlines()]

'utf-8' codec can't decode byte 0x8a in position 170: invalid start byte

I am trying to do this:
fh = request.FILES['csv']
fh = io.StringIO(fh.read().decode())
reader = csv.DictReader(fh, delimiter=";")
This is failing always with the error in title and I spent almost 8 hours on this.
here is my understanding:
I am using python3, so file fh is in bytes. I am encoding it into string and putting it in memory via StringIO.
with csv.DictReader() trying to read it as dict into memory. It is failing here:
also tried with io.StringIO(fh.read().decode('utf-8')), but same error.
what am I missing? :/
The error is because there is some non-ASCII character in the file and it can't be encoded/decoded. One simple way to avoid this error is to encode/decode such strings with encode()/decode() function as follows (if a is the string with non-ASCII character):
a.decode('utf-8')
Also, you could try opening the file as:
with open('filename', 'r', encoding = 'utf-8') as f:
your code using f as file pointer
use 'rb' if your file is binary.

Problems with unicode using pystache

Rendering a bunch of files, I get - with some of them - the following problem:
'ascii' codec can't decode byte 0xe5 in position 7128: ordinal not in range(128)
It seems some of those files are in unicode. How can I read those files so that pystache is able to render them? Currently I am reading those files as follows:
content = open(filename, 'r').read()
Is there an equivalent (simple) way of reading the full unicode file?
I've fixed this with change
content = open(filename, 'r').read()
to
import codecs
content = codecs.open(filename, 'r', 'utf-8').read()

Python CSV file UTF-16 to UTF-8 print error

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.
I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is a full Traceback:
Traceback (most recent call last):
File "keywords.py", line 31, in <module>
main()
File "keywords.py", line 28, in main
get_csv(file_full_path)
File "keywords.py", line 19, in get_csv
for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
With the help of Stack Overflow, I got it open with:
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
Now the problem is that when I am reading the file:
def get_csv(file_full_path):
import csv, codecs
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
for row in reader:
print row
I get stuck on Asian symbols:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.
for row in reader:
#decoded_row = [element_s.decode('UTF-8') for element_s in row]
#print decoded_row
encoded_row = [element_s.encode('UTF-8') for element_s in row]
print encoded_row
At this point I do not really understand why. If I
>>> print u'\u5a07'
娇
or
>>> print '娇'
娇
it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.
I am not sure where to go from here. Could anyone help out?
The csv module can not handle Unicode input. It says so specifically on its documentation page:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;
You need to convert your CSV file to UTF-8 so that the module can deal with it:
with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
with open(file_full_path + '.utf8', 'wb') as outfile:
for line in infile:
outfile.write(line.encode('utf8'))
Alternatively, you can use the command-line utility iconv to convert the file for you.
Then use that re-coded file to read your data:
reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
for row in reader:
print [c.decode('utf8') for c in row]
Note that the columns then need decoding to unicode manually.
Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.
You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.
Change your opening to this:
reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')
And you should be fine. Or even better:
with open(file_full_path, 'rb') as infile:
reader = csv.reader(infile, delimiter='\t', quotechar='"')
# CVS handling here.
However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.

Unicode error in python when printing a list

Edit: http://pastebin.com/W4iG3tjS - the file
I have a text file encoded in utf8 with some Cyrillic text it. To load it, I use the following code:
import codecs
fopen = codecs.open('thefile', 'r', encoding='utf8')
fread = fopen.read()
fread dumps the file on the screen all unicodish (escape sequences). print fread displays it in readable form (ASCII I guess).
I then try to split it and write it to an empty file with no encoding:
a = fread.split()
for l in a:
print>>dasFile, l
But I get the following error message: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Is there a way to dump fread.split() into a file? How can I get rid of this error?
Since you've opened and read the file via codecs.open(), it's been decoded to Unicode. So to output it you need to encode it again, presumably back to UTF-8.
for l in a:
dasFile.write(l.encode('utf-8'))
print is going to use the default encoding, which is normally "ascii". So you see that error with print. But you can open a file and write directly to it.
a = fopen.readlines() # returns a list of lines already, with line endings intact
# do something with a
dasFile.writelines(a) # doesn't add line endings, expects them to be present already.
assuming the lines in a are encoded already.
PS. You should also investigate the io module.

Categories

Resources