I am doing some csv read/write operations over a huge flat file. According the source, the contents of the file are under UTF-8 encoding but while trying to write to a .csv file I am getting following error:
Traceback (most recent call last):
File " basic.py", line 12, in <module>
F.write(q)
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x9a' in position 19: character maps to <undefined>
Quite a possibility that the file contains some multicultural symbols, given it contains data that represent some global information.
But as its huge it can’t be fixed manually one by one. So is there a way I can fix these errors, make the code ignore these or ideally can standardize these characters. As after writing these the csv file I will uploading it over a sql server db.
I had the same error message when trying to write text to a .txt file.
I solved the problem byreplacing
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w")
text_file.write(my_text)
with
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w", encoding='utf-8')
text_file.write(my_text)
Related
I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.
This is the code:
input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
new_line = line.replace(" ", "+")
new_line2 = new_line.replace("\t", "+")
out.write(new_line2)
#print(new_line2)
fh.close()
out.close()
It gives me an error:
Traceback (most recent call last):
File "music.py", line 3, in <module>
for line in input:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>
As music.txt is saved in UTF-8, I changed the first line to:
input = open('music.txt', 'r', encoding="utf8")
This gives another error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>
I tried other things with the out.write() but it didn't work.
This is the raw data of music.txt.
https://pastebin.com/FVsVinqW
I saved it in windows editor as UTF-8 .txt file.
If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.
with open('music.txt', 'r', encoding='utf-8') as infh,\
open("out.txt", "w", encoding='utf-8') as outfh:
for line in infh:
line = line.replace(" ", "+").replace("\t", "+")
outfh.write(line)
This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.
Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.
I have been trying to read a .docx file and copy its text to a .txt file
I started off by writing this piece of script for achieving the above results.
if extension == 'docx' :
document = Document(filepath)
for para in document.paragraphs:
with open("C:/Users/prasu/Desktop/PySumm-resource/CodeSamples/output.txt","w") as file:
file.writelines(para.text)
The error occurred is as follows :
Traceback (most recent call last):
File "input_script.py", line 27, in <module>
file.writelines(para.text)
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in
position 0: character maps to <undefined>
I tried printing "para.text" with the help of print(), it works.
Now, I want to write "para.text" to a .txt file.
You could try using codecs.
Based on your error message it seems that the following character "≥" is causing issues. Outputting in utf-8 with codecs should hopefully solve your issue.
from docx import Document
import codecs
filepath = r"test.docx"
document = Document(filepath)
for para in document.paragraphs:
with codecs.open('output.txt', 'a', "utf-8-sig") as o_file:
o_file.write(para.text)
o_file.close()
I need to print a random line from the file "Long films".
My code is:
import random
with open('Long films') as f:
lines = f.readlines()
print(random.choice(lines))
But it prints this error:
Traceback (most recent call last):
line 3, in <module>
lines = f.readlines()
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
What do I need to do in order to avoid this error?
The problem is not with printing, it is with reading. It seems your file has some special characters. Try opening your file with a different encoding:
with open('Long films', encoding='latin-1') as f:
...
Also, have you made any settings to your locale? Have you set any encoding scheme at the top of your file? Ordinarily, python3 will "helpfully" decode your text to utf-8, so you typically should not be getting this error.
I don't know exactly what's the source of this error and how to fix it. I am getting it by running this code.
Traceback (most recent call last):
File "t1.py", line 86, in <module>
write_results(results)
File "t1.py", line 34, in write_results
dw.writerows(results)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Any explanation is really appreciated!
I changed the code and now I get this error:
File "t1.py", line 88, in <module>
write_results(results)
File "t1.py", line 35, in write_results
dw.writerows(results)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's the change:
with codecs.open('results.csv', 'wb', 'utf-8') as f:
dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
dw.writer.writerow(dw.fieldnames)
dw.writerows(results)
The error is raised by this part of the code:
with open('results.csv', 'w') as f:
dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
dw.writer.writerow(dw.fieldnames)
dw.writerows(results)
You're opening an ASCII file, and then you're trying to write non-ASCII data to it. I guess that whoever wrote that script happened to never encounter a non-ASCII character during testing, so he never ran into an error.
But if you look at the docs for the csv module, you'll see that the module can't correctly handle Unicode strings (which is what Beautiful Soup returns), that CSV files always have to be opened in binary mode, and that only UTF-8 or ASCII are safe to write.
So you need to encode all the strings to UTF-8 before writing them. I first thought that it should suffice to encode the strings on writing, but the Python 2 csv module chokes on the Unicode strings anyway. So I guess there's no other way but to encode each string explicitly:
In parse_results(), change the line
results.append({'url': url, 'create_date': create_date, 'title': title})
to
results.append({'url': url, 'create_date': create_date, 'title': title.encode("utf-8")})
That might already be sufficient since I don't expect URLs or dates to contain non-ASCII characters.
This should work. works for me. Code snippet
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf8')
data = [["a", "b", u'\xe9']]
with open("output.csv", "w") as csv_file:
writer = csv.writer(csv_file, quoting=csv.QUOTE_ALL)
writer.writerows(data)
I am using Python to extract data from an MSSQL database, using an ODBC connection. I am then trying to put the extracted data into an Excel file, using xlwt.
However this generates the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 20: ordinal not in range(128)
I have run the script to just print the data and established that the offending character in the database is an O with a slash through it. On the python print it shows as "\xd8".
The worksheet encoding for xlwt is set as UTF-8.
Is there any way to have this come straight through into Excel?
Edit
Full error message below:
C:\>python dbtest1.py
Traceback (most recent call last):
File "dbtest1.py", line 24, in <module>
ws.write(i,j,item)
File "build\bdist.win32\egg\xlwt\Worksheet.py", line 1032, in write
File "build\bdist.win32\egg\xlwt\Row.py", line 240, in write
File "build\bdist.win32\egg\xlwt\Workbook.py", line 309, in add_str
File "build\bdist.win32\egg\xlwt\BIFFRecords.py", line 25, in add_str
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 20: invalid
continuation byte
Setting the workbook encoding to 'latin-1' seems to have achieved the same:
wb = xlwt.Workbook(encoding='latin-1')
(It was set at 'UTF-8' before)
The other answer didn't work in my case as there were other fields that were not strings.
The SQL extraction seems to be returning strings encoded using ascii. You can convert them to unicode with:
data = unicode(input_string, 'latin-1')
You can then put them into a spreadsheet with xlwt.