CSV Module "UnicodeEncodeError" when using Dictwriter.writerows - python

I'm setting up a prod environment on a new Mac server that should mirror my dev environment. The job runs without a hitch on my dev computer, but on the server I'm getting this traceback:
Traceback (most recent call last):
File "/usr/local/share/Code/PycharmProjects/etl3/jira_scripts/jira_issues_incremental.py", line 189, in <module>
writer.writerows(rows)
File "/usr/local/bin/anaconda3/envs/etl3/lib/python3.5/csv.py", line 156, in writerows
return self.writer.writerows(map(self._dict_to_list, rowdicts))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 1195: ordinal not in range(128)
This job is being run through the Run Shell Script terminal in the Automator App. I've checked the sys.defaultencoding() in the Automater terminal, as well as on the machine itself. Everything says utf8. I've also checked the encoding in my PostgreSQL database, and that is also set to UTF8. Here is my open statement for the file that the Dictwriter is writing to:
with open(loadfile, 'w') as outf:
writer = csv.DictWriter(
f=outf,
delimiter='|',
fieldnames=fieldnames,
extrasaction='ignore',
escapechar=r'/',
quoting=csv.QUOTE_MINIMAL
)
writer.writerows(rows)
I'm a little stumped as to where to even start to track down this error since all the default encodings seem to be correct... I should mention that this file is then copied to a PostgreSQL database using the psycopg2.cursor.copy_from command after, so the file should be written in a mode compatible with that.

You did not specify an encoding for your file, so the default codec is used for your system. Currently that is ASCII. See the open() documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Specify a different codec instead. UTF-8 would work:
with open(loadfile, 'w', encoding='utf8') as outf:
sys.getdefaultencoding() doesn't apply here; that's merely the default for unqualified str.encode() calls.

Related

Can't read a .csv with python: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position..." [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Which encoding should Python open function use?

I'm getting an exception when reading a file that contains a RIGHT DOUBLE QUOTATION MARK Unicode symbol. It is encoded in UTF-8 (0xE2 0x80 0x9D). The minimal example:
import sys
print(sys.getdefaultencoding())
f = open("input.txt", "r")
r.readline()
This script fails reading the first line even if the right quotation mark is not on the first line. The exception looks like that:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 102: char
acter maps to <undefined>
The input file is in utf-8 encoding, I've tried both with and without BOM. The default encoding returned by sys.getdefaultencoding() is utf-8.
This script fails on the machine with Python 3.6.5 but works well on another with Python 3.6.0. Both machines are Windows.
My questions are mostly theoretical, as this exception is thrown from external software that I cannot change, and it reads file that I don't wish to change. What should be the difference in these machines except the Python patch version? Why does vanilla open use cp1252 if the system default is utf-8?
As clearly stated in Python's open documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.
Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.

How to avoid encoding parameter when opening file in Python3

When I am working on a .txt file on a Windows device I must save as either: ANSI, Unicode, Unicode big endian, or UTF-8. When I run Python3 on an OSX device and try to import and read the .txt file, I have to do something along the lines of:
with open('ships.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
print(line)
Is there a particular format I should use to encode the .txt file on the Windows device to avoid adding the encoding parameter when opening the file in Python?
Call locale.getpreferredencoding(False) on your OSX device. That's the default encoding used for reading a file on that device. Save in that encoding on Windows and you won't need to specify the encoding on your OSX device.
But as the Zen of Python says, "Explicit is better than implicit." Since you know the encoding, why not specify it?

My python can only open the saved text file by notepad, why?

I am using Pyhton3.4.1 and win7. I am trying to reading a txt file exported from a software. it seems that python cannot read this text file. But I found if I open the text file by notepad and add a space in any place and save it, the python works well then.
I tried the same code and same file on my mac, it has the same problem as in windows.
For original text file, not working,open and saved in windows notepad, working,
open ans saved in mac textedit, not working.
I am doubting the original coding of the text file might not be right.
Thanks
Python code
InputFileName=input("Please tell me the input file name:")
#StartLNum=int(input("Please tell me the start line number:"))
#EndLNum=int(input("Please tell me the end line number:"))
StartLNum=18
EndLNum=129
lnum=1
OutputName='out'+InputFileName
fw=open(OutputName,'w')
with open(InputFileName,"r") as fr:
for line in fr:
if (lnum >= StartLNum) & (lnum<=EndLNum):
#print(line)
fw.write(line)
lnum+=1
fw.close()
Shell
>>> ================================ RESTART ================================
>>>
Please tell me the input file name:Jul-18-2014.txt
Traceback (most recent call last):
File "C:\Users\Jeremy\Desktop\read.py", line 13, in <module>
for line in fr:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xb3 in position 4309: illegal multibyte sequence
>>> ================================ RESTART ================================
>>>
Please tell me the input file name:Jul-18-2014.txt
>>>
Plus, the error below is the same code reported on my mac(Python3.4.1,OS10.9)
Traceback (most recent call last):
File "/Users/Jeremy/Desktop/read.py", line 14, in <module>
for line in fr:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb3 in position 4174: ordinal not in range(128)
When you save the file in Notepad, the file is reencoded to be saved as your default file encoding for your Windows installation. Notepad auto-detected the encoding when it opened the file, however.
Python opens file using that same system encoding, by default, which is why you can now open the file. Quoting the open() function documentation:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.
You'll have to explicitly specify the correct encoding for the file if you wanted to open it directly in Python:
with open(InputFileName, "r", encoding='utf-8-sig') as fr:
I used 'utf-8-sig' as an example here, as that is a file encoding that Notepad can auto-detect. It could well be that the encoding is UTF-16 or plain UTF-8 or any number of other encodings, however.
If you think that the page is encoded with a specific ANSI codepage you still have to name the exact codepage. Your system is configured to use code page 936 (GBK) but that is not the correct encoding for this file.
See the codecs module for a list of supported encodings.

Changing the preferred encoding for Windows7 command prompt

Using Python3.3
Running a python script to do some CAD file conversion, but getting below error
Conversion Failed
Traceback (most recent call last):
File "Start.py", line 141, in convertLib
lib.writeLibrary(modFile,symFile)
File "C:\Python33\Eagle2Kicad/Library\Library.py", line 67, in writeLibrary
self.writeSymFile(symFile)
File "C:\Python33\Eagle2Kicad/Library\Library.py", line 88, in writeSymFile
devicepart.write(symFile)
File "C:\Python33\Eagle2Kicad/Common\Symbol.py", line 51, in write
symbol.write(symFile)
File "C:\Python33\Eagle2Kicad/Common\Symbol.py", line 114, in write
symFile.write(pin.symRep())
File "C:\Python33\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x96' in position 4: character maps to <undefined>
It seems the preferred encoding in my Windows7 command prompt is NOT cp1252 (typing chcp shows "Active code page 437"). How can I change it to cp1252?
Can anyone please suggest?
EDIT:
As the default preferred encoding of the python command line is cp1252, I tried running the script from python command line instead of the windows command prompt. But I am still getting the same error as above. Can anyone please suggest?
When writing to files, the console encoding is not taken into account; only the locale is. From the open() function documenation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
On your system, evidently that call returns 'cp1252'. The remedy is to always name the encoding for files explicitly; pass in an encoding argument to set the desired encoding for files:
with open(filename, 'w', encoding='utf8') as symFile:

Categories

Resources