Write UTF-8 textfile with Python that Windows Editor can read - python

I use Python 3.4 on Win 7 and have the following problem:
I'd like to write a multiline unicode text to a text file which the user can open with the standard Windows Editor (I know ...) without any special instructions. I already figured out that this editor apparently needs a BOM to understand that the encoding is actually UTF-8:
with codecs.open(r'c:\configfile.txt', 'w', encoding='utf-8-sig') as cf:
cf.write("""Test1
Test2 öäüß
Test3""")
Now I noticed that with this code all newlines are written as 0x0a instead of 0x0d 0x0a, which the Windows Editor doesn't recognize, so it shows everything in one single line.
Long story short: What is a safe way to write a multiline unicode text string to a file that can be opened and edited with the Windows Editor?

With Python 3, you can simply use
with open(r'c:\configfile.txt', 'w', encoding='utf-8-sig') as cf:
...
which will open the file in "text" mode. That will use the correct line ending for the OS on which you run the script.
io.open() works the same way. codecs.open() always uses binary more, no translation of line endings will happen.
In Python 2, you could achieve the same effect by using wt as mode.

I have found a solution myself ... simply using io.open instead of codecs.open with the same parameters fixes the newline character issue:
with io.open(r'c:\configfile.txt', 'w', encoding='utf-8-sig') as cf:
cf.write("""Test1
Test2 öäüß
Test3""")

Related

UT8 issue - Is there a way to convert strange looking characters ä to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?
did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')
If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Python 3: my unicode2shift-jis script works except writes ASCII file. Why?

I have a file with Unicode Japanese writing in it and I want to convert it to Shift-JIS and print it out to Shift-JIS encoded file. I do this:
with open("unikanji.txt", 'rb') as unikanjif:
unikanji = unikanjif.read()
sjskanji = unikanji.decode().encode('shift-jis')
with open("kanji.txt", 'wb') as sjskanjif:
sjskanjif.write(sjskanji)
It works except that when I open kanji.txt it always opens as an Ansi file, not Shift-JIS, and I see misc characters instead of Japanese. If I manually change the file encoding to Shift-JIS then the misc characters turn into the right Japanese characters. How do I make my program create the file as Shift-JIS to begin with?
"ANSI" is Microsoft's term for the default, localized encoding, which varies according to the localized version of Windows used. A Microsoft program like Notepad assumes "ANSI" for the encoding of a text file unless it starts with a byte order mark. Microsoft Notepad recogizes UTF-8, UTF-16LE and UTF-16BE BOMs.
Shift-JIS is a localized encoding, so you have to use an editor such as Notepad++ and manually configure it to Shift-JIS, as you have discovered. The file as you have written it is Shift-JIS-encoded, but unless the editor you use has some heuristic to detect the encoding it will have to be manually configured. You could also use Japanese Windows or change your localization default in your current Windows version and Shift-JIS might be the ANSI default.
By the way, converting encodings can be a little more straightforward. Below assumes the original file is UTF-8 and the target file will be shift-jis. utf-8-sig automatically handles and removes a byte order mark, if present.
with open('unikanji.txt',encoding='utf-8-sig') as f:
text = f.read()
with open('kanji.txt','w',encoding='shift-jis') as f:
f.write(text)

How to avoid encoding parameter when opening file in Python3

When I am working on a .txt file on a Windows device I must save as either: ANSI, Unicode, Unicode big endian, or UTF-8. When I run Python3 on an OSX device and try to import and read the .txt file, I have to do something along the lines of:
with open('ships.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
print(line)
Is there a particular format I should use to encode the .txt file on the Windows device to avoid adding the encoding parameter when opening the file in Python?
Call locale.getpreferredencoding(False) on your OSX device. That's the default encoding used for reading a file on that device. Save in that encoding on Windows and you won't need to specify the encoding on your OSX device.
But as the Zen of Python says, "Explicit is better than implicit." Since you know the encoding, why not specify it?

Writing unicode with python - what is wrong with this character

With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:
import codecs
with open('temp.txt','w') as temp:
temp.write(codecs.BOM_UTF16_LE)
text = unichr(33034) # text = u'\u810a'
temp.write(text.encode('utf-16-le'))
But either of these things, when replaced above, make the code work.
unichr(33033) and unichr(33035) work correctly.
'utf-8' encoding (without BOM, byte-order mark).
How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?
You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.
Open the file in binary mode instead:
open('temp.txt','wb')
#Joni's answer is the root of the problem, but if you use codecs.open instead it always opens in binary mode, even if not specified. Using the utf16 codec also automatically writes the BOM using native endian-ness as well:
import codecs
with codecs.open('temp.txt','w','utf16') as temp:
temp.write(u'\u810a')
Hex dump of temp.txt:
FF FE 0A 81
Reference: codecs.open
You're already using the codecs library. When working with that file, you should swap out using open() with codecs.open() to transparently handle encoding.
import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
temp.write(unichr(33033))
temp.write(unichr(33034))
temp.write(unichr(33035))
If you have a problem after that, you might have an issue with your viewer, not your Python script.

Categories

Resources