I have this Python script that takes the info of a webpage and then saves this info to a text file. But the name of this text file changes from time to time and it can changes to Cyrillic letters sometimes, and some times Korean letters.
The problem is that say I'm trying to save the file with the name "бореиская" then the name will appear very weird when I'm viewing it in Windows.
I'm guessing I need to change some encoding at some places. But the name is being sent to the open() function:
server = "бореиская"
file = open("eu_" + server + ".lua", "w")
I am, earlier on, taking the server variable from an array that already has all the names in it.
But as previously mentioned, in Windows, the names appear with some very weird characters.
tl;dr
Always use Unicode strings for file names and paths. E.g.:
io.open(u"myfile€.txt")
os.listdir(u"mycrazydirß")
In your case:
server = u"бореиская"
file = open(u"eu_" + server + ".lua", "w")
I assume server will come from another location, so you will need to ensure that it's decoded to a Unicode string correctly. See io.open().
Explanation
Windows
Windows stores filenames using UTF-16. The Windows i/o API and Python hides this detail but requires Unicode strings, else a string will have to use the correct 8bit codepage.
Linux
Filenames can be made from any byte string, in any encoding, as long as it's not ASCII "." or "..". As each system user can have their own encoding, you really can't guarantee the encoding one user used is the same as another. The locale is used to configure each user's environment. The user's terminal encoding also needs to match the encoding for consistency.
The best that can be hoped is that a user hasn't changed their locale and all applications are using the same locale. For example, the default locale may be: en_GB.UTF-8, meaning the encoding of files and filenames should be UTF-8.
When Python encounters a Unicode filename, it will use the user's locale to decode/encode filenames. An encoded string will be passed directly to the kernel, meaning you may get lucky with using "UTF-8" filenames.
OS X
OS X's filenames are always UTF-8 encoded, regardless of the user's locale. Therefore, a filename should be a Unicode string, but may be encoded in the user's locale and will be translated. As most user's locales are *.UTF-8, this means you can actually pass a UTF-8 encoded string or a Unicode string.
Roundup
For best cross-platform compatibility, always use Unicode strings as in most cases they will be translated to the correct encoding. It's really just Linux that has the most ambiguity, as some applications may choose to ignore the default locale or a user may have changed their locale to a non-UTF-8 version.
I'm viewing it in windows. ...Using python 2.7
use Unicode filenames on Windows. Python can use Unicode API there.
Do not use non-ascii characters in bytestring literals (it is explicitly forbidden on Python 3).
use Unicode literals u'' or add from __future__ import unicode_literals at the top of the module
make sure the encoding declaration (# -*- coding: utf-8 -*-) is correct i.e., your IDE/editor uses the specified encoding to save your Python source
#!/usr/bin/env python
# -*- coding: utf-8 -*-
server = u"бореиская"
with open(u"eu_{server}.lua".format(**vars()), "w") as file:
...
In Windows you have to encode filename probably to some of cp125x encoding but I don't know which one - probably cp1251.
filename = "eu_" + server + ".lua"
filename = filename.encode('cp1251')
file = open(filename, 'w')
In Linux you should use utf-8
Related
When I read a file in python and print it to the screen, it does not read certain characters properly, however, those same characters hard coded into a variable print just fine. Here is an example where "test.html" contains the text "Hallå":
with open('test.html','r') as file:
Str = file.read()
print(Str)
Str = "Hallå"
print(Str)
This generates the following output:
hallå
Hallå
My guess is that there is something wrong with how the data in the file is being interpreted when it is read into Python, however I am uncertain of what it is since Python 3.8.5 already uses UTF-8 encoding by default.
Function open does not use UTF-8 by default. As the documentation says:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So, it depends, and to be certain, you have to specify the encoding yourself. If the file is saved in UTF-8, you should do this:
with open('test.html', 'r', encoding='utf-8') as file:
On the other hand, it is not clear whether the file is or is not saved in UTF-8 encoding. If it is not, you'll have to choose a different one.
In my small project I had to identify the type of files in the directory. So I went with python-magic module and did the following:
from Tkinter import Tk
from tkFileDialog import askdirectory
def getDirInput():
root = Tk()
root.withdraw()
return askdirectory()
di = getDirInput()
print('Selected Directory: ' + di)
for f in os.listdir(di):
m = magic.Magic(magic_file='magic')
print 'Type of ' + f + ' --> ' + m.from_file(f)
But It seems that python-magic can't take unicode filenames as it is when I pass it to the from_file() function.Here's a sample output:
Selected Directory: C:/Users/pruthvi/Desktop/vidrec/temp
Type of log.txt --> ASCII English text, with very long lines, with CRLF, CR line terminators
Type of TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4 --> cannot open `TAEYEON \355\234\227_ I (feat. Verbal Jint)_Music Video.mp4' (No such file or directory)
Type of test.py --> a python script text executable
you can observe that python-magic failed to identiy the type of second file TAEYEON... as it had unicode characters in it. It shows 태연 characters as \355\234\227 instead while I passed the same in both cases. How can I overcome this problem and find the type of file with Unicode characters also ? Thank you
But It seems that python-magic can't take unicode filenames
Correct. In fact most cross-platform software on Windows can't handle non-ASCII characters in filenames.
This is because the C standard library uses byte strings for all filenames but Windows uses Unicode strings (technically, UTF-16 code unit strings, but the difference isn't important here). When software using the C standard library opens a file by byte-based string, the MS C runtime converts that to a Unicode string automatically, using an encoding (the confusingly-named ‘ANSI’ code page) that depends on the locale of the Windows installation. Your ANSI code page is probably 1252, which can't encode Korean characters, so it's impossible to use that filename. The ANSI code page is unfortunately never anything sensible like UTF-8, so it can never include all possible Unicode characters.
Python is special in that it has extra support for Windows Unicode filenames which bypasses the C standard library and calls the underlying Win32 APIs directly for Unicode filenames. So you can pass a unicode string using eg open() and it will work for all filenames.
However python-magic's from_file call doesn't open the file from Python. Instead it passes the filename to the libmagic library which is written in pure C. libmagic doesn't have the special Windows-filename code path for Unicode so this fails.
I suggest opening the file yourself from Python and using magic.from_buffer instead.
The response from the magic module seems to show that your characters were incorrectly translated somewhere - only half the string is shown and the byte order of 태 is wrong - it should be \355\227\234at least.
As this is on Windows, this raises UTF-16 byte-order alarm bells.
It might be possible to work around this by encoding to UTF-16. As suggested by other commenters, you need to prefix the directory.
input_encoding = locale.getpreferredencoding()
u_di = di.decode(input_encoding)
m = magic.Magic(magic_file='magic') # only needs to be initialised once
for f in os.listdir(u_di):
fq_f = os.path.join(u_di, f)
utf16_fq_f = fq_f.encode("UTF-16LE")
print m.from_file(utf16_fq_f)
I have a collection of files from an older MAC OS file store. I know that there are filename / path name issues with the collection. The issue stems from the inclusion of a codepoint in the path that I think was rendered as a dash in the original OS, but windows struggles with the codepoint, and either includes a diacritic on the previous character, or replaces it with a ?
I'm trying to figure out a way to establishing a "truth" of the files structure, so I can be sure I'm accounting for every file.
I have explored the files with a few tools, and nothing has matching tallies. I believe the following demonstrates the problem.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
folder = "emails"
b = os.listdir(folder)
for f in b:
print repr(f)
print os.path.isfile(os.path.join(folder, f))
(I have to redact the actual filenames a litte)
Results in:-
'file(1)'
True
'file(2)'
True
'file(3)?'
False
'file(4)'
True
The file name of interest is file(3)?, where the odd codepoint has been decoded as a ?, and which evaluates as not being a file (or even exisiting via os.path.exists).
Note that print repr(string) shows that its handling a UTF-8, properly encoded ?.
I can copy paste the filename from the folder and it appears as : file(3) note the fullstop.
I can paste the string into my editor (subl) and see that I now have an undisplayable codepoint glyph for the final codepoint
a = "file(3)"
print a
print repr(a)
Gives me:
file(3)
'file(3)\xef\x80\xa9'
From this I can see that the odd code point is \xef\x80\xa9. Elsewhere in the set I also find the codepoint \xef\x80\xa8.
I must assume that os.listdir is not returning raw codepoint values but an (UTF-8?) encoded string, with a codepoint subsitution that means when it tests for exists or isfile its testing for the existance of the wrong filename, as the the file with a subsituted ? does not exist.
How do I work with these files safely? I have around 40 in a collection of around 700 files.
Try passing a unicode to os.listdir:
folder = u"emails"
b = os.listdir(folder)
Doing so will cause os.listdir to return a list of unicodes instead of strs.
Unfortunately, the more I think about this the less I understand about why this worked. Every filesystem ultimately stores its filenames in bytes using some encoding. HDF+ for instance stores filenames in UTF-16. So it would make sense if os.listdir could return those raw bytes most easily without adulteration. But instead, in this case, it looks like os.listdir can return unadulterated unicode, but not unadulterated bytes.
If someone could explain that mystery I would be most appreciative.
Did the files come from Mac Roman encoding (presumably what MacOS used), or the NFKD normal form of UTF-8 that Mac OS X uses?
The concept of Unicode normal forms is one that every programmer ought to be familiar with.... precious few are though. I can't tell you what you need too know about this with regard to Python though.
My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_Alcañiz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help
You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.
Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.
First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)
In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.
With a filename looking like:
filename = u"/direc/tories/español.jpg"
And using open() as:
fp = open(filename, "rb")
This will correctly open the file on OSX (10.7), but on Ubuntu 11.04 the open() function will try to open u"espa\xf1ol.jpg", and this will fail with an IOError.
Through the process of trying to fix this I've checked sys.getfilesystemencoding() on both systems, both are set to utf-8 (although Ubuntu reports uppercase, i.e. UTF-8, not sure if that is relevant). I've also set # -*- coding: utf-8 -*- in the python file, but I'm sure this only affects encoding within the file itself, not any external functions or how python deals with system resources. The file exists on both systems with the eñe correctly displayed.
The end question is: How do I open the español.jpg file on the Ubuntu system?
Edit:
The español.jpg string is actually coming out of a database via Django's ORM (ImageFileField), but by the time I'm dealing with it and seeing the difference in behaviour I have a single unicode string which is an absolute path to the file.
This one below should work in both cases:
fp = open(filename.encode(sys.getfilesystemencoding()), "rb")
It's not enough to simply set the file encoding at the top of your file. Make sure that your editor is using the same encoding, and saving the text in that encoding. If necessary, re-type any non-ascii characters to ensure that your editor is doing the right thing.
If your value is coming from e.g. a database, you will still need to ensure that nowhere along the line is being encoded as non-unicode.