python magic can't identify unicode filename - python

In my small project I had to identify the type of files in the directory. So I went with python-magic module and did the following:
from Tkinter import Tk
from tkFileDialog import askdirectory
def getDirInput():
root = Tk()
root.withdraw()
return askdirectory()
di = getDirInput()
print('Selected Directory: ' + di)
for f in os.listdir(di):
m = magic.Magic(magic_file='magic')
print 'Type of ' + f + ' --> ' + m.from_file(f)
But It seems that python-magic can't take unicode filenames as it is when I pass it to the from_file() function.Here's a sample output:
Selected Directory: C:/Users/pruthvi/Desktop/vidrec/temp
Type of log.txt --> ASCII English text, with very long lines, with CRLF, CR line terminators
Type of TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4 --> cannot open `TAEYEON \355\234\227_ I (feat. Verbal Jint)_Music Video.mp4' (No such file or directory)
Type of test.py --> a python script text executable
you can observe that python-magic failed to identiy the type of second file TAEYEON... as it had unicode characters in it. It shows 태연 characters as \355\234\227 instead while I passed the same in both cases. How can I overcome this problem and find the type of file with Unicode characters also ? Thank you

But It seems that python-magic can't take unicode filenames
Correct. In fact most cross-platform software on Windows can't handle non-ASCII characters in filenames.
This is because the C standard library uses byte strings for all filenames but Windows uses Unicode strings (technically, UTF-16 code unit strings, but the difference isn't important here). When software using the C standard library opens a file by byte-based string, the MS C runtime converts that to a Unicode string automatically, using an encoding (the confusingly-named ‘ANSI’ code page) that depends on the locale of the Windows installation. Your ANSI code page is probably 1252, which can't encode Korean characters, so it's impossible to use that filename. The ANSI code page is unfortunately never anything sensible like UTF-8, so it can never include all possible Unicode characters.
Python is special in that it has extra support for Windows Unicode filenames which bypasses the C standard library and calls the underlying Win32 APIs directly for Unicode filenames. So you can pass a unicode string using eg open() and it will work for all filenames.
However python-magic's from_file call doesn't open the file from Python. Instead it passes the filename to the libmagic library which is written in pure C. libmagic doesn't have the special Windows-filename code path for Unicode so this fails.
I suggest opening the file yourself from Python and using magic.from_buffer instead.

The response from the magic module seems to show that your characters were incorrectly translated somewhere - only half the string is shown and the byte order of 태 is wrong - it should be \355\227\234at least.
As this is on Windows, this raises UTF-16 byte-order alarm bells.
It might be possible to work around this by encoding to UTF-16. As suggested by other commenters, you need to prefix the directory.
input_encoding = locale.getpreferredencoding()
u_di = di.decode(input_encoding)
m = magic.Magic(magic_file='magic') # only needs to be initialised once
for f in os.listdir(u_di):
fq_f = os.path.join(u_di, f)
utf16_fq_f = fq_f.encode("UTF-16LE")
print m.from_file(utf16_fq_f)

Related

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

Reading a multibyte text file in Windows - how does it detect newlines? (Python 2)

I thought this was a caveat of a Unicode world -> you cannot correctly process a byte stream as writing without knowing what the encoding is. If you assume an encoding, then you might get valid - but incorrect - characters showing up.
Here's a test - a file with the writing:
hi1
hi2
stored on disk with a 2-byte Unicode encoding:
Windows newline characters are \r\n stored as the four byte sequence 0D 00 0A 00. Open it in Python 2 with default encodings, I think it's expecting ASCII 1-byte-per-character (or just a stream of bytes), and it reads:
>>> open('d:/t/hi2.txt').readlines()
['\xff\xfeh\x00i\x001\x00\r\x00\n',
'\x00h\x00i\x002\x00']
It's not decoding two bytes into one character, yet the four byte line ending sequence has been detected as two characters, and the file has been correctly split into two lines.
Presumably, then, Windows opened the file in 'text mode', as described here: Difference between files writen in binary and text mode
and fed the lines to Python. But how did Windows know the file was multibyte encoded, and to look for four-bytes of newlines, without being told, as per the caveat at the top of the question?
Does Windows guess, with a heuristic - and therefore can be wrong?
Is there more cleverness in the design of Unicode, something which makes Windows newline patterns unambiguous across encodings?
Is my understanding wrong, and there is a correct way to process any text file without being told the encoding beforehand?
The result in this case has nothing to do with Windows or the standard I/O implementation of Microsoft's C runtime. You'll see the same result if you test this in Python 2 on a Linux system. It's just how file.readlines (2.7.12 source link) works in Python 2. See line 1717, p = (char *)memchr(buffer+nfilled, '\n', nread) and then line 1749, line = PyString_FromStringAndSize(q, p-q). It naively consumes up to a \n character, which is why the actual UTF-16LE \n\x00 sequence gets split up.
If you had opened the file using Python 2's universal newlines mode, e.g. open('d:/t/hi2.txt', 'U'), the \r\x00 sequences would naively be translated to \n\x00. The result of readlines would instead be ['\xff\xfeh\x00i\x001\x00\n, \x00\n', '\x00h\x00i\x002\x00'].
Thus your initial supposition is correct. You need to know the encoding, or at least know to look for a Unicode BOM (byte order mark) at the start of the file, such as \xff\xfe, which indicates UTF-16LE (little endian). To that end I recommend using the io module in Python 2.7, since it properly handles newline translation. codecs.open, on the other hand, requires binary mode on the wrapped file and ignores universal newline mode:
>>> codecs.open('test.txt', 'U', encoding='utf-16').readlines()
[u'hi1\r\n', u'hi2']
io.open returns a TextIOWrapper that has built-in support for universal newlines:
>>> io.open('test.txt', encoding='utf-16').readlines()
[u'hi1\n', u'hi2']
Regarding Microsoft's CRT, it defaults to ANSI text mode. Microsoft's ANSI codepages are supersets of ASCII, so the CRT's newline translation will work for files encoded with an ASCII compatible encoding such as UTF-8. On the other hand, ANSI text mode doesn't work for a UTF-16 encoded file, i.e. it doesn't remove the UTF-16LE BOM (\xff\xfe) and doesn't translate newlines:
>>> open('test.txt').read()
'\xff\xfeh\x00i\x001\x00\r\x00\n\x00h\x00i\x002\x00'
Thus using standard I/O text mode for a UTF-16 encoded file requires the non-standard ccs flag, e.g. fopen("d:/t/hi2.txt", "rt, ccs=UNICODE"). Python doesn't support this Microsoft extension to the open mode, but it does make the CRT's low I/O (POSIX) _open and _read functions available in the os module. While it might surprise POSIX programmers, Microsoft's low I/O API also supports text mode, including Unicode. For example:
>>> O_WTEXT = 0x10000
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 100)
'h\x00i\x001\x00\n\x00h\x00i\x002\x00'
>>> os.close(fd)
The O_WTEXT constant isn't made directly available in Windows Python because it's not safe to open a file descriptor with this mode as a Python file using os.fdopen. The CRT expects all wide-character buffers to be a multiple of the size of a wchar_t, i.e. a multiple of 2. Otherwise it invokes the invalid parameter handler that kills the process. For example (using the cdb debugger):
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 7)
ntdll!NtTerminateProcess+0x14:
00007ff8`d9cd5664 c3 ret
0:000> k8
Child-SP RetAddr Call Site
00000000`005ef338 00007ff8`d646e219 ntdll!NtTerminateProcess+0x14
00000000`005ef340 00000000`62db5200 KERNELBASE!TerminateProcess+0x29
00000000`005ef370 00000000`62db52d4 MSVCR90!_invoke_watson+0x11c
00000000`005ef960 00000000`62db0cff MSVCR90!_invalid_parameter+0x70
00000000`005ef9a0 00000000`62db0e29 MSVCR90!_read_nolock+0x76b
00000000`005efa40 00000000`1e056e8a MSVCR90!_read+0x10d
00000000`005efaa0 00000000`1e0c3d49 python27!Py_Main+0x12a8a
00000000`005efae0 00000000`1e1146d4 python27!PyCFunction_Call+0x69
The same applies to _O_UTF8 and _O_UTF16.
First things first: open your file as text, indicating the correct encodin,and in explicit text mode.
If you are still using Python 2.7, use codecs.open instead of open. In Python 3.x, just use open:
import codecs
myfile = codecs.open('d:/t/hi2.txt', 'rt', encoding='utf-16')
And you should be able to work on it.
Second, what is likley going on there: Since you did not specify you were opening the file in binary mode, Windows open it in "text" mode - Windows does know about the encoding, and thus, can find the \r\n sequences in the lines - it reads the lines separately, performing the end-of-line translation - using utf-16 - and passes those utf-16 bytes to Python.
On the Python side, you could use these values, just decoding them to text:
[line.decode("utf-16" for line in open('d:/t/hi2.txt')]
instead of
open('d:/t/hi2.txt').readlines()

Save file with Russian letters in the file name

I have this Python script that takes the info of a webpage and then saves this info to a text file. But the name of this text file changes from time to time and it can changes to Cyrillic letters sometimes, and some times Korean letters.
The problem is that say I'm trying to save the file with the name "бореиская" then the name will appear very weird when I'm viewing it in Windows.
I'm guessing I need to change some encoding at some places. But the name is being sent to the open() function:
server = "бореиская"
file = open("eu_" + server + ".lua", "w")
I am, earlier on, taking the server variable from an array that already has all the names in it.
But as previously mentioned, in Windows, the names appear with some very weird characters.
tl;dr
Always use Unicode strings for file names and paths. E.g.:
io.open(u"myfile€.txt")
os.listdir(u"mycrazydirß")
In your case:
server = u"бореиская"
file = open(u"eu_" + server + ".lua", "w")
I assume server will come from another location, so you will need to ensure that it's decoded to a Unicode string correctly. See io.open().
Explanation
Windows
Windows stores filenames using UTF-16. The Windows i/o API and Python hides this detail but requires Unicode strings, else a string will have to use the correct 8bit codepage.
Linux
Filenames can be made from any byte string, in any encoding, as long as it's not ASCII "." or "..". As each system user can have their own encoding, you really can't guarantee the encoding one user used is the same as another. The locale is used to configure each user's environment. The user's terminal encoding also needs to match the encoding for consistency.
The best that can be hoped is that a user hasn't changed their locale and all applications are using the same locale. For example, the default locale may be: en_GB.UTF-8, meaning the encoding of files and filenames should be UTF-8.
When Python encounters a Unicode filename, it will use the user's locale to decode/encode filenames. An encoded string will be passed directly to the kernel, meaning you may get lucky with using "UTF-8" filenames.
OS X
OS X's filenames are always UTF-8 encoded, regardless of the user's locale. Therefore, a filename should be a Unicode string, but may be encoded in the user's locale and will be translated. As most user's locales are *.UTF-8, this means you can actually pass a UTF-8 encoded string or a Unicode string.
Roundup
For best cross-platform compatibility, always use Unicode strings as in most cases they will be translated to the correct encoding. It's really just Linux that has the most ambiguity, as some applications may choose to ignore the default locale or a user may have changed their locale to a non-UTF-8 version.
I'm viewing it in windows. ...Using python 2.7
use Unicode filenames on Windows. Python can use Unicode API there.
Do not use non-ascii characters in bytestring literals (it is explicitly forbidden on Python 3).
use Unicode literals u'' or add from __future__ import unicode_literals at the top of the module
make sure the encoding declaration (# -*- coding: utf-8 -*-) is correct i.e., your IDE/editor uses the specified encoding to save your Python source
#!/usr/bin/env python
# -*- coding: utf-8 -*-
server = u"бореиская"
with open(u"eu_{server}.lua".format(**vars()), "w") as file:
...
In Windows you have to encode filename probably to some of cp125x encoding but I don't know which one - probably cp1251.
filename = "eu_" + server + ".lua"
filename = filename.encode('cp1251')
file = open(filename, 'w')
In Linux you should use utf-8

Get properties of a file whose name contains special (non-ASCII) characters

I'm using python and having some trouble reading the properties of a file, when the filename includes non-ASCII characters.
One of the files for example is named:
0-Channel-https∺∯∯services.apps.microsoft.com∯browse∯6.2.9200-1∯615∯Channel.dat
When I run this:
list2 = os.listdir('C:\\Users\\James\\AppData\\Local\\Microsoft\\Windows Store\\Cache Medium IL\\0\\')
for data in list2:
print os.path.getmtime(data) + '\n'
I get the error:
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: '0-Channel-https???services.apps.microsoft.com?browse?6.2.9200-1?615?Channel.dat'
I assume its caused by the special chars because the code works fine with other file names with only ASCII chars.
Does anyone know of a way to query the filesystem properties of a file named like this?
If this is python 2.x, its an encoding issue. If you pass a unicode string to os.listdir such as u'C:\\my\\pathname', it will return unicode strings and they should have the non-ascii chars encoded correctly. See Unicode Filenames in the docs.
Quoting the doc:
os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames. For example, assuming the default filesystem encoding is UTF-8, running the following program:
this code should work...
directory_name = u'C:\\Users\\James\\AppData\\Local\\Microsoft\\Windows Store\\Cache Medium IL\\0\\'
list2 = os.listdir(directory_name)
for data in list2:
print data, os.path.getmtime(os.path.join(directory_name, data))
As you are in windows you should try with ntpath module instead of os.path
from ntpath import getmtime
As I don't have windows I can't test it. Every os has a different path convention, so, Python provides a specific module for the most common operative systems.

Wrong encoding in filenames created on Windows XP by Python script

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_Alcañiz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help
You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.
Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.
First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)
In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

Categories

Resources