Can't handle strings in windows - python

I have written a python 2.7 code in linux and it worked fine.
It uses
os.listdir(os.getcwd())
to read folder names as variables and uses them later in some parts.
In linux I used simple conversion trick to manually convert the non asci characters into asci ones.
str(str(tfile)[0:-4]).replace('\xc4\xb0', 'I').replace("\xc4\x9e", 'G').replace("\xc3\x9c", 'U').replace("\xc3\x87", 'C').replace("\xc3\x96", 'O').replace("\xc5\x9e", 'S') ,str(line.split(";")[0]).replace(" ", "").rjust(13, "0"),a))
This approach failed in windows. I tried
udata = str(str(str(tfile)[0:-4])).decode("UTF-8")
asci = udata.encode("ascii","ignore")
Which also failed with following
DEM¦-RTEPE # at this string error occured
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python27\lib\lib-tk\Tkinter.py", line 1532, in __call__
return self.func(*args)
File "C:\Users\benhur.satir\workspace\Soykan\tkinter.py", line 178, in SparisDerle
udata = str(str(str(tfile)[0:-4])).decode("utf=8")
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa6 in position 3: invalid start byte
How can I handle such characters in windows?
NOTE:Leaving them UTF causes xlswriter module to fail, so I need to convert them to asci. Missing characters are not desirable yet acceptable.

Windows does not like UTF8. You probably get the folder names in the default system encoding, generally win1252 (a variant of ISO-8859-1).
That's the reason why you could not find UTF8 characters in the file names. By the way the exception says you found a character of code 0xa6, which in win1252 encoding would be |.
It does not say exactly what is the encoding on your windows system as it may depends on the localization, but it proves the data is not UTF8 encoded.

How about this?
You can use this for optional .replace()
In the module of string, there is a set of characters that can be used..
>>> import string
>>> string.digits+string.punctuation
'0123456789!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>

Related

Which encoding should Python open function use?

I'm getting an exception when reading a file that contains a RIGHT DOUBLE QUOTATION MARK Unicode symbol. It is encoded in UTF-8 (0xE2 0x80 0x9D). The minimal example:
import sys
print(sys.getdefaultencoding())
f = open("input.txt", "r")
r.readline()
This script fails reading the first line even if the right quotation mark is not on the first line. The exception looks like that:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 102: char
acter maps to <undefined>
The input file is in utf-8 encoding, I've tried both with and without BOM. The default encoding returned by sys.getdefaultencoding() is utf-8.
This script fails on the machine with Python 3.6.5 but works well on another with Python 3.6.0. Both machines are Windows.
My questions are mostly theoretical, as this exception is thrown from external software that I cannot change, and it reads file that I don't wish to change. What should be the difference in these machines except the Python patch version? Why does vanilla open use cp1252 if the system default is utf-8?
As clearly stated in Python's open documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.
Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.
You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

Error Encoding non-BMP characters

I've developed a little program in python 3.4, but when I try to run it, at the end says:
File "C:\Python34\lib\idlelib\PyShell.py", line 1352, in write
return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 39559-39559: Non-BMP character not supported in Tk
I've tried all, but I found nothing. Help, please!
I presume you did the equivalent of the following.
>>> print('\U00011111')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
print('\U00011111')
File "C:\Programs\Python34\lib\idlelib\PyShell.py", line 1347, in write
return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U00011111' in position 0: Non-BMP character not supported in Tk
The problem is as stated: Idle uses the tkinter interface to tcl/tk and tk cannot display non-BMP supplementary chars (ord(char) > 0xFFFF).
Saving a string with non-BMP chars to a file will work fine as long as you encode with utf-8 (or -16, or -32).
On Windows, the console interpreter gives the same error with 'UCS-2' replaced by 'charmap'. The console interpreter is actually worse in that it raises an error even for some BMP chars, depending on the code page being used. I do not know what the situation is on other systems.
EDIT
I forget the best alternative, at least on Windows. Either of the following will print any string on any ascii terminal.
>>> repr('\U00011111')
"'\U00011111'"
>>> ascii('\U00011111')
"'\\U00011111'"
repr() does not double backslashes when echoed, ascii() does. These escape more chars than needed for Idle, but will not raise an exception at the >>> prompt. However, for reasons I do not understand, print(repr('\U00011111')) fails, so print(ascii(s)) is needed within a program to print s.

UnicodeDecodeError during encode?

We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.
Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.
Example:
>>> unicode('\xab')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
But of course, this works without any problems
>>> unicode(u'\xab')
u'\xab'
Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.
One really horrible hack that occurs is to write our own str2uni() method, something like:
def str2uni(val):
r"""brute force coersion of str -> unicode"""
try:
return unicode(src)
except UnicodeDecodeError:
pass
res = u''
for ch in val:
res += unichr(ord(ch))
return res
But before we do this -- wanted to see if anyone else had any insight?
UPDATED
I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.
for _,_,files in os.walk('/path/to/folder'):
for fname in files:
filename = unicode(fname)
That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'
To see the error for yourself, do the following:
>>> unicode('3\xab Floppy (A).link')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128)
UPDATED
I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!
I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.
As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".
Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.
I think you're confusing Unicode strings and Unicode encodings (like UTF-8).
os.walk(".") returns the filenames (and directory names etc.) as strings that are encoded in the current codepage. It will silently remove characters that are not present in your current codepage (see this question for a striking example).
Therefore, if your file/directory names contain characters outside of your encoding's range, then you definitely need to use a Unicode string to specify the starting directory, for example by calling os.walk(u"."). Then you don't need to (and shouldn't) call unicode() on the results any longer, because they already are Unicode strings.
If you don't do this, you first need to decode the filenames (as in mystring.decode("cp850")) which will give you a Unicode string:
>>> "\xab".decode("cp850")
u'\xbd'
Then you can encode that into UTF-8 or any other encoding.
>>> _.encode("utf-8")
'\xc2\xbd'
If you're still confused why unicode("\xab") throws a decoding error, maybe the following explanation helps:
"\xab" is an encoded string. Python has no way of knowing which encoding that is, but before you can convert it to Unicode, it needs to be decoded first. Without any specification from you, unicode() assumes that it is encoded in ASCII, and when it tries to decode it under this assumption, it fails because \xab isn't part of ASCII. So either you need to find out which encoding is being used by your filesystem and call unicode("\xab", encoding="cp850") or whatever, or start with Unicode strings in the first place.
for fname in files:
filename = unicode(fname)
The second line will complaint if fname is not ASCII. If you want to convert the string to Unicode, instead of unicode(fname) you should do fname.decode('<the encoding here>').
I would suggest the encoding but you don't tell us what does \xab is in your .link file. You can search in google for the encoding anyways so it would stay like this:
for fname in files:
filename = fname.decode('<encoding>')
UPDATE: For example, IF the encoding of your filesystem's names is ISO-8859-1 then \xab char would be "«". To read it into python you should do:
for fname in files:
filename = fname.decode('latin1') #which is synonym to #ISO-8859-1
Hope this helps!
As I understand it your issue is that os.walk(unicode_path) fails to decode some filenames to Unicode. This problem is fixed in Python 3.1+ (see PEP 383: Non-decodable Bytes in System Character Interfaces):
File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.
Windows provides Unicode API to access filesystem so there shouldn't be this problem.
Python 2.7 (utf-8 filesystem on Linux):
>>> import os
>>> list(os.walk("."))
[('.', [], ['\xc3('])]
>>> list(os.walk(u"."))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/os.py", line 284, in walk
if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py", line 71, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: \
ordinal not in range(128)
Python 3.3:
>>> import os
>>> list(os.walk(b'.'))
[(b'.', [], [b'\xc3('])]
>>> list(os.walk(u'.'))
[('.', [], ['\udcc3('])]
Your str2uni() function tries (it introduces ambiguous names) to solve the same issue as "surrogateescape" error handler on Python 3. Use bytestrings for filenames on Python 2 if you are expecting filenames that can't be decoded using sys.getfilesystemencoding().
'\xab'
Is a byte, number 171.
u'\xab'
Is a character, U+00AB Left-pointing double angle quotation mark («).
u'\xab' is a short-hand way of saying u'\u00ab'. It's not the same (not even the same datatype) as the byte '\xab'; it would probably have been clearer to always use the \u syntax in Unicode string literals IMO, but it's too late to fix that now.
To go from bytes to characters is known as a decode operation. To go from characters to bytes is known as an encode operation. For either direction, you need to know which encoding is used to map between the two.
>>> unicode('\xab')
UnicodeDecodeError
unicode is a character string, so there is an implicit decode operation when you pass bytes to the unicode() constructor. If you don't tell it which encoding you want you get the default encoding which is often ascii. ASCII doesn't have a meaning for byte 171 so you get an error.
>>> unicode(u'\xab')
u'\xab'
Since u'\xab' (or u'\u00ab') is already a character string, there is no implicit conversion in passing it to the unicode() constructor - you get an unchanged copy.
res = u''
for ch in val:
res += unichr(ord(ch))
return res
The encoding that maps each input byte to the Unicode character with the same ordinal value is ISO-8859-1. Consequently you could replace this loop with just:
return unicode(val, 'iso-8859-1')
(However note that if Windows is in the mix, then the encoding you want is probably not that one but the somewhat-similar windows-1252.)
One really horrible hack that occurs is to write our own str2uni() method
This isn't generally a good idea. UnicodeErrors are Python telling you you've misunderstood something about string types; ignoring that error instead of fixing it at source means you're more likely to hide subtle failures that will bite you later.
filename = unicode(fname)
So this would be better replaced with: filename = unicode(fname, 'iso-8859-1') if you know your filesystem is using ISO-8859-1 filenames. If your system locales are set up correctly then it should be possible to find out the encoding your filesystem is using, and go straight to that:
filename = unicode(fname, sys.getfilesystemencoding())
Though actually if it is set up correctly, you can skip all the encode/decode fuss by asking Python to treat filesystem paths as native Unicode instead of byte strings. You do that by passing a Unicode character string into the os filename interfaces:
for _,_,files in os.walk(u'/path/to/folder'): # note u'' string
for fname in files:
filename = fname # nothing more to do!
PS. The character in 3″ Floppy should really be U+2033 Double Prime, but there is no encoding for that in ISO-8859-1. Better in the long term to use UTF-8 filesystem encoding so you can include any character.

A UnicodeDecodeError that occurs with json in python on Windows, but not Mac

On windows, I have the following problem:
>>> string = "Don´t Forget To Breathe"
>>> import json,os,codecs
>>> f = codecs.open("C:\\temp.txt","w","UTF-8")
>>> json.dump(string,f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\json\__init__.py", line 180, in dump
for chunk in iterable:
File "C:\Python26\lib\json\encoder.py", line 294, in _iterencode
yield encoder(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
(Notice the non-ascii apostrophe in the string.)
However, my friend, on his mac (also using python2.6), can run through this like a breeze:
> string = "Don´t Forget To Breathe"
> import json,os,codecs
> f = codecs.open("/tmp/temp.txt","w","UTF-8")
> json.dump(string,f)
> f.close(); open('/tmp/temp.txt').read()
'"Don\\u00b4t Forget To Breathe"'
Why is this? I've also tried using UTF-16 and UTF-32 with json and codecs, but to no avail.
What does repr(string) show on each machine? On my Mac the apostrophe shows as \xc2\xb4 (utf8 coding, 2 bytes) so of course the utf8 codec can deal with it; on your Windows it clearly isn't doing that since it talks about three bytes being a problem - so on Windows you must have some other, non-utf8 encoding set for your console.
Your general problem is that, in Python pre-3, you should not enter a byte string ("...." as you used, rather than u"....") with non-ascii content (unless specifically as escape strings): this may (depending on how the session is set) fail directly or produce bytes, according to some codec set as the default one, which are not the exact bytes you expect (because you're not aware of the exact default codec in use). Use explicit Unicode literals
string = u"Don´t Forget To Breathe"
and you should be OK (or if you have any problem it will emerge right at the time of this assignment, at which point we may go into the issue of "how to I set a default encoding for my interactive sessions" if that's what you require).

Categories

Resources