I have some files in a Windows 10 directory that were named with encryption logic written in VB. The VB logic was originally written on Windows 7 with VB.net, but the file names are exactly the same between the two version of Windows, as expected. The problem I'm having is that when I try to decrypt those file names in a character by character loop in Python 3.7.4, what is returned from the ord() function doesn't match what the VB asc() designation is for that character.
All the letters match (up to ascii character 126) but everything after that does not.
For example, in VB:
?asc("ƒ")
returns 131.
However, in Python 3.4.7:
ord('ƒ')
returns 402.
I've read a lot of great posts here discussing UTF-8 vs cp1252 encoding both for strings of data (within files) and filenames, but I haven't come across a solution for my problem.
When I run:
sys.getdefaultencoding()
I get 'utf-8'. This is what, I believe, would be used for file names, and functions used for them, e.g., os.fsdecode(), os.listdir(), etc..
When I run:
locale.getpreferredencoding()
I get 'cp1252.'
One thing I noticed on the "other side of the fence," is that the values returned by python ord () DO match the VB equivalent AscW(), but altering all that code is going to be more problematic than moving forward with the rest of what we've done in Python so far.
Should I be altering the locale's preferredencoding or the sys's default encoding to solve this problem?
Thanks!
Note that the Python value is the Unicode code point (and is bigger than 255). If you already have the correct filenames in Python, just encode the strings with the appropriate encoding (apparently cp1252) and examine the byte values. (You could also call functions like os.listdir with bytes arguments to suppress the decoding in the first place.)
Related
Say, I have a source file encoded in utf8, when python interpreter loads that source file, will it convert file content to unicode in memory and then try to evaluate source code in unicode?
If I have a string with non ASCII char in it, like
astring = '中文'
and the file is encoded in gbk.
Running that file with python 2, I found that string actually is still in raw gbk bytes.
So I dboubt, python 2 interpret does not convert source code to unicode. Beacause if so, the string content will be in unicode(I heard it is actually UTF16)
Is that right? And if so, how about python 3 interpreter? Does it convert source code to unicode format?
Acutally, I know how to define unicode and raw string in both Python2 and 3.
I'm just curious about one detail when the interpreter loads source code.
Will it convert the WHOLE raw source code (encoded bytes) to unicode at very beginning and then try to interpret unicode format source code piece by piece?
Or instead, it just loads raw source piece by piece, and only decodes what it think should. For example, when it hits the statement u'中文' , OK, decode to unicode. While it hits statment b'中文', OK, no need to decode.
Which way the interpreter will go?
If your source file is encoded with GBK, put this line at the top of the file (first or second line):
# coding: gbk
This is required for both Python 2 and 3.
If you omit this encoding declaration, the interpreter will assume ASCII in the case of Python 2, and UTF-8 for Python 3.
The encoding declaration controls how the interpreter reads the bytes of the source file. This is mainly relevant for string literals (like in your example), but theoretically also applies to comments and even identifiers (it's probably not a good idea to use non-ASCII in identifiers, though).
As for the question whether you get byte strings or unicode strings: this depends on the syntax, not on the choice and declaration of encoding.
As pointed out in Ignacio's answer, if you want to have unicode strings in Python 2, you need to use the u'...' notation.
In Python 3, the u prefix is optional.
So, with a correct encoding declaration in the file header, it is sufficient to write astring = '中文' to get a correct unicode string in Python 3.
Update
By comment, the OP asks about the interpretation of b'中文'.
In Python 3, this isn't allowed (byte strings can only contain ASCII characters), but you can test this yourself in Python 2.x:
# coding: gbk
x = b'中文'
y = u'中文'
print repr(x)
print repr(y)
This will produce:
'\xd6\xd0\xce\xc4'
u'\u4e2d\u6587'
The first line reflects the actual bytes contained in the source file (if you saved it with GBK, of course).
So there seems to be no decoding happening for b'中文'.
However, I don't know how the interpreter internally represents the source code with respect to encoding (that seems to be your question).
This is implementation-dependent anyway, so the answer might even be different for cPython, Jython, IronPython etc.
So I dboubt, python 2 interpret does not convert source code to unicode.
It never does. If you want to use Unicode rather than bytes then you need to use a unicode instead.
astring = u'中文'
Python source is only plain ASCII, meaning that the actual encoding does not matter except for litteral strings, be them unicode strings or byte strings. Identifiers can use non ascii characters (IMHO it would be a very bad practice), but their meaning is normally internal to the Python interpreter, so the way it reads them is not really important
Byte strings are always left unchanged. That means that normal strings in Python 2 and byte litteral strings in Python 3 are never converted.
Unicode strings are always converted:
if the special string coding: charset_name exists in a comment on first or second line, the original byte string is converted as it would be with decode(charset_name)
if not encoding is specified, Python 2 will assume ASCII and Python 3 will assume utf8
What function can I apply to a string variable that will cause the same result as prepending the b modifier to a string literal?
I've read in this question about the b modifier for string literals in Python 2 that prepending b to a string makes it a byte string (mainly for compatibility between Python 2 and Python 3 when using 2to3). The result I would like to obtain is the same, but applied to a variable, like so:
def is_binary_string_equal(string_variable):
binary_string = b'this is binary'
return convert_to_binary(string_variable) == binary_string
>>> convert_to_binary('this is binary')
[1] True
What is the correct definition of convert_to_binary?
First, note that in Python 2.x, the b prefix actually does nothing. b'foo' and 'foo' are both exactly the same string literal. The b only exists to allow you to write code that's compatible with both Python 2.x and Python 3.x: you can use b'foo' to mean "I want bytes in both versions", and u'foo' to mean "I want Unicode in both versions", and just plain 'foo' to mean "I want the default str type in both versions, even though that's Unicode in 3.x and bytes in 2.x".
So, "the functional equivalent of prepending the 'b' character to a string literal in Python 2" is literally doing nothing at all.
But let's assume that you actually have a Unicode string (like what you get out of a plain literal or a text file in Python 3, even though in Python 2 you can only get these by explicitly decoding, or using some function that does it for you, like opening a file with codecs.open). Because then it's an interesting question.
The short answer is: string_variable.encode(encoding).
But before you can do that, you need to know what encoding you want. You don't need that with a literal string, because when you use the b prefix in your source code, Python knows what encoding you want: the same encoding as your source code file.* But everything other than your source code—files you open and read, input the user types, messages coming in over a socket—could be anything, and Python has no idea; you have to tell it.**
In many cases (especially if you're on a reasonably recent non-Windows machine and dealing with local data), it's safe to assume that the answer is UTF-8, so you can spell convert_to_binary_string(string_variable) as string_variable.encode('utf8'). But "many" isn't "all".*** This is why text editors and web browsers let the user select an encoding—because sometimes only the user actually knows.
* See PEP 263 for how you can specify the encoding, and why you'd want to..
** You can also use bytes(s, encoding), which is a synonym for s.encode(encoding). And, in both cases, you can leave off the encoding argument—but then it defaults to something which is more likely to be ASCII than what you actually wanted, so don't do that.
*** For example, many older network protocols are defined as Latin-1. Many Windows text files are created in whatever the OEM charset is set to—usually cp1252 on American systems, but there are hundreds of other possibilities. Sometimes sys.getdefaultencoding() or locale.getpreferredencoding() gets what you want, but that obviously doesn't work when, say, you're processing a file that someone uploaded that's in his machine's preferred encoding, not yours.
In the special case where the relevant encoding is "whatever this particular source file is in", you pretty much have to know that somehow out-of-band.* Once a script or module has been compiled and loaded, it's no longer possible to tell what encoding it was originally in.**
But there shouldn't be much reason to want that. After all, if two binary strings are equal, and in the same encoding, the Unicode strings are also equal, and vice-versa, so you could just write your code as:
def is_binary_string_equal(string_variable):
binary_string = u'this is binary'
return string_variable == binary_string
* The default is, of course, documented—it's UTF-8 for 3.0, ASCII or Latin-1 for 2.x depending on your version. But you can override that, as PEP 263 explains.
** Well, you could use the inspect module to find the source, then the importlib module to start processing it, etc.—but that only works if the file is still there and hasn't been edited since you last compiled it.
Note that in python 3.7, executed on linux machine, it is not the same to use .encode('UTF-8') and b'string' .
It cause a lot of pain in a project of mine and to this day I have no clear understanding of why it happens but doing this in Python 3.7
print('\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45'.encode('UTF-8'))
print(b'\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45')
returns this on console
b'\xc2\xadCHIDDINGSTONE'
b'\xadCHIDDINGSTONE'
I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.
# -*- coding: utf-8 -*-
import sys
class Token:
def __init__(self, string, final=False):
self.value = string
self.final = final
def __str__(self):
return self.value
def __repr__(self):
return self.value
print(sys.getdefaultencoding())
print(sys.stdout.encoding)
# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)
reload(sys).setdefaultencoding("utf8")
# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)
My question is why the two encoding variables are different in the first place
They serve different purposes.
sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.
sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.
sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.
btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).
The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.
Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.
The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.
how do I manage to use the wrong encoding in this simple piece of code?
Your first code example is not fine. You use non-ascii literal characters in a byte string on Python 2 that you should not do. Use bytestrings' literals only for binary data (or so called native strings if necessary). The code may produce mojibake such as I need 20 000Γé¼. (notice the character noise) if you run it using Python 2 in any environment that does not use utf-8-compatible encoding such as Windows console
The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals
Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.
I have noticed the same behaviour of some standard code (mailman library).
Thanks for your analysis, it helped me save some time. :-)
The problem is exactly the same. My system uses sys.getdefaultencoding() and gets ascii, which is inappropriate to handle a list of 1000 UTF-8 encoded names.
There is a mismatch between stdin/stdout and even filesystem encoding (utf-8) on one hand and "defaultencoding" on the other (ascii). This thread: How to print UTF-8 encoded text to the console in Python < 3? seems to indicate that it is well known and Changing default encoding of Python? contains some indication that a more homogeneous (like "utf-8 everywhere") would break other things like the hash implementation.
For that reason it is also not straightforward to change the defaultencoding. (See http://blog.ianbicking.org/illusive-setdefaultencoding.html for various ways to do so.) It is removed from the sys instance in the site.py file.
I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.
First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind-module (which parses the *.desktop files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.
However, I properly checked for the encoding in the *.desktop files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile?
Thanks,
Niklas
This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'ä' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.
You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?
By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?
Try using .encode(encoding) instead of .decode(encoding) in all its occurences in the snippet.
I am working on a program (Python 2.7) that reads xls files (in MHTML format). One of the problems I have is that files contain symbols/characters that are not ascii. My initial solution was to read the files in using unicode
Here is how I am reading in a file:
theString=unicode(open(excelFile).read(),'UTF-8','replace')
I am then using lxml to do some processing. These files have many tables, the first step of my processing requires that I find the right table. I can find the table based on words that are in the the first cell of the first row. This is where is gets tricky. I had hoped to use a regular expression to test the text_content() of the cell but discovered that there were too many variants of the words (in a test run of 3,200 files I found 91 different ways that the concept that defines just one of the tables was expressed. Therefore I decided to dump all of the text_contents of the particular cell out and use some algorithims in excel to strictly identify all of the variants.
The code I used to write the text_content() was
headerDict['header_'+str(column+1)]=encode(string,'Latin-1','replace')
I did this baseed on previous answers to questions similar to mine here where it seems the consensus was to read in the file using unicode and then encode it just before the file is written out.
So I processed the labels/words in excel - converted them all to lower case and got rid of the spaces and saved the output as a text file.
The text file has a column of all of the unique ways the table I am looking for is labeled
I then am reading in the file - and the first time I did I read it in using
labels=set([label for label in unicode(open('C:\\balsheetstrings-1.txt').read(),'UTF-8','replace').split('\n')])
I ran my program and discovered that some matches did not occur, investigating it I discovered that unicode replaced certain charactors with \ufffd like in the example below
u'unauditedcondensedstatementsoffinancialcondition(usd\ufffd$)inthousands'
More research turns up that the replacement happens when unicode does not have a mapping for the character (probably not the exact explanation but that was my interpretation)
So then I tried (after thinking what do I have to lose) reading in my list of labels without using unicode. So I read it in using this code:
labels=set(open('C:\\balsheetstrings-1.txt').readlines())
now looking at the same label in the interpreter I see
'unauditedcondensedstatementsoffinancialcondition(usd\xa0$)inthousands'
I then try to use this set of labels to match and I get this error
Warning (from warnings module):
File "C:\FunctionsForExcel.py", line 128
if tableHeader in testSet:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Now the frustrating thing is that the value for tableHeader is NOT in the test set When I ask for the value of tableHeader after it broke I received this
'fairvaluemeasurements:'
And to add insult to injury when I type the test into Idle
tableHeader in testSet
it correctly returns false
I understand that the code '\xa0' is code for a non-breaking space. So does Python when I read it in without using unicode. I thought I had gotten rid of all the spaces in excel but to handle these I split them and then joined them
labels=[''.joiin([word for word in label.split()] for label in labels])
I still have not gotten to a question yet. Sorry I am still trying to get my head around this. It seems to me that I am dealing with inconsistent behavior here. When I read the string in originally and used unicode and UTF-8 all the characters were perserved/transportable if you will. I encoded them to write them out and they displayed fine in Excel, I then saved them as a txt file and they looked okay But something is going on and I can't seem to figure out where.
If I could avoid writing the strings out to identify the correct labels I have a feeling my problem would go away but there are 20,000 or more labels. I can use a regular expression to cut my potential list down significantly but some of it just requires inspection.
As an aside I will note that the source files all specify the charset='UTF-8'
Recap- when I read sourcedocument and list of labels in using unicode I fail to make some matches because the labels have some characters replaced by the ufffd, and when I read the sourcedocument in using unicode and the list of labels in without any special handling I get the warning.
I would like to understand what is going on so I can fix it but I have exhausted all the places I can think to look
You read (and write) encoded files like this:
import codecs
# read a utf8 encoded file and return the data as unicode
data = codecs.open(excelFile, 'rb', 'UTF-8').read()
The encoding you use does not matter as long as you do all the comparisons in unicode.
I understand that the code '\xa0' is code for a non-breaking space.
In a byte string, \xA0 is a byte representing non-breaking space in a few encodings; the most likely of those would be Windows code page 1252 (Western European). But it's certainly not UTF-8, where byte \xA0 on its own is invalid.
Use .decode('cp1252') to turn that byte string into Unicode instead of 'utf-8'. In general if you want to know what encoding an HTML file is in, look for the charset parameter in the <meta http-equiv="Content-Type"> tag; it is likely to differ depending on what exported it.
Not exactly a solution, but something like xlrd would probably make a lot more sense than jumping through all those hoops.