Python coding problems (Utf-8, Hungarian language)

Python coding problems (Utf-8, Hungarian language) - python

I wrote a script in Python 2 which is separated in 4-5 modules. I use the Hungarian language in the script, which contains several unusual characters like öüóőúéáűí. I wrote the modules on Win7 with an original cp-1250 coding, and then I moved to Ubuntu raring, where the system default is Utf-8.
First Tkinter left the spec. letter containing labels blank, which I managed to debug by setting every coding in the beginning of the modules to # -*- Utf-8 -*-.
Entries started to go mad too. Their .get() methods raised UnicodeDecodeError: 'ascii' codec can't decode byte...
And finally if for example module a.py had a dictionary dict = {'Sándor': 16} and module b.py has a line a.dict['Sándor'], it raises KeyError, as if dict didn't contain 'Sándor'. It doesn't do this with strings containing only normal characters, not does this with dictionaries of the module's own.

I wrote a script in python2... I use the Hungarian language in the script...
Did you use unicode literals? No, you did not. Rewrite your script to use and handle them properly.
{u'Sándor': 16}
Unicode In Python, Completely Demystified

Related

Decoding Python 3.7.4 Filenames That Were Created With VB

I have some files in a Windows 10 directory that were named with encryption logic written in VB. The VB logic was originally written on Windows 7 with VB.net, but the file names are exactly the same between the two version of Windows, as expected. The problem I'm having is that when I try to decrypt those file names in a character by character loop in Python 3.7.4, what is returned from the ord() function doesn't match what the VB asc() designation is for that character.
All the letters match (up to ascii character 126) but everything after that does not.
For example, in VB:
?asc("ƒ")
returns 131.
However, in Python 3.4.7:
ord('ƒ')
returns 402.
I've read a lot of great posts here discussing UTF-8 vs cp1252 encoding both for strings of data (within files) and filenames, but I haven't come across a solution for my problem.
When I run:
sys.getdefaultencoding()
I get 'utf-8'. This is what, I believe, would be used for file names, and functions used for them, e.g., os.fsdecode(), os.listdir(), etc..
When I run:
locale.getpreferredencoding()
I get 'cp1252.'
One thing I noticed on the "other side of the fence," is that the values returned by python ord () DO match the VB equivalent AscW(), but altering all that code is going to be more problematic than moving forward with the rest of what we've done in Python so far.
Should I be altering the locale's preferredencoding or the sys's default encoding to solve this problem?
Thanks!

Note that the Python value is the Unicode code point (and is bigger than 255). If you already have the correct filenames in Python, just encode the strings with the appropriate encoding (apparently cp1252) and examine the byte values. (You could also call functions like os.listdir with bytes arguments to suppress the decoding in the first place.)

python crawler get messy code which seems has muti type of coding

I got a location u'\u0107\x9d\xad\u013a\u02c7\x9e\u013a\xb8\x82', which actually should be '\xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82'. How can I decode something like this?

I suggest you read python 2.7 unicode.
\u0107\x9d\xad\u013a\u02c7\x9e\u013a\xb8\x82 does not equal \xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82,so I suppose there is something wrong with your crawler code.
In python2.x,you should be careful with the encoding problem.In Python2 we have two text types:
str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.
Python2 provides a migration path from non-Unicode into Unicode by allowing coercion of byte-strings and non byte-strings.
You can check out More About Unicode in Python 2 and 3.
Also you can add this at the start of your script,it sets system default encoding as utf-8 .It's userful for testing program and it will fix your issue.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
As a matter of fact,I don't suggest programmer use this in large program.It might trigger other issues.
The encoding problem in Python2.x is really discouraged,and if you want to avoid encoding problem, you should start to think seriously about switching to Python3.
Hope this helps.

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.
# -*- coding: utf-8 -*-
import sys
class Token:
def __init__(self, string, final=False):
self.value = string
self.final = final
def __str__(self):
return self.value
def __repr__(self):
return self.value
print(sys.getdefaultencoding())
print(sys.stdout.encoding)
# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)
reload(sys).setdefaultencoding("utf8")
# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)

My question is why the two encoding variables are different in the first place
They serve different purposes.
sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.
sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.
sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.
btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).
The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.
Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.
The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.
how do I manage to use the wrong encoding in this simple piece of code?
Your first code example is not fine. You use non-ascii literal characters in a byte string on Python 2 that you should not do. Use bytestrings' literals only for binary data (or so called native strings if necessary). The code may produce mojibake such as I need 20 000Γé¼. (notice the character noise) if you run it using Python 2 in any environment that does not use utf-8-compatible encoding such as Windows console
The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals
Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.

I have noticed the same behaviour of some standard code (mailman library).
Thanks for your analysis, it helped me save some time. :-)
The problem is exactly the same. My system uses sys.getdefaultencoding() and gets ascii, which is inappropriate to handle a list of 1000 UTF-8 encoded names.
There is a mismatch between stdin/stdout and even filesystem encoding (utf-8) on one hand and "defaultencoding" on the other (ascii). This thread: How to print UTF-8 encoded text to the console in Python < 3? seems to indicate that it is well known and Changing default encoding of Python? contains some indication that a more homogeneous (like "utf-8 everywhere") would break other things like the hash implementation.
For that reason it is also not straightforward to change the defaultencoding. (See http://blog.ianbicking.org/illusive-setdefaultencoding.html for various ways to do so.) It is removed from the sys instance in the site.py file.

My first step in Python

I'm trying to start learning Python, but I became confused from the first step.
I'm getting started with Hello, World, but when I try to run the script, I get:
Syntax Error: Non-UTF-8 code starting with '\xe9' in file C:\Documents and Settings\Home\workspace\Yassine frist stared\src\firstModule.py on line 5 but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details.

add to the first line is
# -*- coding: utf-8 -*-

Put as the first line of your program this:
# coding: utf-8
See also Correct way to define Python source code encoding

First off, you should know what an encoding is. Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Now, the problem you are having is that most people write code in ASCII. Roughly speaking, that means that they use Latin letters, numerals and basic punctuation only in the code files themselves. You appear to have used a non-ASCII character code inside your program, which is confusing Python.
There are two ways to fix this. The first is to tell Python with what encoding you would like it to read the text file. You can do that by adding a # coding declaration at the top of the tile. The second, and probably better, is to restrict yourself to ASCII code. Remember that you can always have whatever characters you like inside strings, by writing them in their encoded form as e.g. \x00 or whatever.

When you run Python through the interpreter, you must run it in this format: python filename.py (command line args) or you will also get this error. I made the comment because you mentioned you were a beginner.

Python program works in Eclipse but not when I run it directly (Unicode stuff)

I have searched and found some related problems but the way they deal with Unicode is different, so I can't apply the solutions to my problem.
I won't paste my whole code but I'm sure this isolated example code replicates the error:
(I'm also using wx for GUI so this is like inside a class)
#coding: utf-8
...
something = u'ЧЕТЫРЕ'
//show the Russian text in a Label on the GUI
self.ExampleLabel.SetValue(str(self.something))
On Eclipse everything works perfectly and it displays the Russian characters. However when I try to open up Python straight through the file I get this error on the CL:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11:
ordinal not in range(128)
I figured this has something to do with the CL not being able to ouput the Unicode chars and Eclipse doing behind-the-scene magic. Any help on how to make it so that it works on its own?

When you call str() on something without specifying an encoding, the default encoding is used, which depends on the environment your program is running in. In Eclipse, that's different from the command line.
Don't rely on the default encoding, instead specify it explicitly:
self.ExampleLabel.SetValue(self.something.encode('utf-8'))
You may want to study the Python Unicode HOWTO to understand what encoding and str() do with unicode objects. The wxPython project has a page on Unicode usage as well.

Try self.something.encode('utf-8') instead.

If you use repr instead of str it should handle the conversion for you and also cover the case that the object is not always of type string, but you may find that it gives you an extra set of quotes or even the unicode u in your context. repr is safer than str - str assumes ascii encoding, but repr is going to show your codepoints in the same way that you would see them in code, since wrapping with eval is supposed to convert it back to what it was - the repr has to be in a form that the python code would be in, namely ascii safe since most python code is written in ascii.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.