I'm trying to start learning Python, but I became confused from the first step.
I'm getting started with Hello, World, but when I try to run the script, I get:
Syntax Error: Non-UTF-8 code starting with '\xe9' in file C:\Documents and Settings\Home\workspace\Yassine frist stared\src\firstModule.py on line 5 but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details.
add to the first line is
# -*- coding: utf-8 -*-
Put as the first line of your program this:
# coding: utf-8
See also Correct way to define Python source code encoding
First off, you should know what an encoding is. Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Now, the problem you are having is that most people write code in ASCII. Roughly speaking, that means that they use Latin letters, numerals and basic punctuation only in the code files themselves. You appear to have used a non-ASCII character code inside your program, which is confusing Python.
There are two ways to fix this. The first is to tell Python with what encoding you would like it to read the text file. You can do that by adding a # coding declaration at the top of the tile. The second, and probably better, is to restrict yourself to ASCII code. Remember that you can always have whatever characters you like inside strings, by writing them in their encoded form as e.g. \x00 or whatever.
When you run Python through the interpreter, you must run it in this format: python filename.py (command line args) or you will also get this error. I made the comment because you mentioned you were a beginner.
Related
I have some files in a Windows 10 directory that were named with encryption logic written in VB. The VB logic was originally written on Windows 7 with VB.net, but the file names are exactly the same between the two version of Windows, as expected. The problem I'm having is that when I try to decrypt those file names in a character by character loop in Python 3.7.4, what is returned from the ord() function doesn't match what the VB asc() designation is for that character.
All the letters match (up to ascii character 126) but everything after that does not.
For example, in VB:
?asc("ƒ")
returns 131.
However, in Python 3.4.7:
ord('ƒ')
returns 402.
I've read a lot of great posts here discussing UTF-8 vs cp1252 encoding both for strings of data (within files) and filenames, but I haven't come across a solution for my problem.
When I run:
sys.getdefaultencoding()
I get 'utf-8'. This is what, I believe, would be used for file names, and functions used for them, e.g., os.fsdecode(), os.listdir(), etc..
When I run:
locale.getpreferredencoding()
I get 'cp1252.'
One thing I noticed on the "other side of the fence," is that the values returned by python ord () DO match the VB equivalent AscW(), but altering all that code is going to be more problematic than moving forward with the rest of what we've done in Python so far.
Should I be altering the locale's preferredencoding or the sys's default encoding to solve this problem?
Thanks!
Note that the Python value is the Unicode code point (and is bigger than 255). If you already have the correct filenames in Python, just encode the strings with the appropriate encoding (apparently cp1252) and examine the byte values. (You could also call functions like os.listdir with bytes arguments to suppress the decoding in the first place.)
I made a program which contains chinese and russian words, but when I ran it, I had a problem with the encoding
In the code that I shared, a complete sentence with some Russian and Chinese characters is shown. With that variable assignment the SyntaxError arises. But when i write sentence=input(), when the user enters the same sentence no error appears.
sentence='n紙sнo頭q愛z語u買gлd娜xтgлj鳥u買gлcхd娜u買 рj鳥pщi魚d娜gлh園d娜gлn紙r無z語 рr無pщl電pщv書kмz語u買gлkмu買o頭d娜r無n紙r無d娜o頭pщh園z語gлh園d娜gлpщcхo頭z語gлu買kмwзd娜cхgлsнgлz語r無kмd娜u買o頭pщh園z語gлpщgлz語aчi魚d娜o頭z語xтgлv書z語u買gлd娜cхgлv書j鳥pщcхgлn紙z語h園d娜l電z語xтgлv書r無d娜pщr無gлo頭z語h園z語gлo頭kмn紙z語gлh園d娜gлpщn紙cхkмv書pщv書kмz語u買d娜xтgлd娜u買o頭r無d娜gлxтj鳥xтgлh園kмwзd娜r無xтz語xтgлo頭kмn紙z語xтgлh園d娜gлd娜xтo頭r無j鳥v書o頭j鳥r無pщxтgлh園d娜gлh園pщo頭z語xтgлxтd娜gлd娜u買v書j鳥d娜u買o頭r無pщgлh園kмv書v書kмz語u買pщr無kмz語xтgлh園d娜gлh園pщo頭z語xтgлd娜u買gлd娜xтo頭d娜gлo頭j鳥o頭z語r無kмpщcхgлpщn紙r無d娜u買h園d娜r無d娜l電z語xтgлpщgлj鳥o頭kмcхkмñсpщr無gлd娜xтo頭pщgлd娜xтo頭r無j鳥v書o頭j鳥r無pщgлr無d娜wзkмxтpщu買h園z語gлxтj鳥xтgлl電d娜o頭z語h園z語xтgлl電pщxтgлj鳥o頭kмcхkмñсpщh園z語xт'
SyntaxError: Non-UTF-8 code starting with '\xe5' in file hjs.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
How can I solve it?
First of all, welcome to Stack Overflow!
Second, you could solve your problem by using Python 3 or, for Python 2, following what is said in this answer.
But why?
Well, according to the aforementioned PEP 263,
Python will default to ASCII as standard encoding if no other encoding hints are given.
And in the PEP you can see the same thing that the mentioned answer says, to add the line # -*- coding: <encoding name> -*-
And why isn't Python 3 affected by this issue?
As said in here,
Since Python 3.0, the language’s str type contains Unicode characters(...)
So there is no need for adding the coding magic comment.
For more on that the full unicode article linked above is a great reading, and as it is a classic in StackOverflow, please see this.
I wrote a script in Python 2 which is separated in 4-5 modules. I use the Hungarian language in the script, which contains several unusual characters like öüóőúéáűí. I wrote the modules on Win7 with an original cp-1250 coding, and then I moved to Ubuntu raring, where the system default is Utf-8.
First Tkinter left the spec. letter containing labels blank, which I managed to debug by setting every coding in the beginning of the modules to # -*- Utf-8 -*-.
Entries started to go mad too. Their .get() methods raised UnicodeDecodeError: 'ascii' codec can't decode byte...
And finally if for example module a.py had a dictionary dict = {'Sándor': 16} and module b.py has a line a.dict['Sándor'], it raises KeyError, as if dict didn't contain 'Sándor'. It doesn't do this with strings containing only normal characters, not does this with dictionaries of the module's own.
I wrote a script in python2... I use the Hungarian language in the script...
Did you use unicode literals? No, you did not. Rewrite your script to use and handle them properly.
{u'Sándor': 16}
Unicode In Python, Completely Demystified
I needed to start dealing with foreign characters, and in doing so, I think I royally screwed up a file's encoding.
The error I'm getting is:
Lexical error at line 1, column 8. Encountered: "" (0), after : ""
The first line of the file is:
import xml.etree.cElementTree as ET
Also of note: when I pasted the line above into the textarea to ask this question, and submitted, an unknown character appeared between every character (e
I have been unable to fix this issue by adding an explicit coding definition:
# -*- coding: utf-8 -*-
I have also been unable to revert the file (using Hg) to a previous version, nor copy/paste code into a new file, or replace the broken file with copied/pasted code.
Please help!
If it is indeed a zero character in there, you may find you've injected some UTF-16/UCS-2 text. That particular Unicode encoding would have a zero byte in between every ASCII character.
The best way to find out is to do a hex dump of you file with something like od -xcb myfile.py.
If that is the case, then you'll need to edit the file with something that's able to see those characters, and fix them up.
vi would be my first choice (since that's what I'm used to) but I don't want to start any holy wars with the Emacs illuminati. In vi, they'll most likely show up as ^# characters.
I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta
You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)
Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.
You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.