Can a Python package name contain an umlaut, i.e. "Γ€", "ΓΌ" or "ΓΆ"? Are there limitations and differences (encoding, OS, Python 2 vs 3)?
https://en.wikipedia.org/wiki/Diaeresis_(diacritic)
Python 2.x does not allow any characters other than letters, numbers, and underscores.
Python 3.x supports far more characters, including the umlaut and other letters with diaereses. However, it is not recommended to use special characters in your identifier names. This could make it difficult for other users to use your package or read your identifier name.
https://www.python.org/dev/peps/pep-3131/
https://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html
Related
One thing that I quite like working in Python inside jupyter notebook is that I can use some unicode symbols to name my variables. For example, to use greek letters, I type \alpha followed by tab and I get Ξ±.
I just ran into an unexpected behaviour when using a bold capital T, \bfT followed by tab which results in π.
The experiment is the following. Inside a cell (running Python 3) type:
T = 1
π = 2
print(T) # prints 2
To my surprise, the second line is reassigning the variable T but I would expect it to be different from π. Can somebody please explain what's the catch with using Unicode?
I don't know if it helps, but as another experiment, I can see that the same two symbols as strings are in fact different
'T'.encode('utf8'), 'π'.encode('utf8') # (b'T', b'\xf0\x9d\x90\x93')
How is the notebook processing my variable names?
This behaviour is defined in the python language specification for identifiers (variable names). https://docs.python.org/3/reference/lexical_analysis.html#identifiers
2.3. Identifiers and keywords
Identifiers (also referred to as names) are described by the following lexical definitions.
The syntax of identifiers in Python is based on the Unicode standard
annex UAX-31, with elaboration and changes as defined below; see also
PEP 3131 for further details.
[...]
All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC.
We can confirm that T and π are equivalent under NFKC using the standard library unicodedata module.
>>> import unicodedata
>>> unicodedata.normalize('NFKC','π') == 'T'
True
So you should avoid using so similar unicode characters in the same scope as unique variable names.
But there's still a lot of unicode characters that are unique and can be safely used in identifiers:
>>> unicodedata.normalize('NFKC','π©Ξ»')
'π©Ξ»'
This error is present on some of the main Python built-in libraries. For example:
"foo".startswith("bar") # instead of .starts_with
re.findall("[ab]", "foobar"]) # instead of .find_all
Is this just a compatibility issue? PEP 8 states that method names must be written in lowercase with words separated by underscores.
I've been going through Learn Python The Hard Way as a sort of refresher. Instead of naming each example ex#.py (where # is the number of the exercise), however, I've just been calling them #.py. This worked fine until I got to Exercise 25, which requires you to import a module you just created through the interpreter. When I try this the following happens:
>>> import 25
File "<stdin>", line 1
import 25
^
SyntaxError: invalid syntax
I tried renaming the file to ex25.py and it then worked as expected (>>> import ex25). What I'm wondering is what are the naming requirements for python modules? I had a look at the official documentation here but didn't see it mention any restrictions.
Edit: All three answers by iCodez, Pavel and BrenBarn give good resources and help answer different aspects of this question. I ended up picking iCodez's answer as the correct one simply because it was the first answer.
Modules that you import with the import statement must follow the same naming rules set for variable names (identifiers). Specifically, they must start with either a letter1 or an underscore and then be composed entirely of letters, digits2, and/or underscores.
You may also be interested in what PEP 8, the official style-guide for Python code, has to say about module names:
Modules should have short, all-lowercase names. Underscores can be
used in the module name if it improves readability. Python packages
should also have short, all-lowercase names, although the use of
underscores is discouraged.
1 Letters are the ASCII characters A-Z and a-z.
2 Digits are the ASCII characters 0-9.
The explicit rules for what is allowed to be a valid identifier (variable, module name etc.) can be found here: https://docs.python.org/dev/reference/lexical_analysis.html#identifiers
In your case, this is the relevant sentence:
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
Strictly speaking, you can name a Python file anything you want. However, in order to import it using the import statement, the filename needs to be a valid Python identifier --- something you could use as a variable name. That means it must use only alphanumerics and underscores, and not start with a digit. This is because the grammar of the import statement requires the module name to be an identifier.
This is why you didn't see the problem until you got to an exercise that requires importing. You can run a Python script with a numeric name from the command line with python 123.py, but you won't be able to import that module.
After a half hour searching Google, I am surprised I cannot find any way to create a file on Windows with slashes in the name. The customer demands that file names have the following structure:
04/28/2012 04:07 PM 6,781 12Q1_C125_G_04-17.pdf
So far I haven't found any way to encode the slashes so they become part of the file name instead of the path.
Any Suggestions?
You can't.
The forward slash is one of the characters that are not allowed to be used in Windows file names, see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
The following fundamental rules enable applications to create and
process valid names for files and directories, regardless of the file
system:
Use a period to separate the base file name from the extension in the name of a directory or file.
Use a backslash (\) to separate the components of a path. The backslash divides the file name from the path to it, and one directory name from another directory name in a path. You cannot use a backslash in the name for the actual file or directory because it is a reserved character that separates the names into components.
Use a backslash as required as part of volume names, for example, the "C:\" in "C:\path\file" or the "\server\share" in
"\server\share\path\file" for Universal Naming Convention (UNC)
names. For more information about UNC names, see the Maximum Path
Length Limitation section.
Do not assume case sensitivity. For example, consider the names OSCAR, Oscar, and oscar to be the same, even though some file systems (such as a POSIX-compliant file system) may consider them as
different. Note that NTFS supports POSIX semantics for case
sensitivity but this is not the default behavior. For more
information, see CreateFile.
Volume designators (drive letters) are similarly case-insensitive. For example, "D:\" and "d:\" refer to the same volume.
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128β255), except for the following:
The following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Integer value zero, sometimes referred to as the ASCII NUL character.
Characters whose integer representations are in the range from 1 through 31, except for alternate data streams where these characters are allowed. For more information about file streams, see File
Streams.
Any other character that the target file system does not allow.
At least all windows installation i've seen won't let you create files with slashes in them.
Even if it were possible somehow, by doing deepshit magic, it will probably screw up almost all applications, including windows explorer.
you could abuse windows' unicode capabilities, though.
Creating a file with β (this is not a forward slash, it is "division slash", see http://www.fileformat.info/info/unicode/char/2215/index.htm ) in it's name works just fine, for example.
Um... forward slash is not a legal character in a Windows file name?
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
I have a program that includes an embedded Python 2.6 interpreter. When I invoke the interpreter, I call PySys_SetPath() to set the interpreter's import-path to the subdirectories installed next to my executable that contain my Python script files... like this:
PySys_SetPath("/path/to/my/program/scripts/type1:/path/to/my/program/scripts/type2");
(except that the path strings are dynamically generated based on the current location of my program's executable, not hard-coded as in the example above)
This works fine... except when the clever user decides to install my program underneath a folder that has a colon in its name. In that case, my PySys_SetPath() command ends up looking like this (note the presence of a folder named "path:to"):
PySys_SetPath("/path:to/my/program/scripts/type1:/path:to/my/program/scripts/type2");
... and this breaks all my Python scripts, because now Python looks for script files in "/path", and "to/my/program/scripts/type1" instead of in "/path:to/myprogram/scripts/type1", and so none of the import statements work.
My question is, is there any fix for this issue, other than telling the user to avoid colons in his folder names?
I looked at the makepathobject() function in Python/sysmodule.c, and it doesn't appear to support any kind of quoting or escaping to handle literal colons.... but maybe I am missing some nuance.
The problem you're running into is the PySys_SetPath function parses the string you pass using a colon as the delimiter. That parser sees each : character as delimiting a path, and there isn't a way around this (can't be escaped).
However, you can bypass this by creating a list of the individual paths (each of which may contain colons) and use PySys_SetObject to set the sys.path:
PyListObject *path;
path = (PyListObject *)PyList_New(0);
PyList_Append((PyObject *) path, PyString_FromString("foo:bar"));
PySys_SetObject("path", (PyObject *)path);
Now the interpreter will see "foo:bar" as a distinct component of the sys.path.
Supporting colons in a file path opens up a huge can of worms on multiple operating systems; it is not a valid path character on Windows or Mac OS X, for example, and it doesn't seem like a particularly reasonable thing to support in the context of a scripting environment either for exactly this reason. I'm actually a bit surprised that Linux allows colon filenames too, especially since : is a very common path separator character.
You might try escaping the colon out, i.e. converting /path:to/ to /path\:to/ and see if that works. Other than that, just tell the user to avoid using colons in their file names. They will run into all sorts of problems in quite a few different environments and it's a just plain bad idea.