What is the best way to ensure a reliable path in Python?

What is the best way to ensure a reliable path in Python? - python

I'm a bit puzzled by the way paths are handled in Python. Using common constructs line "~" or "." or "..", I often run into cases where a path is not recognized as valid or existing, especially if I pass the path on as an argument to a shell command; but all of my problems go away if I always do something like:
some_path = os.path.abspath(os.path.expanduser(some_path))
Is this a common — or perhaps even required — idiom, or am I just reinventing the wheel? Should I really expect that wherever I have some_path, I should have the above code before passing it to any (or at least most) functions that do anything with it?

Yes, most things you can call will expect path that has been run through that idiom. When you use paths like that in the shell (eg, when you do something like cat ~raxacoricofallapatorius/foo.txt), it is the shell itself - rather than cat or any other program you might run - that does the path normalisation.
You can verify this trivially - eg,
lvc#tiamat:~/Projects$ echo ~
/home/lvc
So this does mean that if you expect to get a path with those kinds of variables as input, you will need to do the preprocessing yourself. The alternative is to run the commands through a shell, and be ready to deal with all the problems that brings.
However, at least on unix-like systems (Windows may or may not behave the same way), you don't need to do this for . and .. - these are understood by the system calls, and the shell does not transform them - so, eg:
lvc#tiamat:~/Projects$ file ..
..: directory
lvc#tiamat:~/Projects$ file ~
/home/lvc: directory
Notice that file sees .. unchanged, but sees the expanded form of ~.
This means that if all you want is paths that will work directly in external programs, passing it through expanduser and possibly expandvars is sufficient. You will want to call abspath primarily if the program you are calling out to will run in a different working directory than yours.

Yes, if you need an absolute path with $HOME resolved, you'll have to do that.
It should be easy enough to write a short helper function, if you require this functionality regularly. There are also path helper libraries available, like these:
https://github.com/jaraco/path.py
https://github.com/xando/python-shelltools

It's generally a good idea. os.path.abspath will resolve relative paths like ., .., and ~. If you want your code to be portable across OSes, you should be using os.path instead of defining your own path handling, if you can - os.path always points to the correct path module for the OS you are on. If you try define your own path functions, you lose the built-in cross platform behavior of os.path.

Related

Get current directory - 'os' and 'subprocess' library are banned

I'm stuck in a rock and a hard place.
I have been given a very restricted Python 2/3 environment where both os and subprocess library are banned. I am testing file upload functionality using Python selenium; creating the file is straight forward, however, when I use the method driver.send_keys(xyz), the send_keys is expecting a full path rather than just a file name. Any suggestions on how to work around this?

I do not know if it would work in very restricted Python 2/3, but you might try following:
create empty file, let say empty.py like so:
with open('empty.py','w') as f:
pass
then do:
import empty
and:
pathtofile = empty.__file__
print(pathtofile)
This exploits fact empty text file is legal python module (remember to set name which is not use) and that generally import machinery set __file__ dunder, though this is not required, so you might end with None. Keep in mind that if it is not None then it is path to said file (empty.py) so you would need to further process it to get catalog itself.

with no way of using the os module it would seem you're SOL unless the placement of your script is static (i.e. you define the current working dirrectory as a constant) and you then handle all the path/string operations within that directory yourself.
you won't be able to explore what's in the directory but you can keep tabs on any files you create and just store the paths yourself, it will be tedious but there's no reason why it shouldn't work.
you won't be able to delete files i don't think but you should be able to clear their contents

Analyzing impact of adding an import-path

I have the following directory structure:
TestFolder:
test.py
CommonFolder:
common.py
And in test.py, I need to import common.py.
In order to do that, in test.py I add the path of CommonFolder to the system paths.
Here is what I started off with:
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(__file__)), 'CommonFolder'))
Then I figured that / is a valid separator in pretty much every OS, so I changed it to:
sys.path.append(os.path.dirname(os.path.dirname(__file__)) + '/CommonFolder')
Then I figured that .. is also a valid syntax in pretty much every OS, so I changed it to:
sys.path.append(os.path.dirname(__file__) + '/../CommonFolder')
My questions:
Are my assumptions above correct, and will the code run correctly on every OS?
In my last change, I essentially add a slightly longer path to the system paths. More precisely - FullPath/TestFolder/../CommonFolder instead of FullPath/CommonFolder. Is the any runtime impact to this? I suppose that every import statement might be executed slightly slower, but even if so, that would be minor. Is there any good reason not to do it this way?

If you're writing code to span multiple Operating Systems it's best not to try to construct the paths yourself. Between Linux and Windows you immediately run into the forward vs backwards slash issue, just as an example.
I'd recommend looking into the Python pathlib library. It handles generating paths for different operating systems.
https://docs.python.org/3/library/pathlib.html
This is a great blog about this subject and how to use the library:
https://medium.com/#ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f
UPDATE:
Updating this with a more specific answer.
Regarding the directory paths, as long as you're not building the paths yourself (using a utility such as pathlib) the paths you've created should be fine. Linux, Mac, and Windows all support relative paths (both mac and linux are Unix ~based of course).
As for whether it's efficient, unless you're frequently dynamically loading or reloading your source files (which is not common) most files are loaded into memory before the code is run, so there would be no performance impact on setting up the file paths in this way.

Use os.path.join() for OS independent path separator instead of /
Example: os.path.join(os.path.dirname(__file__),"..","CommonFolder")
Or instead you can make CommonFolder as python package by just placing a empty file by name __init__.py inside CommonFolder. After that you can simply import common in test.py as:-
from CommonFolder import common

Is there a way to combine a python project codebase that spans across different files into one file?

The reason I want to this is I want to use the tool pyobfuscate to obfuscate my python code. Butpyobfuscate can only obfuscate one file.

I've answered your direct question separately, but let me offer a different solution to what I suspect you're actually trying to do:
Instead of shipping obfuscated source, just ship bytecode files. These are the .pyc files that get created, cached, and used automatically, but you can also create them manually by just using the compileall module in the standard library.
A .pyc file with its .py file missing can be imported just fine. It's not human-readable as-is. It can of course be decompiled into Python source, but the result is… basically the same result you get from running an obfuscater on the original source. So, it's slightly better than what you're trying to do, and a whole lot easier.
You can't compile your top-level script this way, but that's easy to work around. Just write a one-liner wrapper script that does nothing but import the real top-level script. If you have if __name__ == '__main__': code in there, you'll also need to move that to a function, and the wrapper becomes a two-liner that imports the module and calls the function… but that's as hard as it gets.) Alternatively, you could run pyobfuscator on just the top-level script, but really, there's no reason to do that.
In fact, many of the packager tools can optionally do all of this work for you automatically, except for writing the trivial top-level wrapper. For example, a default py2app build will stick compiled versions of your own modules, along with stdlib and site-packages modules you depend on, into a pythonXY.zip file in the app bundle, and set up the embedded interpreter to use that zipfile as its stdlib.

There are a definitely ways to turn a tree of modules into a single module. But it's not going to be trivial. The simplest thing I can think of is this:
First, you need a list of modules. This is easy to gather with the find command or a simple Python script that does an os.walk.
Then you need to use grep or Python re to get all of the import statements in each file, and use that to topologically sort the modules. If you only do absolute flat import foo statements at the top level, this is a trivial regex. If you also do absolute package imports, or from foo import bar (or from foo import *), or import at other levels, it's not much trickier. Relative package imports are a bit harder, but not that big of a deal. Of course if you do any dynamic importing, use the imp module, install import hooks, etc., you're out of luck here, but hopefully you don't.
Next you need to replace the actual import statements. With the same assumptions as above, this can be done with a simple sed or re.sub, something like import\s+(\w+) with \1 = sys.modules['\1'].
Now, for the hard part: you need to transform each module into something that creates an equivalent module object dynamically. This is the hard part. I think what you want to do is to escape the entire module code so that it can put into a triple-quoted string, then do this:
import types
mod_globals = {}
exec('''
# escaped version of original module source goes here
''', mod_globals)
mod = types.ModuleType(module_name)
mod.__dict__.update(mod_globals)
sys.modules[module_name] = mod
Now just concatenate all of those transformed modules together. The result will be almost equivalent to your original code, except that it's doing the equivalent of import foo; del foo for all of your modules (in dependency order) right at the start, so the startup time could be a little slower.

You can make a tool that:
Reads through your source files and puts all identifiers in a set.
Subtracts all identifiers from recursively searched standard- and third party modules from that set (modules, classes, functions, attributes, parameters).
Subtracts some explicitly excluded identifiers from that list as well, as they may be used in getattr/setattr/exec/eval
Replaces the remaining identifiers by gibberish
Or you can use this tool I wrote that does exactly that.
To obfuscate multiple files, use it as follows:
For safety, backup your source code and valuable data to an off-line medium.
Put a copy of opy_config.txt in the top directory of your project.
Adapt it to your needs according to the remarks in opy_config.txt.
This file only contains plain Python and is exec’ed, so you can do anything clever in it.
Open a command window, go to the top directory of your project and run opy.py from there.
If the top directory of your project is e.g. ../work/project1 then the obfuscation result will be in ../work/project1_opy.
Further adapt opy_config.txt until you’re satisfied with the result.
Type ‘opy ?’ or ‘python opy.py ?’ (without the quotes) on the command line to display a help text.

I think you can try using the find command with -exec option.
you can execute all python scripts in a directory with the following command.
find . -name "*.py" -exec python {} ';'
Wish this helps.
EDIT:
OH sorry I overlooked that if you obfuscate files seperately they may not run properly, because it renames function names to different names in different files.

Most efficient way to traverse file structure Python

Is using os.walk in the way below the least time consuming way to recursively search through a folder and return all the files that end with .tnt?
for root, dirs, files in os.walk('C:\\data'):
print "Now in root %s" %root
for f in files:
if f.endswith('.tnt'):

Yes, using os.walk is indeed the best way to do that.

As everyone has said, os.walk is almost certainly the best way to do it.
If you actually have a performance problem, and profiling has shown that it's caused by os.walk (and/or iterating the results with .endswith), your best answer is probably to step outside Python. Replace all of the code above with:
for f in sys.argv[1:]:
Now you need some outside tool that can gather the paths and run your script. (Ideally batching as many paths as possible into each script execution.)
If you can rely on Windows Desktop Search having indexed the drive, it should only need to do a quick database operation to find all files under a certain path with a certain extension. I have no idea how to write a batch file that runs that query and gets the results as a list of arguments to pass to a Python script (or a PowerShell file that runs the query and passes the results to IronPython without serializing it into a list of arguments), but it would be worth researching this before anything else.
If you can't rely on your platform's desktop search index, on any POSIX platform, it would almost certainly be fastest and simplest to use this one-liner shell script:
find /my/path -name '*.tnt' -exec myscript.py {} +
Unfortunately, you're not on a POSIX platform, you're on Windows, which doesn't come with the find tool, which is the thing that's doing all the heavy lifting here.
There are ports of find to native Windows, but you'll have to figure out the command-line intricaties to get everything quoted right and format the path and so on, so you can write the one-liner batch file. Alternatively, you could install cygwin and use the exact same shell script you'd use on a POSIX system. Or you could find a more Windows-y tool that does what you need.
This could conceivably be slower rather than faster—Windows isn't designed to execute lots of little processes with as little overhead as possible, and I believe it has smaller limits on command lines than platforms like linux or OS X, so you may spend more time waiting for the interpreter to start and exit than you save. You'd have to test to see. In fact, you probably want to test both native and cygwin versions (with both native and cygwin Python, in the latter case).
You don't actually have to move the find invocation into a batch/shell script; it's probably the simplest answer, but there are others, such as using subprocess to call find from within Python. This might solve performance problems caused by starting the interpreter too many times.
Getting the right amount of parallelism may also help—spin off each invocation of your script to the background and don't wait for them to finish. (I believe on Windows, the shell isn't involved in this; instead there's a tool named something like "run" that kicks off a process detached from the shell. But I don't remember the details.)
If none of this works out, you may have to write a custom C extension that does the fastest possible Win32 or .NET thing (which also means you have to do the research to find out what that is…) so you can call that from within Python.

Faster way to get a directory listing than invoking "ls" in a subprocess

After a search, and some test runs both os.popen()+read() and subprocess.check_output() seem to be almost equivalent for reading out the contents of a folder. Is there a way to improve either a combination of os.popen()+read() or subprocess.check_output()? I have to ls a number of folders and read the outputs, and using either of the above is similar, but represents the major bottleneck according to profiling results.

You are looking for os.listdir and/or os.walk, and perhaps also the os.stat family of functions. These are (Python bindings to) the same primitives that ls uses itself, so anything you can do by parsing the output of ls you can do with these. I recommend you read carefully through everything the os, os.path, and stat modules offer; there may be other things you don't need an external program to do.
You should probably also read the documentation for stat, the underlying system call -- it's C-oriented, but it'll help you understand what os.stat does.

Why don't you just read the directory contents directly with os.listdir? Why do you need to shell out to ls? If you need more information about files in addition to just the filenames, you can also use os.stat. It's much more efficient to do the system calls yourself than to create subprocesses to do it for you.
For doing entirely directory traversals, there's os.walk. The shutil module also has some useful functions.

Use glob:
>>> from glob import glob
>>> glob('*')
The syntax is the same.
e.g.
glob('*.txt') # the same as "ls *.txt"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.