Most efficient way to traverse file structure Python

Most efficient way to traverse file structure Python - python

Is using os.walk in the way below the least time consuming way to recursively search through a folder and return all the files that end with .tnt?
for root, dirs, files in os.walk('C:\\data'):
print "Now in root %s" %root
for f in files:
if f.endswith('.tnt'):

Yes, using os.walk is indeed the best way to do that.

As everyone has said, os.walk is almost certainly the best way to do it.
If you actually have a performance problem, and profiling has shown that it's caused by os.walk (and/or iterating the results with .endswith), your best answer is probably to step outside Python. Replace all of the code above with:
for f in sys.argv[1:]:
Now you need some outside tool that can gather the paths and run your script. (Ideally batching as many paths as possible into each script execution.)
If you can rely on Windows Desktop Search having indexed the drive, it should only need to do a quick database operation to find all files under a certain path with a certain extension. I have no idea how to write a batch file that runs that query and gets the results as a list of arguments to pass to a Python script (or a PowerShell file that runs the query and passes the results to IronPython without serializing it into a list of arguments), but it would be worth researching this before anything else.
If you can't rely on your platform's desktop search index, on any POSIX platform, it would almost certainly be fastest and simplest to use this one-liner shell script:
find /my/path -name '*.tnt' -exec myscript.py {} +
Unfortunately, you're not on a POSIX platform, you're on Windows, which doesn't come with the find tool, which is the thing that's doing all the heavy lifting here.
There are ports of find to native Windows, but you'll have to figure out the command-line intricaties to get everything quoted right and format the path and so on, so you can write the one-liner batch file. Alternatively, you could install cygwin and use the exact same shell script you'd use on a POSIX system. Or you could find a more Windows-y tool that does what you need.
This could conceivably be slower rather than faster—Windows isn't designed to execute lots of little processes with as little overhead as possible, and I believe it has smaller limits on command lines than platforms like linux or OS X, so you may spend more time waiting for the interpreter to start and exit than you save. You'd have to test to see. In fact, you probably want to test both native and cygwin versions (with both native and cygwin Python, in the latter case).
You don't actually have to move the find invocation into a batch/shell script; it's probably the simplest answer, but there are others, such as using subprocess to call find from within Python. This might solve performance problems caused by starting the interpreter too many times.
Getting the right amount of parallelism may also help—spin off each invocation of your script to the background and don't wait for them to finish. (I believe on Windows, the shell isn't involved in this; instead there's a tool named something like "run" that kicks off a process detached from the shell. But I don't remember the details.)
If none of this works out, you may have to write a custom C extension that does the fastest possible Win32 or .NET thing (which also means you have to do the research to find out what that is…) so you can call that from within Python.

Related

How do I speed up repeated calls a ruby program (github's linguist) from python?

I'm using github's linguist to identify unknown source code files. Running this from the command line after a gem install github-linguist is insanely slow. I'm using python's subprocess module to make a command-line call on a stock Ubuntu 14 installation.
Running against an empty file: linguist __init__.py takes about 2 seconds (similar results for other files). I assume this is completely from the startup time of Ruby. As #MartinKonecny points out, it seems that it is the linguist program itself.
Is there some way to speed this process up -- or a way to bundle the calls together?

One possibility is to just adapt the linguist program (https://github.com/github/linguist/blob/master/bin/linguist) to take multiple paths on the command-line. It requires mucking with a bit of Ruby, sure, but it would make it possible to pass multiple files without the startup overhead of Linguist each time.
A script this simple could suffice:
require 'linguist/file_blob'
ARGV.each do |path|
blob = Linguist::FileBlob.new(path, Dir.pwd)
# print out blob.name, blob.language, blob.sloc, etc.
end

Optimal way of matlab to python communication

So I am working on a Matlab application that has to do some communication with a Python script. The script that is called is a simple client software. As a side note, if it would be possible to have a Matlab client and a Python server communicating this would solve this issue completely but I haven't found a way to work that out.
Anyhow, after searching the web I have found two ways to call Python scripts, either by the system() command or editing the perl.m file to call Python scripts instead. Both ways are too slow though (tic tocing them to > 20ms and must run faster than 6ms) as this call will be in a loop that is very time sensitive.
As a solution I figured I could instead save a file at a certain location and have my Python script continuously check for this file and when finding it executing the command I want it to. Now after timing each of these steps and summing them up I found it to be incredibly much faster (almost 100x so for sure fast enough) and I cant really believe that, or rather I cant understand why calling python scripts is so slow (not that I have more than a superficial knowledge of the subject). I also found this solution to be really messy and ugly so just wanted to check that, first, is it a good idea and second, is there a better one?
Finally, I realize that the Python time.time() and Matlab tic, toc might not be precise enough to measure time correctly on that scale so also a reason why I ask.

Spinning up new instances of the Python interpreter takes a while. If you spin up the interpreter once, and reuse it, this cost is paid only once, rather than for every run.
This is normal (expected) behaviour, since startup includes large numbers of allocations and imports. For example, on my machine, the startup time is:
$ time python -c 'import sys'
real 0m0.034s
user 0m0.022s
sys 0m0.011s

Start Another Program From Python >Separately<

I'm trying to run an external, separate program from Python. It wouldn't be a problem normally, but the program is a game, and has a Python interpreter built into it. When I use subprocess.Popen, it starts the separate program, but does so under the original program's Python instance, so that they share the first Python console. I can end the first program fine, but I would rather have separate consoles (mainly because I have the console start off hidden, but it gets shown when I start the program from Python with subprocess.POpen).
I would like it if I could start the second program wholly on its own, as though I just 'double-clicked on it'. Also, os.system won't work because I'm aiming for cross-platform compatibility, and that's only available on Windows.

I would like it if I could start the second program wholly on its own, as though I just 'double-clicked on it'.
As of 2.7 and 3.3, Python doesn't have a cross-platform way to do this. A new shutil.open method may be added in the future (possibly not under that name); see http://bugs.python.org/issue3177 for details. But until then, you'll have to write your own code for each platform you care about.
Fortunately, what you're trying to do is simpler and less general than what shutil.open is ultimately hoped to provide, which means it's not that hard to code:
On OS X, there's a command called open that does exactly what you want: "The open command opens a file (or a directory or URL), just as if you had double-clicked the file's icon." So, you can just popen open /Applications/MyGame.app.
On Windows, the equivalent command is start, but unfortunately, that's part of the cmd.exe shell rather than a standalone program. Fortunately, Python comes with a function os.startfile that does the same thing, so just os.startfile(r'C:\Program Files\MyGame\MyGame.exe').
On FreeDesktop-compatible *nix systems (which includes most modern linux distros, etc.), there's a very similar command called xdg-open: "xdg-open opens a file or URL in the user's preferred application." Again, just popen xdg-open /usr/local/bin/mygame.
If you expect to run on other platforms, you'll need to do a bit of research to find the best equivalent. Otherwise, for anything besides Mac and Windows, I'd just try to popen xdg-open, and throw an error if that fails.
See http://pastebin.com/XVp46f7X for an (untested) example.
Note that this will only work to run something that actually can be double-clicked to launch in Finder/Explorer/Nautilus/etc. For example, if you try to launch './script.py', depending on your settings, it may just fire up a text editor with your script in it.
Also, on OS X, you want to run the .app bundle, not the UNIX executable inside it. (In some cases, launching a UNIX executable—whether inside an .app bundle or standalone—may work, but don't count on it.)
Also, keep in mind that launching a program this way is not the same as running it from the command line—in particular, it will inherit its environment, current directory/drive, etc. from the Windows/Launch Services/GNOME/KDE/etc. session, not from your terminal session. If you need more control over the child process, you will need to look at the documentation for open, xdg-open, and os.startfile and/or come up with a different solution.
Finally, just because open/xdg-open/os.startfile succeeds doesn't actually mean that the game started up properly. For example, if it launches and then crashes before it can even create a window, it'll still look like success to you.
You may want to look around PyPI for libraries that do what you want. http://pypi.python.org/pypi/desktop looks like a possibility.
Or you could look through the patches in issue 3177, and pick the one you like best. As far as I know, they're all pure Python, and you can easily just drop the added function in your own module instead of in os or shutil.
As a quick hack, you may be able to (ab)use webbrowser.open. "Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable." In particular, IIRC, it will not work on OS X 10.5+. However, I believe that making a file: URL out of the filename actually does work on OS X and Windows, and also works on linux for most, but not all, configurations. If so, it may be good enough for a quick&dirty script. Just keep in mind that it's not documented to work, it may break for some of your users, it may break in the future, and it's explicitly considered abuse by the Python developers, so I wouldn't count on it for anything more serious. And it will have the same problems launching 'script.py' or 'Foo.app/Contents/MacOS/foo', passing env variables, etc. as the more correct method above.
Almost everything else in your question is both irrelevant and wrong:
It wouldn't be a problem normally, but the program is a game, and has a Python interpreter built into it.
That doesn't matter. If the game were writing to stdout from C code, it would do the exact same thing.
When I use subprocess.Popen, it starts the separate program, but does so under the original program's Python instance
No it doesn't. It starts an entirely new process, whose embedded Python interpreter is an entirely new instance of Python. You can verify that by, e.g., running a different version of Python than the game embeds.
so that they share the first Python console.
No they don't. They may share the same tty/cmd window, but that's not the same thing.
I can end the first program fine, but I would rather have separate consoles (mainly because I have the console start off hidden, but it gets shown when I start the program from Python with subprocess.POpen).
You could always pipe the child's stdout and stderr to, e.g., a logfile, which you could then view separately from the parent process's output, if you wanted to. But I think this is going off on a tangent that has nothing to do with what you actually care about.
Also, os.system won't work because I'm aiming for cross-platform compatibility, and that's only available on Windows.
Wrong; os.system is available on "Unix, Windows"--which is probably everywhere you care about. However, it won't work because it runs the child program in a subshell of your script, using the same tty. (And it's got lots of other problems—e.g., blocking until the child finishes.)

When I use subprocess.Popen, it starts the separate program, but does so under the original program's Python instance...
Incorrect.
... so that they share the first Python console.
This is the crux of your problem. If you want it to run in another console then you must run another console and tell it to run your program instead.
... I'm aiming for cross-platform compatibility ...
Sorry, there's no cross-platform way to do it. You'll need to run the console/terminal appropriate for the platform.

Grabbing output FILE from Python Popen process?

I have written a python program to interface with a compiled program (call it ProgramX) that has some idiosyncrasies that are proving difficult to deal with. I need to feed many thousands of input files to ProgramX via my python program. What I would like to do is to grab the output file that ProgramX creates with each run, and rename it something sensible, like inputfilename.output.
The problem comes in the output file that is written by ProgramX -- it is named via an unpredictable method, which will write, and "mercilessly overwrite", the output file if it already exists (which is the case the majority of the time). The saving grace probably comes with the fact that there is a standard prefix to the output files: think ProgramX.notQuiteRandomNumber.
The only think I can think to do is something like this in my bash shell:
PROGRAMXOUTPUT=$(ls -ltr ProgramX* | tail -n -1 | awk '{print $8}')
mv $PROGRAMXOUTPUT input.output
Which does 90% of what I need, but before I program all that bash into a series of Popen statements, is there a better way to do this? This problem feels like something people might have a much better solution than what I'm thinking.
Sidenote: I can grab the program's standard output without problems, however it's the output file that I need to grab.
Bonus: I was planning on running a bunch of instantiations of the program in the same directory, so my naive approach above may start to have unforeseen problems. So perhaps something fancy that watches the PID of ProgramX and follows its output.

To do what your shell script above does, assuming you've only got one ProgramX* in the current directory:
import glob, os
programxoutput = glob.glob('ProgramX*')[0]
os.rename(programxoutput, 'input.output')
If you need to sort by time, etc., there are ways to do that too (look at os.stat), but using the most recent modification date is a recipe for nasty race conditions if you'll be running multiple copies of ProgramX concurrently.
I'd suggest instead that you create and change to a new, perhaps temporary directory for each run of ProgramX, so the runs have no possibility of treading on each other. The tempfile module can help with this.

Two options that I see:
You could use lsof to find open files to find the files that ProgramX is writing.
A different approach would be to run ProgramX in a temporary directory (see tempfile for an easy way of setting up directories. Between runs of ProgramX, you can clean that directory or keep requesting new temp directories, if you are planning on running multiple copieProgramX at the same time.

If there is only one ProgramX* file, then what about just:
mv ProgramX* input.output

Dangerous Python Keywords?

I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)

This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.

You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **

Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.

An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.

built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...

Use a Virtual Machine instead of running it on a system that you are concerned about.

Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.