Popularity of path hooks (PEP 302 custom import) - python

My project has the ability to run python functions remotely. Doing so requires transmitting modules a given function utilizes. Determining what to send is conducted via a modified modulefinder.
As I modify the modulefinder to support arbitrary path_hooks, I've started to get the impression that path_hooks are not all that popular. Quick google codesearching seems to only show the ZipImporter using them. I've noticed a minor project using it (and even then, its loader doesn't support the PEP 302 extension of get_code, which is needed by the modified modulefinder).
Has anyone come across or created projects that use custom path_hooks to access source code?

Yes, I've coded some path hooks (for one of the obvious purposes: access modules living in other forms of storage besides the filesystem and zipfiles), but never on an open-source project (and actually never needed to support modulefinder in them). What difficulties are you encountering? While I can't share my original code I think I can share the know-how developed with it (though offhand I can't recall any special difficulties -- it has been a while). As for "popular", I guess they will be in direct proportion to the need to site modules "elsewhere" (e.g. in some form of DB), though of course general "usermode file systems" built e.g. using fuse , macfuse and dokan may also allow this (and offer other advantages in terms of generality -- not sure how performance compares).

Related

Does it hurt performance to dynamically link a shared library that won't be used?

I have a C++ program (in fact it is a dll) which dynamically links to another shared library (python dll), this program has two occasions to be used.
In occasion A, the program will make function calls to that dynamically linked shared library while in occasion B, the program will not.
My question is if I build the program specially for occasion B without linking to the shared library, will I gain performance comparing to the case where I link the shared library without actually using it?
It's really dependent on several factors: What OS, what shared library, and what the application actually does. Possibly also how the shared library is built.
In general, it's not a particularly large penalty, since shared libraries are demand-loaded and use position independent addressing [PIC] (PC-relative, and similar). What this means is that the shared library is being loaded only when it actually is used, and that there's "no work" to load the library. This is something that OS designers and system architects think a lot about, because for many applications that are performance sensitive (for example compilers or web-services), a badly designed shared library system will make performance BAD.
It is of course possible to configure when building the shared library. At least the use of PIC aspect of this, so if the person/company configuring the build of the shared library "wants to", it could be badly configured and worse than zero effect.
To this you have to add any initialization that the shared library does. Well designed shared libraries does "on demand" or "lazy" initialization, in other words, doesn't do much initialization until it's actually required. Again, the details of exactly which library - including how that was configured when it was built - can make a huge difference here.
The only REAL way to tell, in any particular use-case, is to build "with" and "without" the extra shared library, and measure the actual performance.

Identifying "sensitive" code in your application

Looking to improve quality of a fairly large Python project. I am happy with the types of warnings PyLint gives me. However, they are just too numerous and hard to enforce across a large organization. Also I believe that some code is more critical/sensitive than others with respect to where the next bug may come. For example I would like to spend more time validating a library method that is used by 100 modules rather than a script that was last touched 2 years ago and may not be used in production. Also it would be interesting to know modules that are frequently updated.
Is anyone familiar with tools for Python or otherwise that help with this type of analysis?
You problem is similar to the one I answered over at SQA https://sqa.stackexchange.com/a/3082. This problem was associated with Java which made the tooling a bit easier, but I have a number of suggestions below.
A number of other answers suggest that there is no good runtime tools for Python. I disagree on this in several ways:
Coverage tools work very well
Based on my experience in tooling in Java, static and dynamic analysis tools in Python are weaker than in a strongly typed less dynamic language but will work more than well enough to give good heuristics for you here. Unless you use an unusually large pathological number of dynamic features (including adding and removing methods, intercepting method and property invocations, playing with import, manually modifying the namespace) - in which case any problems you have may well be associated with this dynamism...
Pylint picks up simpler problems, and will not detect problems with dynamic class/instance modifications and decorators - so it doesn't matter that the metric tools don't measure these
In any case, where you can usefully focus can be determined by much more than a dependency graph.
Heuristics for selecting code
I find that there are a number of different considerations for selecting code for improvement which work both individually and together. Remember that, at first, all you need to do is find a productive seam of work - you don't need to find the absolutely worst code before you start.
Use your judgement.
After a few cycles through the codebase, you will have a huge amount of information and be much better positioned to continue your work - if indeed more needs to be done.
That said, here are my suggestions:
High value to the business: For example any code that could cost your company a lot of money. Many of these may be obvious or widely known (because they are important), or they may be detected by running the important use cases on a system with the run-time profiler enabled. I use Coverage.
Static code metrics: There are a lot of metrics, but the ones that concern us are:
High afferent couplings. This is code that a lot of other files depends on. While I don't have a tool that directly outputs this, snakefood is a good way to dump the dependencies directly to file, one line per dependency, each being a tuple of afferent and efferent file. I hate to say it, but computing the afferent coupling value from this file is a simple exercise left to the reader.
High McCabe (cyclomatic) complexity: This is more complex code. PyMetrics seems to produce this measure although I have not used the tool.
Size: You can get a surprising amount of information by viewing the size of your project using a visualiser (eg https://superuser.com/questions/8248/how-can-i-visualize-the-file-system-usage-on-windows or https://superuser.com/questions/86194/good-program-to-visualize-file-system-usage-on-mac?lq=1. Linux has KDirStat at Filelight). Large files are a good place to start as fixing one file fixes many warnings.
Note that these tools are file-based. This is probably fine-enough resolution since you mention the project is itself has hundreds of modules (files).
Changes frequently: Code that changes frequently is highly suspect. The code may:
Historically have had many defects, and empirically may continue to do so
Be undergoing changes from feature development (high number of revisions in your VCS)
Find areas of change using a VCS visualisation tool such as those discussed later in this answer.
Uncovered code: Code not covered by tests.
If you run (or can run) your unit tests, your other automated tests and typical user tests with coverage, take a look at the packages and files with next to no coverage. There are two logical reasons why there is no coverage:
The code is needed (and important) but not tested at all (at least automatically). These areas are extremely high risk
The code may be unused and is a candidate for removal.
Ask other developers
You may be surprised at the 'smell' metrics you can gather by having a coffee with the longer-serving developers. I bet they will be very happy if someone cleans up a dirty area of the codebase where only the bravest souls will venture.
Visibility - detecting changes over time
I am assuming that your environment has a DVCS (such as Git or Mercurial) or at least a VCS (eg SVN). I hope that you are also using an issue or bug tracker of some kind. If so, there is a huge amount of information available. It's even better if developers have reliably checked in with comments and issue numbers. But how do you visualise it and use it?
While you can tackle the problem on a single desktop, it is probably a good idea to set up a Continuous Integration (CI) environment, perhaps using a tool like Jenkins. To keep the answer short, I will assume Jenkins from now on. Jenkins comes with a large number of plugins that really help with code analysis. I use:
py.test with JUnit test output picked up by the JUnit test report Jenkins plugin
Coverage with the Cobertura plugin
SLOCCount and SLOCCount plugin
Pylint and Violations plugin
Apparently there is a plugin for McCabe (cyclometric) complexity for Python, although I have not used it. It certainly looks interesting.
This gives me visibility of changes over time, and I can drill in from there. For example, suppose PyLint violations start increasing in a module - I have evidence of the increase, and I know the package or file in which this is occurring, so I can find out who's involved and go speak with them.
If you need historic data and you have just installed Jenkins, see if you can run a few manual builds that start at the beginning of the project and take a series of jumps forward in time until the present. You can choose milestone release tags (or dates) from the VCS.
Another important area, as mentioned above, is detecting the loci of changes in the code base. I have really liked Atlassian Fisheye for this. Besides being really good at searching for commit messages (eg bug id) or file contents at any point in time, it allows me to easily see metrics:
Linecount by directory and subdirectory
Committers at any point in time or in specific directories and/or files
Patterns of committal, both by time and also location in the source code
I'm afraid you are mostly on your own.
If you have decent set of tests, look at code coverage and dead code.
If you have a decent profiling setup, use that to get a glimpse of what's used more.
In the end, it seems you are more interested in fan-in/fan-out analysis, I'm not aware of any good tools for Python, primarily because static analysis is horribly unreliable against a dynamic language, and so far I didn't see any statistical analysis tools.
I reckon that this information is sort of available in JIT compilers -- whatever (function, argument types) is in cache (compiled) those are used the most. Whether or not you can get this data out of e.g. PyPy I really don't have a clue.
Source control tools can give a good indication of frequently updated modules - often indicating trouble spots.
If you don't have source control but the project is run from a shared location delete all the pycache folders or .pyc files. Over time/under use watch which files get recreated to indicate their use.
Analysing the Python imports printed when running from particular entry points with
python -v entry_point
may give some insight into which modules are being used. Although if you have known access points you should try the coverage module.
For a more intrusive solution, consider setting up project wide logging. You can log metrics easy enough, even over distributed programs.
I agree with the others, in that I have yet to come across a good runtime analysis tool for Python that will do this.
There are some ways to deal with it, but none are trivial.
The most robust, I think, would be to get the Python source and recompile the binaries with some sort of built-in runtime logging. That way you could just slide it into the existing environment without any code changes to your project.
Of course, that isn't exactly trivial to do, but it has the bonus that you might some day be able to get that merged back into the trunk for future generations and what not.
For non-recompile approaches, the first place I would look is the profile library's deterministic profiling section.
How you implement it will be heavily dependent on how your environment is set up. Do you have many individual scripts and projects run independently of one another, or just the one main script or module or package that gets used by everybody else, and you just want to know what parts of it can be trimmed out to make maintenance easier?
Is it a load once, run forever kind of set up, or a situation where you just run scripts atomically on some sort of schedule?
You could implement project-wide logging (as mentioned in #Hardbyte's answer), but that would require going through the project and adding the logging lines to all of your code. If you do that, you may as well just do it using the built-in profiler, I think.
Have a look at sys.setprofile: it allows you to install a profiler function.
Its usage is detailed in http://docs.python.org/library/profile.html#profile, for a jumpstart go here.
If you can not profile your application you will be bound to the cooverage approach.
Another thing you might have a look at is decorators, you can write a debugging decorator, and apply it to set of functions you suspect. Take alook here to see how to apply the decorator to an entire module.
You might also take a look at python call graph, while it will not generate quite what you want it shows you how often one function calls another:
If your code runs on user input this will be hard, since you would have to simulate 'typical' usage.
There is not more to tell you, just remember profiling as keyword.
Pylint sometimes gives warnings that (after careful consideration) are not justified. In which case it is useful to make use of the special #pylint: disable=X0123 comments (where X0123 is the actual error/warning message number) if the code cannot be refactored to not trigger the warning.
I'd like to second Hardbyte's mention of using your source control logs to see which files are most often changed.
If you are working on a system that has find, grep and sort installed, the following is a way to check which file imports what;
find . -name '*.py' -exec grep -EH '^import|^from .* import' {} \+| sort |less
To find the most popular imports across all files;
find . -name '*.py' -exec grep -Eh '^import|^from .* import' {} \+ | sort | less
These two commands should help you find the most-used modules from your project.

recommendations for report generator (Python or web service)

I maintain several web applications and I'd like to add some "nice" reporting/analytics pages. Building that once is simple enough (e.g. using flot or similar plotting libraries) but somehow it seems like there should be a report generation library out there which "just" generates the necessary graphs without much coding + offer some filtering ability.
There are some tools out there but for some reason there was never a good fit:
must work on Linux
open source preferred though closed source works as well as long the pricing model is also suitable for small installs
Python API required (or external services using standard web protocols)
I realize that this is not exactly a unique question but I couldn't find other stackoverflow questions with the same scope. Any pointers appreciated.
Update (2012-08-09, 15:10 UTC): I realized I did not state some more requirements/wishes:
web interface to access reports
access control: Each user can only get reports on his own data (simple to do with a library, might be hard with an external server)
filtering: I need interactive filtering of values based on some parameters (e.g. "only events in this time frame", "only in place X").
Windward* is one software company that offers a solution that seems to meet most of your needs. They offer a Python API through either Jython or a RESTful API (their Java Engine and Javelin, respectively), and their main strength is that template design is done in Microsoft Office, so reports can be very flexible and are easy to put together (most people already know how to use Word, so there's also much less learning curve than other solutions out there). You can add dynamic filters that take parameters at runtime or change on-the-fly, you can output to a variety of formats including HTML and PDF, and it works with pretty much every major datasource. For a web interface, you can either build your own and easily integrate reporting into it (Engine) or buy one pre-built and modify it to your specifications (Javelin).
On the downside, they are closed-source and without knowing more about your setup, it would be difficult for me to say whether their pricing would work. Might be worth a look, though--the links above and their documentation wiki are probably good places to start looking to see if you're a fit.
*Disclaimer: I work for Windward. I do believe they are one of the better reporting packages out there, but there are others that may fit your needs too.

Python modding - prevent dangerous scripts to be imported?

I want to allow users to make their own Python "mods" for my game, by placing their scripts in a special folder which the game "scans" for Python modules and imports.
What would be the simplest way to prevent "dangerous" scripts from being imported? I don't want people complaining to me that they used someone's mod and it erased their hard drive.
Things I would like to limit is accessing/modifying/creating any files outside of their folder and connecting to the internet/downloading/sending data. If you can thik of anything else, let me know.
So how can this be done?
Restricted Python seems to able to restrict functionality for code in a clean way and is compatible with python up to 2.7.
http://pypi.python.org/pypi/RestrictedPython/
e.g.
By supplying a different __builtins__ dictionary, we can rule out unsafe operations, such as opening files [...]
The obvious way to do it is to load the module as a string and exec it. This has just as many security risks, but might be easier to block by using custom globals and locals. Have a look at this question - it gives some really good guidance on this. As pointed out in Delnan's comments, this isn't completely secure though.
You could also try this. I haven't used it, but it seems to provide a safe environment for unsafe scripts.
There are some serious shortcomings for sandboxed python execution. aquavitae's answer links to some good discussion on the matter, especially this blog post. Read that first.
There is a kernel of secure execution within cPython. The fundamental idea is to replace the __builtins__ global (Note: not the __builtin__ module), which informs python to turn on some security features; making a handful of attributes on certain objects inaccessible, and removing most of the implementation objects from the interpreter when evaulating that bit of code.
You'll then need to write an actual implementation; in such a way that the protected modules are not the leaked into the sandbox. A fairly tested "file" replacement is provided in the linked blog. Getting a look on that might give you an idea of how involved and complex this problem is.
So now that you have understood that this is a challenge in python; you should take a look at languages with sandbox execution as a core feature, such as Lua, which is very popular in games.
Giving them python execution and trying to limit what they do is asking for trouble. See this SO question for discussion and a pointer to a good article. (You would presumably disable "eval", but it wouldn't make much difference in practice.
My suggestion: Turn the question around. Your goal is to provide them with scripting facilities so they can enhance the game. Find or define an interpreter for a suitable scripting language that has the features you need, and use it to execute their scripts. For example, you could support data persistence in a simple keystore model, without giving them file creation access. Or give them a command to create files but ensure it only accepts a path-less filename. The essential thing is to ensure that there is NO way for them to execute python commands directly.

How can I create a locked-down python environment?

I'm responsible for developing a large Python/Windows/Excel application used by a financial institution which has offices all round the world. Recently the regulations in one country have changed, and as a result we have been told that we need to create a "locked-down" version of our distribution.
After some frustrating conversations with my foreign counterpart, it seems that they are concerned that somebody might misuse the python interpreter on their computer to generate non-standard applications which might be used to circumvent security.
My initial suggestion was just to take away execute-rights on python.exe and pythonw.exe: Our application works as an Excel plugin which only uses the Python DLL. Those exe files are never actually used.
My counterpart remained concerned that somebody could make calls against the Python DLL - a hacker could exploit the "exec" function, for example from another programming language or virtual machine capable of calling functions in Windows DLLs, for example VBA.
Is there something we can do to prevent the DLL we want installed from being abused? At this point I ran out of ideas. I need to find a way to ensure that Python will only run our authorized programs.
Of course, there is an element of absurdity to this question: Since the computers all have Excel and Word they all have VBA which is a well-known scripting language somewhat equivalent in capability to Python.
It obviously does not make sense to worry about python when Excel's VBA is wide-open, however this is corporate politics and it's my team who are proposing to use Python, so we need to prove that our stuff can be made reasonably safe.
"reasonably safe" defined arbitrarily as "safer than Excel and VBA".
You can't win that fight. Because the fight is over the wrong thing. Anyone can use any EXE or DLL.
You need to define "locked down" differently -- in a way that you can succeed.
You need to define "locked down" as "cannot change the .PY files" or "we can audit all changes to the .PY files". If you shift the playing field to something you can control, you stand a small chance of winning.
Get the regulations -- don't trust anyone else to interpret them for you.
Be absolutely sure what the regulations require -- don't listen to someone else's interpretation.
Sadly, Python is not very safe in this way, and this is quite well known. Its dynamic nature combined with the very thin layer over standard OS functionality makes it hard to lock things down. Usually the best option is to ensure the user running the app has limited rights, but I expect that is not practical. Another is to run it in a virtual machine where potential damage is at least limited to that environment.
Failing that, someone with knowledge of Python's C API could build you a bespoke .dll that explicitly limits the scope of that particular Python interpreter, vetting imports and file writes etc., but that would require quite a thorough inspection of what functionality you need and don't need and some equally thorough testing after the fact.
One idea is to run IronPython inside an AppDomain on .NET. See David W.'s comment here.
Also, there is a module you can import to restrict the interpreter from writing files. It's called safelite.
You can use that, and point out that even the inventor of Python was unable to break the security of that module! (More than twice.)
I know that writing files is not the only security you're worried about. But what you are being asked to do is stupid, so hopefully the person asking you will see "safe Python" and be satisfied.
Follow the geordi way.
Each suspect program is run in its own jailed account. Monitor system calls from the process (I know how to do this in Windows, but it is beyond the scope of this question) and kill the process if needed.
I agree that you should examine the regulations to see what they actually require. If you really do need a "locked down" Python DLL, here's a way that might do it. It will probably take a while and won't be easy to get right. BTW, since this is theoretical, I'm waving my hands over work that varies from massive to trivial :-).
The idea is to modify Python.DLL so it only imports .py modules that have been signed by your private key. It verifies this by using the public key and a code signature stashed in a variable you add to each .py that you trust via a coding signing tool.
Thus you have:
Build a private build from the Python source.
In your build, change the import statement to check for a code signature on every module when imported.
Create a code signing tool that signs all the modules you use and assert are safe (aka trusted code). Something like __code_signature__ = "ABD03402340"; but cryptographically secure in each module's __init__.py file.
Create a private/public key pair for signing the code and guard the private key.
You probably don't want to sign any of the modules with exec() or eval() type capabilities.
.NET's code signing model might be helpful here if you can use IronPython.
I'm not sure its a great idea to pursue, but in the end you would be able to assert that your build of the Python.DLL would only run the code that you have signed with your private key.

Categories

Resources