Reliable way to get path to py file of a module

Reliable way to get path to py file of a module - python

I'm trying to figure out the best way to reliably discover at runtime the location on the file system of the py file for a given module. I need to do this because I plan to externalize some configuration data about some methods (in this case, schemas to be used for validation of responses from service calls for which interfaces are defined in a module) for cleanliness and ease of maintenance.
An simplified illusutration of the system:
package
|
|-service.py
|
|-call1.scm
|
|-call2.scm
service.py (_call() is a method on the base class, though that's irrelevant to the question)
class FooServ(AbstractService):
def call1(*args):
result = self._call('/relative/uri/call1', *args)
# additional call specific processing
return result
def call2(*args):
result = self._call('/relative/uri/call2', *args)
# additional call specific processing
return result
call1.scm and call2.scm define the response schemas (in the current case, using the draft JSON schema format, though again, irrelevant to the question)
In another place in the codebase, when the service calls are actually made, I want to be able to detect the location of service.py so that I can traverse the file structure and find the scm files. At least on my system, I think that this will work:
# I realize this is contrived here, but in my code, the method is accessed this way
method = FooServ().call1
module_path = sys.modules[method.__self__.__class__.__module__].__file__
schema_path = os.path.join(os.path.dirname(module_path), method.__name__ + '.scm')
However, I wanted to make sure this would be safe on all platforms and installation configurations, and I came across this while doing research, which made me concerned that trying to do this this way will not work reliably. Will this work universally, or is the fact that __file__ on a module object returns the location of the pyc file, which could be in some location other than along side the py file, going to make this solution ineffective? If it will make it ineffective, what, if anything, can I do instead?

In the PEP you link, it says:
In Python 3, when you import a module, its __file__ attribute points to its source py file (in Python 2, it points to the pyc file).
So in Python 3 you're fine because __file__ will always point to the .py file. In Python 2 it might point to the .pyc file, but that .pyc will only ever be in the same directory as the .py file for Python 2.
Okay, I think you're referring to this bit:
Because these distributions cannot share pyc files, elaborate mechanisms have been developed to put the resulting pyc files in non-shared locations while the source code is still shared. Examples include the symlink-based Debian regimes python-support [8] and python-central [9]. These approaches make for much more complicated, fragile, inscrutable, and fragmented policies for delivering Python applications to a wide range of users.
I believe those mechanisms are applied only to Python modules that are packaged by the distribution. I don't think they should affect modules installed manually outside of the distribution's packaging system. It would be the responsibility of whoever was packaging your module for that distribution to make sure that the module isn't broken by the mechanism.

Related

Python import from byte IO (or more generally from an object in memory)

Context:
I currently have a program which loads a set of plugins from their file paths (in a mapped network drive) using the method shown in another SO thread. These plugins are designed to be rolling release, which means I need constant access to writing them. The current mechanism locks the files so I have to ask everyone to close the software so I can update the files.
The question:
I was wondering if there was a way to, possibly using a similar method to that linked above, import a file from an io.BytesIO object of the plugin's raw contents (hence unlocking the file for me to make changes as I please).
More generally:
More specifically, can I keep the raw module contents in memory without touching a physical disk? If such a thing is not possible, is there a way to fully load these modules into memory so I can then unlock the files being imported?
As I have stated in my comment, I understand you can mount a virtual-filesystem on a Linux-based OS (which could have solved my problem), though sadly I developing for Windows and Microsoft can never make your life easy! :-)
Note:
I am not asking where I can copy these files to import them from a local version (e.g. temp, cache, etc.).
I understand this is quite a specialist question so any help is much appreciated

While not being from an io.BytesIO object as I originally asked for, I was able to import a module from its source after finding this incredibly helpful article. I have not copied the code here as it is quite large, though I was able to get it to successfully import the virtual module.
The following code is after I modified the loader to remove the common prefix, and creates a class of the module by first executing the source, getting the globals from it and finally using Python's type method to create the module class.
It is not particularly pretty and definitely breaks some Python style recommendations, so I am definitely open to improvements!
source = """def hello():
print("I don't want to say hi to the world")"""
name = "my_module"
glo = {}
exec(source, glo)
injector = DependencyInjector()
injector.provide(name, type(name, (), glo))
injector.install()
foo = __import__(name)
foo.hello()

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...

It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

combine source code from different files

I am using Emacs Org-mode Babel source code block to write and use some small functions. Now I want to do a little bit more. Say after a while, I found the function I wrote in Org-babel is valuable for reuse, I want to put it into my personal python package, e.g., my_tools.
So org-babel provide extraction of the source code, lets say, I have the source code extracted into a file called examples.py, which has func1 and func2 in the file. I want to add these functions into a python file/module called my_functions.py, is there a python package or best practice to do such thing so the source code of func1, func2 will be inserted into the module?
To me it is something I am trying to do for a while, usually, when working with python, we just write the code for one time usage, later on, we may find some code/functions are reused again and again, thus we want to save it to a package so that it can be easily installed and shared with others.
We can even add tags to the code so that when extract and insert it to the package module, it knows where to insert based on the tag information. I am little fuzzy here to know if there is already a PyPI package for such scenario, or how should I architect the package if I want to build such one for myself. I am not that experienced and would like to hear opinions on this.

This should be doable using "tangling" of source code into files and noweb syntax to gather up individual pieces into a larger whole. The following is meant as an illustration of the method:
* Individual code blocks
#+name: foo
#+BEGIN_SRC elisp
(princ "Hello")
#+END_SRC
and another one:
#+name: bar
#+BEGIN_SRC elisp
(princ "Goodbye")
#+END_SRC
* Combine them together
#+BEGIN_SRC elisp :tangle ./tangled/foo :noweb yes
(message "Package stuff")
<<foo>>
<<bar>>
#+END_SRC
Using C-c C-v C-t to tangle, gets you a file named foo, in the ./tangled subdirectory (which has to exist already), whose contents are:
(message "Package stuff")
(princ "Hello")
(princ "Goodbye")
The pythonization of this should be straightforward, but the more advanced aspects of what you describe (using tags to select functions e.g.) are certainly not addressed by this (and I'm not sure how to do them off the top of my head).

I'm a big fan of keeping things simple. If I understand your requirements
correctly, your primary interest is in generating python source files and
modules rather than executing python code and having the results used in or
copied back into the org file.
If this is the case, I think your best approach is to just have an org file
which represents your /tools/ module. When you find a function etc which you
keep using in different files/projects and which should go into your tools
module, add that function code block into the org file representing your tools
module (along with appropriate docs etc). Then, update your other org files
which represent the different code blocks of your program to load that module
and reference that function.
In the org file which represents your tool module, you could use some of Org's
functionality to execute the code to incorporate tests etc. This way, you can
load your org file and have it execute the tests to varify all the utility
functions in your module are working.
In your other projects, just write your source blocks to source the functions
from your utility module. Don't worry about using org to try and do fancy
referencing or the like. Keep it simple. You can use org links to reference back
to your org file representing your toolbox modules to get documentation
references.
If on the other hand, you want to do something like a python lab book system,
where you run python code from within the org file and get results back which
you use for documentation or as input for other blocks, then you need to use
some of the advanced noweb features to handle more complex block references and
pass around arguments etc. You may also find the library of babel
useful.
14.6 Library of Babel
=====================
The "Library of Babel" is a collection of code blocks. Like a function
library, these code blocks can be called from other Org files. This
collection is in a repository file in Org mode format in the `doc'
directory of Org mode installation. For remote code block evaluation
syntax, *note Evaluating code blocks::.
For any user to add code to the library, first save the code in
regular `src' code blocks of an Org file, and then load the Org file
with `org-babel-lob-ingest', which is bound to `C-c C-v i'.

Is there a way to combine a python project codebase that spans across different files into one file?

The reason I want to this is I want to use the tool pyobfuscate to obfuscate my python code. Butpyobfuscate can only obfuscate one file.

I've answered your direct question separately, but let me offer a different solution to what I suspect you're actually trying to do:
Instead of shipping obfuscated source, just ship bytecode files. These are the .pyc files that get created, cached, and used automatically, but you can also create them manually by just using the compileall module in the standard library.
A .pyc file with its .py file missing can be imported just fine. It's not human-readable as-is. It can of course be decompiled into Python source, but the result is… basically the same result you get from running an obfuscater on the original source. So, it's slightly better than what you're trying to do, and a whole lot easier.
You can't compile your top-level script this way, but that's easy to work around. Just write a one-liner wrapper script that does nothing but import the real top-level script. If you have if __name__ == '__main__': code in there, you'll also need to move that to a function, and the wrapper becomes a two-liner that imports the module and calls the function… but that's as hard as it gets.) Alternatively, you could run pyobfuscator on just the top-level script, but really, there's no reason to do that.
In fact, many of the packager tools can optionally do all of this work for you automatically, except for writing the trivial top-level wrapper. For example, a default py2app build will stick compiled versions of your own modules, along with stdlib and site-packages modules you depend on, into a pythonXY.zip file in the app bundle, and set up the embedded interpreter to use that zipfile as its stdlib.

There are a definitely ways to turn a tree of modules into a single module. But it's not going to be trivial. The simplest thing I can think of is this:
First, you need a list of modules. This is easy to gather with the find command or a simple Python script that does an os.walk.
Then you need to use grep or Python re to get all of the import statements in each file, and use that to topologically sort the modules. If you only do absolute flat import foo statements at the top level, this is a trivial regex. If you also do absolute package imports, or from foo import bar (or from foo import *), or import at other levels, it's not much trickier. Relative package imports are a bit harder, but not that big of a deal. Of course if you do any dynamic importing, use the imp module, install import hooks, etc., you're out of luck here, but hopefully you don't.
Next you need to replace the actual import statements. With the same assumptions as above, this can be done with a simple sed or re.sub, something like import\s+(\w+) with \1 = sys.modules['\1'].
Now, for the hard part: you need to transform each module into something that creates an equivalent module object dynamically. This is the hard part. I think what you want to do is to escape the entire module code so that it can put into a triple-quoted string, then do this:
import types
mod_globals = {}
exec('''
# escaped version of original module source goes here
''', mod_globals)
mod = types.ModuleType(module_name)
mod.__dict__.update(mod_globals)
sys.modules[module_name] = mod
Now just concatenate all of those transformed modules together. The result will be almost equivalent to your original code, except that it's doing the equivalent of import foo; del foo for all of your modules (in dependency order) right at the start, so the startup time could be a little slower.

You can make a tool that:
Reads through your source files and puts all identifiers in a set.
Subtracts all identifiers from recursively searched standard- and third party modules from that set (modules, classes, functions, attributes, parameters).
Subtracts some explicitly excluded identifiers from that list as well, as they may be used in getattr/setattr/exec/eval
Replaces the remaining identifiers by gibberish
Or you can use this tool I wrote that does exactly that.
To obfuscate multiple files, use it as follows:
For safety, backup your source code and valuable data to an off-line medium.
Put a copy of opy_config.txt in the top directory of your project.
Adapt it to your needs according to the remarks in opy_config.txt.
This file only contains plain Python and is exec’ed, so you can do anything clever in it.
Open a command window, go to the top directory of your project and run opy.py from there.
If the top directory of your project is e.g. ../work/project1 then the obfuscation result will be in ../work/project1_opy.
Further adapt opy_config.txt until you’re satisfied with the result.
Type ‘opy ?’ or ‘python opy.py ?’ (without the quotes) on the command line to display a help text.

I think you can try using the find command with -exec option.
you can execute all python scripts in a directory with the following command.
find . -name "*.py" -exec python {} ';'
Wish this helps.
EDIT:
OH sorry I overlooked that if you obfuscate files seperately they may not run properly, because it renames function names to different names in different files.

Determining a file's path name from different working directories in python

I have a python module that is shared among several of my projects (the projects each have a different working directory). One of the functions in this shared module, executes a script using os.spawn. The problem is, I'm not sure what pathname to give to os.spawn since I don't know what the current working directory will be when the function is called. How can I reference the file in a way that any caller can find it? Thanks!

So I just learned about the __file__ variable, which will provide a solution to my problem. I can use file to get a pathname which will be constant among all projects, and use that to reference the script I need to call, since the script will always be in the same location relative to __file__. However, I'm open to other/better methods if anyone has them.

Put it in a well known directory (/usr/lib/yourproject/ or ~/lib or something similar), or have it in a well known relative path based on the location of your source files that are using it.

The following piece of code will find the location of the calling module, which makes sense from a programmer's point of view:
## some magic to allow paths relative to calling module
if path.startswith('/'):
self.path = path
else:
frame = sys._getframe(1)
base = os.path.dirname(frame.f_globals['__file__'])
self.path = os.path.join(base, path)
I.e. if your project lives in /home/foo/project, and you want to reference a script 'myscript' in scripts/, you can simply pass 'scripts/myscript'. The snippet will figure out the caller is in /home/foo/project and the entire path should be /home/foo/projects/scripts/myscript.
Alternatively, you can always require the programmer to specify a full path, and check using os.path.exists if it exists.

You might find the materials in this PyCon 2010 presentation on cross platform application development and distribution useful. One of the problems they solve is finding data files consistently across platforms and for installed vs development checkouts of the code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.