combine source code from different files

combine source code from different files - python

I am using Emacs Org-mode Babel source code block to write and use some small functions. Now I want to do a little bit more. Say after a while, I found the function I wrote in Org-babel is valuable for reuse, I want to put it into my personal python package, e.g., my_tools.
So org-babel provide extraction of the source code, lets say, I have the source code extracted into a file called examples.py, which has func1 and func2 in the file. I want to add these functions into a python file/module called my_functions.py, is there a python package or best practice to do such thing so the source code of func1, func2 will be inserted into the module?
To me it is something I am trying to do for a while, usually, when working with python, we just write the code for one time usage, later on, we may find some code/functions are reused again and again, thus we want to save it to a package so that it can be easily installed and shared with others.
We can even add tags to the code so that when extract and insert it to the package module, it knows where to insert based on the tag information. I am little fuzzy here to know if there is already a PyPI package for such scenario, or how should I architect the package if I want to build such one for myself. I am not that experienced and would like to hear opinions on this.

This should be doable using "tangling" of source code into files and noweb syntax to gather up individual pieces into a larger whole. The following is meant as an illustration of the method:
* Individual code blocks
#+name: foo
#+BEGIN_SRC elisp
(princ "Hello")
#+END_SRC
and another one:
#+name: bar
#+BEGIN_SRC elisp
(princ "Goodbye")
#+END_SRC
* Combine them together
#+BEGIN_SRC elisp :tangle ./tangled/foo :noweb yes
(message "Package stuff")
<<foo>>
<<bar>>
#+END_SRC
Using C-c C-v C-t to tangle, gets you a file named foo, in the ./tangled subdirectory (which has to exist already), whose contents are:
(message "Package stuff")
(princ "Hello")
(princ "Goodbye")
The pythonization of this should be straightforward, but the more advanced aspects of what you describe (using tags to select functions e.g.) are certainly not addressed by this (and I'm not sure how to do them off the top of my head).

I'm a big fan of keeping things simple. If I understand your requirements
correctly, your primary interest is in generating python source files and
modules rather than executing python code and having the results used in or
copied back into the org file.
If this is the case, I think your best approach is to just have an org file
which represents your /tools/ module. When you find a function etc which you
keep using in different files/projects and which should go into your tools
module, add that function code block into the org file representing your tools
module (along with appropriate docs etc). Then, update your other org files
which represent the different code blocks of your program to load that module
and reference that function.
In the org file which represents your tool module, you could use some of Org's
functionality to execute the code to incorporate tests etc. This way, you can
load your org file and have it execute the tests to varify all the utility
functions in your module are working.
In your other projects, just write your source blocks to source the functions
from your utility module. Don't worry about using org to try and do fancy
referencing or the like. Keep it simple. You can use org links to reference back
to your org file representing your toolbox modules to get documentation
references.
If on the other hand, you want to do something like a python lab book system,
where you run python code from within the org file and get results back which
you use for documentation or as input for other blocks, then you need to use
some of the advanced noweb features to handle more complex block references and
pass around arguments etc. You may also find the library of babel
useful.
14.6 Library of Babel
=====================
The "Library of Babel" is a collection of code blocks. Like a function
library, these code blocks can be called from other Org files. This
collection is in a repository file in Org mode format in the `doc'
directory of Org mode installation. For remote code block evaluation
syntax, *note Evaluating code blocks::.
For any user to add code to the library, first save the code in
regular `src' code blocks of an Org file, and then load the Org file
with `org-babel-lob-ingest', which is bound to `C-c C-v i'.

Related

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...

It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

Is there a way to combine a python project codebase that spans across different files into one file?

The reason I want to this is I want to use the tool pyobfuscate to obfuscate my python code. Butpyobfuscate can only obfuscate one file.

I've answered your direct question separately, but let me offer a different solution to what I suspect you're actually trying to do:
Instead of shipping obfuscated source, just ship bytecode files. These are the .pyc files that get created, cached, and used automatically, but you can also create them manually by just using the compileall module in the standard library.
A .pyc file with its .py file missing can be imported just fine. It's not human-readable as-is. It can of course be decompiled into Python source, but the result is… basically the same result you get from running an obfuscater on the original source. So, it's slightly better than what you're trying to do, and a whole lot easier.
You can't compile your top-level script this way, but that's easy to work around. Just write a one-liner wrapper script that does nothing but import the real top-level script. If you have if __name__ == '__main__': code in there, you'll also need to move that to a function, and the wrapper becomes a two-liner that imports the module and calls the function… but that's as hard as it gets.) Alternatively, you could run pyobfuscator on just the top-level script, but really, there's no reason to do that.
In fact, many of the packager tools can optionally do all of this work for you automatically, except for writing the trivial top-level wrapper. For example, a default py2app build will stick compiled versions of your own modules, along with stdlib and site-packages modules you depend on, into a pythonXY.zip file in the app bundle, and set up the embedded interpreter to use that zipfile as its stdlib.

There are a definitely ways to turn a tree of modules into a single module. But it's not going to be trivial. The simplest thing I can think of is this:
First, you need a list of modules. This is easy to gather with the find command or a simple Python script that does an os.walk.
Then you need to use grep or Python re to get all of the import statements in each file, and use that to topologically sort the modules. If you only do absolute flat import foo statements at the top level, this is a trivial regex. If you also do absolute package imports, or from foo import bar (or from foo import *), or import at other levels, it's not much trickier. Relative package imports are a bit harder, but not that big of a deal. Of course if you do any dynamic importing, use the imp module, install import hooks, etc., you're out of luck here, but hopefully you don't.
Next you need to replace the actual import statements. With the same assumptions as above, this can be done with a simple sed or re.sub, something like import\s+(\w+) with \1 = sys.modules['\1'].
Now, for the hard part: you need to transform each module into something that creates an equivalent module object dynamically. This is the hard part. I think what you want to do is to escape the entire module code so that it can put into a triple-quoted string, then do this:
import types
mod_globals = {}
exec('''
# escaped version of original module source goes here
''', mod_globals)
mod = types.ModuleType(module_name)
mod.__dict__.update(mod_globals)
sys.modules[module_name] = mod
Now just concatenate all of those transformed modules together. The result will be almost equivalent to your original code, except that it's doing the equivalent of import foo; del foo for all of your modules (in dependency order) right at the start, so the startup time could be a little slower.

You can make a tool that:
Reads through your source files and puts all identifiers in a set.
Subtracts all identifiers from recursively searched standard- and third party modules from that set (modules, classes, functions, attributes, parameters).
Subtracts some explicitly excluded identifiers from that list as well, as they may be used in getattr/setattr/exec/eval
Replaces the remaining identifiers by gibberish
Or you can use this tool I wrote that does exactly that.
To obfuscate multiple files, use it as follows:
For safety, backup your source code and valuable data to an off-line medium.
Put a copy of opy_config.txt in the top directory of your project.
Adapt it to your needs according to the remarks in opy_config.txt.
This file only contains plain Python and is exec’ed, so you can do anything clever in it.
Open a command window, go to the top directory of your project and run opy.py from there.
If the top directory of your project is e.g. ../work/project1 then the obfuscation result will be in ../work/project1_opy.
Further adapt opy_config.txt until you’re satisfied with the result.
Type ‘opy ?’ or ‘python opy.py ?’ (without the quotes) on the command line to display a help text.

I think you can try using the find command with -exec option.
you can execute all python scripts in a directory with the following command.
find . -name "*.py" -exec python {} ';'
Wish this helps.
EDIT:
OH sorry I overlooked that if you obfuscate files seperately they may not run properly, because it renames function names to different names in different files.

Reliable way to get path to py file of a module

I'm trying to figure out the best way to reliably discover at runtime the location on the file system of the py file for a given module. I need to do this because I plan to externalize some configuration data about some methods (in this case, schemas to be used for validation of responses from service calls for which interfaces are defined in a module) for cleanliness and ease of maintenance.
An simplified illusutration of the system:
package
|
|-service.py
|
|-call1.scm
|
|-call2.scm
service.py (_call() is a method on the base class, though that's irrelevant to the question)
class FooServ(AbstractService):
def call1(*args):
result = self._call('/relative/uri/call1', *args)
# additional call specific processing
return result
def call2(*args):
result = self._call('/relative/uri/call2', *args)
# additional call specific processing
return result
call1.scm and call2.scm define the response schemas (in the current case, using the draft JSON schema format, though again, irrelevant to the question)
In another place in the codebase, when the service calls are actually made, I want to be able to detect the location of service.py so that I can traverse the file structure and find the scm files. At least on my system, I think that this will work:
# I realize this is contrived here, but in my code, the method is accessed this way
method = FooServ().call1
module_path = sys.modules[method.__self__.__class__.__module__].__file__
schema_path = os.path.join(os.path.dirname(module_path), method.__name__ + '.scm')
However, I wanted to make sure this would be safe on all platforms and installation configurations, and I came across this while doing research, which made me concerned that trying to do this this way will not work reliably. Will this work universally, or is the fact that __file__ on a module object returns the location of the pyc file, which could be in some location other than along side the py file, going to make this solution ineffective? If it will make it ineffective, what, if anything, can I do instead?

In the PEP you link, it says:
In Python 3, when you import a module, its __file__ attribute points to its source py file (in Python 2, it points to the pyc file).
So in Python 3 you're fine because __file__ will always point to the .py file. In Python 2 it might point to the .pyc file, but that .pyc will only ever be in the same directory as the .py file for Python 2.
Okay, I think you're referring to this bit:
Because these distributions cannot share pyc files, elaborate mechanisms have been developed to put the resulting pyc files in non-shared locations while the source code is still shared. Examples include the symlink-based Debian regimes python-support [8] and python-central [9]. These approaches make for much more complicated, fragile, inscrutable, and fragmented policies for delivering Python applications to a wide range of users.
I believe those mechanisms are applied only to Python modules that are packaged by the distribution. I don't think they should affect modules installed manually outside of the distribution's packaging system. It would be the responsibility of whoever was packaging your module for that distribution to make sure that the module isn't broken by the mechanism.

python modules missing in sage

I have Sage 4.7.1 installed and have run into an odd problem. Many of my older scripts that use functions like deepcopy() and uniq() no longer recognize them as global names. I have been able to fix this by importing the python modules one by one, but this is quite tedious. But when I start the command-line Sage interface, I can type "list2=deepcopy(list1)" without importing the copy module, and this works fine. How is it possible that the command line Sage can recognize global name 'deepcopy' but if I load my script that uses the same name it doesn't recognize it?
oops, sorry, not familiar with stackoverflow yet. I type: 'sage_4.7.1/sage" to start the command line interface; then, I type "load jbom.py" to load up all the functions I defined in a python script. When I use one of the functions from the script, it runs for a few seconds (complex function) then hits a spot where I use some function that Sage normally has as a global name (deepcopy, uniq, etc) but for some reason the script I loaded does not know what the function is. And to reiterate, my script jbom.py used to work the last time I was working on this particular research, just as I described.
It also makes no difference if I use 'load jbom.py' or 'import jbom'. Both methods get the functions I defined in my script (but I have to use jbom. in the second case) and both get the same error about 'deepcopy' not being a global name.
REPLY TO DSM: I have been sloppy about describing the problem, for which I am sorry. I have created a new script 'experiment.py' that has "import jbom" as its first line. Executing the function in experiment.py recognizes the functions in jbom.py but deepcopy is not recognized. I tried loading jbom.py as "load jbom.py" and I can use the functions just like I did months ago. So, is this all just a problem of layering scripts without proper usage of import/load etc?
SOLVED: I added "from sage.all import *" to the beginning of jbom.py and now I can load experiment.py and execute the functions calling jbom.py functions without any problems. From the Sage doc on import/load I can't really tell what I was doing wrong exactly.

Okay, here's what's going on:
You can only import files ending with .py (ignoring .py[co]) These are standard Python files and aren't preparsed, so 1/3 == int(0), not QQ(1)/QQ(3), and you don't have the equivalent of a from sage.all import * to play with.
You can load and attach both .py and .sage files (as well as .pyx and .spyx and .m). Both have access to Sage definitions but the .py files aren't preparsed (so y=17 makes y a Python int) while the .sage files are (so y=17 makes y a Sage Integer).
So import jbom here works just like it would in Python, and you don't get the access to what Sage has put in scope. load etc. are handy but they don't scale up to larger programs so well. I've proposed improving this in the past and making .sage scripts less second-class citizens, but there hasn't yet been the mix of agreement on what to do and energy to do it. In the meantime your best bet is to import from sage.all.

Dangerous Python Keywords?

I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)

This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.

You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **

Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.

An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.

built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...

Use a Virtual Machine instead of running it on a system that you are concerned about.

Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.