Can python source code executed using PyRun_SimpleString be extracted?

Can python source code executed using PyRun_SimpleString be extracted? - python

I am developing a software based on embedded Python and C++. I want to secure some of my python code and prevent people from copying it.
For now I am using PyRun_SimpleString to execute the python code and the string is generated using my C++ code.
If I use this method, will it secure the Python code from being copied?

So, as I understand it the actual program will not exist in the executable, either in python source form, nor in 'marshalled' form (basically .pyc image) or compiled form, though I guess it will exist in some encrypted or obfuscated form which is converted into python source at run time.
This definitely makes it harder to extract the code, but an attacker who can trace the code as it runs will be able to catch calls to PyRun_SimpleString and obtain the plain source.
It's a question of degree - how hard you want to work to make the job harder for the attacker.
You might want to look into the 'frozen module' facility which is in the python source. This basically allows '.pyc' images to compiled in as byte arrays, and imported at runtime by un-marshalling them. So there's never any plain text source, but there is the .pyc image which is reasonably easy to analyze if you find it in the image. Take it one step further and obfuscate the pyc image - now attacker needs to analyze or trace past the de-obfuscation, and still won't see plain source code.

Related

How to protect Python source code, while making the file available for running?

So, I recently made a Python program that I want to send to someone with them being able to execute it, but not read the code I have typed in it. Any ideas how to do it?
BTW, I want it to be irreversible
In short, here are my Parameters:
Should remain a Python file
Can't be reversed
Code should not be readable
Should still have the ability to be run

The criteria you've posted are inconsistent.
Python is an interpreted language. The entity running the language (i.e. Python interpreter) is reading your code and executing it, line by line. If you wrap it up to send to someone, their Python interpreter must have read permissions on the file, whether it's source code or "compiled" Python (which is easily decompiled into equivalent source code).
If we take a wider interpretation of "send to someone", there may be a business solution that serves your needs. You would provide your functionality, rather than the code: deploy it as a service from some available server: your own, or rented space. To do this, you instead provide an interface to your functionality.
If this fulfills your needs, you now have your next research topic.

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...

It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

How to run c code within python

How can I run c/c++ code within python in the form:
def run_c_code(code):
#Do something to run the code
code = """
Arbitrary code
"""
run_c_code(code)
It would be great if someone could provide an easy solution which does not involve installing packages. I know that C is not a scripting language but it would be great if it could do a 'mini'-compile that is able to run the code into the console. The code should run as it would compiled normally but this needs to be able to work on the fly as the rest of the code runs it and if possible, run as fast as normal and be able to create and edit variables so that python can use it. If necessary, the code can be pre-compiled into the code = """something""".
Sorry for all the requirements but if you can make the c code run in python then that would be great. Thanks in advance for all the answers..

As somebody else already pointed out, to run C/C++ code from "within" Python, you'd have to write said C/C++ code into an own file, compile it correctly, and then execute that program from your Python code.
You can't just type one command, compile it, and execute it. You always have to have the whole "framework" set up. You can't compile a program when you haven't yet written the } that ends the class/function/statement 20 lines later on. At this point you'd already have to write the whole C/C++ program for it to work. It's simply not meant to be interpreted on the run, line by line. You can do that with python, bash/dash/batch, and a few others. But C/C++ definitely isn't one of them.
With those come several issues. Firstly, the C/C++ part probably needs data from the Python part. I don't know of any way of doing it in RAM alone (maybe there is one, but I don't know), so the Python part would have to write it into a file, the C/C++ part would read and process it, then put the processed data into another file, and then the Python part would have to read that and continue.
Which brings another point up. Here we're already getting into multi-threading territory, because the moment you execute that C/C++ program you're dealing with a second thread. So, somehow, you'd have to coordinate those programs so that the Python part only continues once the C/C++ part is done. Shouldn't be a huge problem to get running, but it can be a nightmare to performance and RAM if done wrongly.
Without knowing to what extent you use that program, I also like to add that C/C++ isn't platform-independent like Python. You'll have to compile that program for every single different OS that you run it on. That may come with minor changes to the code and in general just a lot of work because you have to debug and test it for every single system.
To sum up, I think it may be better to find another solution. I don't know why you'd want to run this specific part in C/C++, but I'd recommend trying to get it done in one language. If there's absolutely no way you can get it done in Python (which I doubt, there's libraries for almost everything), you should get your Python to C/C++ instead.

If you want to run C/C++ code - you'll need either a C/C++ compiler, or a C/C++ interpreter.
The former is quite easy to arrange (though probably not suitable for an end user product) and you can just compile the code and run as required.
The latter requires that you attempt to process the code yourself and generate python code that you can then import. I'm not sure this one is worth the effort at all given that even websites that offer compilation tools wrap gcc/g++ rather than implement it in javascript.
I suspect that this is an XY problem; you may wish to take a couple of steps back and try to explain why you want to run c++ code from within a python script.

Dangerous Python Keywords?

I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)

This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.

You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **

Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.

An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.

built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...

Use a Virtual Machine instead of running it on a system that you are concerned about.

Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.

Python/C "defs" file - what is it?

In the nautilus-python bindings, there is a file "nautilus.defs". It contains stanzas like
(define-interface MenuProvider
(in-module "Nautilus")
(c-name "NautilusMenuProvider")
(gtype-id "NAUTILUS_TYPE_MENU_PROVIDER")
)
or
(define-method get_mime_type
(of-object "NautilusFileInfo")
(c-name "nautilus_file_info_get_mime_type")
(return-type "char*")
)
Now I can see what most of these do (eg. that last one means that I can call the method "get_mime_type" on a "FileInfo" object). But I'd like to know: what is this file, exactly (ie. what do I search the web for to find out more info)? Is it a common thing to find in Python/C bindings? What is the format, and where is it documented? What program actually processes it?
(So far, I've managed to glean that it gets transformed into a C source file, and it looks a bit like lisp to me.)

To answer your "What program actually processes it?" question:
From Makefile.in in the src directory, the command that translates the .defs file into C is PYGTK_CODEGEN. To find out what PYGTK_CODEGEN is, look in the top-level configure.in file, which contains these lines:
AC_MSG_CHECKING(for pygtk codegen)
PYGTK_CODEGEN="$PYTHON `$PKG_CONFIG --variable=codegendir pygtk-2.0`/codegen.py"
AC_SUBST(PYGTK_CODEGEN)
AC_MSG_RESULT($PYGTK_CODEGEN)
So the program that processes it is a Python script called codegen.py, that apparently has some link with PyGTK. Now a Google search for PyGTK codegen gives me this link as the first hit, which says:
"PyGTK-Codegen is a system for automatically generating wrappers for interfacing GTK code with Python."
and also gives some examples.
As for: "What is the format, and where is it documented?". As others have said, the code looks a lot like simple Scheme. I couldn't find any documentation at all on codegen on the PyGTK site; this looks like one of those many dark corners of open source that isn't well documented. Your best bet would probably be to download a recent tarball for PyGTK, look through the sources for the codegen.py file and see if the file itself contains sufficient documentation.

All you need to create Python bindings for C code is to use the Python / C API. However, the API can be somewhat repetitive and redundant, and so various forms of automation may be used to create them. For example, you may have heard of swig. The LISP-like (Scheme) code that you see is simply a configuration file for PyGTK-Codegen, which is a similar automation program for creating bindings to Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.