Dangerous Python Keywords?

Dangerous Python Keywords? - python

I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)

This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.

You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **

Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.

An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.

built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...

Use a Virtual Machine instead of running it on a system that you are concerned about.

Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.

Related

Prevent any file system usage in Python's pytest

I have a program that, for data security reasons, should never persist anything to local storage if deployed in the cloud. Instead, any input / output needs to be written to the connected (encrypted) storage instead.
To allow deployment locally as well as to multiple clouds, I am using the very useful fsspec. However, other developers are working on the project as well, and I need a way to make sure that they aren't accidentally using local File I/O methods - which may pass unit tests, but fail when deployed to the cloud.
For this, my idea is to basically mock/replace any I/O methods in pytest with ones that don't work and make the test fail. However, this is probably not straightforward to implement. I am wondering whether anyone else has had this problem as well, and maybe best practices / a library exists for this already?
During my research, I found pyfakefs, which looks like it is very close what I am trying to do - except I don't want to simulate another file system, I want there to be no local file system at all.
Any input appreciated.

You can not use any pytest addons to make it secure. There will always be ways to overcome it. Even if you patch everything in the standard python library, the code always can use third-party C libraries which can't be patched from the Python side.
Even if you, by some way, restrict every way the python process can write the file, it will still be able to call the OS or other process to write something.
The only ways are to run only the trusted code or to use some sandbox to run the process.
In Unix-like operating systems, the workable solution may be to create a chroot and run the program inside it.
If you're ok with just preventing opening files using open function, you can patch this function in builtins module.
_original_open = builtins.open
class FileSystemUsageError(Exception):
pass
def patched_open(*args, **kwargs):
raise FileSystemUsageError()
#pytest.fixture
def disable_fs():
builtins.open = patched_open
yield
builtins.open = _original_open
I've done this example of code on the basis of the pytest plugin which is written by the company in which I work now to prevent using network in pytests. You can see a full example here: https://github.com/best-doctor/pytest_network/blob/4e98d816fb93bcbdac4593710ff9b2d38d16134d/pytest_network.py

How to protect Python source code, while making the file available for running?

So, I recently made a Python program that I want to send to someone with them being able to execute it, but not read the code I have typed in it. Any ideas how to do it?
BTW, I want it to be irreversible
In short, here are my Parameters:
Should remain a Python file
Can't be reversed
Code should not be readable
Should still have the ability to be run

The criteria you've posted are inconsistent.
Python is an interpreted language. The entity running the language (i.e. Python interpreter) is reading your code and executing it, line by line. If you wrap it up to send to someone, their Python interpreter must have read permissions on the file, whether it's source code or "compiled" Python (which is easily decompiled into equivalent source code).
If we take a wider interpretation of "send to someone", there may be a business solution that serves your needs. You would provide your functionality, rather than the code: deploy it as a service from some available server: your own, or rented space. To do this, you instead provide an interface to your functionality.
If this fulfills your needs, you now have your next research topic.

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...

It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

How to customize Flyway so that it can handle CSV files as input as well?

Has someone implemented the CSV-handling for Flyway? It was requested some time ago (Flyway specific migration with csv files). Flyway comments it now as a possibility for the MigrationResolver and MigrationExecutor, but it does not seem to be implemented.
I've tried to do it myself with Flyway 4.2, but I'm not very good with java. I got as far as creating my own jar using the sample and make it accessible to flyway. But how does flyway distinguish when to use the SqlMigrator and when to use my CsvMigrator? I thought I have to register my own prefix/suffix (as the question above writes), but FlywayConfiguration seems to be read-only, at least I did not see any API calls for doing this :(.
How to connect the different Resolvers to the different migration file types? (.sql to the migration using Sql and .csv/.py to the loading of Csv and executing python scripts)

After some shed of tears and blood, it looks like came up with something on this. I can't make the whole code available because it is using proprietary file format, but here's the main ideas:
implement the ConfigurationAware as well, and use the setFlywayConfiguration implementation to catalog the extra files you want to handle (i.e. .csv). This is executed only once during the run.
during this cataloging I could not use the scanner or LoadableResources, there's some Java magic I do not understand. All the classes and methods seem to be available and accessible, even when using .getMethods() runtime... but when trying to actually call them during a run it throws java.lang.NoSuchMethodError and java.lang.NoClassDefFoundError. I've wasted a whole day on this - don't do that, just copy-paste the code from org.flywaydb.core.internal.util.scanner.filesystem.FileSystemScanner.
use Set< String > instead of LoadableResources[], way easier to work with, especially since there's no access to LoadableResources anyway and working with [] was a nightmare.
the python/shell call will go to the execute(). Some tips:
any exception or fawlty exitcode needs to be translated to an SQLException.
the build is enforcing Java 1.6, so new ProcessBuilder(cmd).inheritIO() cannot be used. Look at these solutions: ProcessBuilder: Forwarding stdout and stderr of started processes without blocking the main thread if you want to print the STDOUT/STDERR.
to compile flyway including your custom module, clone the whole flyway repo from git, edit the main pom.xml to include your module as well and use this command to compile: "mvn install -P-CommercialDBTest -P-CommandlinePlatformAssemblies -DskipTests=true" (I found this in another stackoverflow question.)
what I haven't done yet is the checksum part, I don't know yet what that wants.

Are Python functions "compile" and "compiler.parse" safe (sandboxed)?

I plan to use those functions in web-environment, so my concern is if those functions can be exploited and used for executing malicious software on the server.
Edit: I don't execute the result. I parse the AST tree and/or catch SyntaxError.
This is the code in question:
try:
#compile the code and check for syntax errors
compile(code_string, filename, "exec")
except SyntaxError, value:
msg = value.args[0]
(lineno, offset, text) = value.lineno, value.offset, value.text
if text is None:
return [{"line": 0, "offset": 0,
"message": u"Problem decoding source"}]
else:
line = text.splitlines()[-1]
if offset is not None:
offset = offset - (len(text) - len(line))
else:
offset = 0
return [{"line": lineno, "offset": offset, "message": msg}]
else:
#no syntax errors, check it with pyflakes
tree = compiler.parse(code_string)
w = checker.Checker(tree, filename)
w.messages.sort(lambda a, b: cmp(a.lineno, b.lineno))
checker.Checker is pyflakes class that parses the AST tree.

I think the more interesting question is what are you doing with the compiled functions? Running them is definitely unsafe.
I've tested the few exploits i could think of seeing as its just a syntax checker (can't redefine classes/functions etc) i don't think there is anyway to get python to execute arbitrary code at compile time

If the resulting code or AST object is never evaluated, I think you are only subject to DDoS attacks.
If you are evaluating user inputed code, it is the same as giving shell access as the webserver user to every user.

They are not, but it's not too hard finding a subset of Python that can be sandboxed to a point. If you want to go down that road you need to parse that subset of Python yourself and intercept all calls, attribute lookups and everything else involved. You also don't want to give users access to any language construct such as unterminating loops and more.
Still interested? Head over to jinja2.sandbox :)

compiler.parse and compile could most definitely be used for an attack if the attacker can control their input and the output is executed. In most cases, you are going to either eval or exec their output to make it run so those are still the usual suspects and compile and compiler.parse (deprecated BTW) are just adding another step between the malicious input and the execution.
EDIT: Just saw that you left a comment indicating that you are actually planning on using these on USER INPUT. Don't do that. Or at least, don't actually execute the result. That's a huge security hole for whoever ends up running that code. And if nobody's going to run it, why compile it? Since you clarified that you only want to check syntax, this should be fine. I would not store the output though as there's no reason to make anything easier for a potential attacker and being able to get arbitrary code onto your system is a first step.
If you do need to store it, I would probably favor a scheme similar to that commonly used for images where they are renamed in a non-predictable manner with the added step of making sure that it is not stored on the import path.

Yes, they can be maliciously exploited.
If you really want safe sandboxing, you could look at PyPy's sandboxing features, but be aware that sandboxing is not easy, and there may be better ways to accomplish whatever you are seeking.
Correction
Since you've updated your question to clarify that you're only parsing the untrusted input to AST, there is no need to sandbox anything: sandboxing is specifically about executing untrusted code (which most people probably assumed your goal was, by asking about sandboxing).
Using compile / compiler only for parsing this way should be safe: Python source parsing does not have any hooks into code execution. (Note that this is not necessarily true of all languages: for example, Perl cannot be (completely) parsed without code execution.)
The only other remaining risk is that someone may be able to craft some pathological Python source code that makes one of the parsers use runaway amounts of memory / processor time, but resource exhaustion attacks affect everything, so you'll just want to manage this as it becomes necessary. (For example, if your deployment is mission-critical and cannot afford a denial of service by an attacker armed with pathological source code, you can execute the parsing in a resource-limited subprocess).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.