How to properly sanitize a filename (protect against shell injection)?

How to properly sanitize a filename (protect against shell injection)? - python

What is the common practice to sanitize a filename from an outside source (e.g.: xml file) before using it within a subprocess (shell=False)?
Update:
Before sending some parsed strings around I would like to make some basic security checks. The given example uses mpg123 (a command line audioplayer) in remote mode to play a sound file.
filename = child.find("filename").text # e.g.: filename = "sound.mp3"
pid = subprocess.Popen(["mpg123"],"-R"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
command = "L "+filename+"\n"
pid.stdin.write(command.encode())

There's a couple of things I can think of.
A lightweight verification can be done if the systems are tolerant. It may also be appropriate if there is little chance of data destruction, or compromise of sensitive data. You can test to see if the string your given is an actual file by using os.path.isfile.
A more classic "secure" programming design would have you index the acceptable files that can be played and do a lookup based on user input. In this way you never actually pass on user input. It get's "filtered" by the lookup to already validated data (the accepted playable files list).
"Sanitizing" input is a black-listing type of technique. They are always less secure than a white-listing type of technique (above). If you have no choice but to "sanitize" the data you have to understand how that data passes through your system, and any other systems you depend on. You then have to craft rules to take into account any flaws or limitations within ALL of the systems. You would also have to cover classic malicious input cases like data input size, unacceptable character encoding and others.

Filenames doesn't need to be sanitized unless you are using a shell or executing anything. Pythons open() will not execute any commands in the filename given.
For security checks, to avoid overwriting files, you use the permission system of your OS, and make sure that the user under which the program is running only can overwrite and access files it should be able to overwrite and access.
It's generally not a good idea to let any program that takes input from the net or another process to accept absolute path names. In this case it should only be allowed to specify files under a defined music folder. I don't think the mp3 player can cause damage by giving it the wrong file, but you can crash it, at least, and that would be annoying.

Related

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)

Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.

If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

Time trial version of a program

I want to create a trial version of a program for my customer. I want to give him/her some time to test the program (7 days in this case).
I have this command in the application (in *.py file):
if os.path.isfile('some_chars.txt') or datetime.now()<datetime.strptime('30-8-2015','%d-%m-%Y'):
# DO WHAT application HAS TO DO
else:
print 'TRIAL EXPIRED'
quit()
I'm curious whether is this approach enough for common customer or whether I have to change it. The thing is that the application has to find a file which name is, let's say, 'some_chars.txt'. If the file was found, the application works as it has to do, if not, it returns a text 'Trial version expired'.
So the main question is - is it enough for common customer? Can it be found somewhere or is it compiled to machine code so he would had to disassemble it?
EDIT: I forgot to mention very important thing, I'm using py2exe to make an executable file (main) with unnecessary files and folders.

Of course it has everything to do with the target (population) you're aiming: there are some cases when security is an offense (that involves lots of money so it's not our case);
Let's take an example:
Have a program that reads plain data from a file(registry,...); e.g. :the date (the program converts the date does a comparison and depending on the trial period close or lets the user go on)
Have everything from previous step, but the data is not in plain text, it is encrypted (e.g.: 1 is added to every char in the data so it is not immediately readable)
Use some well known encryption algorithms (would make the data unreadable to the user)
But, no matter the method you choose, it's just a matter of time til it will be broken.
A "hard to beat" way would be to have an existing server where the client could connect and "secretly talk" (I'm talking about a SSLed connecion anyway), even for trial period.
"Hiding the obvious info"(delivering a "compiled" .py script) is no longer the way (the most common Google search will point to a Python "decompiler")

Python is interpreted, so all they have to do is look at the source code to see time limiting section.
There are some options of turning a python script into an executable. I would try this and don't use any external files to set the date, keep it in the script.

Is it possible to restrict access to globals for a block of code in python?

I would like users of my program to be able to define custom scripts in python without breaking the program. I am looking at something like this:
def call(script):
code = "access modification code" + script
exec(code)
Where "access modification code" defines a scope such that script can only access variables it instantiates itself. Is it possible to do this or something with similar functionality, such as creating a new python environment with its own scope and then receiving output from it?
Thank you for your time :)
Clarification Edit
"I want to prevent both active attacks and accidental interaction with the program variables outside the users script (hence hiding all globals). The user scripts are intended to be small and inputted as text. The return of the user script needs to be immediate, as though it were native to the program."

There's two separate problems you want to prevent in this scenario:
Prevent an outside attacker from running arbitrary code in the context of the OS user executing your program. This means preventing arbitrary code execution and privilege escalation.
Preventing a legitimate user of your program from changing the program behavior in unintended ways.
For the first problem, you need to make sure that the source of the python code you execute is the user. You shouldn't accept input from a socket, or from a file that other users can write to. You need to make sure, somehow, that it was the user who is running the program that provided the input.
Actual solutions will depend on your OS, but you may want to consider setting up restrictive file permissions, if you're storing the code in a file.
Don't ignore or downplay this problem, or your users will fall victim to virus/malware/hackers thanks to your program.
The way you solve the second problem depends on what exactly constitutes intended behaviour in your program. If you're happy with outputting simple data structures, you can run your user-inputted code in a separate process, and pass the result over a pipe in a serialization format such as JSON or YAML.
Here's a very simple example:
#remember to set restrictive file permissions on this file. This is OS-dependent
USER_CODE_FILE="/home/user/user_code_file.py"
#absolute path to python binary (executable)
PYTHON_PATH="/usr/bin/python"
import subprocess
import json
user_code= '''
import json
my_data= {"a":[1,2,3]}
print json.dumps(my_data)
'''
with open(USER_CODE_FILE,"wb") as f:
f.write(user_code)
user_result_str= subprocess.check_output([PYTHON_PATH, USER_CODE_FILE])
user_result= json.loads( user_result_str )
print user_result
This is a fairly simple solution, and it has significant overhead. Don't use this if you need to run the user code many times in a short period of time.
Ultimately, this solution is only effective against unsophisticated attackers (users). Generally speaking, there's no way to protect any user process from the user itself - nor would it make much sense.
If you really want more assurance, and want to mitigate the first problem too, you should run the process as a separate ("guest") user. This, again, is OS-dependent.
Finally a warning: avoid exec and eval to the best of your abilities. They don't protect either your program or the user against the inputted code. There's a lot of information about this on the web, just search for "python secure eval"

Keep persistent variables in memory between runs of Python script

Is there any way of keeping a result variable in memory so I don't have to recalculate it each time I run the beginning of my script?
I am doing a long (5-10 sec) series of the exact operations on a data set (which I am reading from disk) every time I run my script.
This wouldn't be too much of a problem since I'm pretty good at using the interactive editor to debug my code in between runs; however sometimes the interactive capabilities just don't cut it.
I know I could write my results to a file on disk, but I'd like to avoid doing so if at all possible. This should be a solution which generates a variable the first time I run the script, and keeps it in memory until the shell itself is closed or until I explicitly tell it to fizzle out. Something like this:
# Check if variable already created this session
in_mem = var_in_memory() # Returns pointer to var, or False if not in memory yet
if not in_mem:
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
in_mem = store_persistent(result)
I've an inkling that the shelve module might be what I'm looking for here, but looks like in order to open a shelve variable I would have to specify a file name for the persistent object, and so I'm not sure if it's quite what I'm looking for.
Any tips on getting shelve to do what I want it to do? Any alternative ideas?

You can achieve something like this using the reload global function to re-execute your main script's code. You will need to write a wrapper script that imports your main script, asks it for the variable it wants to cache, caches a copy of that within the wrapper script's module scope, and then when you want (when you hit ENTER on stdin or whatever), it calls reload(yourscriptmodule) but this time passes it the cached object such that yourscript can bypass the expensive computation. Here's a quick example.
wrapper.py
import sys
import mainscript
part1Cache = None
if __name__ == "__main__":
while True:
if not part1Cache:
part1Cache = mainscript.part1()
mainscript.part2(part1Cache)
print "Press enter to re-run the script, CTRL-C to exit"
sys.stdin.readline()
reload(mainscript)
mainscript.py
def part1():
print "part1 expensive computation running"
return "This was expensive to compute"
def part2(value):
print "part2 running with %s" % value
While wrapper.py is running, you can edit mainscript.py, add new code to the part2 function and be able to run your new code against the pre-computed part1Cache.

To keep data in memory, the process must keep running. Memory belongs to the process running the script, NOT to the shell. The shell cannot hold memory for you.
So if you want to change your code and keep your process running, you'll have to reload the modules when they're changed. If any of the data in memory is an instance of a class that changes, you'll have to find a way to convert it to an instance of the new class. It's a bit of a mess. Not many languages were ever any good at this kind of hot patching (Common Lisp comes to mind), and there are a lot of chances for things to go wrong.

If you only want to persist one object (or object graph) for future sessions, the shelve module probably is overkill. Just pickle the object you care about. Do the work and save the pickle if you have no pickle-file, or load the pickle-file if you have one.
import os
import cPickle as pickle
pickle_filepath = "/path/to/picklefile.pickle"
if not os.path.exists(pickle_filepath):
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
with open(pickle_filepath, 'w') as pickle_handle:
pickle.dump(result, pickle_handle)
else:
with open(pickle_filepath) as pickle_handle:
result = pickle.load(pickle_handle)

Python's shelve is a persistence solution for pickled (serialized) objects and is file-based. The advantage is that it stores Python objects directly, meaning the API is pretty simple.
If you really want to avoid the disk, the technology you are looking for is a "in-memory database." Several alternatives exist, see this SO question: in-memory database in Python.

Weirdly, none of the earlier answers here mention simple text files. The OP says they don't like the idea, but as this is becoming a canonical for duplicates which might not have that constraint, this alternative deserves a mention. If all you need is for some text to survive between invocations of your script, save it in a regular text file.
def main():
# Before start, read data from previous run
try:
with open('mydata.txt', encoding='utf-8') as statefile:
data = statefile.read().rstrip('\n')
except FileNotFound:
data = "some default, or maybe nothing"
updated_data = your_real_main(data)
# When done, save new data for next run
with open('mydata.txt', 'w', encoding='utf-8') as statefile:
statefile.write(updated_data + '\n')
This easily extends to more complex data structures, though then you'll probably need to use a standard structured format like JSON or YAML (for serializing data with tree-like structures into text) or CSV (for a matrix of columns and rows containing text and/or numbers).
Ultimately, shelve and pickle are just glorified generalized versions of the same idea; but if your needs are modest, the benefits of a simple textual format which you can inspect and update in a regular text editor, and read and manipulate with ubiquitous standard tools, and easily copy and share between different Python versions and even other programming languages as well as version control systems etc, are quite compelling.
As an aside, character encoding issues are a complication which you need to plan for; but in this day and age, just use UTF-8 for all your text files.
Another caveat is that beginners are often confused about where to save the file. A common convention is to save it in the invoking user's home directory, though that obviously means multiple users cannot share this data. Another is to save it in a shared location, but this then requires an administrator to separately grant write access to this location (except I guess on Windows; but that then comes with its own tectonic plate of other problems).
The main drawback is that text is brittle if you need multiple processes to update the file in rapid succession, and slow to handle if you have lots of data and need to update parts of it frequently. For these use cases, maybe look at a database (probably start with SQLite which is robust and nimble, and included in the Python standard library; scale up to Postgres or etc if you have entrerprise-grade needs).
And, of course, if you need to store native Python structures, shelve and pickle are still there.

This is a os dependent solution...
$mkfifo inpipe
#/usr/bin/python3
#firstprocess.py
complicated_calculation()
while True:
with open('inpipe') as f:
try:
print( exec (f.read()))
except Exception as e: print(e)
$./first_process.py &
$cat second_process.py > inpipe
This will allow you to change and redefine variables in the first process without copying or recalculating anything. It should be the most efficient solution compared to multiprocessing, memcached, pickle, shelve modules or databases.
This is really nice if you want to edit and redefine second_process.py iteratively in your editor or IDE until you have it right without having to wait for the first process (e.g. initializing a large dict, etc.) to execute each time you make a change.

You can do this but you must use a Python shell. In other words, the shell that you use to start Python scripts must be a Python process. Then, any global variables or classes will live until you close the shell.
Look at the cmd module which makes it easy to write a shell program. You can even arrange so that any commmands that are not implemented in your shell get passed to the system shell for execution (without closing your shell). Then you would have to implement some kind of command, prun for instance, that runs a Python script by using the runpy module.
http://docs.python.org/library/runpy.html
You would need to use the init_globals parameter to pass your special data to the program's namespace, ideally a dict or a single class instance.

You could run a persistent script on the server through the os which loads/calcs, and even periodically reloads/recalcs the sql data into memory structures of some sort and then acess the in-memory data from your other script through a socket.

Dangerous Python Keywords?

I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)

This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.

You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **

Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.

An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.

built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...

Use a Virtual Machine instead of running it on a system that you are concerned about.

Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.