Say i store a password in plain text in a variable called passWd as a string.
How does python release this variable once i discard of it (for instance, with del passWd or passWd= 'new random data')?
Is the string stored as a byte-array meaning it can be overwritten in the memoryplace that it originally existed or is it a fixed set in a memory area which can't be modified and there for when assining a new value a new memory area is created and the old area is discareded but not overwritten by null?
I'm questioning how Python implements the safety of memory areas and would like to know more about it, mainly because i'm curious :)
From what i've gathered so far, using del (or __del__) causes the interpreter to not release memory areas of that variable automaticly which can cause issues, and also i'm not sure that del is so thurrow on deleting the values. But that's just from what i've gathered and not something in black or white :)
The main reason for me asking, is I'm intending to write a hand-over application that gets a string, does some I/O, passes it along to another subsystem (bootloader for raspberry pi for instance) and the interface is written in Python (how odd that must sound in some peoples ears..) and i'm not worried that the data is compromised during the I/O calculations but that a memory dump might be occuring in between the two subsystem handovers. or if the system is frozen (say a hiberation) say 20min after the system is booted and i removed the variable as fast as i could, but somehow it's still in the memory despite me doing a del passWd :)
(Ps. I've asked on Superuser, they refered me here aand i'm sorry for poor grammar!)
Unless you use custom coded input methods to get the password, it will be in many more places then just your immutable string. So don't worry too much.
The OS should take care that any data from your process is cleared before the memory is allocated to another process. This may of course fail if the page is copied to disk (swapped out or hibernated).
Secure password entry is not easy. Maybe you can find a special library or module that handles this.
I finally whent with two solutions.
ld_preload to replace the functionality of the string handling of Python on a lower level.
One other option which is a bit easier was to develop my own C library that has more functionality then what Python offers through the standard string handling.
Mainly the C code has a shread() function that writes over the memory area where the string "was" stored and some other error checks.
However, #Ber gave me a good enough answer to start developing my own solution since (as he pointed out) there is no secure method in Python and python stores strings in way to many places and relies on the OS (which, on it's own isn't a bad thing except when you don't trust the OS you are installing your realtively secure application on).
Related
This question already has an answer here:
Prevent RAM from paging to swap area (mlock)
(1 answer)
Closed 3 years ago.
I'm working on a password manager application for linux and I'm using Python for it.
Because of the security reasons I want to call mlock system call in order to avoid swapping password variable on hard drive.
I noticed that python itself didn't wrap this function.
so is there any way so can I avoid swapping?
Thanks
For CPython, there is no good answer for this that doesn't involve writing a Python C extension, since mlock works on pages, not objects. The internals of the str object differ from version to version (in Py3.3 and higher, a str may actually have several copies of the data in memory in different encodings, some inlined after the object structure, some dynamically allocated separately and linked by pointer), and even if you used ctypes to retrieve the necessary addresses and mlock-ed them all through ctypes mlock calls, you'll have a hell of a time determining when to mlock and when to munlock. Since mlock works on pages, you'd have to carefully track how many strings are currently in any given page (because if you just mlock and munlock blindly, and there are more than one things to lock in a page, the first munlock would unlock all of them; mlock/munlock is a boolean flag, it doesn't count the number of locks and unlocks).
Even if you manage that, you still would have a race between password acquisition and mlock during which the data could be written to swap, and those cached alternate encodings are computed lazily, so mlocking the non-NULL pointers at any given time doesn't necessarily mean those pointers might not be populated later.
You could partially avoid these problems through careful use of the mmap module and memoryviews (mmap gives you pages of memory, memoryview references said memory without copying it, so ctypes could be used to mlock the page), but you'd have to build it all from scratch (can't use the getpass module because it would store as a str for a moment).
In short, Python doesn't care about swapping or memory protection in the way you want; it trusts the swap file to be configured to your desired security (e.g. disabled or encrypted), neither providing additional protection nor providing the information you'd need to add it in.
I've considered storing the high scores for my game as variables in the code itself rather than as a text file as I've done so far because it means less additional files are required to run it and that attributing 999999 points becomes harder.
However, this would then require me to run self-modifying code to overwrite the global variables representing the scores permanently. I looked into that and considering that all I want to do is really just to change global variables, all the stuff I found was too advanced.
I'd appreciate if someone could give me an explanation on how to write self-modifying Python code to do just that, preferably with an example too as it aids understanding.
My first inclination is to say "don't do that". Self-modifying Python (really any language) makes it extremely difficult to maintain a versioned library.
You make a bug fix and need to redistribute - how do you merge data you stored via self-modification.
Very hard to authenticate packaging using a hash - once the local version is modified it's hard to tell which version it originated because SHAs won't match.
It's unsafe - You could just save and load a Python class that's not stored with your package, however, if it's user writable, a foreign process could add any arbitrary Python code to that file to evaluate. Kind of like SQL injection but Python style.
Python makes is so trivial to load and dump JSON files, that for simple things, I wouldn't think of anything else. Even CSV files are trivial and can be bound to maps but can be more easily manipulated as data using your favorite spreadsheet editor.
My suggestion - don't use self-modifiying Python unless you're just wanting to experiment; It's just not a practical solution in the real world, unless you're working in an embedded environment where disk and memory are a premium.
Before the question I want to say sorry to you, because i'm not a native english and my english is very very poor.
i'm writing a python program to read a file to a string,
and then analyze it,
and then pass it to some other programs.(They can't work with stream)
simple code like this:
content = open("file").read()
if passItToA(content):
A.portal(content)
del content
The problem is the string I read will not be release. Usually it lives until the end of process.
I know it is a feature of dynamic languages.
But it cause a lot of memory waste when i running 1000 duplicates processes in the same time.
Can I release it on my call?
Python relies heavily on the garbage collection. To mark a value as being garbage (and let the collector do its work on it), just overwrite it:
content = ''
You also can delete the whole variable from the dictionary of variables:
del content
But concerning the string value, both work the same.
Just make sure that no other variable is still holding a pointer to that string. In your case, A.portal() and passItToA() should not create long-living pointers to the same string in order to be able to free it.
I have a Python (2.7) script that acts as a server and it will therefore run for very long periods of time. This script has a bunch of values to keep track of which can change at any time based on client input.
What I'm ideally after is something that can keep a Python data structure (with values of types dict, list, unicode, int and float – JSON, basically) in memory, letting me update it however I want (except referencing any of the reference type instances more than once) while also keeping this data up-to-date in a human-readable file, so that even if the power plug was pulled, the server could just start up and continue with the same data.
I know I'm basically talking about a database, but the data I'm keeping will be very simple and probably less than 1 kB most of the time, so I'm looking for the simplest solution possible that can provide me with the described data integrity. Are there any good Python (2.7) libraries that let me do something like this?
Well, since you know we're basically talking about a database, albeit a very simple one, you probably won't be surprised that I suggest you have a look at the sqlite3 module.
I agree that you don't need a fully blown database, as it seems that all you want is atomic file writes. You need to solve this problem in two parts, serialisation/deserialisation, and the atomic writing.
For the first section, json, or pickle are probably suitable formats for you. JSON has the advantage of being human readable. It doesn't seem as though this the primary problem you are facing though.
Once you have serialised your object to a string, use the following procedure to write a file to disk atomically, assuming a single concurrent writer (at least on POSIX, see below):
import os, platform
backup_filename = "output.back.json"
filename = "output.json"
serialised_str = json.dumps(...)
with open(backup_filename, 'wb') as f:
f.write(serialised_str)
if platform.system() == 'Windows':
os.unlink(filename)
os.rename(backup_filename, filename)
While os.rename is will overwrite an existing file and is atomic on POSIX, this is sadly not the case on Windows. On Windows, there is the possibility that os.unlink will succeed but os.rename will fail, meaning that you have only backup_filename and no filename. If you are targeting Windows, you will need to consider this possibility when you are checking for the existence of filename.
If there is a possibility of more than one concurrent writer, you will have to consider a synchronisation construct.
Any reason for the human readable requirement?
I would suggest looking at sqlite for a simple database solution, or at pickle for a simple way to serialise objects and write them to disk. Neither is particularly human readable though.
Other options are JSON, or XML as you hinted at - use the built in json module to serialize the objects then write that to disk. When you start up, check for the presence of that file and load the data if required.
From the docs:
>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)
{
"4": 5,
"6": 7
}
Since you mentioned your data is small, I'd go with a simple solution and use the pickle module, which lets you dump a python object into a line very easily.
Then you just set up a Thread that saves your object to a file in defined time intervals.
Not a "libraried" solution, but - if I understand your requirements - simple enough for you not to really need one.
EDIT: you mentioned you wanted to cover the case that a problem occurs during the write itself, effectively making it an atomic transaction. In this case, the traditional way to go is using "Log-based recovery". It is essentially writing a record to a log file saying that "write transaction started" and then writing "write transaction comitted" when you're done. If a "started" has no corresponding "commit", then you rollback.
In this case, I agree that you might be better off with a simple database like SQLite. It might be a slight overkill, but on the other hand, implementing atomicity yourself might be reinventing the wheel a little (and I didn't find any obvious libraries that do it for you).
If you do decide to go the crafty way, this topic is covered on the Process Synchronization chapter of Silberschatz's Operating Systems book, under the section "atomic transactions".
A very simple (though maybe not "transactionally perfect") alternative would be just to record to a new file every time, so that if one corrupts you have a history. You can even add a checksum to each file to automatically determine if it's broken.
You are asking how to implement a database which provides ACID guarantees, but you haven't provided a good reason why you can't use one off-the-shelf. SQLite is perfect for this sort of thing and gives you those guarantees.
However, there is KirbyBase. I've never used it and I don't think it makes ACID guarantees, but it does have some of the characteristics you're looking for.
What would be the best way to handle lightweight crash recovery for my program?
I have a Python program that runs a number of test cases and the results are stored in a dictionary which serves as a cache. If I could save (and then restore) each item that is added to the dictionary, I could simply run the program again and the caching would provide suitable crash recovery.
You may assume that the keys and values in the dictionary are easily convertible to strings ie. using either str or the pickle module.
I want this to be completely cross platform - well at least as cross platform as Python is
I don't want to simply write out each value to a file and load it in my program might crash while I am writing the file
UPDATE: This is intended to be a lightweight module so a DBMS is out of the question.
UPDATE: Alex is correct in that I don't actually need to protect against crashes while writing out, but there are circumstances where I would like to be able to manually terminate it in a recoverable state.
UPDATE Added a highly limited solution using standard input below
There's no good way to guard against "your program crashing while writing a checkpoint to a file", but why should you worry so much about that?! What ELSE is your program doing at that time BESIDES "saving checkpoint to a file", that could easily cause it to crash?!
It's hard to beat pickle (or cPickle) for portability of serialization in Python, but, that's just about "turning your keys and values to strings". For saving key-value pairs (once stringified), few approaches are safer than just appending to a file (don't pickle to files if your crashes are far, far more frequent than normal, as you suggest tjey are).
If your environment is incredibly crash-prone for whatever reason (very cheap HW?-), just make sure you close the file (and fflush if the OS is also crash-prone;-), then reopen it for append. This way, worst that can happen is that the very latest append will be incomplete (due to a crash in the middle of things) -- then you just catch the exception raised by unpickling that incomplete record and redo only the things that weren't saved (because they weren't completed due to a crash, OR because they were completed but not fully saved due to a crash, comes to much the same thing in the end).
If you have the option of checkpointing to a database engine (instead of just doing so to files), consider it seriously! The DB engine will keep transaction logs and ensure ACID properties, making your application-side programming much easier IF you can count on that!-)
The pickle module supports serializing objects to a file (and loading from file):
http://docs.python.org/library/pickle.html
One possibility would be to create a number of smaller files ... each representing a subset of the state that you're trying to preserve and each with a checksum or tag indicating that it's complete as the last line/datum of the file (just before the file is closed).
If the checksum/tag is good then the rest of the data can be considered valid ... though program would then have to find all of these files, open and read all of them, and use meta data you've provided (in their headers or their names?) to determine which ones constitute the most recent cohesive state representation (or checkpoint) from which you can continue processing.
Without knowing more about the nature of the data that you're working with it's impossible to be more specific.
You can use files, of course, or you could use a DBMS system just about as easily. Any decent DBMS (PostgreSQL, MySQL if you're using the proper storage back-ends) can give you ACID guarantees and transactional support. So the data you read back should always be consistent with the constraints that you put in your schema and/or with the transactions (BEGIN, COMMIT, ROLLBACK) that you processed.
A possible advantage of posting your serialized date to a DBMS is that you can host the DBMS on a separate system (which is unlikely to suffer the same instabilities as your test host at the same times).
Pickle/cPickle have problems.
I use the JSON module to serialize objects out. I like it because not only does it work on any OS, but it will work fine in other programming languages, too; many other languages and platforms have readily-accessible JSON deserialization support, which makes it easy to use the same objects in different programs.
Solution with severe restrictions
If I don't worry about it crashing while writing out and I only want to allow manual termination, I can use standard output to control this. Unfortunately, this can only terminate the program when a control point is reached. This could be solved by creating a new thread to read standard input. This thread could use a global lock to check if the main thread is inside a critical section (writing to a file) and terminate the program if this is not the case.
Downsides:
This is reasonably complex
It adds an extra thread
It stops me using standard input for anything else