Exactly how long do Django/Python/FastCGI Processes last? - python

I have been working on a website in Django, served using FCGI set up using an autoinstaller and a custom templating system.
As i have it set up now, each View is an instance of a class, which is bound to a template file at load time, and not time of execution. That is, the class is bound to the template via a decorator:
#include("page/page.xtag") # bind template to view class
class Page (Base):
def main(self): # main end-point to retrieve web page
blah = get_some_stuff()
return self.template.main(data=blah) # evaluates template using some data
One thing i have noticed is that since FCGI does not create a new process and reload all the modules/classes every request, changes to the template do not automatically appear on the website until after i force a restart (i.e. by editing/saving a python file).
The web pages also contain lots of data that is stored in .txt files in the filesystem. For example, i will load big snippets of code from separate files rather than leaving them in the template (where they clutter it up) or in the database (where it is inconvenient to edit them). Knowing that the process is persistent, i created an ad-hoc memcache by saving the text i loaded in a static dictionary in one of my classes:
class XLoad:
rawCache = {} #{name : (time, text)}
#staticmethod
def loadRaw(source):
latestTime = os.stat(source).st_mtime
if source in XLoad.rawCache.keys() and latestTime < XLoad.rawCache[source][0]:
# if the cached version of file is up to date, use it
return XLoad.rawCache[source][1]
else:
# otherwise read it from disk, dump it in cache and use that
text = open(source).read()
XLoad.rawCache[source] = (latestTime, text)
return text
Which sped everything up considerably, because the two dozen or so code-snippets which i was loading one-by-one from the filesystem were now being taken directly from the process' memory. Every time i forced a restart, it would be slow for one request while the cache filled up then become blazing fast again.
My question is, what exactly determines how/when the process gets restart, the classes and modules reloaded, and the data i keep in my static dictionary purged? Is it dependent on my installation of Python, or Django, or Apache, or FastCGI? Is it deterministic, based on time, on number of requests, on load, or pseudo-random? And is it safe to do this sort of in-memory caching (which really is very easy and convenient!), or should i look into some proper way of caching these file-reads?

It sounds like you already know this.
When you edit a Python file.
When you restart the server.
When there is a nonrecoverable error.
Also known as "only when it has to".
Caching like this is fine -- you're doing it whenever you store anything in a variable. Since the information is read only, how could this not be safe? Try not to write changes to a file right after you've restarted the server; but the worst thing that could happen is one page view gets messed up.
There is a simple way to confirm all this -- logging. Have your decorators log when they are called, and log when you have to load a file from disk.

In addition to the already mentioned reasons, Apache can be configurated to terminate idle fcgi processes after a specified timespan.

Related

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)
Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.
If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

Python dumbdbm, when will data be written back to disk?

I'm using Python2.7's dumbdbm, but this question also applies to Python3's dbm.dumb.
The documentation says:
dumbdbm.sync()
Synchronize the on-disk directory and data files. This method is called by the sync() method of Shelve objects.
I've got three questions:
If I don't call sync, will disk file get updated?
And does this function always write data back to disk, not inverse?
What if I call close?
One — perhaps the best if not only — way to answer questions like this that aren't specifically addressed in the documentation is to read the source code (when it's available, as it is here).
The dumbdbm.py file should be in your /Python/Lib directory and can also be viewed online in your browser through the Mercurial source code revision control system at:
https://hg.python.org/cpython/file/2.7/Lib/dumbdbm.py
The first thing to notice is the longish comment at the beginning of the private _Database class — which is what a dumbdbm database really is — because it seems to generally deal with what seems to be overall theme of your questions:
class _Database(UserDict.DictMixin):
# The on-disk directory and data files can remain in mutually
# inconsistent states for an arbitrarily long time (see comments
# at the end of __setitem__). This is only repaired when _commit()
# gets called. One place _commit() gets called is from __del__(),
# and if that occurs at program shutdown time, module globals may
# already have gotten rebound to None. Since it's crucial that
# _commit() finish successfully, we can't ignore shutdown races
# here, and _commit() must not reference any globals.
In-depth information about specific methods can be found by reading the source code for them. Given that, here's what I think the answers to your questions would be for version 2.7 of Python:
If I don't call sync, will disk file get updated?
From the preceding comment, it sounds like it will as long as your program shuts down gracefully.
Beyond that it depends on the methods that have been called. Some may, but only partially. For instance, it looks like __setitem__() does, depending on whether the item is for a entirely new key or an existing one. For the latter cases there's a comment at the end of part that deals with them that says (realizing that _commit() is just another name for sync()):
Note that _index may be out of synch with the directory file now:
_setval() and _addval() don't update the directory file. This also means that the on-disk directory and data files are in a mutually
inconsistent state, and they'll remain that way until _commit() is
called. Note that this is a disaster (for the database) if the
program crashes (so that _commit() never gets called).
And does this function always write data back to disk, not inverse?
sync() / _commit() does not appear to load any data back into memory from the disk.
What if I call close?
close() just calls _commit() and then sets all internal data structures to None, preventing any further database operations.
In conclusion, for a somewhat humorous take on the meta-subject here, I suggest you read Learn to Read the Source, Luke.

Best way to avoid data loss in a high-load Django app?

Imagine a quite complex Django application with both frontend and backend parts. Some users modify some data on the frontend part. Some scripts modify the same data periodically on the backend part.
Example:
instance = SomeModel.objects.get(...)
# (long-running part where various fields are changed, takes from 3 to 20 seconds)
instance.field = 123
instance.another_field = 'abc'
instance.save()
If somebody (or something) changes the instance while that part is changing some fields, then the changes will be lost because the instance will be saved lately, dumping the data from the Python (Django) class. In other words, if something in the code takes data, then waits for some time, and then saves the data back - then only the latest 'saver' will save its data, all the others (previous) ones will lose their changes.
It's a "high-load" app, the database load (we use Postgres) is quite high and I'd like to avoid anything that would cause a significant increase of the DB activity or memory taken.
Another issue - we have many signals attached, and even the save() method overriden, so I'd like to avoid anything that might break the signals or might be incompatible with custom save() or update() methods.
What would you recommend in this situation? Any special app for that? Transactions? Anything else?
Thank you!
The correct way to protect against this is to use select_for_update to make sure that the data doesn't change between reading and writing. However this causes the row to be locked for updates so this might slow down your application significantly.
Oen solution might be to read the data and perform your long-running tasks. Then before saving it back you start a transaction, read the data again but now with select_for_update and verify that the original data hasn't changed. If the data is still the same then you save. If the data has changed you abort and re-run the long-running task. That way you will hold the lock for as short as possible.
Something like:
success = False
while not success:
instance1 = SomeModel.objects.get(...)
# (long-running part)
with django.db.transaction.atomic():
instance2 = SomeModel.objects.select_for_update().get(...)
# (compare relevant data from instance1 vs instance2)
if unchanged:
# (make the changes on instance2)
instance2.field = 123
instance2.another_field = 'abc'
instance2.save()
success = True
If this is a viable approach does depend on what exactly your long-running task is. And a user might still overwrite the data you save here.

How to modify a large file remotely

I have a large XML file, ~30 MB.
Every now and then I need to update some of the values. I am using element tree module to modify the XML. I am currently fetching the entire file, updating it and then placing it again. SO there is ~60 MB of data transfer every time. Is there a way I update the file remotely?
I am using the following code to update the file.
import xml.etree.ElementTree as ET
tree = ET.parse("feed.xml")
root = tree.getroot()
skus = ["RUSSE20924","PSJAI22443"]
qtys = [2,3]
for child in root:
sku = child.find("Product_Code").text.encode("utf-8")
if sku in skus:
print "found"
i = skus.index(sku)
child.find("Quantity").text = str(qtys[i])
child.set('updated', 'yes')
tree.write("feed.xml")
Modifying a file directly via FTP without uploading the entire thing is not possible except when appending to a file.
The reason is that there are only three commands in FTP that actually modify a file (Source):
APPE: Appends to a file
STOR: Uploads a file
STOU: Creates a new file on the server with a unique name
What you could do
Track changes
Cache the remote file locally and track changes to the file using the MDTM command.
Pros:
Will half the required data transfer in many cases.
Hardly requires any change to existing code.
Almost zero overhead.
Cons:
Other clients will have to download the entire thing every time something changes(no change from current situation)
Split up into several files
Split up your XML into several files. (One per product code?)
This way you only have to download the data that you actually need.
Pros:
Less data to transfer
Allows all scripts that access the data to only download what they need
Combinable with suggestion #1
Cons:
All existing code has to be adapted
Additional overhead when downloading or updating all the data
Switch to a delta-sync protocol
If the storage server supports it switching to a delta synchronization protocol like rsync would help a lot because these only transmit the changes (with little overhead).
Pros:
Less data transfer
Requires little change to existing code
Cons:
Might not be available
Do it remotely
You already pointed out that you can't but it still would be the best solution.
What won't help
Switch to a network filesystem
As somebody in the comments already pointed out switching to a network file system (like NFS or CIFS/SMB) would not really help because you cannot actually change parts of the file unless the new data has the exact same length.
What to do
Unless you can do delta synchronization I'd suggest to implement some caching on the client side first and if it doesn't help enough to then split up your files.

Keep persistent variables in memory between runs of Python script

Is there any way of keeping a result variable in memory so I don't have to recalculate it each time I run the beginning of my script?
I am doing a long (5-10 sec) series of the exact operations on a data set (which I am reading from disk) every time I run my script.
This wouldn't be too much of a problem since I'm pretty good at using the interactive editor to debug my code in between runs; however sometimes the interactive capabilities just don't cut it.
I know I could write my results to a file on disk, but I'd like to avoid doing so if at all possible. This should be a solution which generates a variable the first time I run the script, and keeps it in memory until the shell itself is closed or until I explicitly tell it to fizzle out. Something like this:
# Check if variable already created this session
in_mem = var_in_memory() # Returns pointer to var, or False if not in memory yet
if not in_mem:
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
in_mem = store_persistent(result)
I've an inkling that the shelve module might be what I'm looking for here, but looks like in order to open a shelve variable I would have to specify a file name for the persistent object, and so I'm not sure if it's quite what I'm looking for.
Any tips on getting shelve to do what I want it to do? Any alternative ideas?
You can achieve something like this using the reload global function to re-execute your main script's code. You will need to write a wrapper script that imports your main script, asks it for the variable it wants to cache, caches a copy of that within the wrapper script's module scope, and then when you want (when you hit ENTER on stdin or whatever), it calls reload(yourscriptmodule) but this time passes it the cached object such that yourscript can bypass the expensive computation. Here's a quick example.
wrapper.py
import sys
import mainscript
part1Cache = None
if __name__ == "__main__":
while True:
if not part1Cache:
part1Cache = mainscript.part1()
mainscript.part2(part1Cache)
print "Press enter to re-run the script, CTRL-C to exit"
sys.stdin.readline()
reload(mainscript)
mainscript.py
def part1():
print "part1 expensive computation running"
return "This was expensive to compute"
def part2(value):
print "part2 running with %s" % value
While wrapper.py is running, you can edit mainscript.py, add new code to the part2 function and be able to run your new code against the pre-computed part1Cache.
To keep data in memory, the process must keep running. Memory belongs to the process running the script, NOT to the shell. The shell cannot hold memory for you.
So if you want to change your code and keep your process running, you'll have to reload the modules when they're changed. If any of the data in memory is an instance of a class that changes, you'll have to find a way to convert it to an instance of the new class. It's a bit of a mess. Not many languages were ever any good at this kind of hot patching (Common Lisp comes to mind), and there are a lot of chances for things to go wrong.
If you only want to persist one object (or object graph) for future sessions, the shelve module probably is overkill. Just pickle the object you care about. Do the work and save the pickle if you have no pickle-file, or load the pickle-file if you have one.
import os
import cPickle as pickle
pickle_filepath = "/path/to/picklefile.pickle"
if not os.path.exists(pickle_filepath):
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
with open(pickle_filepath, 'w') as pickle_handle:
pickle.dump(result, pickle_handle)
else:
with open(pickle_filepath) as pickle_handle:
result = pickle.load(pickle_handle)
Python's shelve is a persistence solution for pickled (serialized) objects and is file-based. The advantage is that it stores Python objects directly, meaning the API is pretty simple.
If you really want to avoid the disk, the technology you are looking for is a "in-memory database." Several alternatives exist, see this SO question: in-memory database in Python.
Weirdly, none of the earlier answers here mention simple text files. The OP says they don't like the idea, but as this is becoming a canonical for duplicates which might not have that constraint, this alternative deserves a mention. If all you need is for some text to survive between invocations of your script, save it in a regular text file.
def main():
# Before start, read data from previous run
try:
with open('mydata.txt', encoding='utf-8') as statefile:
data = statefile.read().rstrip('\n')
except FileNotFound:
data = "some default, or maybe nothing"
updated_data = your_real_main(data)
# When done, save new data for next run
with open('mydata.txt', 'w', encoding='utf-8') as statefile:
statefile.write(updated_data + '\n')
This easily extends to more complex data structures, though then you'll probably need to use a standard structured format like JSON or YAML (for serializing data with tree-like structures into text) or CSV (for a matrix of columns and rows containing text and/or numbers).
Ultimately, shelve and pickle are just glorified generalized versions of the same idea; but if your needs are modest, the benefits of a simple textual format which you can inspect and update in a regular text editor, and read and manipulate with ubiquitous standard tools, and easily copy and share between different Python versions and even other programming languages as well as version control systems etc, are quite compelling.
As an aside, character encoding issues are a complication which you need to plan for; but in this day and age, just use UTF-8 for all your text files.
Another caveat is that beginners are often confused about where to save the file. A common convention is to save it in the invoking user's home directory, though that obviously means multiple users cannot share this data. Another is to save it in a shared location, but this then requires an administrator to separately grant write access to this location (except I guess on Windows; but that then comes with its own tectonic plate of other problems).
The main drawback is that text is brittle if you need multiple processes to update the file in rapid succession, and slow to handle if you have lots of data and need to update parts of it frequently. For these use cases, maybe look at a database (probably start with SQLite which is robust and nimble, and included in the Python standard library; scale up to Postgres or etc if you have entrerprise-grade needs).
And, of course, if you need to store native Python structures, shelve and pickle are still there.
This is a os dependent solution...
$mkfifo inpipe
#/usr/bin/python3
#firstprocess.py
complicated_calculation()
while True:
with open('inpipe') as f:
try:
print( exec (f.read()))
except Exception as e: print(e)
$./first_process.py &
$cat second_process.py > inpipe
This will allow you to change and redefine variables in the first process without copying or recalculating anything. It should be the most efficient solution compared to multiprocessing, memcached, pickle, shelve modules or databases.
This is really nice if you want to edit and redefine second_process.py iteratively in your editor or IDE until you have it right without having to wait for the first process (e.g. initializing a large dict, etc.) to execute each time you make a change.
You can do this but you must use a Python shell. In other words, the shell that you use to start Python scripts must be a Python process. Then, any global variables or classes will live until you close the shell.
Look at the cmd module which makes it easy to write a shell program. You can even arrange so that any commmands that are not implemented in your shell get passed to the system shell for execution (without closing your shell). Then you would have to implement some kind of command, prun for instance, that runs a Python script by using the runpy module.
http://docs.python.org/library/runpy.html
You would need to use the init_globals parameter to pass your special data to the program's namespace, ideally a dict or a single class instance.
You could run a persistent script on the server through the os which loads/calcs, and even periodically reloads/recalcs the sql data into memory structures of some sort and then acess the in-memory data from your other script through a socket.

Categories

Resources