Reloading global Python variables in a long running process - python

I have celery Python worker processes that are restarted every day or so. They execute Python/Django programs.
I have set certain quasi-global values that should persist in memory for the duration of the process. Namely, I have certain MySQL querysets that do not change often and are therefore evaluated one time and stored as a CONSTANT as soon as the process starts (a bad example being PROFILE = Profile.objects.get(user_id=5)).
Let's say that I want to reset this value in the celery process without exec-ing a whole new program.
This value is imported (and used) in a number of different modules. I'm assuming I'd have to go through each one in sys.modules that imports the CONSTANT and delete/reset the key? Is that right?
This seems very hacky. I usually use external services like Memcached for coordination of memory among multiple processes, but every once in a while, I figure local memory is preferable to over the network calls to a NoSQL store.

It's a bit hard to say without seeing some code, but importing just sets a reference, exactly as with variable assignment: that is, if the data changes, the references change too. Naturally though this only works if it's the parent context that you've imported (otherwise assignment will change the reference, rather than updating the value.)
In other words, if you do this:
from mypackage import mymodule
do_something_with(mymodule.MY_CONSTANT)
#elsewhere
mymodule.MY_CONSTANT = 'new_value'
then all references to mymodule.MY_CONSTANT will get the new value. But if you did this:
from mypackage.mymodule import MY_CONSTANT
# elsewhere
mymodule.MY_CONSTANT = 'new_value'
the original reference won't get the new value, because you've rebound MY_CONSTANT to something else but the first reference is still pointing at the old value.

Related

python reduce sqlite3 db lookups

I am trying to reduce sqlite3 db look ups in python. I have system with limited RAM of 1 GB only where i am implementing it. I want to store current DB values somewhere from where i can retrieve them without consulting DB again and again. One thing to keep in mind is that trigger point of all of my python scripts (processes) is different and there is no master script or you can say i am not controlling all of my scripts from one point.
What's in my knowledge:
I don't want to save/retrieve data from file as i don't want to make read/write operations. In a nutshell i don't want to manipulate it via file (Simply saying no to pickel and shelve python modules)
I also cannot use in memory cache modules like memcahced and beaker because of limitation of memory size and also these modules are intended for server side development and i am working on stand alone scripts (iot device)
I cannot use singleton classes because of limitation of namespaces and scope. As soon as scope of one script ends, instance of singleton also vanishes and i am not able to persist instance of singleton class in all of my python scripts. I am not able to use static variables and static methods too because instance does not stick in scope and everything becomes volatile and goes back to initialized value instead of current DB values every next time i import singleton class script in any of my other scripts.
As trigger point of all of my python scripts is different which also makes it impossible to use global variables too. Global variables are required to be initialized with some value whereas i want current DB values in global variables.
I also cannot do memory segmentation as python does not allow me to do so.
What else can i do more?
Is there any python library or any other language's library which allows me to insert current DB values so that i instead of looking up from Sqlite3 DB i get values from there without doing any read/write operation?? (By read/write operation i mean not to load from hard drive or sd card)
Thanks in advance, any help from you is highly appreciated.

Storing and loading str:func Python dictionaries

Alright so I want to know if doing this would affect the code, and if I'm doing it correctly.
So basically, let's say in one file I have a dictionary called commands (inside a class), and in another an object of the other class is made, and the dictionary is used. During run-time, I edit the dictionary and add new functions. Now I need to reload the dictionary without having to restart the whole script (because this would affect a lot of people using my services). If I send a signal to the script (it's a socket server) that indicates that the dictionary should be reloaded. How would I re-import the module after it's already imported mid-code? and would re-importing it affect the objects made of it, or do I have to somehow reload the objects? (note that the objects contain an active socket, and I do not wish to kill that socket).
It is better to store the data in a database, like Redis which supports dictionary-like data. This way you can avoid the problem of reloading altogether, as the database process makes sure the fetched data is always up-to-date.

Is it faster to create new instance of class/variable or set existing one?

In Python 2.7, (or in programming languages in general), is it faster to create a new instance of a class/variable or to set an existing one to something new?
For example, which is faster to create another_pic.png? This:
my_img = Image.open(cur_directory_path + '\\my_pic.png') # don't need this anymore
new_img = Image.open(cur_directory_path + '\\another_pic.png') # but need this new pic
or this:
my_img = Image.open(cur_directory_path + '\\my_pic.png') # don't need this anymore
my_img = Image.open(cur_directory_path + '\\another_pic.png') # but need this new pic
I ask because I have one Image variable which I "gets around" so to speak in my code, by constantly being reset to various things, and I am wondering if this affects performance at all.
In both cases, you're creating two completely new objects at the exact same speed, so to that end I don't think either one is faster than the other. You're never really "resetting" an object; you're just reassigning a name. All that's happening is you're changing an existing pointer to a new memory location, which is a fraction of a fraction of a second.
The main difference is that with the bottom option, you have left an unused object for the garbage collector to pick up, but deallocating memory is not a very speed-intensive task. It's possible (depending on the number of free objects you have lying around) that won't even happen before your program ends. But you're also using more memory by keeping two objects lying around. So if you're constantly importing new images, to the degree that it may impact your memory, it's probably best to be resetting the same pointer. Or you could even invoke the garbage collector manually if you're concerned about running out of memory, but it doesn't sound like you are.
They're exactly the same. Both go through the process of importing the image. The variable assignment is only storing a reference to the object. The only difference is that the latter may begin garbage collecting the my_pic.png image sooner since there are no more references to the object.
Technically it is faster to reuse variables as long as they are storing objects of the same type then it is to constantly create a new one. This boils down to addressing in memory and the fact that if you already have a variable (an address in memory is associated with it) then it will be easy to access that slot in memory and update the object located there. The reason that I mention that the object types should be the same is because of how memory is allocated for classes and objects when they are created at run time. As for why creating a new variable to store objects is slower is because it has to find proper space in memory (enough free space for the object) and then assign that address to that variable. This involves accessing address lookup tables and depending on the table configuration would also add time. The thing is the difference is so small that in any normal application you shouldn't notice it.

Python: How to add/initialize new global vars IN another module?

I looked up other posts on the topic and I couldn't find my situation exactly. It is in a Django app, although I believe it's purely a (newbie) Python question. Here's my situation:
Let's say I have mymodule.py where I have various constants and common functions, and at some point elsewhere in the program, I will want to add (and initialize) another attribute for mymodule (if it it's not yet been added):
import mymodule
class UserView(View):
# this method always gets called first..
def get(self, request):
try:
# check if attribute exists
mymodule.user_data;
except AttributeError:
# add it if it doesn't
mymodule.user_data = mymodule.get_user_data()
# continue on..
# sometime later, this method is called..
def post(self, request)
print(mymodule.user_data)
My assumption was that once mymodule.user_data is added, it will persist as a global variable? Even though I do set it in the get() method first, when I try to read it in the post() method later, I get Error: 'module' object has no attribute 'account'
Does it need to be pre-initialized in mymodule.py, as some empty object? I may not necessarily know what type of object it will be -- how would I do it in Python? (Sorry, coming from JS -- don't shoot!)
You should not do this. Your proposed solution is very dangerous, as now all users will share the same data. You almost certainly don't want that.
For per-user data shared between requests, you should use the session.
Edit
There's no way to know if they are separate processes or not. Your server software (Apache, or whatever) will determine the number of processes to run (based on your settings), and automatically route requests between them. Each process could serve any number of requests before being killed and restarted. So, in all likelihood, two consecutive requests could indeed be served by the same process, in which case the data will indeed collide.
Note that the session data is stored on the server (only a key is stored in the user's cookie), so size shouldn't be a consideration. See the sessions documentation.
You should not want to do that.
But it works as "expected": just do
mymodule.variable = value
anywhere in your code.
So, yes, your example code is setting the variable in the current running program -
but then you hit the part where I said: "you should not want to do that" :-)
Because django, when running with production settings will behave differently than a single-proccess, single-thread python application.
In this case, if the variable is not set in mymodule when you try to access it later, it maybe because this access is happening in another process entirely (thus, "global variables" (actually, in Python we have "module" variables) won't work, since they are set per process).
In this particular case, since you have a function ot retrieve your desired value,and you may be worried that it is an expensive value, you should memoize it - check the documentation on django.utils.functional.memoize (which will change to django.utils.lru_cache.lru_cache in upcoming versions) - https://docs.djangoproject.com/en/dev/releases/1.7/ - this way it will be called once per process in your application as it serves from separated processes.
My solution (for now):
In the module mymodule.py, I initialized a dictionary: data = {}
Then in my get() method:
if not ('user' in mymodule.data):
mymodule.data['user'] = mymodule.get_user_data()
Subsequently, I'm able to retrieve the mymodule.data['user'] object in the post() method (and presumably elsewhere in my code). Seems to work but please let me know if it's an aberration!

How to avoid computation every time a python module is reloaded

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the module takes more then one minute which is totally unacceptable. How can I save the computation result somewhere so that the next import/reload doesn't have to compute it? I tried cPickle, but loading the dictionary variable from a file(1.3M) takes approximately the same time as computation.
To give more information about my problem,
FD = FreqDist(word for word in brown.words()) # this line of code takes 1 min
Just to clarify: the code in the body of a module is not executed every time the module is imported - it is run only once, after which future imports find the already created module, rather than recreating it. Take a look at sys.modules to see the list of cached modules.
However, if your problem is the time it takes for the first import after the program is run, you'll probably need to use some other method than a python dict. Probably best would be to use an on-disk form, for instance a sqlite database, one of the dbm modules.
For a minimal change in your interface, the shelve module may be your best option - this puts a pretty transparent interface between the dbm modules that makes them act like an arbitrary python dict, allowing any picklable value to be stored. Here's an example:
# Create dict with a million items:
import shelve
d = shelve.open('path/to/my_persistant_dict')
d.update(('key%d' % x, x) for x in xrange(1000000))
d.close()
Then in the next process, use it. There should be no large delay, as lookups are only performed for the key requested on the on-disk form, so everything doesn't have to get loaded into memory:
>>> d = shelve.open('path/to/my_persistant_dict')
>>> print d['key99999']
99999
It's a bit slower than a real dict, and it will still take a long time to load if you do something that requires all the keys (eg. try to print it), but may solve your problem.
Calculate your global var on the first use.
class Proxy:
#property
def global_name(self):
# calculate your global var here, enable cache if needed
...
_proxy_object = Proxy()
GLOBAL_NAME = _proxy_object.global_name
Or better yet, access necessery data via special data object.
class Data:
GLOBAL_NAME = property(...)
data = Data()
Example:
from some_module import data
print(data.GLOBAL_NAME)
See Django settings.
I assume you've pasted the dict literal into the source, and that's what's taking a minute? I don't know how to get around that, but you could probably avoid instantiating this dict upon import... You could lazily-instantiate it the first time it's actually used.
You could try using the marshal module instead of the c?Pickle one; it could be faster. This module is used by python to store values in a binary format. Note especially the following paragraph, to see if marshal fits your needs:
Not all Python object types are supported; in general, only objects whose value is independent from a particular invocation of Python can be written and read by this module. The following types are supported: None, integers, long integers, floating point numbers, strings, Unicode objects, tuples, lists, sets, dictionaries, and code objects, where it should be understood that tuples, lists and dictionaries are only supported as long as the values contained therein are themselves supported; and recursive lists and dictionaries should not be written (they will cause infinite loops).
Just to be on the safe side, before unmarshalling the dict, make sure that the Python version that unmarshals the dict is the same as the one that did the marshal, since there are no guarantees for backwards compatibility.
If the 'shelve' solution turns out to be too slow or fiddly, there are other possibilities:
shove
Durus
ZopeDB
pyTables
shelve gets really slow with large data sets. I've been using redis quite successfully, and wrote a FreqDist wrapper around it. It's very fast, and can be accessed concurrently.
You can use a shelve to store your data on disc instead of loading the whole data into memory. So startup time will be very fast, but the trade-off will be slower access time.
Shelve will pickle the dict values too, but will do the (un)pickle not at startup for all the items, but only at access time for each item itself.
A couple of things that will help speed up imports:
You might try running python using the -OO flag when running python. This will do some optimizations that will reduce import time of modules.
Is there any reason why you couldn't break the dictionary up into smaller dictionaries in separate modules that can be loaded more quickly?
As a last resort, you could do the calculations asynchronously so that they won't delay your program until it needs the results. Or maybe even put the dictionary in a separate process and pass data back and forth using IPC if you want to take advantage of multi-core architectures.
With that said, I agree that you shouldn't be experiencing any delay in importing modules after the first time you import it. Here are a couple of other general thoughts:
Are you importing the module within a function? If so, this can lead to performance problems since it has to check and see if the module is loaded every time it hits the import statement.
Is your program multi-threaded? I have seen occassions where executing code upon module import in a multi-threaded app can cause some wonkiness and application instability (most notably with the cgitb module).
If this is a global variable, be aware that global variable lookup times can be significantly longer than local variable lookup times. In this case, you can achieve a significant performance improvement by binding the dictionary to a local variable if you're using it multiple times in the same context.
With that said, it's a tad bit difficult to give you any specific advice without a little bit more context. More specifically, where are you importing it? And what are the computations?
Factor the computationally intensive part into a separate module. Then at least on reload, you won't have to wait.
Try dumping the data structure using protocol 2. The command to try would be cPickle.dump(FD, protocol=2). From the docstring for cPickle.Pickler:
Protocol 0 is the
only protocol that can be written to a file opened in text
mode and read back successfully. When using a protocol higher
than 0, make sure the file is opened in binary mode, both when
pickling and unpickling.
I'm going through this same issue...
shelve, databases, etc... are all too slow for this type of problem. You'll need to take the hit once, insert it into an inmemory key/val store like Redis. It will just live there in memory (warning it could use up a good amount of memory so you may want a dedicated box). You'll never have to reload it and you'll just get looking in memory for keys
r = Redis()
r.set(key, word)
word = r.get(key)
Expanding on the delayed-calculation idea, why not turn the dict into a class that supplies (and caches) elements as necessary?
You might also use psyco to speed up overall execution...
OR you could just use a database for storing the values in? Check out SQLObject, which makes it very easy to store stuff to a database.
There's another pretty obvious solution for this problem. When code is reloaded the original scope is still available.
So... doing something like this will make sure this code is executed only once.
try:
FD
except NameError:
FD = FreqDist(word for word in brown.words())

Categories

Resources