I am seeing an odd error where a variable I create at the module scope -- as in, at the top of the file before any classes or functions are defined -- is behaving differently over time. This variable (let's call it _cache) gets pulled into my classes:
_cache = None
class XMLGenerator(object):
global _cache
def __init__(self, parms):
if _cache is None:
_cache = expensive_query(parms)
The results of this cache can be different depending on the context of the request coming into the web services, but I am seeing differing behavior in the resulting XML output between calls to the same service: I can restart the server and everything is great, but eventually the anomalous behavior begins again.
Is uWSGI preserving state between requests somehow?
I wanted to circle back and explain what happened here. Global variables are, in fact, not "refreshed" between requests to the same service in uWSGI. Thus, if you create a module level variable, it will carry state between multiple requests. This, obviously, was not what I intended; so I ended up passing a caching object around between the different calls into XMLGenerator. It caused the API to be quite ugly, but avoided the issue with module-level variables.
If you are doing this with multiple workers than you probably want to use uwsig's CachingFramework:
http://projects.unbit.it/uwsgi/wiki/CachingFramework
Otherwise I believe _cache can be different across the workers.
Also, you could test with uwsgi --processes 1 to see if the problem goes away.
Related
The apparent requirement to provide class definitions instead of instances causes very difficult problems. I have two different classes and one of them needs a reference to the other
app = tornado.web.Application([
(r"/fusion.*", FusionListener),
(r"/admin.*", AdminListener),
])
. The AdminListener needs a reference to the FusionListener since there are internal items needing to be managed. Sending messages is an unacceptable additional complexity here. The current mechanism does not seem to afford that possibility.
What kind of pattern can get around this shortcoming in Tornado?
For my use-case there are both persistent and in-memory state. We have spark and postgres repositories for the former. For the latter I had already designed and written the application to have instance-level data structures. But I have gathered that instance attributes on Tornado launched RequestHandler / WebHandler subclasses are not persistent.
The latter wants to live in a class managing the state: but I am compelled to significantly redraw the boundaries due to this design ot Tornado. Instead it will be necessary to push everything to global variables. Few would argue this were a preferred design. I will be dumping tornado as soon as I can get the time.
Not sure what will be the alternative: I already reverted from cherrypy due to significant limitations of its own: here are a couple of my questions on it
404 for path served by Cherrypy
How to specify the listening server instances using cherrypy tree.mount?
I got through those with some scars but still in one piece. There were additional issues that knocked me out: url's were not being served, and there was no clear end to the mole whacking. It also generally does not get a lot of attention and has confusing outdated or incomplete documentation. There is plenty of docs - that's why I got started on it: but the holes make for a series of rabbit-chasing episodes.
Flask and django have their own issues. It seems finding a functionally adequate but not super heavy weight web server in python is an illusory target. Not certain yet which framework has the least gotchas.
Posting this as answer in order to benefit from proper code formatting.
The paradigm I used for keeping track of existing instances of a RequestHandler is very simple:
class MyHandler(RequestHandler):
_instances = set()
def get(self):
if needs_to_be_added(self.request): # some conditions can be used here
if len(MyHandler._instances) > THRESHOLD: # careful with memory usage
return self.finish("some_error")
MyHandler._instances.add(self)
...
def post(self):
if needs_to_be_removed(self.request):
MyHandler._instances.discard(self)
...
Of course you might need to change when to add / discard elements.
Depending on how you want to refer to existing instances in the future (by some key for example) you could use a dict for keeping track of them.
I don't think you can use something weak references (as in classes from the weakref module) because those will only track live instances which won't work with the way request handlers instances are created and destroyed.
I'm building a test framework using python + pytest + xdist + selenium grid. This framework needs to talk to a pre-existing custom logging system. As part of this logging process, I need to submit API calls to: set up each new test run, set up test cases within those test runs, and log strings and screenshots to those test cases.
The first step is to set up a new test run, and the API call for that returns (among other things) a Test Run ID. I need to keep this ID available for all test cases to read. I'd like to just stick it in a global variable somewhere, but running my tests with xdist causes the framework to lose track of the value.
I've tried:
Using a "globals" class; it forgot the value when using xdist.
Keeping a global variable inside my conftest.py file; same problem, the value gets dropped when using xdist. Also it seems wrong to import my conftest everywhere.
Putting a "globals" class inside the conftest; same thing.
At this point, I'm considering writing it to a temp file, but that seems primitive, and I think I'm overlooking a better solution. What's the most correct, pytest-style way to store and access global data across multiple xdist threads?
Might be worth looking into Proboscis, as it allows specific test dependencies and could be a possible solution.
Can you try config.cache E.g. -
request.config.cache.set('run_id', run_id)
refer documention
I looked up other posts on the topic and I couldn't find my situation exactly. It is in a Django app, although I believe it's purely a (newbie) Python question. Here's my situation:
Let's say I have mymodule.py where I have various constants and common functions, and at some point elsewhere in the program, I will want to add (and initialize) another attribute for mymodule (if it it's not yet been added):
import mymodule
class UserView(View):
# this method always gets called first..
def get(self, request):
try:
# check if attribute exists
mymodule.user_data;
except AttributeError:
# add it if it doesn't
mymodule.user_data = mymodule.get_user_data()
# continue on..
# sometime later, this method is called..
def post(self, request)
print(mymodule.user_data)
My assumption was that once mymodule.user_data is added, it will persist as a global variable? Even though I do set it in the get() method first, when I try to read it in the post() method later, I get Error: 'module' object has no attribute 'account'
Does it need to be pre-initialized in mymodule.py, as some empty object? I may not necessarily know what type of object it will be -- how would I do it in Python? (Sorry, coming from JS -- don't shoot!)
You should not do this. Your proposed solution is very dangerous, as now all users will share the same data. You almost certainly don't want that.
For per-user data shared between requests, you should use the session.
Edit
There's no way to know if they are separate processes or not. Your server software (Apache, or whatever) will determine the number of processes to run (based on your settings), and automatically route requests between them. Each process could serve any number of requests before being killed and restarted. So, in all likelihood, two consecutive requests could indeed be served by the same process, in which case the data will indeed collide.
Note that the session data is stored on the server (only a key is stored in the user's cookie), so size shouldn't be a consideration. See the sessions documentation.
You should not want to do that.
But it works as "expected": just do
mymodule.variable = value
anywhere in your code.
So, yes, your example code is setting the variable in the current running program -
but then you hit the part where I said: "you should not want to do that" :-)
Because django, when running with production settings will behave differently than a single-proccess, single-thread python application.
In this case, if the variable is not set in mymodule when you try to access it later, it maybe because this access is happening in another process entirely (thus, "global variables" (actually, in Python we have "module" variables) won't work, since they are set per process).
In this particular case, since you have a function ot retrieve your desired value,and you may be worried that it is an expensive value, you should memoize it - check the documentation on django.utils.functional.memoize (which will change to django.utils.lru_cache.lru_cache in upcoming versions) - https://docs.djangoproject.com/en/dev/releases/1.7/ - this way it will be called once per process in your application as it serves from separated processes.
My solution (for now):
In the module mymodule.py, I initialized a dictionary: data = {}
Then in my get() method:
if not ('user' in mymodule.data):
mymodule.data['user'] = mymodule.get_user_data()
Subsequently, I'm able to retrieve the mymodule.data['user'] object in the post() method (and presumably elsewhere in my code). Seems to work but please let me know if it's an aberration!
So there has been a lot of hating on singletons in python. I generally see that having a singleton is usually no good, but what about stuff that has side effects, like using/querying a Database? Why would I make a new instance for every simple query, when I could reuse a present connection already setup again? What would be a pythonic approach/alternative to this?
Thank you!
Normally, you have some kind of object representing the thing that uses a database (e.g., an instance of MyWebServer), and you make the database connection a member of that object.
If you instead have all your logic inside some kind of function, make the connection local to that function. (This isn't too common in many other languages, but in Python, there are often good ways to wrap up multi-stage stateful work in a single generator function.)
If you have all the database stuff spread out all over the place, then just use a global variable instead of a singleton. Yes, globals are bad, but singletons are just as bad, and more complicated. There are a few cases where they're useful, but very rare. (That's not necessarily true for other languages, but it is for Python.) And the way to get rid of the global is to rethink you design. There's a good chance you're effectively using a module as a (singleton) object, and if you think it through, you can probably come up with a good class or function to wrap it up in.
Obviously just moving all of your globals into class attributes and #classmethods is just giving you globals under a different namespace. But moving them into instance attributes and methods is a different story. That gives you an object you can pass around—and, if necessary, an object you can have 2 of (or maybe even 0 under some circumstances), attach a lock to, serialize, etc.
In many types of applications, you're still going to end up with a single instance of something—every Qt GUI app has exactly one MyQApplication, nearly every web server has exactly one MyWebServer, etc. No matter what you call it, that's effectively a singleton or global. And if you want to, you can just move everything into attributes of that god object.
But just because you can do so doesn't mean you should. You've still got function parameters, local variables, globals in each module, other (non-megalithic) classes with their own instance attributes, etc., and you should use whatever is appropriate for each value.
For example, say your MyWebServer creates a new ClientConnection instance for each new client that connects to you. You could make the connections write MyWebServer.instance.db.execute whenever they want to execute a SQL query… but you could also just pass self.db to the ClientConnection constructor, and each connection then just does self.db.execute. So, which one is better? Well, if you do it the latter way, it makes your code a lot easier to extend and refactor. If you want to load-balance across 4 databases, you only need to change code in one place (where the MyWebServer initializes each ClientConnection) instead of 100 (every time the ClientConnection accesses the database). If you want to convert your monolithic web app into a WSGI container, you don't have to change any of the ClientConnection code except maybe the constructor. And so on.
If you're using an object oriented approach, then abamet's suggestion of attaching the database connection parameters as class attributes makes sense to me. The class can then establish a single database connection which all methods of the class refer to as self.db_connection, for example.
If you're not using an object oriented approach, a separate database connection module can provide a functional-style equivalent. Devote a module to establishing a database connection, and simply import that module everywhere you want to use it. Your code can then refer to the connection as db.connection, for example. Since modules are effectively singletons, and the module code is only run on the first import, you will be re-using the same database connection each time.
I have celery Python worker processes that are restarted every day or so. They execute Python/Django programs.
I have set certain quasi-global values that should persist in memory for the duration of the process. Namely, I have certain MySQL querysets that do not change often and are therefore evaluated one time and stored as a CONSTANT as soon as the process starts (a bad example being PROFILE = Profile.objects.get(user_id=5)).
Let's say that I want to reset this value in the celery process without exec-ing a whole new program.
This value is imported (and used) in a number of different modules. I'm assuming I'd have to go through each one in sys.modules that imports the CONSTANT and delete/reset the key? Is that right?
This seems very hacky. I usually use external services like Memcached for coordination of memory among multiple processes, but every once in a while, I figure local memory is preferable to over the network calls to a NoSQL store.
It's a bit hard to say without seeing some code, but importing just sets a reference, exactly as with variable assignment: that is, if the data changes, the references change too. Naturally though this only works if it's the parent context that you've imported (otherwise assignment will change the reference, rather than updating the value.)
In other words, if you do this:
from mypackage import mymodule
do_something_with(mymodule.MY_CONSTANT)
#elsewhere
mymodule.MY_CONSTANT = 'new_value'
then all references to mymodule.MY_CONSTANT will get the new value. But if you did this:
from mypackage.mymodule import MY_CONSTANT
# elsewhere
mymodule.MY_CONSTANT = 'new_value'
the original reference won't get the new value, because you've rebound MY_CONSTANT to something else but the first reference is still pointing at the old value.