How do I disable history in python mechanize module? - python

I have a web scraping script that gets new data once every minute, but over the course of a couple of days, the script ends up using 200mb or more of memory, and I found out it's because mechanize is keeping an infinite browser history for the .back() function to use.
I have looked in the docstrings, and I found the clear_history() function of the browser class, and I invoke that each time I refresh, but I still get 2-3mb higher memory usage on each page refresh. edit: Hmm, seems as if it kept doing the same thing after I called clear_history, up until I got to about 30mb worth of memory usage, then it cleared back down to 10mb or so (which is the base amount of memory my program starts up with)...any way to force this behavior on a more regular basis?
How do I keep mechanize from storing all of this info? I don't need to keep any of it. I'd like to keep my python script below 15mb memory usage.

You can pass an argument history=whatever when you instantiate the Browser; the default value is None which means the browser actually instantiates the History class (to allow back and reload). The simplest approach (will give an attribute error exception if you ever do call back or reload):
class NoHistory(object):
def add(self, *a, **k): pass
def clear(self): pass
b = mechanize.Browser(history=NoHistory())
a cleaner approach would implement other methods in NoHistory to give clearer exceptions on erroneous use of the browser's back or reload, but this simple one should suffice otherwise.
Note that this is an elegant (though not well documented;-) use of the dependency injection design pattern: in a (bleah) "monkeypatching" world, the client code would be expected to overwrite b._history after the browser is instantiated, but with dependency injection you just pass in the "history" object you want to use. I've often maintained that Dependency Injection may be the most important DP that wasn't in the "gang of 4" book!-).

Related

Python multiprocessing.Pool(): am I limited in what I can return?

I am using Python's multi-processing pool. I have been told, although not experienced this myself so I cannot post the code, that one cannot just "return" anything from within the multiprocessing.Pool()-worker back to the multiprocessing.Pool()'s main process. Words like "pickling" and "lock" were being thrown around but I am not sure.
Is this correct, and if so, what are these limitations?
In my case, I have a function which generates a mutable class object and then returns it after it has done some work with it. I'd like to have 8 processes run this function, generate their own classes, and return each of them after they're done. Full code is NOT written yet, so I cannot post it.
Any issues I may run into?
My code is: res = pool.map(foo, list_of_parameters)
Q : "Is this correct, and if so, what are these limitations?"
It depends. It is correct, but the SER/DES processing is the problem here, as a pair of disjoint processes tries to "send" something ( there: a task specification with parameters and back: ... Yessss, the so long waited for result* )
Initial versions of the Python standard library of modules piece, responsible for doing this, the pickle-module, was not able to SER-ialise some more complex types of objects, Class-instances being one such example.
There are newer and newer versions evolving, sure, yet this SER/DES step is one of the SPoFs that may avoid a smooth code-execution for some such cases.
Next are the cases, that finish by throwing a Memory Error as they request as much memory allocations, that the O/S simply rejects any new request for such an allocation, and the whole process attempt to produce and send pickle.dumps( ... ) un-resolvably crashes.
Do we have any remedies available?
Well, maybe yes, maybe no - Mike McKearn's dill may help in some cases to better handle complex objects in SER/DES-processing.
May try to use import dill as pickle; pickle.dumps(...) and test your hot-candidates for Class()-instances to get SER/DES-ed, if they get a chance to pass through. If not, no way using this low-hanging fruit first trick.
Next, a less easy way would be to avoid your dependence on hardwired multiprocessing.Pool()-instantiations and their (above)-limited SER/comms/DES-methods, and design your processing strategy as a distributed-computing system, based on a communicating agents paradigm.
That way you benefit from a right-sized, just-enough designed communication interchange between intelligent-enough agents, that know (as you've designed them to know it) what to tell one to the others, even without sending any mastodon-sized BLOB(s), that accidentally crash the processing in any of the SPoF(s) you cannot both prevent and salvage ex-post.
There seem no better ways forward I know about or can foresee in 2020-Q4 for doing this safe and smart.

Python: How to add/initialize new global vars IN another module?

I looked up other posts on the topic and I couldn't find my situation exactly. It is in a Django app, although I believe it's purely a (newbie) Python question. Here's my situation:
Let's say I have mymodule.py where I have various constants and common functions, and at some point elsewhere in the program, I will want to add (and initialize) another attribute for mymodule (if it it's not yet been added):
import mymodule
class UserView(View):
# this method always gets called first..
def get(self, request):
try:
# check if attribute exists
mymodule.user_data;
except AttributeError:
# add it if it doesn't
mymodule.user_data = mymodule.get_user_data()
# continue on..
# sometime later, this method is called..
def post(self, request)
print(mymodule.user_data)
My assumption was that once mymodule.user_data is added, it will persist as a global variable? Even though I do set it in the get() method first, when I try to read it in the post() method later, I get Error: 'module' object has no attribute 'account'
Does it need to be pre-initialized in mymodule.py, as some empty object? I may not necessarily know what type of object it will be -- how would I do it in Python? (Sorry, coming from JS -- don't shoot!)
You should not do this. Your proposed solution is very dangerous, as now all users will share the same data. You almost certainly don't want that.
For per-user data shared between requests, you should use the session.
Edit
There's no way to know if they are separate processes or not. Your server software (Apache, or whatever) will determine the number of processes to run (based on your settings), and automatically route requests between them. Each process could serve any number of requests before being killed and restarted. So, in all likelihood, two consecutive requests could indeed be served by the same process, in which case the data will indeed collide.
Note that the session data is stored on the server (only a key is stored in the user's cookie), so size shouldn't be a consideration. See the sessions documentation.
You should not want to do that.
But it works as "expected": just do
mymodule.variable = value
anywhere in your code.
So, yes, your example code is setting the variable in the current running program -
but then you hit the part where I said: "you should not want to do that" :-)
Because django, when running with production settings will behave differently than a single-proccess, single-thread python application.
In this case, if the variable is not set in mymodule when you try to access it later, it maybe because this access is happening in another process entirely (thus, "global variables" (actually, in Python we have "module" variables) won't work, since they are set per process).
In this particular case, since you have a function ot retrieve your desired value,and you may be worried that it is an expensive value, you should memoize it - check the documentation on django.utils.functional.memoize (which will change to django.utils.lru_cache.lru_cache in upcoming versions) - https://docs.djangoproject.com/en/dev/releases/1.7/ - this way it will be called once per process in your application as it serves from separated processes.
My solution (for now):
In the module mymodule.py, I initialized a dictionary: data = {}
Then in my get() method:
if not ('user' in mymodule.data):
mymodule.data['user'] = mymodule.get_user_data()
Subsequently, I'm able to retrieve the mymodule.data['user'] object in the post() method (and presumably elsewhere in my code). Seems to work but please let me know if it's an aberration!

Memory leak in django when keeping a reference of all instances of forms

This is a followup to this thread.
I have implemented the method for keeping a reference to all my forms in a array like mentioned by the selected answer in this but unfortunately I am getting a memory leak on each request to the django host.
The code in question is as follows:
This is my custom form I am extending which has a function to keep reference of neighboring forms, and whenever I instantiate a new form, it just keeps getting added to the _instances stack.
class StepForm(ModelForm):
TAGS = []
_instances = []
def __new__(cls, *args, **kwargs):
instance = object.__new__(cls)
cls._instances.append(instance)
return instance
Even though this more of a python issue then Django I have decided that it's better to show you the full context that I am encountering this problem at.
As requested I am posting what I am trying to accomplish with this feat:
I have a js applet with steps, and for each step there is a form, but in order to load the contents of each step dynamically through JS I need to execute some calls on the next form in line. And on the previous aswell. Therefore the only solution I Could come up with is to just keep a reference to all forms on each request and just use the forms functions I need.
Well it's not only a Python issue - the execution context (here a Django app) is important too. As Ludwik Trammer rightly comments, you're in a long running process, and as such anything at the module or class level will live for the duration of the process. Also if using more than one process to serve the app you may (and will) get inconsistant results from one request to another, since two subsequent requests from a same user can (and will) end up being served by different processes.
To make a long story short: the way to safely keep per-user persistant state in a web application is to use sessions. Please explain what problem you're trying to solve, there's very probably a more appropriate (and possibly existing and tested) solution.
EDIT : ok what you're looking for is a "wizard". There are a couple available implementations for Django but most of them don't handle going back - which, from experience, can get tricky when each step depends on the previous one (and that's one of the driving points for using a wizard). What one usually do is have a `Wizard' class (plain old Python object) with a set of forms.
The wizard takes care of
step to step navigation
instanciating forms
maintaining state (which includes storing and retrieving form's data for each step, revalidating etc).
FWIW I've had rather mixed success using Django's existing session-based wizard. We rolled our own for another project (with somehow complex requirements) and while it works I wouldn't name it a success neither. Having ajax and file uploads thrown in the mix doesn't help neither. Anyway, you can try to start with an existing implementation, see how it fits your needs, and go for a custom solution if it doesn't - generic solutions sometimes make things harder than they have to be.
My 2 cents...
The leak is not just a side effect of your code - it's part of its core function. It is not possible to remove the leak without changing what the code does.
It does exactly what it is programmed to do - every time the form is displayed a new instance is created and added to the _instances list. It is never removed from the list. As a consequence after 100 requests you will have a list with 100 requests, after 1 000 requests there will be 1 000 instances in the list, and so on - until all memory is exhausted and the program crashes.
What did you want to accomplish by keeping all instances of your form? And what else did you expect to happen?

Is there an established memoize on-disk decorator for python?

I have been searching a bit for a python module that offers a memoize decorator with the following capabilities:
Stores cache on disk to be reused among subsequent program runs.
Works for any pickle-able arguments, most importantly numpy arrays.
(Bonus) checks whether arguments are mutated in function calls.
I found a few small code snippets for this task and could probably implement one myself, but I would prefer having an established package for this task. I also found incpy, but that does not seem to work with the standard python interpreter.
Ideally, I would like to have something like functools.lru_cache plus cache storage on disk. Can someone point me to a suitable package for this?
I don't know of any memoize decorator that takes care of all that, but you might want to have a look at ZODB. It's a persistence system built on top of pickle that provides some additional features including being able move objects from memory to disk when they aren't being used and the ability to save only objects that have been modified.
Edit: As a follow-up for the comment. A memoization decorator isn't supported out of the box by ZODB. However, I think you can:
Implement your own persistent class
Use a memoization decorator in the methods you need (any standard implementation should work, but it probably needs to be modified to make sure that the dirty bit is set)
After that, if you create an object of that class and add it to a ZODB database, when you execute one of the memoized methods, the object will be marked as dirty and changes will be saved to the database in the next transaction commit operation.
I realize this is a 2-year-old question, and that this wouldn't count as an "established" decorator, but…
This is simple enough that you really don't need to worry about only using established code. The module's docs link to the source because, in addition to being useful in its own right, it works as sample code.
So, what do you need to add? Add a filename parameter. At run time, pickle.load the filename into the cache, using {} if it fails. Add a cache_save function that just pickle.saves the cache to the file under the lock. Attach that function to wrapper the same as the existing ones (cache_info, etc.).
If you want to save the cache automatically, instead of leaving it up to the caller, that's easy; it's just a matter of when to do so. Any option you come up with—atexit.register, adding a save_every argument so it saves every save_every misses, …—is trivial to implement. In this answer I showed how little work it takes. Or you can get a complete working version (to customize, or to use as-is) on GitHub.
There are other ways you could extend it—put some save-related statistics (last save time, hits and misses since last save, …) in the cache_info, copy the cache and save it in a background thread instead of saving it inline, etc. But I can't think of anything that would be worth doing that wouldn't be easy.

How do I dump an entire Python process for later debugging inspection?

I have a Python application in a strange state. I don't want to do live debugging of the process. Can I dump it to a file and examine its state later? I know I've restored corefiles of C programs in gdb later, but I don't know how to examine a Python application in a useful way from gdb.
(This is a variation on my question about debugging memleaks in a production system.)
There is no builtin way other than aborting (with os.abort(), causing the coredump if resource limits allow it) -- although you can certainly build your own 'dump' function that dumps relevant information about the data you care about. There are no ready-made tools for it.
As for handling the corefile of a Python process, the Python source has a gdbinit file that contains useful macros. It's still a lot more painful than somehow getting into the process itself (with pdb or the interactive interpreter) but it makes life a little easier.
If you only care about storing the traceback object (which is all you need to start a debugging session), you can use debuglater (a fork of pydump). It works with recent versions of Python and has a IPython/Jupyter integration.
If you want to store the entire session, look at dill. It has a dump_session, and load_session functions.
Here are two other relevant projects:
python-checkpointing2
pycrunch-trace
If you're looking for a language agnostic solution, you want to create a core dump file. Here's an example with Python.
Someone above said that there is no builtin way to perform this, but that's not entirely true. For an example, you could take a look at the pylons debugging tools. Whene there is an exception, the exception handler saves the stack trace and prints a URL on the console that can be used to retrieve the debugging session over HTTP.
While they're probably keeping these sessions in memory, they're just python objects, so there's nothing to stop you from pickling a stack dump and restoring it later for inspection. It would mean some changes to the app, but it should be possible...
After some research, it turns out the relevant code is actually coming from Paste's EvalException module. You should be able to look there to figure out what you need.
It's also possible to write something that would dump all the data from the process, e.g.
Pickler that ignores the objects it can't pickle (replacing them with something else) (e.g. Python: Pickling a dict with some unpicklable items)
Method that recursively converts everything into serializable stuff (e.g. this, except it needs a check for infinitely recursing objects and do something with those; also it could try dir() and getattr() to process some of the unknown objects, e.g. extension classes).
But leaving a running process with manhole or pylons or something like that certainly seems more convenient when possible.
(also, I wonder if something more convenient was written since this question was first asked)
This answer suggests making your program core dump and then continuing execution on another sufficiently similar box.

Categories

Resources