Simultaneously modify different keys in ZODB - python

I'm using ZODB as a persistent storage for objects that are going to be modified through a webservice.
Below is an example to which I reduced the issue.
The increment-function is what is called from multiple threads.
My problem is, that when increment is called simultaneously from two threads, for different keys, I'm getting the conflict-error.
I imagine it should be possible to resolve this, at least as long different keys are modified, in a proper way?
If so, I didn't manage to find an example on how to... (the zodb-documentation seems to be somewhat spread across different sites :/ )
Glad about any ideas...
import time
import transaction
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
from ZODB.POSException import ConflictError
def test_db():
store = FileStorage('zodb_storage.fs')
return DB(store)
db_test = test_db()
# app here is a flask-app
#app.route('/increment/<string:key>')
def increment(key):
'''increment the value of a certain key'''
# open connection
conn = db_test.open()
# get the current value:
root = conn.root()
val = root.get(key,0)
# calculate new value
# in the real application this might take some seconds
time.sleep(0.1)
root[key] = val + 1
try:
transaction.commit()
return '%s = %g' % (key, val)
except ConflictError:
transaction.abort()
return 'ConflictError :-('

You have two options here: implement conflict resolution, or retry the commit with fresh data.
Conflict resolution only applies to custom types you store in the ZODB, and can only be applied if you know how to merge your change into the newly-changed state.
The ZODB looks for a _p_resolveConflict() method on custom types and calls that method with the old state, the saved state you are in conflict with, and the new state you tried to commit; you are supposed to return the merged state. For a simple counter, like in your example, that'd be a as simple as updating the saved state with the change between the old and new states:
class Counter(Persistent):
def __init__(self, start=0):
self._count = start
def increment(self):
self._count += 1
return self._count
def _p_resolveConflict(self, old, saved, new):
# default __getstate__ returns a dictionary of instance attributes
saved['_count'] += new['_count'] - old['_count']
return saved
The other option is to retry the commit; you want to limit the number of retries, and you probably want to encapsulate this in a decorator on your method, but the basic principle is that you loop up to a limit, make your calculations based on ZODB data (which, after a conflict error, will auto-read fresh data where needed), then attempt to commit. If the commit is successful you are done:
max_retries = 10
retry = 0
conn = db_test.open()
root = conn.root()
while retry < max_retries:
val = root.get(key,0)
time.sleep(0.1)
root[key] = val + 1
try:
transaction.commit()
return '%s = %g' % (key, val)
except ConflictError:
retry += 1
raise CustomExceptionIndicatingTooManyRetries

Related

Python API Rate Limiting - How to Limit API Calls Globally

I'm trying to restrict the API calls in my code. I already found a nice python library ratelimiter==1.0.2.post0
https://pypi.python.org/pypi/ratelimiter
However, this library can only limit the rate in local scope. i.e) in function and loops
# Decorator
#RateLimiter(max_calls=10, period=1)
def do_something():
pass
# Context Manager
rate_limiter = RateLimiter(max_calls=10, period=1)
for i in range(100):
with rate_limiter:
do_something()
Because I have several functions, which make API calls, in different places, I want to limit the API calls in global scope.
For example, suppose I want to limit the APIs call to one time per second. And, suppose I have functions x and y in which two API calls are made.
#rate(...)
def x():
...
#rate(...)
def y():
...
By decorating the functions with the limiter, I'm able to limit the rate against the two functions.
However, if I execute the above two functions sequentially, it looses track of the number of API calls in global scope because they are unaware of each other. So, y will be called right after the execution of x without waiting another second. And, this will violate the one time per second restriction.
Is there any way or library that I can use to limit the rate globally in python?
I had the same problem, I had a bunch of different functions that calls the same API and I wanted to make rate limiting work globally. What I ended up doing was to create an empty function with rate limiting enabled.
PS: I use a different rate limiting library found here: https://pypi.org/project/ratelimit/
from ratelimit import limits, sleep_and_retry
# 30 calls per minute
CALLS = 30
RATE_LIMIT = 60
#sleep_and_retry
#limits(calls=CALLS, period=RATE_LIMIT)
def check_limit():
''' Empty function just to check for calls to API '''
return
Then I just call that function at the beginning of every function that calls the API:
def get_something_from_api(http_session, url):
check_limit()
response = http_session.get(url)
return response
If the limit is reached, the program will sleep until the (in my case) 60 seconds have passed, and then resume normally.
After all, I implemented my own Throttler class. By proxying every API request to the request method, we can keep track of all API requests. Taking advantage of passing function as the request method parameter, it also caches the result in order to reduce API calls.
class TooManyRequestsError(Exception):
def __str__(self):
return "More than 30 requests have been made in the last five seconds."
class Throttler(object):
cache = {}
def __init__(self, max_rate, window, throttle_stop=False, cache_age=1800):
# Dict of max number of requests of the API rate limit for each source
self.max_rate = max_rate
# Dict of duration of the API rate limit for each source
self.window = window
# Whether to throw an error (when True) if the limit is reached, or wait until another request
self.throttle_stop = throttle_stop
# The time, in seconds, for which to cache a response
self.cache_age = cache_age
# Initialization
self.next_reset_at = dict()
self.num_requests = dict()
now = datetime.datetime.now()
for source in self.max_rate:
self.next_reset_at[source] = now + datetime.timedelta(seconds=self.window.get(source))
self.num_requests[source] = 0
def request(self, source, method, do_cache=False):
now = datetime.datetime.now()
# if cache exists, no need to make api call
key = source + method.func_name
if do_cache and key in self.cache:
timestamp, data = self.cache.get(key)
logging.info('{} exists in cached # {}'.format(key, timestamp))
if (now - timestamp).seconds < self.cache_age:
logging.info('retrieved cache for {}'.format(key))
return data
# <--- MAKE API CALLS ---> #
# reset the count if the period passed
if now > self.next_reset_at.get(source):
self.num_requests[source] = 0
self.next_reset_at[source] = now + datetime.timedelta(seconds=self.window.get(source))
# throttle request
def halt(wait_time):
if self.throttle_stop:
raise TooManyRequestsError()
else:
# Wait the required time, plus a bit of extra padding time.
time.sleep(wait_time + 0.1)
# if exceed max rate, need to wait
if self.num_requests.get(source) >= self.max_rate.get(source):
logging.info('back off: {} until {}'.format(source, self.next_reset_at.get(source)))
halt((self.next_reset_at.get(source) - now).seconds)
self.num_requests[source] += 1
response = method() # potential exception raise
# cache the response
if do_cache:
self.cache[key] = (now, response)
logging.info('cached instance for {}, {}'.format(source, method))
return response
Many API providers constrain developers from making too many API calls.
Python ratelimit packages introduces a function decorator preventing a function from being called more often than that allowed by the API provider.
from ratelimit import limits
import requests
TIME_PERIOD = 900 # time period in seconds
#limits(calls=15, period=TIME_PERIOD)
def call_api(url):
response = requests.get(url)
if response.status_code != 200:
raise Exception('API response: {}'.format(response.status_code))
return response
Note: This function will not be able to make more then 15 API call within a 15 minute time period.
Adding to Sunil answer, you need to add #sleep_and_retry decorator, otherwise your code will break when reach the rate limit:
#sleep_and_retry
#limits(calls=0.05, period=1)
def api_call(url, api_key):
r = requests.get(
url,
headers={'X-Riot-Token': api_key}
)
if r.status_code != 200:
raise Exception('API Response: {}'.format(r.status_code))
return r
There are lots of fancy libraries that will provide nice decorators, and special safety features, but the below should work with django.core.cache or any other cache with a get and set method:
def hit_rate_limit(key, max_hits, max_hits_interval):
'''Implement a basic rate throttler. Prevent more than max_hits occurring
within max_hits_interval time period (seconds).'''
# Use the django cache, but can be any object with get/set
from django.core.cache import cache
hit_count = cache.get(key) or 0
logging.info("Rate Limit: %s --> %s", key, hit_count)
if hit_count > max_hits:
return True
cache.set(key, hit_count + 1, max_hits_interval)
return False
Using the Python standard library:
import threading
from time import time, sleep
b = threading.Barrier(2)
def belay(s=1):
"""Block the main thread for `s` seconds."""
while True:
b.wait()
sleep(s)
def request_something():
b.wait()
print(f'something at {time()}')
def request_other():
b.wait()
print(f'or other at {time()}')
if __name__ == '__main__':
thread = threading.Thread(target=belay)
thread.daemon = True
thread.start()
# request a lot of things
i = 0
while (i := i+1) < 5:
request_something()
request_other()
There's about s seconds between each timestamp printed. Because the main thread waits rather than sleeps, time it spends responding to requests is unrelated to the (minimum) time between requests.

Redis get and set decorator

Currently working on python and redis. I have Flask as my framework and working on Blueprints.
Looking into implementing cache with redis for my APIs, I have tried Flask-Cache and redis-simple-cache.
Downside are Flask-Cache saves the function even when you change the parameter of the function. It only saves it once per function.
As per redis-simple-cache, it save its keys as SimpleCache-<key name> which not advisable on my end.
So my question is, how can you create a decorator which save and retrieve or check if there is a key existing for the specific key.
I know a save decorator is possible. But is a retrieve or check decorator possible?? Please correct me if I am wrong. Thank you.
You seem to not have read the Flask-Cache documentation very closely. Caching does not ignore parameters and the cache key is customisable. The decorators the project supplies already give you the functionality you seek.
From the Caching View Functions section:
This decorator will use request.path by default for the cache_key.
So the default cache key is request.path, but you can specify a different key. Since Flask view functions get their arguments from path elements, the default request.path makes a great key.
From the #cached() decorator documentation:
cached(timeout=None, key_prefix='view/%s', unless=None)
By default the cache key is view/request.path. You are able to use this decorator with any function by changing the key_prefix. If the token %s is located within the key_prefix then it will replace that with request.path. You are able to use this decorator with any function by changing the key_prefix.
and
key_prefix – [...] Can optionally be a callable which takes no arguments but returns a string that will be used as the cache_key.
So you can set key_prefix to a function, and it'll be called (without arguments) to produce the key.
Moreover:
The returned decorated function now has three function attributes assigned to it. These attributes are readable/writable:
[...]
make_cache_key
A function used in generating the cache_key used.
This function is passed the same arguments the view function is passed. In all, this allows you to produce any cache key you want; either use key_prefix and pull out more information from the request or g or other sources, or assign to view_function.make_cache_key and access the same arguments the view function receives.
Then there is the #memoize() decorator:
memoize(timeout=None, make_name=None, unless=None)
Use this to cache the result of a function, taking its arguments into account in the cache key.
So this decorator caches return values purely based on the arguments passed into the function. It too supports a make_cache_key function.
I've used both decorators to make a Google App Engine Flask project scale to double-digit millions of views per month for a CMS-backed site, storing results in the Google memcached structure. Doing this with Redis would only require setting a configuration option.
You can use this cache decorator, the cache object you create will have to be a flask cache object instead of django one i.e. should support cache.get and cache.set methods, this is pretty flexible based on how you want to create cache keys, i.e.
based on what parameters being passed to the method
in what cases to cache/not cache the result
Use same cache even if kwarg order is changed, i.e. same cache for my_method(a=1,b=2) and my_method(b=2,a=1) call.
"
__author__ = 'Dhruv Pathak'
import cPickle
import logging
import time
from functools import wraps
from django.conf import settings
import traceback
"""following imports are from datautils.py in this repo, datautils
also contains many other useful data utility methods/classes
"""
from datautils import mDict, mList, get_nested_ordered_dict, nested_path_get
"""following import is specific to django framework, and can be altered
based on what type of caching library your code uses"""
from django.core.cache import cache, caches
logger = logging.getLogger(__name__)
def cache_result(cache_key=None, cache_kwarg_keys=None, seconds=900, cache_filter=lambda x: True, cache_setup = "default"):
def set_cache(f):
#wraps(f)
def x(*args, **kwargs):
if settings.USE_CACHE is not True:
result = f(*args, **kwargs)
return result
try:
# cache_conn should be a cache object supporting get,set methods
# can be from python-memcached, pylibmc, django, django-redis-cache, django-redis etc
cache_conn = caches[cache_setup]
except Exception, e:
result = f(*args, **kwargs)
return result
final_cache_key = generate_cache_key_for_method(f, kwargs, args, cache_kwarg_keys, cache_key)
try:
result = cache_conn.get(final_cache_key)
except Exception, e:
result = None
if settings.DEBUG is True:
raise
else:
logger.exception("Cache get failed,k::{0},ex::{1},ex::{2}".format(final_cache_key, str(e), traceback.format_exc()))
if result is not None and cache_filter(result) is False:
result = None
logger.debug("Cache had invalid result:{0},not returned".format(result))
if result is None: # result not in cache
result = f(*args, **kwargs)
if isinstance(result, (mDict, mList)):
result.ot = int(time.time())
if cache_filter(result) is True:
try:
cache_conn.set(final_cache_key, result, seconds)
except Exception, e:
if settings.DEBUG is True:
raise
else:
logger.exception("Cache set failed,k::{0},ex::{1},ex::{2},dt::{3}".format(final_cache_key, str(e), traceback.format_exc(), str(result)[0:100],))
#else:
# logger.debug("Result :{0} failed,not cached".format(result))
else: # result was from cache
if isinstance(result, (mDict, mList)):
result.src = "CACHE_{0}".format(cache_setup)
return result
return x
return set_cache
def generate_cache_key_for_method(method, method_kwargs, method_args, cache_kwarg_keys=None, cache_key=None):
if cache_key is None:
if cache_kwarg_keys is not None and len(cache_kwarg_keys) > 0:
if len(method_args) > 0:
raise Exception("cache_kwarg_keys mode needs set kwargs,args should be empty")
method_kwargs = get_nested_ordered_dict(method_kwargs)
cache_kwarg_values = [nested_path_get(method_kwargs, path_str=kwarg_key, strict=False) for kwarg_key in cache_kwarg_keys]
if any([kwarg_value is not None for kwarg_value in cache_kwarg_values]) is False:
raise Exception("all cache kwarg keys values are not set")
final_cache_key = "{0}::{1}::{2}".format(str(method.__module__), str(method.__name__), hash(cPickle.dumps(cache_kwarg_values)))
else:
final_cache_key = "{0}::{1}".format(str(method.__module__), str(method.__name__))
final_cache_key += "::{0}".format(str(hash(cPickle.dumps(method_args, 0)))) if len(method_args) > 0 else ''
final_cache_key += "::{0}".format(str(hash(cPickle.dumps(method_kwargs, 0)))) if len(method_kwargs) > 0 else ''
else:
final_cache_key = "{0}::{1}::{2}".format(method.__module__, method.__name__, cache_key)
return final_cache_key
2-3 utility methods are imported from this file in same repo, you can just put them in same file.

How do I log multiple very similar events gracefully in python?

With pythons logging module, is there a way to collect multiple events into one log entry? An ideal solution would be an extension of python's logging module or a custom formatter/filter for it so collecting logging events of the same kind happens in the background and nothing needs to be added in code body (e.g. at every call of a logging function).
Here an example that generates a large number of the same or very similar logging events:
import logging
for i in range(99999):
try:
asdf[i] # not defined!
except NameError:
logging.exception('foo') # generates large number of logging events
else: pass
# ... more code with more logging ...
for i in range(88888): logging.info('more of the same %d' % i)
# ... and so on ...
So we have the same exception 99999 times and log it. It would be nice, if the log just said something like:
ERROR:root:foo (occured 99999 times)
Traceback (most recent call last):
File "./exceptionlogging.py", line 10, in <module>
asdf[i] # not defined!
NameError: name 'asdf' is not defined
INFO:root:foo more of the same (occured 88888 times with various values)
You should probably be writing a message aggregate/statistics class rather than trying to hook onto the logging system's singletons but I guess you may have an existing code base that uses logging.
I'd also suggest that you should instantiate your loggers rather than always using the default root. The Python Logging Cookbook has extensive explanation and examples.
The following class should do what you are asking.
import logging
import atexit
import pprint
class Aggregator(object):
logs = {}
#classmethod
def _aggregate(cls, record):
id = '{0[levelname]}:{0[name]}:{0[msg]}'.format(record.__dict__)
if id not in cls.logs: # first occurrence
cls.logs[id] = [1, record]
else: # subsequent occurrence
cls.logs[id][0] += 1
#classmethod
def _output(cls):
for count, record in cls.logs.values():
record.__dict__['msg'] += ' (occured {} times)'.format(count)
logging.getLogger(record.__dict__['name']).handle(record)
#staticmethod
def filter(record):
# pprint.pprint(record)
Aggregator._aggregate(record)
return False
#staticmethod
def exit():
Aggregator._output()
logging.getLogger().addFilter(Aggregator)
atexit.register(Aggregator.exit)
for i in range(99999):
try:
asdf[i] # not defined!
except NameError:
logging.exception('foo') # generates large number of logging events
else: pass
# ... more code with more logging ...
for i in range(88888): logging.error('more of the same')
# ... and so on ...
Note that you don't get any logs until the program exits.
The result of running it this is:
ERROR:root:foo (occured 99999 times)
Traceback (most recent call last):
File "C:\work\VEMS\python\logcount.py", line 38, in
asdf[i] # not defined!
NameError: name 'asdf' is not defined
ERROR:root:more of the same (occured 88888 times)
Your question hides a subliminal assumption of how "very similar" is defined.
Log records can either be const-only (whose instances are strictly identical), or a mix of consts and variables (no consts at all is also considered a mix).
An aggregator for const-only log records is a piece of cake. You just need to decide whether process/thread will fork your aggregation or not.
For log records which include both consts and variables you'll need to decide whether to split your aggregation based on the variables you have in your record.
A dictionary-style counter (from collections import Counter) can serve as a cache, which will count your instances in O(1), but you may need some higher-level structure in order to write the variables down if you wish. Additionally, you'll have to manually handle the writing of the cache into a file - every X seconds (binning) or once the program has exited (risky - you may lose all in-memory data if something gets stuck).
A framework for aggregation would look something like this (tested on Python v3.4):
from logging import Handler
from threading import RLock, Timer
from collections import defaultdict
class LogAggregatorHandler(Handler):
_default_flush_timer = 300 # Number of seconds between flushes
_default_separator = "\t" # Seperator char between metadata strings
_default_metadata = ["filename", "name", "funcName", "lineno", "levelname"] # metadata defining unique log records
class LogAggregatorCache(object):
""" Keeps whatever is interesting in log records aggregation. """
def __init__(self, record=None):
self.message = None
self.counter = 0
self.timestamp = list()
self.args = list()
if record is not None:
self.cache(record)
def cache(self, record):
if self.message is None: # Only the first message is kept
self.message = record.msg
assert self.message == record.msg, "Non-matching log record" # note: will not work with string formatting for log records; e.g. "blah {}".format(i)
self.timestamp.append(record.created)
self.args.append(record.args)
self.counter += 1
def __str__(self):
""" The string of this object is used as the default output of log records aggregation. For example: record message with occurrences. """
return self.message + "\t (occurred {} times)".format(self.counter)
def __init__(self, flush_timer=None, separator=None, add_process_thread=False):
"""
Log record metadata will be concatenated to a unique string, separated by self._separator.
Process and thread IDs will be added to the metadata if set to True; otherwise log records across processes/threads will be aggregated together.
:param separator: str
:param add_process_thread: bool
"""
super().__init__()
self._flush_timer = flush_timer or self._default_flush_timer
self._cache = self.cache_factory()
self._separator = separator or self._default_separator
self._metadata = self._default_metadata
if add_process_thread is True:
self._metadata += ["process", "thread"]
self._aggregation_lock = RLock()
self._store_aggregation_timer = self.flush_timer_factory()
self._store_aggregation_timer.start()
# Demo logger which outputs aggregations through a StreamHandler:
self.agg_log = logging.getLogger("aggregation_logger")
self.agg_log.addHandler(logging.StreamHandler())
self.agg_log.setLevel(logging.DEBUG)
self.agg_log.propagate = False
def cache_factory(self):
""" Returns an instance of a new caching object. """
return defaultdict(self.LogAggregatorCache)
def flush_timer_factory(self):
""" Returns a threading.Timer daemon object which flushes the Handler aggregations. """
timer = Timer(self._flush_timer, self.flush)
timer.daemon = True
return timer
def find_unique(self, record):
""" Extracts a unique metadata string from log records. """
metadata = ""
for single_metadata in self._metadata:
value = getattr(record, single_metadata, "missing " + str(single_metadata))
metadata += str(value) + self._separator
return metadata[:-len(self._separator)]
def emit(self, record):
try:
with self._aggregation_lock:
metadata = self.find_unique(record)
self._cache[metadata].cache(record)
except Exception:
self.handleError(record)
def flush(self):
self.store_aggregation()
def store_aggregation(self):
""" Write the aggregation data to file. """
self._store_aggregation_timer.cancel()
del self._store_aggregation_timer
with self._aggregation_lock:
temp_aggregation = self._cache
self._cache = self.cache_factory()
# ---> handle temp_aggregation and write to file <--- #
for key, value in sorted(temp_aggregation.items()):
self.agg_log.info("{}\t{}".format(key, value))
# ---> re-create the store_aggregation Timer object <--- #
self._store_aggregation_timer = self.flush_timer_factory()
self._store_aggregation_timer.start()
Testing this Handler class with random log severity in a for-loop:
if __name__ == "__main__":
import random
import logging
logger = logging.getLogger()
handler = LogAggregatorHandler()
logger.addHandler(handler)
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.DEBUG)
logger.info("entering logging loop")
for i in range(25):
# Randomly choose log severity:
severity = random.choice([logging.DEBUG, logging.INFO, logging.WARN, logging.ERROR, logging.CRITICAL])
logger.log(severity, "test message number %s", i)
logger.info("end of test code")
If you want to add more stuff, this is what a Python log record looks like:
{'args': ['()'],
'created': ['1413747902.18'],
'exc_info': ['None'],
'exc_text': ['None'],
'filename': ['push_socket_log.py'],
'funcName': ['<module>'],
'levelname': ['DEBUG'],
'levelno': ['10'],
'lineno': ['17'],
'module': ['push_socket_log'],
'msecs': ['181.387901306'],
'msg': ['Test message.'],
'name': ['__main__'],
'pathname': ['./push_socket_log.py'],
'process': ['65486'],
'processName': ['MainProcess'],
'relativeCreated': ['12.6709938049'],
'thread': ['140735262810896'],
'threadName': ['MainThread']}
One more thing to think about:
Most features you run depend on a flow of several consecutive commands (which will ideally report log records accordingly); e.g. a client-server communication will typically depend on receiving a request, processing it, reading some data from the DB (which requires a connection and some read commands), some kind of parsing/processing, constructing the response packet and reporting the response code.
This highlights one of the main disadvantages of using an aggregation approach: by aggregating log records you lose track of the time and order of the actions that took place. It will be extremely difficult to figure out what request was incorrectly structured if you only have the aggregation at hand.
My advice in this case is that you keep both the raw data and the aggregation (using two file handlers or something similar), so that you can investigate a macro-level (aggregation) and a micro-level (normal logging).
However, you are still left with the responsibility of finding out that things have gone wrong, and then manually investe what caused it. When developing on your PC this is an easy enough task; but deploying your code in several production servers makes these tasks cumbersome, wasting a lot of your time.
Accordingly, there are several companies developing products specifically for log management. Most aggregate similar log records together, but others incorporate machine learning algorithms for automatic aggregation and learning your software's behavior. Outsourcing your log handling can then enable you to focus on your product, instead of on your bugs.
Disclaimer: I work for Coralogix, one such solution.
You can subclass the logger class and override the exception method to put your error types in a cache until they reach a certain counter before they are emitted to the log.
import logging
from collections import defaultdict
MAX_COUNT = 99999
class MyLogger(logging.getLoggerClass()):
def __init__(self, name):
super(MyLogger, self).__init__(name)
self.cache = defaultdict(int)
def exception(self, msg, *args, **kwargs):
err = msg.__class__.__name__
self.cache[err] += 1
if self.cache[err] > MAX_COUNT:
new_msg = "{err} occurred {count} times.\n{msg}"
new_msg = new_msg.format(err=err, count=MAX_COUNT, msg=msg)
self.log(logging.ERROR, new_msg, *args, **kwargs)
self.cache[err] = None
log = MyLogger('main')
try:
raise TypeError("Useful error message")
except TypeError as err:
log.exception(err)
Please note this isn't copy paste code.
You need to add your handlers (I recommend formatter, too) yourself.
https://docs.python.org/2/howto/logging.html#handlers
Have fun.
Create a counter and only log it for count=1, then increment thereafter and write out in a finally block (to ensure it gets logged no matter how bad the application crashes and burns). This could of course pose an issue if you have the same exception for different reasons, but you could always search for the line number to verify it's the same issue or something similar. A minimal example:
name_error_exception_count = 0
try:
for i in range(99999):
try:
asdf[i] # not defined!
except NameError:
name_error_exception_count += 1
if name_error_exception_count == 1:
logging.exception('foo')
else: pass
except Exception:
pass # this is just to get the finally block, handle exceptions here too, maybe
finally:
if name_error_exception_count > 0:
logging.exception('NameError exception occurred {} times.'.format(name_error_exception_count))

Threading memory profiling

So I hope this isn't a duplicate, however I either haven't been able to find the adequate solution or I just am not 100% on what I'm looking for. I've written a program to thread lots of requests. I create a thread to
Fetch responses from a number of api's such as this: share.yandex.ru/gpp.xml?url=MY_URL as well as scraping blogs
Parse the responses of all requests from the example above/ json/ using python-goose to extract articles
Return the parsed results back to the primary thread and insert into a database.
It's all been going well until it needs to pull back larger amounts of data which i haven't tested before. The primary reason for this is that it takes me over my shared memory limit on a shared Linux server (512mb) initiating a kill. This should be enough as it's only a few thousand requests, although i could be wrong. I'm clearing all large data variables/ objects within the main thread but that doesn't seem to help either.
I ran a memory_profile on the primary function which creates the threads with a thread class which looks like this:
class URLThread(Thread):
def __init__(self,request):
super(URLThread, self).__init__()
self.url = request['request']
self.post_id = request['post_id']
self.domain_id = request['domain_id']
self.post_data = request['post_params']
self.type = request['type']
self.code = ""
self.result = ""
self.final_results = ""
self.error = ""
self.encoding = ""
def run(self):
try:
self.request = get_page(self.url,self.type)
self.code = self.request['code']
self.result = self.request['result']
self.final_results = response_handler(dict(result=self.result,type=self.type,orig_url=self.url ))
self.encoding = chardet.detect(self.result)
self.error = self.request['error']
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
errors.append((exc_type, fname, exc_tb.tb_lineno,e,'NOW()'))
pass
#profile
def multi_get(uris,timeout=2.0):
def alive_count(lst):
alive = map(lambda x : 1 if x.isAlive() else 0, lst)
return reduce(lambda a,b : a + b, alive)
threads = [ URLThread(uri) for uri in uris ]
for thread in threads:
thread.start()
while alive_count(threads) > 0 and timeout > 0.0:
timeout = timeout - UPDATE_INTERVAL
sleep(UPDATE_INTERVAL)
return [ {"request":x.url,
"code":str(x.code),
"result":x.result,
"post_id":str(x.post_id),
"domain_id":str(x.domain_id),
"final_results":x.final_results,
"error":str(x.error),
"encoding":str(x.encoding),
"type":x.type}
for x in threads ]
And the results look like this on the first batch of requests i pump through it (FYI it's a link as the output text isn't readable in here, i can't paste a html table or embed an image until i get 2 more points ):
http://tinypic.com/r/28c147d/8
And it doesn't seem to drop any of the memory in subsequent passes (I'm batching 100 requests/ threads through at 1 time). By this i mean once a batch of threads is complete they seem to stay in memory ad every time it runs another, memory is added as below:
http://tinypic.com/r/nzkeoz/8
Am I doing something really stupid here?
Python will generally free the memory taken up by an object when there are no references to that object left. Your multi_get function returns a list that contains references to every thread that you have created. So it's unlikely that Python would free that memory. But we would need to see what the code that is calling multi_get is doing in order to be sure.
To start freeing the memory you will need to stop returning references to the threads from this function. Or if you want to continue to do that, at least delete them somewhere del x.

SQLAlchemy Session error

I have a problem with the session in SQLAlchemy, when i Add a row in the DB it's OK, but if i want to add another row without closing my app, It doesn't Add
This is the function in my Model:
def add(self,name):
self.slot_name = name
our_slot = self.session_.query(Slot).filter_by(slot_name = str(self.slot_name)).first()
if our_slot:
return 0
else:
self.session_.add(self)
self.session_.commit()
return 1
The problem is that you commit your session. After committing a session, it is closed. Either you commit after you are done adding, or you open a new session after each commit. Also take a look at Session.commit(). You should probably read something about sessions in SQLAlchemy's documentation.
Furthermore, suggest you do this:
def add(self,name):
self.slot_name = name
try:
our_slot = self.session_.query(Slot)\
.filter_by(slot_name = str(self.slot_name)).one()
self.session_.add(self)
return 1
except NoResultFound:
return 0
Of course, this only works if you expect exactly one result. It is considerd good practice to raise exceptions and catch them instead of making up conditions.

Categories

Resources