Multithreading with Numpy Causes Segmentation Fault - python

I'm trying to generate a report that contains differing "groupings" of data. For each I have to query postgres differently and apply different logic, which can take a fair amount of time (~1 hour).
To increase performance I've created a thread for each task, each with its own connection as psycopg2 executes queries serially per connection. I'm using numpy to calculate the median and mean values of a portion of data (that is common between each group).
A short example of my code is the following:
# -*- coding: utf-8 -*-
from postgres import Connection
from lookup import Lookup
from queries import QUERY1, QUERY2
from threading import Thread
class Report(object):
def __init__(self, **credentials):
self.conn = self.__get_conn(**credentials)
self._lookup = Lookup(self.conn)
self.data = {}
def __get_conn(self, **credentials):
return Connection(**credentials)
def _get_averages(self, data):
return {
'mean' : numpy.mean(data),
'median' : numpy.median(data)
}
def method1(self):
conn = self.__get_conn()
cursor = conn.get_cursor()
data = cursor.execute(QUERY1)
for row in data:
# Logic specific to the results returned by the query.
row['arg1'] = self._lookup.find_data_by_method_1(row)
avgs = self._get_averages(row['data'])
row['mean'] = avgs['mean']
row['median'] = avgs['median']
return data
def method2(self):
conn = self.__get_conn()
cursor = conn.get_cursor()
data = cursor.execute(QUERY2)
for row in data:
# Logic specific to the results returned by the query.
row['arg2'] = self._lookup.find_data_by_method_2(row)
avgs = self._get_averages(row['data'])
row['mean'] = avgs['mean']
row['median'] = avgs['median']
return data
def lookup(self, arg):
methods = {
'arg1' : self.method1,
'arg2' : self.method2
}
method = methods(arg)
self.data[arg] = method()
def lookup_args(self):
return self._lookup.find_args()
def do_something_with_data(self):
print self.data
def main():
creds = {
'host':'host',
'user':'postgres',
'database':'mydatabase',
'password':'mypassword'
}
reporter = Report(**creds)
args = reporter.lookup_args()
threads = []
for arg in args:
thread = Thread(target=reporter.lookup, args=(arg,))
threads.append(thread)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
reporter.do_something_with_data()
The imported Connection class is a simple wrapper around psycopg2 to facilitate cursor creation and connecting to multiple postgres databases.
The imported Lookup class accepts a Connection instance and is used to perform short queries to find relevant data that drastically decreases performance when incorporated into the larger query.
data that is accepted by the _get_averages example method is a list of decimal.Decimal objects.
When I run all threads simultaneously I get a segfault. If I run each thread independently the script finishes successfully.
Using gdb I find numpy to be the culprit:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedc8c700 (LWP 10997)]
0x00007ffff2ac33b7 in sortCompare (a=0x2e956298, b=0x2e956390) at numpy/core/src/multiarray/item_selection.c:1045
1045 numpy/core/src/multiarray/item_selection.c: No such file or directory.
in numpy/core/src/multiarray/item_selection.c
I'm aware of this bug with numpy, but that seems to only affect sorting lists containing both class instances and other numeric types. The objects in my lists are guaranteed to be decimal.Decimal instances. (Yes, I verified this).
What could cause numpy to cause a segfault when used inside of a thread, but behave as expected otherwise?

Related

Is it safe to instantiate a python class object populated directly with JSON data?

I am writing a p2p project in python which involves sending serialised data between untrusted nodes. For convenience originally I used pickle but this is insecure and now I am converting my (sometimes very complex and highly nested) class objects into JSON for transmission and storage.
My question is whether it is safe to perform the following without risk of malicious code being run:
#class initially created:
class MyClass():
def __init__(self, cat, list_of_dogs, nested_list_of_zebras):
self.cat = cat
self.dogs = list_of_dogs
self.zebras = nested_list_of_zebras
p = MyClass(cat, dog_list, zebra_list)
json_obj = jsonpickle.encode(p)
#on the receiving end (or when bringing out of storage) we do
class ReCreateMyClass():
def __init__(self, python_obj)
self.cat = python_obj['cat']
self.dogs = python_obj['dogs']
self.zebras = python_obj['zebras']
def decode_js(python_obj):
return ReCreateMyClass(json.loads(python_obj))
# is this safe?
class_obj = decode_js(python_obj)
I am aware it is possible to automate this better (Is parsing a json naively into a Python class or struct secure?) but is there any risk of executing malicious code with JSON in this manner?

Understanding flow of execution of Python code

I'm trying to do home assignment connected with python from Data Manipulation at Scale: Systems and Algorithms at Curesra. Generally I have problems with understanding base code which was presented as an example of MapReduce alogorythm. I would be grateful for helping me understand it in 2 places, details below.
I tired to go step by step through code flow of below two files after running command:
python wordcount.py 'data/books.json'
File wordcount.py is opened
mr = MapReduce.MapReduce() - me object is created
def __init__(self): part from MapReduce.py is
executed
We come back to wordcount.py
Functions def mapper(record): and def reducer(key,list_of_values): are created but for the time being without execution
Python go to if __name__ == '__main__':
` inputdata = open(sys.argv[1]) - json file is assigned to a
variable
mr.execute(inputdata, mapper, reducer) - A call to the function from MapReduce.py.
And here is my first question we haven't deffined mapper or reducer variable/object so far. Is it just null/no value passed to this function or we somehow defined this variable before but I missed this?
Later me move to def execute(self, data, mapper, reducer): in
MapReduce.py
And there we have mapper(record).
So this is reference to a function in wordcount.py, am I right? But if we have reference to a function in different file shouldn't we use import at the beginning of the file and define from which file this function came?
(...) further code execution
wordcount.py file:
import MapReduce
import sys
"""
Word Count Example in the Simple Python MapReduce Framework
"""
mr = MapReduce.MapReduce()
# =============================
# Do not modify above this line
def mapper(record):
# key: document identifier
# value: document contents
key = record[0]
value = record[1]
words = value.split()
for w in words:
mr.emit_intermediate(w, 1)
def reducer(key, list_of_values):
# key: word
# value: list of occurrence counts
total = 0
for v in list_of_values:
total += v
mr.emit((key, total))
# Do not modify below this line
# =============================
if __name__ == '__main__':
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)
MapReduce.py file:
import json
class MapReduce:
def __init__(self):
self.intermediate = {}
self.result = []
def emit_intermediate(self, key, value):
self.intermediate.setdefault(key, [])
self.intermediate[key].append(value)
def emit(self, value):
self.result.append(value)
def execute(self, data, mapper, reducer):
for line in data:
record = json.loads(line)
mapper(record)
for key in self.intermediate:
reducer(key, self.intermediate[key])
#jenc = json.JSONEncoder(encoding='latin-1')
jenc = json.JSONEncoder()
for item in self.result:
print jenc.encode(item)
Thank you in advance for help with that.
In python everything is a object, that include functions, so you can pass a functionA as argument to another functionB (or class or whenever), and if functionB expect that you to do it, it will assume that you give it a functions with the right firm and a proceed as normal.
In yours case
mr.execute(inputdata, mapper, reducer)
here mapper, reducer are the functions previously defined that are passed as argument to the method execute of the instance mr of the class MapReduce and as you can see, said method use it as the functions that it expect.
Thank to this you can, as the that code show, make generic code that do some calculus that can be used in similar way by many applications by given the user the options of supplies his/her own functions.
A much more generic example of this is the function map, this function receive a function that do something, map don't care what it does or where it comefrom, only that receive as many argument as map himself receive (others that say functions) and return a value to build a new list with the results.

How do I log multiple very similar events gracefully in python?

With pythons logging module, is there a way to collect multiple events into one log entry? An ideal solution would be an extension of python's logging module or a custom formatter/filter for it so collecting logging events of the same kind happens in the background and nothing needs to be added in code body (e.g. at every call of a logging function).
Here an example that generates a large number of the same or very similar logging events:
import logging
for i in range(99999):
try:
asdf[i] # not defined!
except NameError:
logging.exception('foo') # generates large number of logging events
else: pass
# ... more code with more logging ...
for i in range(88888): logging.info('more of the same %d' % i)
# ... and so on ...
So we have the same exception 99999 times and log it. It would be nice, if the log just said something like:
ERROR:root:foo (occured 99999 times)
Traceback (most recent call last):
File "./exceptionlogging.py", line 10, in <module>
asdf[i] # not defined!
NameError: name 'asdf' is not defined
INFO:root:foo more of the same (occured 88888 times with various values)
You should probably be writing a message aggregate/statistics class rather than trying to hook onto the logging system's singletons but I guess you may have an existing code base that uses logging.
I'd also suggest that you should instantiate your loggers rather than always using the default root. The Python Logging Cookbook has extensive explanation and examples.
The following class should do what you are asking.
import logging
import atexit
import pprint
class Aggregator(object):
logs = {}
#classmethod
def _aggregate(cls, record):
id = '{0[levelname]}:{0[name]}:{0[msg]}'.format(record.__dict__)
if id not in cls.logs: # first occurrence
cls.logs[id] = [1, record]
else: # subsequent occurrence
cls.logs[id][0] += 1
#classmethod
def _output(cls):
for count, record in cls.logs.values():
record.__dict__['msg'] += ' (occured {} times)'.format(count)
logging.getLogger(record.__dict__['name']).handle(record)
#staticmethod
def filter(record):
# pprint.pprint(record)
Aggregator._aggregate(record)
return False
#staticmethod
def exit():
Aggregator._output()
logging.getLogger().addFilter(Aggregator)
atexit.register(Aggregator.exit)
for i in range(99999):
try:
asdf[i] # not defined!
except NameError:
logging.exception('foo') # generates large number of logging events
else: pass
# ... more code with more logging ...
for i in range(88888): logging.error('more of the same')
# ... and so on ...
Note that you don't get any logs until the program exits.
The result of running it this is:
ERROR:root:foo (occured 99999 times)
Traceback (most recent call last):
File "C:\work\VEMS\python\logcount.py", line 38, in
asdf[i] # not defined!
NameError: name 'asdf' is not defined
ERROR:root:more of the same (occured 88888 times)
Your question hides a subliminal assumption of how "very similar" is defined.
Log records can either be const-only (whose instances are strictly identical), or a mix of consts and variables (no consts at all is also considered a mix).
An aggregator for const-only log records is a piece of cake. You just need to decide whether process/thread will fork your aggregation or not.
For log records which include both consts and variables you'll need to decide whether to split your aggregation based on the variables you have in your record.
A dictionary-style counter (from collections import Counter) can serve as a cache, which will count your instances in O(1), but you may need some higher-level structure in order to write the variables down if you wish. Additionally, you'll have to manually handle the writing of the cache into a file - every X seconds (binning) or once the program has exited (risky - you may lose all in-memory data if something gets stuck).
A framework for aggregation would look something like this (tested on Python v3.4):
from logging import Handler
from threading import RLock, Timer
from collections import defaultdict
class LogAggregatorHandler(Handler):
_default_flush_timer = 300 # Number of seconds between flushes
_default_separator = "\t" # Seperator char between metadata strings
_default_metadata = ["filename", "name", "funcName", "lineno", "levelname"] # metadata defining unique log records
class LogAggregatorCache(object):
""" Keeps whatever is interesting in log records aggregation. """
def __init__(self, record=None):
self.message = None
self.counter = 0
self.timestamp = list()
self.args = list()
if record is not None:
self.cache(record)
def cache(self, record):
if self.message is None: # Only the first message is kept
self.message = record.msg
assert self.message == record.msg, "Non-matching log record" # note: will not work with string formatting for log records; e.g. "blah {}".format(i)
self.timestamp.append(record.created)
self.args.append(record.args)
self.counter += 1
def __str__(self):
""" The string of this object is used as the default output of log records aggregation. For example: record message with occurrences. """
return self.message + "\t (occurred {} times)".format(self.counter)
def __init__(self, flush_timer=None, separator=None, add_process_thread=False):
"""
Log record metadata will be concatenated to a unique string, separated by self._separator.
Process and thread IDs will be added to the metadata if set to True; otherwise log records across processes/threads will be aggregated together.
:param separator: str
:param add_process_thread: bool
"""
super().__init__()
self._flush_timer = flush_timer or self._default_flush_timer
self._cache = self.cache_factory()
self._separator = separator or self._default_separator
self._metadata = self._default_metadata
if add_process_thread is True:
self._metadata += ["process", "thread"]
self._aggregation_lock = RLock()
self._store_aggregation_timer = self.flush_timer_factory()
self._store_aggregation_timer.start()
# Demo logger which outputs aggregations through a StreamHandler:
self.agg_log = logging.getLogger("aggregation_logger")
self.agg_log.addHandler(logging.StreamHandler())
self.agg_log.setLevel(logging.DEBUG)
self.agg_log.propagate = False
def cache_factory(self):
""" Returns an instance of a new caching object. """
return defaultdict(self.LogAggregatorCache)
def flush_timer_factory(self):
""" Returns a threading.Timer daemon object which flushes the Handler aggregations. """
timer = Timer(self._flush_timer, self.flush)
timer.daemon = True
return timer
def find_unique(self, record):
""" Extracts a unique metadata string from log records. """
metadata = ""
for single_metadata in self._metadata:
value = getattr(record, single_metadata, "missing " + str(single_metadata))
metadata += str(value) + self._separator
return metadata[:-len(self._separator)]
def emit(self, record):
try:
with self._aggregation_lock:
metadata = self.find_unique(record)
self._cache[metadata].cache(record)
except Exception:
self.handleError(record)
def flush(self):
self.store_aggregation()
def store_aggregation(self):
""" Write the aggregation data to file. """
self._store_aggregation_timer.cancel()
del self._store_aggregation_timer
with self._aggregation_lock:
temp_aggregation = self._cache
self._cache = self.cache_factory()
# ---> handle temp_aggregation and write to file <--- #
for key, value in sorted(temp_aggregation.items()):
self.agg_log.info("{}\t{}".format(key, value))
# ---> re-create the store_aggregation Timer object <--- #
self._store_aggregation_timer = self.flush_timer_factory()
self._store_aggregation_timer.start()
Testing this Handler class with random log severity in a for-loop:
if __name__ == "__main__":
import random
import logging
logger = logging.getLogger()
handler = LogAggregatorHandler()
logger.addHandler(handler)
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.DEBUG)
logger.info("entering logging loop")
for i in range(25):
# Randomly choose log severity:
severity = random.choice([logging.DEBUG, logging.INFO, logging.WARN, logging.ERROR, logging.CRITICAL])
logger.log(severity, "test message number %s", i)
logger.info("end of test code")
If you want to add more stuff, this is what a Python log record looks like:
{'args': ['()'],
'created': ['1413747902.18'],
'exc_info': ['None'],
'exc_text': ['None'],
'filename': ['push_socket_log.py'],
'funcName': ['<module>'],
'levelname': ['DEBUG'],
'levelno': ['10'],
'lineno': ['17'],
'module': ['push_socket_log'],
'msecs': ['181.387901306'],
'msg': ['Test message.'],
'name': ['__main__'],
'pathname': ['./push_socket_log.py'],
'process': ['65486'],
'processName': ['MainProcess'],
'relativeCreated': ['12.6709938049'],
'thread': ['140735262810896'],
'threadName': ['MainThread']}
One more thing to think about:
Most features you run depend on a flow of several consecutive commands (which will ideally report log records accordingly); e.g. a client-server communication will typically depend on receiving a request, processing it, reading some data from the DB (which requires a connection and some read commands), some kind of parsing/processing, constructing the response packet and reporting the response code.
This highlights one of the main disadvantages of using an aggregation approach: by aggregating log records you lose track of the time and order of the actions that took place. It will be extremely difficult to figure out what request was incorrectly structured if you only have the aggregation at hand.
My advice in this case is that you keep both the raw data and the aggregation (using two file handlers or something similar), so that you can investigate a macro-level (aggregation) and a micro-level (normal logging).
However, you are still left with the responsibility of finding out that things have gone wrong, and then manually investe what caused it. When developing on your PC this is an easy enough task; but deploying your code in several production servers makes these tasks cumbersome, wasting a lot of your time.
Accordingly, there are several companies developing products specifically for log management. Most aggregate similar log records together, but others incorporate machine learning algorithms for automatic aggregation and learning your software's behavior. Outsourcing your log handling can then enable you to focus on your product, instead of on your bugs.
Disclaimer: I work for Coralogix, one such solution.
You can subclass the logger class and override the exception method to put your error types in a cache until they reach a certain counter before they are emitted to the log.
import logging
from collections import defaultdict
MAX_COUNT = 99999
class MyLogger(logging.getLoggerClass()):
def __init__(self, name):
super(MyLogger, self).__init__(name)
self.cache = defaultdict(int)
def exception(self, msg, *args, **kwargs):
err = msg.__class__.__name__
self.cache[err] += 1
if self.cache[err] > MAX_COUNT:
new_msg = "{err} occurred {count} times.\n{msg}"
new_msg = new_msg.format(err=err, count=MAX_COUNT, msg=msg)
self.log(logging.ERROR, new_msg, *args, **kwargs)
self.cache[err] = None
log = MyLogger('main')
try:
raise TypeError("Useful error message")
except TypeError as err:
log.exception(err)
Please note this isn't copy paste code.
You need to add your handlers (I recommend formatter, too) yourself.
https://docs.python.org/2/howto/logging.html#handlers
Have fun.
Create a counter and only log it for count=1, then increment thereafter and write out in a finally block (to ensure it gets logged no matter how bad the application crashes and burns). This could of course pose an issue if you have the same exception for different reasons, but you could always search for the line number to verify it's the same issue or something similar. A minimal example:
name_error_exception_count = 0
try:
for i in range(99999):
try:
asdf[i] # not defined!
except NameError:
name_error_exception_count += 1
if name_error_exception_count == 1:
logging.exception('foo')
else: pass
except Exception:
pass # this is just to get the finally block, handle exceptions here too, maybe
finally:
if name_error_exception_count > 0:
logging.exception('NameError exception occurred {} times.'.format(name_error_exception_count))

Threading memory profiling

So I hope this isn't a duplicate, however I either haven't been able to find the adequate solution or I just am not 100% on what I'm looking for. I've written a program to thread lots of requests. I create a thread to
Fetch responses from a number of api's such as this: share.yandex.ru/gpp.xml?url=MY_URL as well as scraping blogs
Parse the responses of all requests from the example above/ json/ using python-goose to extract articles
Return the parsed results back to the primary thread and insert into a database.
It's all been going well until it needs to pull back larger amounts of data which i haven't tested before. The primary reason for this is that it takes me over my shared memory limit on a shared Linux server (512mb) initiating a kill. This should be enough as it's only a few thousand requests, although i could be wrong. I'm clearing all large data variables/ objects within the main thread but that doesn't seem to help either.
I ran a memory_profile on the primary function which creates the threads with a thread class which looks like this:
class URLThread(Thread):
def __init__(self,request):
super(URLThread, self).__init__()
self.url = request['request']
self.post_id = request['post_id']
self.domain_id = request['domain_id']
self.post_data = request['post_params']
self.type = request['type']
self.code = ""
self.result = ""
self.final_results = ""
self.error = ""
self.encoding = ""
def run(self):
try:
self.request = get_page(self.url,self.type)
self.code = self.request['code']
self.result = self.request['result']
self.final_results = response_handler(dict(result=self.result,type=self.type,orig_url=self.url ))
self.encoding = chardet.detect(self.result)
self.error = self.request['error']
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
errors.append((exc_type, fname, exc_tb.tb_lineno,e,'NOW()'))
pass
#profile
def multi_get(uris,timeout=2.0):
def alive_count(lst):
alive = map(lambda x : 1 if x.isAlive() else 0, lst)
return reduce(lambda a,b : a + b, alive)
threads = [ URLThread(uri) for uri in uris ]
for thread in threads:
thread.start()
while alive_count(threads) > 0 and timeout > 0.0:
timeout = timeout - UPDATE_INTERVAL
sleep(UPDATE_INTERVAL)
return [ {"request":x.url,
"code":str(x.code),
"result":x.result,
"post_id":str(x.post_id),
"domain_id":str(x.domain_id),
"final_results":x.final_results,
"error":str(x.error),
"encoding":str(x.encoding),
"type":x.type}
for x in threads ]
And the results look like this on the first batch of requests i pump through it (FYI it's a link as the output text isn't readable in here, i can't paste a html table or embed an image until i get 2 more points ):
http://tinypic.com/r/28c147d/8
And it doesn't seem to drop any of the memory in subsequent passes (I'm batching 100 requests/ threads through at 1 time). By this i mean once a batch of threads is complete they seem to stay in memory ad every time it runs another, memory is added as below:
http://tinypic.com/r/nzkeoz/8
Am I doing something really stupid here?
Python will generally free the memory taken up by an object when there are no references to that object left. Your multi_get function returns a list that contains references to every thread that you have created. So it's unlikely that Python would free that memory. But we would need to see what the code that is calling multi_get is doing in order to be sure.
To start freeing the memory you will need to stop returning references to the threads from this function. Or if you want to continue to do that, at least delete them somewhere del x.

Simultaneously modify different keys in ZODB

I'm using ZODB as a persistent storage for objects that are going to be modified through a webservice.
Below is an example to which I reduced the issue.
The increment-function is what is called from multiple threads.
My problem is, that when increment is called simultaneously from two threads, for different keys, I'm getting the conflict-error.
I imagine it should be possible to resolve this, at least as long different keys are modified, in a proper way?
If so, I didn't manage to find an example on how to... (the zodb-documentation seems to be somewhat spread across different sites :/ )
Glad about any ideas...
import time
import transaction
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
from ZODB.POSException import ConflictError
def test_db():
store = FileStorage('zodb_storage.fs')
return DB(store)
db_test = test_db()
# app here is a flask-app
#app.route('/increment/<string:key>')
def increment(key):
'''increment the value of a certain key'''
# open connection
conn = db_test.open()
# get the current value:
root = conn.root()
val = root.get(key,0)
# calculate new value
# in the real application this might take some seconds
time.sleep(0.1)
root[key] = val + 1
try:
transaction.commit()
return '%s = %g' % (key, val)
except ConflictError:
transaction.abort()
return 'ConflictError :-('
You have two options here: implement conflict resolution, or retry the commit with fresh data.
Conflict resolution only applies to custom types you store in the ZODB, and can only be applied if you know how to merge your change into the newly-changed state.
The ZODB looks for a _p_resolveConflict() method on custom types and calls that method with the old state, the saved state you are in conflict with, and the new state you tried to commit; you are supposed to return the merged state. For a simple counter, like in your example, that'd be a as simple as updating the saved state with the change between the old and new states:
class Counter(Persistent):
def __init__(self, start=0):
self._count = start
def increment(self):
self._count += 1
return self._count
def _p_resolveConflict(self, old, saved, new):
# default __getstate__ returns a dictionary of instance attributes
saved['_count'] += new['_count'] - old['_count']
return saved
The other option is to retry the commit; you want to limit the number of retries, and you probably want to encapsulate this in a decorator on your method, but the basic principle is that you loop up to a limit, make your calculations based on ZODB data (which, after a conflict error, will auto-read fresh data where needed), then attempt to commit. If the commit is successful you are done:
max_retries = 10
retry = 0
conn = db_test.open()
root = conn.root()
while retry < max_retries:
val = root.get(key,0)
time.sleep(0.1)
root[key] = val + 1
try:
transaction.commit()
return '%s = %g' % (key, val)
except ConflictError:
retry += 1
raise CustomExceptionIndicatingTooManyRetries

Categories

Resources