How to obtain a speedup using Python multiprocessing with cx_Oracle? - python

From How to use threading in Python? I took this code sample:
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)
And it is perfectly working with such a urllib.request.urlopen function and gives about 2-3 times speed increase
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
from timeit import Timer
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
]
def mult(urls):
pool = ThreadPool(8)
results = pool.map(urlopen, urls)
pool.close()
pool.join()
def single(urls):
[urlopen(url) for url in urls]
print(Timer(lambda: single(urls)).timeit(number=1))
print(Timer(lambda: mult(urls)).timeit(number=1))
But in case of calling DB-procedures I did not notice any speedup provided by multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
import cx_Oracle as ora
import configparser
config = configparser.ConfigParser()
config.read('configuration.ini')
conf = config['sample_config']
dsn = ora.makedsn(conf['ip'], conf['port'], sid=conf['sid'])
connection = ora.Connection(user=conf['user'], password=conf['password'], dsn=dsn, threaded=True)
cursor = ora.Cursor(connection)
def test_function(params):
cursor = p.connection.cursor()
# call procedure
cursor.callproc('Sample_PKG.my_procedure', keywordParameters=params)
dicts = [{'a': 'b'}, {'a': 'c'}] # a lot of dictionaries contains about 30 items each
pool = ThreadPool(4)
pool.map(test_function, dicts)
pool.close()
pool.join()
So, why is it so and what could be the solution to boost the script's work?
UPD
Tried to use session pool. This sample code is working
db_pool = ora.SessionPool(user=conf['user'], password=conf['password'], dsn=dsn, min=1, max=4, increment=1, threaded=True)
connection = ora.Connection(dsn=dsn, pool=db_pool)
cursor = connection.cursor()
cursor.execute("select 1 from dual")
result = cursor.fetchall()
cursor.close()
db_pool.release(connection)
But when I replace
cursor.execute("select 1 from dual")
with
cursor.callproc('Sample_PKG.my_procedure', keywordParameters=params)
I get my console hangs on. Am I doing something wrong?

Each Oracle connection can only execute one statement at a time. Try using a session pool and use a different connection for each statement (and hope your DBA doesn't mind the extra connections).

Related

Cannot use urllib when monkey patch of gevent was applied

I am new to gevent package that allows us to run things asynchronously. I found this example code on https://sdiehl.github.io/gevent-tutorial/ that allows us to make async api calls. Original code is,
import gevent.monkey
gevent.monkey.patch_socket()
import gevent
import urllib2
import simplejson as json
def fetch(pid):
response = urllib2.urlopen('http://json-time.appspot.com/time.json')
result = response.read()
json_result = json.loads(result)
datetime = json_result['datetime']
print('Process %s: %s' % (pid, datetime))
return json_result['datetime']
def synchronous():
for i in range(1,10):
fetch(i)
def asynchronous():
threads = []
for i in range(1,10):
threads.append(gevent.spawn(fetch, i))
gevent.joinall(threads)
print('Synchronous:')
synchronous()
print('Asynchronous:')
asynchronous()
Because the api that use to take time doesn't work, I had to do some changes. this is the code that I edited.
import gevent
from urllib.request import urlopen #this is the alternative for urllib2
import requests
#import simplejson as json
def fetch(pid):
response = urlopen('https://just-the-time.appspot.com/')
result = response.read()
datetime = result.decode('utf-8')
print('Process %s: %s' % (pid, datetime))
return datetime
def synchronous():
for i in range(1,10):
fetch(i)
def asynchronous():
threads = []
for i in range(1,10):
threads.append(gevent.spawn(fetch, i))
gevent.joinall(threads)
print('Synchronous:')
synchronous()
print('Asynchronous:')
asynchronous()
I tried make requests to the time api directly and it works fine. But when I apply the 'monkey patch' from gevent it gives this error. TypeError: _wrap_socket() argument 'sock' must be _socket.socket, not SSLSocket
You can try removing monkey patch and it works fine. But I am not sure without the monkey patch, async part runs asynchronously or not, because in synchronous and asynchronous the results are kind of identical. Can someone explain what is going on here?

Python: use external function in "apply_async"

I'm starting to learn python and have to write a script which takes advantages of multiprocessing. It looks as follows:
import multiprocessing as mp
import os
abspath = os.path.abspath(__file__)
dname = os.path.dirname(abspath)
os.chdir(dname)
import PoseOptimization2 as opt
import telemMS_Vicon as extr
import Segmentation as seg
result_list = []
def log_result(result):
# This is called whenever foo_pool(i) returns a result.
# result_list is modified only by the main process, not the pool workers.
result_list.append(result)
print('Frame:',result[1])
def multiproc(seg):
pool = mp.Pool(mp.cpu_count())
for i in range(0,len(seg)):
pool.apply_async(opt.optimize, args=(seg[0],seg[i],False,False,i),callback=log_result)
pool.close()
pool.join()
if __name__ == "__main__":
#extract point cloud data (list of arrays)
OutputObj = extr.extraction('MyoSuit Custom Feasibility')
segments = seg.segmentation(OutputObj)
multiproc(segments[0])
list1 = result_list
result_list = []
In principle, the variable "segments" contains several lists. The entries of each list shall then be modified by using the function "optimize" in the "PoseOptimization2" script. To make this faster, I want to use multiprocessing.
But when I try to run the function "multiproc", there is the error "ModuleNotFoundErrorError: No module named PoseOptimization2" and the program has to be restarted in order to work again.
How can I properly pass a function from another script to the multiprocessing function?
Thank you very much for your help!

Lambda Python Pool.map and urllib2.urlopen : Retry only failing processes, log only errors

I have an AWS Lambda function which calls a set of URLs using pool.map. The problem is that if one of the URLs returns anything other than a 200 the Lambda function fails and immediately retries. The problem is it immediately retries the ENTIRE lambda function. I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
This is the code as it currently sits (with some details removed), working only when all URLs are:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls);
#close the pool and wait for the work to finish
pool.close();
return pool.join();
I tried reading the official documentation but it seems to be lacking a bit in explaining the map function, specifically explaining return values.
Using the urlopen documentation I've tried modifying my code to the following:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url); # TODO: figure out which URL errored
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
#close the pool and wait for the work to finish
pool.close();
return true; # always return true so we never duplicate successful calls
I'm not sure if I'm correct to be doing exceptions that way, or if I'm even making python exception notation correctly. Again, my goal is I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
I figured out the answer thanks to a "lower-level" look at this question I posted here.
The answer was to create my own custom wrapper to the urllib2.urlopen function, since each thread itself needed to be try{}catch'd instead of the whole thing. That function looked like so:
def my_urlopen(url):
try:
return urllib2.urlopen(url)
except URLError:
urllib2.urlopen("https://example.com/log_error/?url="+url)
return None
I put that above the def lambda_handler function declaration, then I can replace the whole try/catch within it from this:
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url);
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
To this:
results = pool.map(my_urlopen, urls);
Q.E.D.

Python: Using sqlite3 with multiprocessing

I have a SQLite3 DB. I need to parse 10000 files. I read some data from each file, and then query the DB with this data to get a result. My code works fine in a single process environment. But I get an error when trying to use the mulitprocessing Pool.
My approach without multiprocessing (works OK):
1. Open DB connection object
2. for f in files:
foo(f, x1=x1, x2=x2, ..., db=DB)
3. Close DB
My approach with multiprocessing (does NOT work):
1. Open DB
2. pool = multiprocessing.Pool(processes=4)
3. pool.map(functools.partial(foo, x1=x1, x2=x2, ..., db=DB), [files])
4. pool.close()
5. Close DB
I get the following error: sqlite3.ProgrammingError: Base Cursor.__init__ not called.
My DB class is implemented as follows:
def open_db(sqlite_file):
"""Open SQLite database connection.
Args:
sqlite_file -- File path
Return:
Connection
"""
log.info('Open SQLite database %s', sqlite_file)
try:
conn = sqlite3.connect(sqlite_file)
except sqlite3.Error, e:
log.error('Unable to open SQLite database %s', e.args[0])
sys.exit(1)
return conn
def close_db(conn, sqlite_file):
"""Close SQLite database connection.
Args:
conn -- Connection
"""
if conn:
log.info('Close SQLite database %s', sqlite_file)
conn.close()
class MapDB:
def __init__(self, sqlite_file):
"""Initialize.
Args:
sqlite_file -- File path
"""
# 1. Open database.
# 2. Setup to receive data as dict().
# 3. Get cursor to execute queries.
self._sqlite_file = sqlite_file
self._conn = open_db(sqlite_file)
self._conn.row_factory = sqlite3.Row
self._cursor = self._conn.cursor()
def close(self):
"""Close DB connection."""
if self._cursor:
self._cursor.close()
close_db(self._conn, self._sqlite_file)
def check(self):
...
def get_driver_net(self, net):
...
def get_cell_id(self, net):
...
Function foo() looks like this:
def foo(f, x1, x2, db):
extract some data from file f
r1 = db.get_driver_net(...)
r2 = db.get_cell_id(...)
The overall not working implementation is as follows:
mapdb = MapDB(sqlite_file)
log.info('Create NetInfo objects')
pool = multiprocessing.Pool(processes=4)
files = [get list of files to process]
pool.map(functools.partial(foo, x1=x1, x2=x2, db=mapdb), files)
pool.close()
mapdb.close()
To fix this, I think I need to create the MapDB() object inside each pool worker (so have 4 parallel/independent connections). But I'm not sure how to do this. Can someone show me an example of how to accomplish this with Pool?
What about defining foo like this:
def foo(f, x1, x2, db_path):
mapdb = MapDB(db_path)
... open mapdb
... process data ...
... close mapdb
and then change your pool.map call to:
pool.map(functools.partial(foo, x1=x1, x2=x2, db_path="path-to-sqlite3-db"), files)
Update
Another option is to handle the worker threads yourself and distribute work via a Queue.
from Queue import Queue
from threading import Thread
q = Queue()
def worker():
mapdb = ...open the sqlite database
while True:
item = q.get()
if item[0] == "file":
file = item[1]
... process file ...
q.task_done()
else:
q.task_done()
break
...close sqlite connection...
# Start up the workers
nworkers = 4
for i in range(nworkers):
worker = Thread(target=worker)
worker.daemon = True
worker.start()
# Place work on the Queue
for x in ...list of files...:
q.put(("file",x))
# Place termination tokens onto the Queue
for i in range(nworkers):
q.put(("end",))
# Wait for all work to be done.
q.join()
The termination tokens are used to ensure that the sqlite connections are closed - in case that matters.

Python multiprocessing pool hangs on map call

I have a function that parses a file and inserts the data into MySQL using SQLAlchemy. I've been running the function sequentially on the result of os.listdir() and everything works perfectly.
Because most of the time is spent reading the file and writing to the DB, I wanted to use multiprocessing to speed things up. Here is my pseduocode as the actual code is too long:
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
The problem I'm seeing is that the script hangs and never finishes. I usually get 48 of 63 records into the database. Sometimes it's more, sometimes it's less.
I've tried using pool.close() and in combination with pool.join() and neither seems to help.
How do I get this script to complete? What am I doing wrong? I'm using Python 2.7.8 on a Linux box.
You need to put all code which uses multiprocessing, inside its own function. This stops it recursively launching new pools when multiprocessing re-imports your module in separate processes:
def parse_file(filename):
...
def main():
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
if __name__ == '__main__':
main()
See the documentation about making sure your module is importable, also the advice for running on Windows(tm)
The problem was a combination of 2 things:
my pool code being called multiple times (thanks #Peter Wood)
my DB code opening too many sessions (and/or) sharing sessions
I made the following changes and everything works now:
Original File
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session = get_session() # see below
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
DB File
def get_session():
engine = create_engine('mysql://root:root#localhost/my_db')
Base.metadata.create_all(engine)
Base.metadata.bind = engine
db_session = sessionmaker(bind=engine)
return db_session()

Categories

Resources