I have a function that queries a large table for the purposes of indexing it... It creates a server-side cursor named "all_accounts".
def get_all_accounts(self):
cursor = self.get_cursor('all_accounts')
cursor.execute("SELECT * FROM account_summary LIMIT 20000;")
I then process those 2,000 or so at a time to insert into a NoSQL solution:
def index_docs(self, cursor):
while True:
# consume result over a series of iterations
# with each iteration fetching 2000 records
record_count = cursor.rowcount
records = cursor.fetchmany(size=2000)
if not records:
break
for r in records:
# do stuff
I'd like the index_docs function to be consuming the cursor fetchmany() calls in parallel x10 as my bottleneck is not caused by the target system, but rather the single threaded nature of my script. I have done a few async/worker things in the past, but the psycopg2 cursor seemed like it might be an issue. Thoughts?
I think you'll be safe if a single process/thread accesses the cursor and dishes out work to multiple worker processes that push to the other database. (At a quick glance, server-side cursors can't be shared between connections, but I could be wrong there.)
That is, something like this. Generally you'd use imap_unordered to iterate over a collection of single items (and use a higher chunksize than the default 1), but I think we can just as well use the batches here...
import multiprocessing
def get_batches(conn):
cursor = conn.get_cursor('all_accounts')
cursor.execute("SELECT * FROM account_summary LIMIT 20000;")
while True:
records = cursor.fetchmany(size=500)
if not records:
break
yield list(records)
def process_batch(batch):
# (this function is run in child processes)
for r in batch:
# ...
return "some arbitrary result"
def main():
conn = connect...()
with multiprocessing.Pool() as p:
batch_generator = get_batches(conn)
for result in p.imap_unordered(process_batch, get_batches):
print(result) # doesn't really matter
Related
I followed the below code in order to implement a parallel select query on a postgres database:
https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/
My basic problem is that I have ~6k queries that need to be executed, and I am trying to optimise the execution of these select queries. Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. However it takes a lot longer to run time wise, which is less of an issue for my use case. Even so I am trying to reduce the time as much as possible.
This is what my code looks like:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.engine = self.init_connection()
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS)
def init_connection(self):
LOGGER.info('Creating Postgres engine')
return create_engine(self.db_url)
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
self.pool.close()
self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
con = psycopg2.connect(self.db_url)
cur = con.cursor()
cur.execute(query)
records = cur.fetchall()
con.close()
return list(records)
However whenever this runs, I get the following error:
TypeError: can't pickle _thread.RLock objects
I've read lots of similar questions regarding the use of multiprocessing and pickleable objects but I cant for the life of me figure out what I am doing wrong.
The pool is generally one per process (which I believe is the best practise) but shared per instance of the connector class so that its not creating a pool for each use of the parallel_query method.
The top answer to a similar question:
Accessing a MySQL connection pool from Python multiprocessing
Shows an almost identical implementation to my own, except using MySql instead of Postgres.
Am I doing something wrong?
Thanks!
EDIT:
I've found this answer:
Python Postgres psycopg2 ThreadedConnectionPool exhausted
which is incredibly detailed and looks as though I have misunderstood what multiprocessing.Pool vs a connection pool such as ThreadedConnectionPool gives me. However in the first link it doesn't mention needing any connection pools etc. This solution seems good but seems A LOT of code for what I think is a fairly simple problem?
EDIT 2:
So the above link solves another problem, which I would have likely run into anyway so I'm glad I found that, but it doesnt solve the initial issue of not being able to use imap_unordered down to the pickling error. Very frustrating.
Lastly, I think its probably worth noting that this runs in Heroku, on a worker dyno, using Redis rq for scheduling, background tasks etc and a hosted instance of Postgres as the database.
To put it simply, postgres connection and sqlalchemy connection pool is thread safe, however they are not fork-safe.
If you want to use multiprocessing, you should initialize the engine in each child processes after the fork.
You should use multithreading instead if you want to share engines.
Refer to Thread and process safety in psycopg2 documentation:
libpq connections
shouldn’t be used by a forked processes, so when using a module such
as multiprocessing or a forking web deploy method such as FastCGI make
sure to create the connections after the fork.
If you are using multiprocessing.Pool, there is a keyword argument initializer which can be used to run code once on each child process. Try this:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS, initializer=self.init_connection(self.db_url))
#classmethod
def init_connection(cls, db_url):
def _init_connection():
LOGGER.info('Creating Postgres engine')
cls.engine = create_engine(db_url)
return _init_connection
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
pass
#self.pool.close()
#self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
with self.engine.connect() as conn:
with conn.begin():
result = conn.execute(query)
return result.fetchall()
def __getstate__(self):
# this is a hack, if you want to remove this method, you should
# remove self.pool and just pass pool explicitly
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
Now, to address the XY problem.
Initially it was a single query with the where id in (...) contained
all 6k predicate IDs but I ran into issues with the query using up >
4GB of RAM on the machine it ran on, so I decided to split it out into
6k individual queries which when synchronously keeps a steady memory
usage.
What you may want to do instead is one of these options:
write a subquery that generates all 6000 IDs and use the subquery in your original bulk query.
as above, but write the subquery as a CTE
if your ID list comes from an external source (i.e. not from the database), then you can create a temporary table containing the 6000 IDs and then run your original bulk query against the temporary table
However, if you insist on running 6000 IDs through python, then the fastest query is likely neither to do all 6000 IDs in one go (which will run out of memory) nor to run 6000 individual queries. Instead, you may want to try to chunk the queries. Send 500 IDs at once for example. You will have to experiment with the chunk size to determine the largest number of IDs you can send at one time while still comfortably within your memory budget.
I am reading data from large CSV files, processing it, and loading it into a SQLite database. Profiling suggests 80% of my time is spent on I/O and 20% is processing input to prepare it for DB insertion. I sped up the processing step with multiprocessing.Pool so that the I/O code is never waiting for the next record. But, this caused serious memory problems because the I/O step could not keep up with the workers.
The following toy example illustrates my problem:
#!/usr/bin/env python # 3.4.3
import time
from multiprocessing import Pool
def records(num=100):
"""Simulate generator getting data from large CSV files."""
for i in range(num):
print('Reading record {0}'.format(i))
time.sleep(0.05) # getting raw data is fast
yield i
def process(rec):
"""Simulate processing of raw text into dicts."""
print('Processing {0}'.format(rec))
time.sleep(0.1) # processing takes a little time
return rec
def writer(records):
"""Simulate saving data to SQLite database."""
for r in records:
time.sleep(0.3) # writing takes the longest
print('Wrote {0}'.format(r))
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
writer(pool.imap_unordered(process, data, chunksize=5))
This code results in a backlog of records that eventually consumes all memory because I cannot persist the data to disk fast enough. Run the code and you'll notice that Pool.imap_unordered will consume all the data when writer is at the 15th record or so. Now imagine the processing step is producing dictionaries from hundreds of millions of rows and you can see why I run out of memory. Amdahl's Law in action perhaps.
What is the fix for this? I think I need some sort of buffer for Pool.imap_unordered that says "once there are x records that need insertion, stop and wait until there are less than x before making more." I should be able to get some speed improvement from preparing the next record while the last one is being saved.
I tried using NuMap from the papy module (which I modified to work with Python 3) to do exactly this, but it wasn't faster. In fact, it was worse than running the program sequentially; NuMap uses two threads plus multiple processes.
Bulk import features of SQLite are probably not suited to my task because the data need substantial processing and normalization.
I have about 85G of compressed text to process. I'm open to other database technologies, but picked SQLite for ease of use and because this is a write-once read-many job in which only 3 or 4 people will use the resulting database after everything is loaded.
As I was working on the same problem, I figured that an effective way to prevent the pool from overloading is to use a semaphore with a generator:
from multiprocessing import Pool, Semaphore
def produce(semaphore, from_file):
with open(from_file) as reader:
for line in reader:
# Reduce Semaphore by 1 or wait if 0
semaphore.acquire()
# Now deliver an item to the caller (pool)
yield line
def process(item):
result = (first_function(item),
second_function(item),
third_function(item))
return result
def consume(semaphore, result):
database_con.cur.execute("INSERT INTO ResultTable VALUES (?,?,?)", result)
# Result is consumed, semaphore may now be increased by 1
semaphore.release()
def main()
global database_con
semaphore_1 = Semaphore(1024)
with Pool(2) as pool:
for result in pool.imap_unordered(process, produce(semaphore_1, "workfile.txt"), chunksize=128):
consume(semaphore_1, result)
See also:
K Hong - Multithreading - Semaphore objects & thread pool
Lecture from Chris Terman - MIT 6.004 L21: Semaphores
Since processing is fast, but writing is slow, it sounds like your problem is
I/O-bound. Therefore there might not be much to be gained from using
multiprocessing.
However, it is possible to peel off chunks of data, process the chunk, and
wait until that data has been written before peeling off another chunk:
import itertools as IT
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
chunksize = ...
for chunk in iter(lambda: list(IT.islice(data, chunksize)), []):
writer(pool.imap_unordered(process, chunk, chunksize=5))
It sounds like all you really need is to replace the unbounded queues underneath the Pool with bounded (and blocking) queues. That way, if any side gets ahead of the rest, it'll just block until they're ready.
This would be easy to do by peeking at the source, to subclass or monkeypatch Pool, something like:
class Pool(multiprocessing.pool.Pool):
def _setup_queues(self):
self._inqueue = self._ctx.Queue(5)
self._outqueue = self._ctx.Queue(5)
self._quick_put = self._inqueue._writer.send
self._quick_get = self._outqueue._reader.recv
self._taskqueue = queue.Queue(10)
But that's obviously not portable (even to CPython 3.3, much less to a different Python 3 implementation).
I think you can do it portably in 3.4+ by providing a customized context, but I haven't been able to get that right, so…
A simple workaround might be to use psutil to detect the memory usage in each process and say if more than 90% of memory are taken, than just sleep for a while.
while psutil.virtual_memory().percent > 75:
time.sleep(1)
print ("process paused for 1 seconds!")
The SQLite documentation says (here) that you can avoid checkpoint pauses in WAL-mode by running the checkpoints on a separate thread. I tried this, and it doesn't appear to work: the '-wal' file grows without bound, it is unclear whether anything is actually getting copied back into the main database file, and (most important) after the -wal file has gotten big enough (over a gigabyte) the main thread starts having to wait for the checkpointer.
In my application the main thread continuously does something essentially equivalent to this, where generate_data is going to spit out order of a million rows to be inserted:
db = sqlite3.connect("database.db")
cursor = db.cursor()
cursor.execute("PRAGMA wal_autocheckpoint = 0")
for datum in generate_data():
# It is a damned shame that there is no way to do this in one operation.
cursor.execute("SELECT id FROM strings WHERE str = ?", (datum.text,))
row = cursor.fetchone()
if row is not None:
id = row[0]
else:
cur.execute("INSERT INTO strings VALUES(NULL, ?)", (datum.text,))
id = cur.lastrowid
cursor.execute("INSERT INTO data VALUES (?, ?, ?)",
(id, datum.foo, datum.bar))
batch_size += 1
if batch_size > batch_limit:
db.commit()
batch_size = 0
and the checkpoint thread does this:
db = sqlite3.connect("database.db")
cursor = db.cursor()
cursor.execute("PRAGMA wal_autocheckpoint = 0")
while True:
time.sleep(10)
cursor.execute("PRAGMA wal_checkpoint(PASSIVE)")
(Being on separate threads, they have to have separate connections to the database, because pysqlite doesn't support sharing a connection among multiple threads.) Changing to a FULL or RESTART checkpoint does not help - then the checkpoints just fail.
How do I make this actually work? Desiderata are: 1) main thread never has to wait, 2) journal file does not grow without bound.
Checkpointing needs to lock the entire database, so all other readers and writes would have to be blocked.
(A passive checkpoint just aborts.)
So running checkpointing in a separate thread does not increase concurrency.
(The SQLite documentation suggests this only because the main thread might no be designed to handle checkpointing at idle moments.)
If you continuously access the database, you cannot checkpoint.
If your batch operations make the WAL file grow too big, you should insert explicit checkpoints into that loop (or rely on autocheckpointing).
I have a list of items aprox 60,000 items - i would like to send queries to the database to check if they exist and if they do return some computed results. I run an ordinary query, while iterating through the list one-by-one, the query has been running for the last 4 days. I thought i could use the threading module to improve on this. I did something like this
if __name__ == '__main__':
for ra, dec in candidates:
t = threading.Thread(target=search_sl, args=(ra,dec, q))
t.start()
t.join()
I tested with only 10 items and it worked fine - when i submitted the whole list of 60k items, i run into errors i.e, "maximum number of sessions exceeded". What I want to do is to create maybe 10 thread at a time. When the 1st bunch of thread have finished excuting, i send another request and so on.
You could try using a process pool, which is available in the multiprocessing module. Here is the example from the python docs:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers
Try increasing the number of processes until you reach the maximum your system can support.
Improve your queries before threading (premature optimization is the root of all evil!)
Your problem is having 60,000 different queries on a single database. Having a single query for each item means a lot of overhead for opening the connection and invoking a DB cursor session.
Threading those queries can speed up your process, but yields another set of problems like DB overload and max sessions allowed.
First approach: Load many item IDs into every query
Instead, try to improve your queries. Can your write a query that sends a long list of products and returns the matches? Perhaps something like:
SELECT item_id, *
FROM items
WHERE item_id IN (id1, id2, id3, id4, id5, ....)
Python gives you convenient interfaces for this kind if queries, so that the IN clause can use a pythonic list. This way you can break your long list of items to, say, 60 queries with 1,000 ids each.
Second approach: Use a temporary table
Another interesting approach is creating a temporary table on the database with your item ids. Temporary tables lasts as long as the connection lives, so you won't have to worry about cleanups. Perhaps something like:
CREATE TEMPORARY TABLE
item_ids_list (id INT PRIMARY KEY); # Remember indexing!
Insert the ids using an appropriate Python library:
INSERT INTO item_ids_list ... # Insert your 60,000 items here
Get your results:
SELECT * FROM items WHERE items.id IN (SELECT * FROM items_ids_list);
First of all you join only the last thread. There is no guarantee that it will be finished the last. You should use like that:
from time import sleep
delay = 0.5
tlist = [threading.Thread(target=search_sl, args=(ra,dec, q)) for ra, dec in candidates ]
map(lambda t:t.start(), tlist)
while(any(map(lambda t:t.isAlive()))): sleep(delay)
The second issue is the running 60K threads at the moment requires really huge hardware resource :-) It's better to queue your tasks and then process by workers. The number of worker threads must be limited. Like that (haven't tested the code, but the idea is clear I hope):
from Queue import Queue
from threading import Thread
from time import sleep
tasks = Queue()
map(tasks.put, candidates)
maxthreads = 50
delay = 0.1
try:
threads = [Thread(target=search_sl, args=tasks.get()) \
for i in xrange(0,maxthreads) ]
except Queue.Empty:
pass
map(lambda t:t.start(), threads)
while not tasks.empty():
threads = filter(lambda t:t.isAlive(), threads)
while len(threads) < maxthreads:
try:
t = Thread(target=search_sl, args=tasks.get())
t.start()
threads.append(t)
except Queue.Empty:
break
sleep(delay)
while(any(map(lambda t:t.isAlive(), threads))): sleep(delay)
Since it's an IO task, neither of thread or process is good for it. You use those if you need to parallelize computational tasks. So, be modern please ™, use something like gevent for parallel IO intensive tasks.
http://www.gevent.org/intro.html#example
On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui
You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break
Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break