multiprocessing pool imap_unordered not completing all tasks

multiprocessing pool imap_unordered not completing all tasks - python

I want to invoke a list of ~5000 API calls which return JSON data and write it to a file.I'm expecting the total response of all API calls to be quite large, so I want to avoid reading them into memory all at once. I'm attempting to use a multiprocessing.Pool to distribute the tasks across 8 processes (# of available CPU cores). I'm expecting this to take 20+ hours to complete, since each worker will be fetching paginated responses starting from these URLs.
The 8 workers are spun up and retrieve data for 2-3 hours (each producing logs from different pid) and then somehow all the workers/processes move the "sleeping" status. No more logs are produced and no exceptions are caught.
I'm using imap_unordered to process the responses as they become available, and writing the data to a file. Why are the processes sleeping after a set number of tasks? Is there an optimization I can make to the following to ensure each task is completed? Note:
urls_to_query = [...] # list of 5000 strings
def call_api(url):
response_json = call_some_api(url)
return response_json
with Pool() as pool: # 8 CPU cores available
total = len(urls_to_query)
# iterates results as tasks are completed, in the order they are completed
for i, result in enumerate(pool.imap_unordered(call_api, urls_to_query)):
print(f"Response {i + 1}/{total} - {result}")
if isinstance(result, Exception):
handle_failure(result)
else:
write_to_file(result)
If there's a much more optimized way to accomplish this task, more than happy to try that out instead.
I've attempted to use other resources in multiprocessing module but no luck thus far.
UPDATE:
Each worker is now logging the memory available and I do not see any memory concerns. Each worker is hovering between 70-60% memory available.

Related

Job, Worker, and Task in dask_jobqueue

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

Python: how many process can access to Database(PostgreSQL) table at the same time?

This is the simplified version of my code, at which each process crawl the link and get data and store them in database in parallel.
def crawl_and_save_data(url):
while True:
res = requests.get(url)
price_list = res.json()
if len(price_list) == 0:
sys.exit()
# Save all datas in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
until_date = convert_date_format(price_list[len(price_list)-1]['candleDateTime'])
time.sleep(1)
if __name__=='__main__':
# When executed with pure python
pool = Pool()
pool.map(
crawl_and_save_data,
get_bunch_of_url_list()
)
The key point of this code is,
# Save all data in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
, where each process accesses same database table.
I wonder whether this kind of task prevents concurrency of my whole task.
Or, Would it be a possibility to lose data because of the concurrent database accesses?
Or, would all queries are put in a I/O queue or something?
Need your advices. Thanks

tl;dr - you should be fine, but the question doesn't include enough detail to answer definitively. You will need to run some tests, but you should expect to get a good amount of concurrency (a few dozen simultaneous writes) before things start to slow down.
Note though - as currently written, seems like your workers will get the same URL over and over again, because of the while True loop that never breaks or exits. You detect if the list is empty, but does the URL track state somehow? I would expect multiple, identical GETs to return the same data over and over...
As far as concurrency, that ultimately depends on -
The resources available to the database (memory, I/O, CPU)
The server-side resources consumed by each connection/operation.
That second point includes memory, etc., but also whether independent
operations are competing for the same resources (are 10 different connections
trying to update the same set of rows in the database?). Updating the same
table is fine, more or less, because the database can use row-level locks.
Also note the difference between concurrency (how many things happen at
once) and throughput (how many things happen within a period of time).
Concurrency and throughput can relate to each in counter-intuitive ways -
it's not uncommon to see a situation where 1 process can do N operations per
second, but M processes sustain far less than M x N operations per second,
possibly even bringing the whole thing to a screeching halt (e.g., via a
deadlock)
Thinking about your code snippet, here are some observations:
You are using multiprocessing.Pool, which uses sub-processes for concurrency and will work well for your case if you...
Make sure you open your connections in the sub-process; trying to re-use a connection from the parent process will not work
If you do nothing else to your code, you will be using a number of sub-processes equal to the number of cores on your db client machine
This is a good starting point. If a function is CPU-bound, you really can't go higher. If your function is I/O-bound, the CPU will be idle waiting for I/O operations to return. You can start ramping up the worker count in this case.
Thus, each sub-process will have a connection to the database, with some amount of server memory per connection.
This also means that each insert should be in isolated transactions, with no additional work on your part.
Given that, simple, append-only, row-by-row transactions should support
relatively high concurrency and high throughput, again depending on how
big and fast your DB server is.
Also, note that you are already queueing :) With no args, Pool() creates a
number of child processes equal to os.cpu_count() (see
the docs).
If that's greater than the number of URLs in your collection, that collection
is a queue of sorts, just not a durable one. If your master process dies, the
list of URLs is gone.
Unrelated - unless you are worried about your URL fetches getting throttled, from a db perspective, there is no need for the time.sleep(1) statement.
Hope this helps.

Google appengine: Task queue performance

I currently have an application running on appengine and I am executing a few jobs using the deferred library, some of these tasks run daily, while some of them are executed once a month. Most of these tasks query Datastore to retrieve documents and then store the entities in an index (Search API). Some of these tables are replaced monthly and I have to run these tasks on all entities (4~5M).
One exemple of such a task is:
def addCompaniesToIndex(cursor=None, n_entities=0, mindate=None):
#get index
BATCH_SIZE = 200
cps, next_cursor, more = Company.query().\
fetch_page(BATCH_SIZE,
start_cursor=cursor)
doc_list = []
for i in range(0, len(cps)):
cp = cps[i]
#create a Index Document using the Datastore entity
#this document has only about 5 text fields and one date field
cp_doc = getCompanyDocument(cp)
doc_list.append(cp_doc)
index = search.Index(name='Company')
index.put(doc_list)
n_entities += len(doc_list)
if more:
logging.debug('Company: %d added to index', n_entities)
#to_put[:] = []
doc_list[:] = []
deferred.defer(addCompaniesToIndex,
cursor=next_cursor,
n_entities=n_entities,
mindate=mindate)
else:
logging.debug('Finished Company index creation (%d processed)', n_entities)
When I run one task only, the execution takes around 4-5s per deferred task, so indexing my 5M entities would take about 35 hours.
Another thing is that when I run an update on another index (eg, one of the daily updates) using a different deferred task on the same queue, both are executed a lot slower. And start taking about 10-15 seconds per deferred call which is just unbearable.
My question is: is there a way to do this faster and scale the push queue to more than one job running each time? Or should I use a different approach for this problem?
Thanks in advance,

By placing the if more statement at the end of the addCompaniesToIndex() function you're practically serializing the task execution: the next deferred task is not created until the current deferred task completed indexing its share of docs.
What you could do is move the if more statement right after the Company.query().fetch_page() call where you obtain (most of) the variables needed for the next deferred task execution.
This way the next deferred task would be created and enqueued (long) before the current one completes, so their processing can potentially be overlapping/staggered. You will need some other modifications as well, for example handling the n_entities variable which loses its current meaning in the updated scenario - but that's more or less cosmetic/informational, not essential to the actual doc indexing operation.
If the number of deferred tasks is very high there is a risk of queueing too many of them simultaneously, which could cause an "explosion" in the number of instances GAE would spawn to handle them. In such case is not desired you can "throttle" the rate at which the deferred tasks are spawned by delaying their execution a bit, see https://stackoverflow.com/a/38958475/4495081.

I think I finally managed to get around this issue by using two queues and idea proposed by the previous answer.
On the first queue we only query the main entities (with keys_only). And launch another task on a second queue for those keys. The first task will then relaunch itself on queue 1 with the next_cursor.
The second queue gets the entity keys and does all the queries and inserts on Full text search/BigQuery/PubSub. (this is slow ~ 15s per group of 100 keys)
I tried using only one queue as well but the processing throughput was not as good. I believe that this might come from the fact that we have slow and fast tasks running on the same queue and the scheduler might not work as well in this case.

limited number of user-initiated background processes

I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?

I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.

First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.

First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.

Best practice to release memory after url fetch on appengine (python)

my problem is how to best release memory the response of an asynchrones url fetch needs on appengine. Here is what I basically do in python:
rpcs = []
for event in event_list:
url = 'http://someurl.com'
rpc = urlfetch.create_rpc()
rpc.callback = create_callback(rpc)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
for rpc in rpcs:
rpc.wait()
In my test scenario it does that for 1500 request. But I need an architecture to handle even much more within a short amount of time.
Then there is a callback function, which adds a task to a queue to process the results:
def event_callback(rpc):
result = rpc.get_result()
data = json.loads(result.content)
taskqueue.add(queue_name='name', url='url', params={'data': data})
My problem is, that I do so many concurrent RPC calls, that the memory of my instance crashes: "Exceeded soft private memory limit with 159.234 MB after servicing 975 requests total"
I already tried three things:
del result
del data
and
result = None
data = None
and I ran the garbage collector manually after the callback function.
gc.collect()
But nothing seem to release the memory directly after a callback functions has added the task to a queue - and therefore the instance crashes. Is there any other way to do it?

Wrong approach: Put these urls into a (put)-queue, increase its rate to the desired value (defaut: 5/sec), and let each task handle one url-fetch (or a group hereof). Please note that theres a safety limit of 3000 url-fetch-api-calls / minute (and one url-fetch might use more than one api-call)

Use the task queue for urlfetch as well, fan out and avoid exhausting memory, register named tasks and provide the event_list cursor to next task. You might want to fetch+process in such a scenario instead of registering new task for every process, especially if process also includes datastore writes.
I also find ndb to make these async solutions more elegant.
Check out Brett Slatkins talk on scalable apps and perhaps pipelines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.