I currently have an application running on appengine and I am executing a few jobs using the deferred library, some of these tasks run daily, while some of them are executed once a month. Most of these tasks query Datastore to retrieve documents and then store the entities in an index (Search API). Some of these tables are replaced monthly and I have to run these tasks on all entities (4~5M).
One exemple of such a task is:
def addCompaniesToIndex(cursor=None, n_entities=0, mindate=None):
#get index
BATCH_SIZE = 200
cps, next_cursor, more = Company.query().\
fetch_page(BATCH_SIZE,
start_cursor=cursor)
doc_list = []
for i in range(0, len(cps)):
cp = cps[i]
#create a Index Document using the Datastore entity
#this document has only about 5 text fields and one date field
cp_doc = getCompanyDocument(cp)
doc_list.append(cp_doc)
index = search.Index(name='Company')
index.put(doc_list)
n_entities += len(doc_list)
if more:
logging.debug('Company: %d added to index', n_entities)
#to_put[:] = []
doc_list[:] = []
deferred.defer(addCompaniesToIndex,
cursor=next_cursor,
n_entities=n_entities,
mindate=mindate)
else:
logging.debug('Finished Company index creation (%d processed)', n_entities)
When I run one task only, the execution takes around 4-5s per deferred task, so indexing my 5M entities would take about 35 hours.
Another thing is that when I run an update on another index (eg, one of the daily updates) using a different deferred task on the same queue, both are executed a lot slower. And start taking about 10-15 seconds per deferred call which is just unbearable.
My question is: is there a way to do this faster and scale the push queue to more than one job running each time? Or should I use a different approach for this problem?
Thanks in advance,
By placing the if more statement at the end of the addCompaniesToIndex() function you're practically serializing the task execution: the next deferred task is not created until the current deferred task completed indexing its share of docs.
What you could do is move the if more statement right after the Company.query().fetch_page() call where you obtain (most of) the variables needed for the next deferred task execution.
This way the next deferred task would be created and enqueued (long) before the current one completes, so their processing can potentially be overlapping/staggered. You will need some other modifications as well, for example handling the n_entities variable which loses its current meaning in the updated scenario - but that's more or less cosmetic/informational, not essential to the actual doc indexing operation.
If the number of deferred tasks is very high there is a risk of queueing too many of them simultaneously, which could cause an "explosion" in the number of instances GAE would spawn to handle them. In such case is not desired you can "throttle" the rate at which the deferred tasks are spawned by delaying their execution a bit, see https://stackoverflow.com/a/38958475/4495081.
I think I finally managed to get around this issue by using two queues and idea proposed by the previous answer.
On the first queue we only query the main entities (with keys_only). And launch another task on a second queue for those keys. The first task will then relaunch itself on queue 1 with the next_cursor.
The second queue gets the entity keys and does all the queries and inserts on Full text search/BigQuery/PubSub. (this is slow ~ 15s per group of 100 keys)
I tried using only one queue as well but the processing throughput was not as good. I believe that this might come from the fact that we have slow and fast tasks running on the same queue and the scheduler might not work as well in this case.
Related
I am using cx Oracle and schedule module in python. Following is the psuedo code.
import schedule,cx_Oracle
def db_operation(query):
'''
Some DB operations like
1. Get connection
2. Execute query
3. commit result (in case of DML operations)
'''
schedule.every().hour.at(":10").do(db_operation,query='some_query_1') # Runs at 10th minute in every hour
schedule.every().day.at("13:10").do(db_operation,query='some_query_2') # Runs at 1:10 p.m every day
Both the above scheduled jobs calls the same function (which does some DB operations) and will coincide at 13:10.
Questions:
So how does the scheduler handles this scenario? Like running 2 jobs at the same time. Does it puts in some sort of queue and runs one by one even though time is same? or are they in parallel?
Which one gets picked first? and if I would want the priority of first job over second, how to do it?
Also, important thing is that at a time only one of these should be accessing the database, otherwise it may lead to inconsistent data. How to take care of this scenario? Like is it possible to put a sort of lock while accessing the function or should the table be locked somehow?
I took a look at the code of schedule and I have come to the following conclusions:
The schedule library does not work in parallel or concurrent. Therefore, jobs that have expired are processed one after the other. They are sorted according to their due date. The job that should be performed furthest in the past is performed first.
If jobs are due at the same time, schedule execute the jobs according to the FIFO scheme, regarding the creation of the jobs. So in your example, some_query_1 would be executed before some_query_2.
Question three is actually self-explanatory as only one function can be executed at a time. Therefore, the functions should not actually get in each other's way.
This is the simplified version of my code, at which each process crawl the link and get data and store them in database in parallel.
def crawl_and_save_data(url):
while True:
res = requests.get(url)
price_list = res.json()
if len(price_list) == 0:
sys.exit()
# Save all datas in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
until_date = convert_date_format(price_list[len(price_list)-1]['candleDateTime'])
time.sleep(1)
if __name__=='__main__':
# When executed with pure python
pool = Pool()
pool.map(
crawl_and_save_data,
get_bunch_of_url_list()
)
The key point of this code is,
# Save all data in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
, where each process accesses same database table.
I wonder whether this kind of task prevents concurrency of my whole task.
Or, Would it be a possibility to lose data because of the concurrent database accesses?
Or, would all queries are put in a I/O queue or something?
Need your advices. Thanks
tl;dr - you should be fine, but the question doesn't include enough detail to answer definitively. You will need to run some tests, but you should expect to get a good amount of concurrency (a few dozen simultaneous writes) before things start to slow down.
Note though - as currently written, seems like your workers will get the same URL over and over again, because of the while True loop that never breaks or exits. You detect if the list is empty, but does the URL track state somehow? I would expect multiple, identical GETs to return the same data over and over...
As far as concurrency, that ultimately depends on -
The resources available to the database (memory, I/O, CPU)
The server-side resources consumed by each connection/operation.
That second point includes memory, etc., but also whether independent
operations are competing for the same resources (are 10 different connections
trying to update the same set of rows in the database?). Updating the same
table is fine, more or less, because the database can use row-level locks.
Also note the difference between concurrency (how many things happen at
once) and throughput (how many things happen within a period of time).
Concurrency and throughput can relate to each in counter-intuitive ways -
it's not uncommon to see a situation where 1 process can do N operations per
second, but M processes sustain far less than M x N operations per second,
possibly even bringing the whole thing to a screeching halt (e.g., via a
deadlock)
Thinking about your code snippet, here are some observations:
You are using multiprocessing.Pool, which uses sub-processes for concurrency and will work well for your case if you...
Make sure you open your connections in the sub-process; trying to re-use a connection from the parent process will not work
If you do nothing else to your code, you will be using a number of sub-processes equal to the number of cores on your db client machine
This is a good starting point. If a function is CPU-bound, you really can't go higher. If your function is I/O-bound, the CPU will be idle waiting for I/O operations to return. You can start ramping up the worker count in this case.
Thus, each sub-process will have a connection to the database, with some amount of server memory per connection.
This also means that each insert should be in isolated transactions, with no additional work on your part.
Given that, simple, append-only, row-by-row transactions should support
relatively high concurrency and high throughput, again depending on how
big and fast your DB server is.
Also, note that you are already queueing :) With no args, Pool() creates a
number of child processes equal to os.cpu_count() (see
the docs).
If that's greater than the number of URLs in your collection, that collection
is a queue of sorts, just not a durable one. If your master process dies, the
list of URLs is gone.
Unrelated - unless you are worried about your URL fetches getting throttled, from a db perspective, there is no need for the time.sleep(1) statement.
Hope this helps.
I want to query an api (which is time consuming) with lots of items (~100) but not all at once. Instead I want a little delay between the queries.
What I currently have is a task that gets executed asynchronously and iterates over the queries and after each iteration waits some time:
#shared_task
def query_api_multiple(values):
delay_between_queries = 1
query_results = []
for value in values:
time.sleep(delay_between_queries)
response = query_api(value)
if response['result']:
query_results.append(response)
return query_results
My question is, when multiple of those requests come in, will the second request gets executed after the first is finished or while the first is still running? And when they are not getting executed at the same time, how can I achieve this?
You should not use time.sleep but rate limit your task instead:
Task.rate_limit
Set the rate limit for this task type (limits the
number of tasks that can be run in a given time frame).
The rate limits can be specified in seconds, minutes or hours by
appending “/s”, “/m” or “/h” to the value. Tasks will be evenly
distributed over the specified time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum
delay of 600ms between starting two tasks on the same worker instance.
So if you want to limit it to 1 query per second, try this:
#shared_task(rate_limit='1/s')
def query_api_multiple(values):
...
Yes, if you create multiple tasks then they may run at the same time.
You can rate limit on a task type basis with celery if you want to limit the number of tasks that run per period of time. Alternatively, you could implement a rate limiting pattern using something like redis, combined with celery retries, if you need more flexibility than what celery provides OOtB.
I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).
I'm using deferred to put tasks in the default queue in a AppEngine app similar to this approach.
Im naming the task with a timestamp that changes every 5 second. During that time a lot of calls are made to the queue with the same name resulting in a TaskAlreadyExistsError which is fine. The problem is when I check the quotas the "Task Queue API Calls" are increasing for every call made, not only those who actually are put in the queue.
I mean if you look at the quota: Task Queue API Calls: 34,017 of 100,000 and compare to the actual queue calls: /_ah/queue/deferred - 2.49K
Here is the code that handles the queue:
try:
deferred.defer(function_call, params, _name=task_name, _countdown=int(interval/2))
except (taskqueue.TaskAlreadyExistsError, taskqueue.TombstonedTaskError):
pass
I suppose that is the way it works. Is there a good way to solve the problem with the quota? Can I use memcache to store the task_name and check if the task has been added besides the above try/catch? Or is there a way to check if the task already exists without using Task Queue Api Calls?
Thanks for pointing out that this is the case, because I didn't realise, but the same problem must be affecting me.
As far as I can see yeah, throwing something into memcache comprised of the taskname should work fine, and then if you want to reduce those hits on memcache you can store the flag locally also within the instance.
The "good way" to solve the quota problem is eliminating destined-to-fail calls to Task Queue API.
_name you are using changes every 5 seconds which might not be a bottleneck if you increase the execution rate of your Task Queue. But you also add Tasks using _countdown.