I am using cx Oracle and schedule module in python. Following is the psuedo code.
import schedule,cx_Oracle
def db_operation(query):
'''
Some DB operations like
1. Get connection
2. Execute query
3. commit result (in case of DML operations)
'''
schedule.every().hour.at(":10").do(db_operation,query='some_query_1') # Runs at 10th minute in every hour
schedule.every().day.at("13:10").do(db_operation,query='some_query_2') # Runs at 1:10 p.m every day
Both the above scheduled jobs calls the same function (which does some DB operations) and will coincide at 13:10.
Questions:
So how does the scheduler handles this scenario? Like running 2 jobs at the same time. Does it puts in some sort of queue and runs one by one even though time is same? or are they in parallel?
Which one gets picked first? and if I would want the priority of first job over second, how to do it?
Also, important thing is that at a time only one of these should be accessing the database, otherwise it may lead to inconsistent data. How to take care of this scenario? Like is it possible to put a sort of lock while accessing the function or should the table be locked somehow?
I took a look at the code of schedule and I have come to the following conclusions:
The schedule library does not work in parallel or concurrent. Therefore, jobs that have expired are processed one after the other. They are sorted according to their due date. The job that should be performed furthest in the past is performed first.
If jobs are due at the same time, schedule execute the jobs according to the FIFO scheme, regarding the creation of the jobs. So in your example, some_query_1 would be executed before some_query_2.
Question three is actually self-explanatory as only one function can be executed at a time. Therefore, the functions should not actually get in each other's way.
Related
I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.
Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.
I currently have an application running on appengine and I am executing a few jobs using the deferred library, some of these tasks run daily, while some of them are executed once a month. Most of these tasks query Datastore to retrieve documents and then store the entities in an index (Search API). Some of these tables are replaced monthly and I have to run these tasks on all entities (4~5M).
One exemple of such a task is:
def addCompaniesToIndex(cursor=None, n_entities=0, mindate=None):
#get index
BATCH_SIZE = 200
cps, next_cursor, more = Company.query().\
fetch_page(BATCH_SIZE,
start_cursor=cursor)
doc_list = []
for i in range(0, len(cps)):
cp = cps[i]
#create a Index Document using the Datastore entity
#this document has only about 5 text fields and one date field
cp_doc = getCompanyDocument(cp)
doc_list.append(cp_doc)
index = search.Index(name='Company')
index.put(doc_list)
n_entities += len(doc_list)
if more:
logging.debug('Company: %d added to index', n_entities)
#to_put[:] = []
doc_list[:] = []
deferred.defer(addCompaniesToIndex,
cursor=next_cursor,
n_entities=n_entities,
mindate=mindate)
else:
logging.debug('Finished Company index creation (%d processed)', n_entities)
When I run one task only, the execution takes around 4-5s per deferred task, so indexing my 5M entities would take about 35 hours.
Another thing is that when I run an update on another index (eg, one of the daily updates) using a different deferred task on the same queue, both are executed a lot slower. And start taking about 10-15 seconds per deferred call which is just unbearable.
My question is: is there a way to do this faster and scale the push queue to more than one job running each time? Or should I use a different approach for this problem?
Thanks in advance,
By placing the if more statement at the end of the addCompaniesToIndex() function you're practically serializing the task execution: the next deferred task is not created until the current deferred task completed indexing its share of docs.
What you could do is move the if more statement right after the Company.query().fetch_page() call where you obtain (most of) the variables needed for the next deferred task execution.
This way the next deferred task would be created and enqueued (long) before the current one completes, so their processing can potentially be overlapping/staggered. You will need some other modifications as well, for example handling the n_entities variable which loses its current meaning in the updated scenario - but that's more or less cosmetic/informational, not essential to the actual doc indexing operation.
If the number of deferred tasks is very high there is a risk of queueing too many of them simultaneously, which could cause an "explosion" in the number of instances GAE would spawn to handle them. In such case is not desired you can "throttle" the rate at which the deferred tasks are spawned by delaying their execution a bit, see https://stackoverflow.com/a/38958475/4495081.
I think I finally managed to get around this issue by using two queues and idea proposed by the previous answer.
On the first queue we only query the main entities (with keys_only). And launch another task on a second queue for those keys. The first task will then relaunch itself on queue 1 with the next_cursor.
The second queue gets the entity keys and does all the queries and inserts on Full text search/BigQuery/PubSub. (this is slow ~ 15s per group of 100 keys)
I tried using only one queue as well but the processing throughput was not as good. I believe that this might come from the fact that we have slow and fast tasks running on the same queue and the scheduler might not work as well in this case.
I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).
We have hundreds of thousands of tasks that need to be run at a variety of arbitrary intervals, some every hour, some every day, and so on. The tasks are resource intensive and need to be distributed across many machines.
Right now tasks are stored in a database with an "execute at this time" timestamp. To find tasks that need to be executed, we query the database for jobs that are due to be executed, then update the timestamps when the task is complete. Naturally this leads to a substantial write load on the database.
As far as I can tell, we are looking for something to release tasks into a queue at a set interval. (Workers could then request tasks from that queue.)
What is the best way to schedule recurring tasks at scale?
For what it's worth we're largely using Python, although we have no problems using components (RabbitMQ?) written in other languages.
UPDATE: Right now we have about 350,000 tasks that run every half hour or so, with some variation. 350,000 tasks * 48 times per day is 16,800,000 tasks executed per day.
UPDATE 2: There are no dependencies. The tasks do not have to be executed in order and do not rely on previous results.
Since ACID isn't needed and you're okay with tasks potentially running twice, I wouldn't keep the timestamps in the database at all. For each task, create a list of [timestamp_of_next_run, task_id] and use a min-heap to store all of the lists. Python's heapq module can maintain the heap for you. You'll be able to very efficiently pop off the task with the soonest timestamp. When you need to run a task, use its task_id to look up in the database what the task needs to do. When a task completes, update the timestamp and put it back into the heap. (Just be careful not to change an item that's currently in the heap, as that will break the heap properties).
Use the database only to store information that you will still care about after a crash and reboot. If you won't need the information after a reboot, don't spend the time writing to disk. You will still have a lot of database read operations to load the information about a task that needs to run, but a read is much cheaper than a write.
If you don't have enough RAM to store all of the tasks in memory at the same time, you could go with a hybrid setup where you keep the tasks for the next 24 hours (for example) in RAM and everything else stays in the database. Alternately, you could rewrite the code in C or C++, which are less memory hungry.
If you don't want a database, you could store just the next run timestamp and task id in memory. You could store the properties for each task in a file named [task_id].txt. You would need a data structure to store all the tasks, sorted by timestamp in memory, an AVL tree seems like it would work, here's a simple one for python: http://bjourne.blogspot.com/2006/11/avl-tree-in-python.html. Hopefully Linux (I assume that's what you are running on) could handle millions of files in a directory, otherwise you might need to hash on the task id to get a sub folder).
Your master server would just need to run a loop, popping off tasks out of the AVL tree until the next task's timestamp is in the future. Then you could sleep for a few seconds and start checking again. Whenever a task runs, you would update the next run timestamp in the task file and re-insert it into the AVL tree.
When the master server reboots, there would be the overhead of reloading all tasks id and next run timestamp back into memory, so that might be painful with millions of files. Maybe you just have one giant file and give each task 1K space in the file for properties and next run timestamp and then use [task_id] * 1K to get to the right offset for the task properties.
If you are willing to use a database, I am confident MySQL could handle whatever you throw at it given the conditions you describe, assuming you have 4GB+ RAM and several hard drives in RAID 0+1 on your master server.
Finally, if you really want to get complicated, Hadoop might work too: http://hadoop.apache.org/
If you're worried about writes, you can have a set of servers that dispatch the tasks (may be stripe the servers to equalize load) and have each server write bulk checkpoints to the DB (this way, you will not have so many write queries). You still have to write to be able to recover if scheduling server dies, of course.
In addition, if you don't have a clustered index on timestamp, you will avoid having a hot-spot at the end of the table.
350,000 tasks * 48 times per day is
16,800,000 tasks executed per day.
To schedule the jobs, you don't need a database.
Databases are for things that are updated. The only update visible here is a change to the schedule to add, remove or reschedule a job.
Cron does this in a totally scalable fashion with a single flat file.
Read the entire flat file into memory, start spawning jobs. Periodically, check the fstat to see if the file changed. Or, even better, wait for a HUP signal and use that to reread the file. Use kill -HUP to signal the scheduler to reread the file.
It's unclear what you're updating the database for.
If the database is used to determine future schedule based on job completion, then a single database is a Very Dad Idea.
If you're using the database to do some analysis of job history, then you have a simple data warehouse.
Record completion information (start time, end time, exit status, all that stuff) in a simple flat log file.
Process the flat log files to create a fact table and dimension updates.
When someone has the urge to do some analysis, load relevant portions of the flat log files into a datamart so they can do queries and counts and averages and the like.
Do not directly record 17,000,000 rows per day into a relational database. No one wants all that data. They want summaries: counts and averages.
Why hundreds of thousands and not hundreds of millions ? :evil:
I think you need stackless python, http://www.stackless.com/. created by the genius of Christian Tismer.
Quoting
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure. More
readable code. Increased programmer
productivity.
Is used for massive multiplayer games.
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.