We have hundreds of thousands of tasks that need to be run at a variety of arbitrary intervals, some every hour, some every day, and so on. The tasks are resource intensive and need to be distributed across many machines.
Right now tasks are stored in a database with an "execute at this time" timestamp. To find tasks that need to be executed, we query the database for jobs that are due to be executed, then update the timestamps when the task is complete. Naturally this leads to a substantial write load on the database.
As far as I can tell, we are looking for something to release tasks into a queue at a set interval. (Workers could then request tasks from that queue.)
What is the best way to schedule recurring tasks at scale?
For what it's worth we're largely using Python, although we have no problems using components (RabbitMQ?) written in other languages.
UPDATE: Right now we have about 350,000 tasks that run every half hour or so, with some variation. 350,000 tasks * 48 times per day is 16,800,000 tasks executed per day.
UPDATE 2: There are no dependencies. The tasks do not have to be executed in order and do not rely on previous results.
Since ACID isn't needed and you're okay with tasks potentially running twice, I wouldn't keep the timestamps in the database at all. For each task, create a list of [timestamp_of_next_run, task_id] and use a min-heap to store all of the lists. Python's heapq module can maintain the heap for you. You'll be able to very efficiently pop off the task with the soonest timestamp. When you need to run a task, use its task_id to look up in the database what the task needs to do. When a task completes, update the timestamp and put it back into the heap. (Just be careful not to change an item that's currently in the heap, as that will break the heap properties).
Use the database only to store information that you will still care about after a crash and reboot. If you won't need the information after a reboot, don't spend the time writing to disk. You will still have a lot of database read operations to load the information about a task that needs to run, but a read is much cheaper than a write.
If you don't have enough RAM to store all of the tasks in memory at the same time, you could go with a hybrid setup where you keep the tasks for the next 24 hours (for example) in RAM and everything else stays in the database. Alternately, you could rewrite the code in C or C++, which are less memory hungry.
If you don't want a database, you could store just the next run timestamp and task id in memory. You could store the properties for each task in a file named [task_id].txt. You would need a data structure to store all the tasks, sorted by timestamp in memory, an AVL tree seems like it would work, here's a simple one for python: http://bjourne.blogspot.com/2006/11/avl-tree-in-python.html. Hopefully Linux (I assume that's what you are running on) could handle millions of files in a directory, otherwise you might need to hash on the task id to get a sub folder).
Your master server would just need to run a loop, popping off tasks out of the AVL tree until the next task's timestamp is in the future. Then you could sleep for a few seconds and start checking again. Whenever a task runs, you would update the next run timestamp in the task file and re-insert it into the AVL tree.
When the master server reboots, there would be the overhead of reloading all tasks id and next run timestamp back into memory, so that might be painful with millions of files. Maybe you just have one giant file and give each task 1K space in the file for properties and next run timestamp and then use [task_id] * 1K to get to the right offset for the task properties.
If you are willing to use a database, I am confident MySQL could handle whatever you throw at it given the conditions you describe, assuming you have 4GB+ RAM and several hard drives in RAID 0+1 on your master server.
Finally, if you really want to get complicated, Hadoop might work too: http://hadoop.apache.org/
If you're worried about writes, you can have a set of servers that dispatch the tasks (may be stripe the servers to equalize load) and have each server write bulk checkpoints to the DB (this way, you will not have so many write queries). You still have to write to be able to recover if scheduling server dies, of course.
In addition, if you don't have a clustered index on timestamp, you will avoid having a hot-spot at the end of the table.
350,000 tasks * 48 times per day is
16,800,000 tasks executed per day.
To schedule the jobs, you don't need a database.
Databases are for things that are updated. The only update visible here is a change to the schedule to add, remove or reschedule a job.
Cron does this in a totally scalable fashion with a single flat file.
Read the entire flat file into memory, start spawning jobs. Periodically, check the fstat to see if the file changed. Or, even better, wait for a HUP signal and use that to reread the file. Use kill -HUP to signal the scheduler to reread the file.
It's unclear what you're updating the database for.
If the database is used to determine future schedule based on job completion, then a single database is a Very Dad Idea.
If you're using the database to do some analysis of job history, then you have a simple data warehouse.
Record completion information (start time, end time, exit status, all that stuff) in a simple flat log file.
Process the flat log files to create a fact table and dimension updates.
When someone has the urge to do some analysis, load relevant portions of the flat log files into a datamart so they can do queries and counts and averages and the like.
Do not directly record 17,000,000 rows per day into a relational database. No one wants all that data. They want summaries: counts and averages.
Why hundreds of thousands and not hundreds of millions ? :evil:
I think you need stackless python, http://www.stackless.com/. created by the genius of Christian Tismer.
Quoting
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure. More
readable code. Increased programmer
productivity.
Is used for massive multiplayer games.
Related
I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.
Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.
I have some DAGs already defined in Airflow that perform queries on third party APIs, pull some data partitioned by date (for example the trending items for yesterday) and write them to the DB. They can also be triggered manually with a bunch of parameters to download the same items without the date-based logic. So far so good, this is a standard scenario for Airflow.
I now want to reuse and adapt some of these dags to perform special queries: in Airflow's terms this means receiving different Job's parameters. I can do it one by one manually but clearly this is not the best. The main reason is that these third party APIs have daily quota thresholds that we don't want to cross. So we are not free to run everything every day but we need to be considerate with the executions.
So let's say I want to download 100 entities, which ID I can download through a service call and let's say my quota is 10 per day. One solution would be to create a DAG that does the call, saves the ids into a database with the date in which they should be executed, but I'm doing the Airflow scheduler's job and it seems not good.There are many things that could go wrong.
I could do the same trick but with something that looks like a queue: one manual DAG puts tasks in the queue and another, daily, DAG pulls from the queue. This one kinda works in my mind but it seems like a lot of effort and I'm not sure what should keep track of the queue. Something like Celery seems like an overkill so probably I would have to use a database. Still, it seems like over engineering and some kind of Airflow anti-pattern but I don't have much experience with the tool so feedbacks are welcome.
Are there other options? Is there some Airflow's feature that would solve this easily?
This is the simplified version of my code, at which each process crawl the link and get data and store them in database in parallel.
def crawl_and_save_data(url):
while True:
res = requests.get(url)
price_list = res.json()
if len(price_list) == 0:
sys.exit()
# Save all datas in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
until_date = convert_date_format(price_list[len(price_list)-1]['candleDateTime'])
time.sleep(1)
if __name__=='__main__':
# When executed with pure python
pool = Pool()
pool.map(
crawl_and_save_data,
get_bunch_of_url_list()
)
The key point of this code is,
# Save all data in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
, where each process accesses same database table.
I wonder whether this kind of task prevents concurrency of my whole task.
Or, Would it be a possibility to lose data because of the concurrent database accesses?
Or, would all queries are put in a I/O queue or something?
Need your advices. Thanks
tl;dr - you should be fine, but the question doesn't include enough detail to answer definitively. You will need to run some tests, but you should expect to get a good amount of concurrency (a few dozen simultaneous writes) before things start to slow down.
Note though - as currently written, seems like your workers will get the same URL over and over again, because of the while True loop that never breaks or exits. You detect if the list is empty, but does the URL track state somehow? I would expect multiple, identical GETs to return the same data over and over...
As far as concurrency, that ultimately depends on -
The resources available to the database (memory, I/O, CPU)
The server-side resources consumed by each connection/operation.
That second point includes memory, etc., but also whether independent
operations are competing for the same resources (are 10 different connections
trying to update the same set of rows in the database?). Updating the same
table is fine, more or less, because the database can use row-level locks.
Also note the difference between concurrency (how many things happen at
once) and throughput (how many things happen within a period of time).
Concurrency and throughput can relate to each in counter-intuitive ways -
it's not uncommon to see a situation where 1 process can do N operations per
second, but M processes sustain far less than M x N operations per second,
possibly even bringing the whole thing to a screeching halt (e.g., via a
deadlock)
Thinking about your code snippet, here are some observations:
You are using multiprocessing.Pool, which uses sub-processes for concurrency and will work well for your case if you...
Make sure you open your connections in the sub-process; trying to re-use a connection from the parent process will not work
If you do nothing else to your code, you will be using a number of sub-processes equal to the number of cores on your db client machine
This is a good starting point. If a function is CPU-bound, you really can't go higher. If your function is I/O-bound, the CPU will be idle waiting for I/O operations to return. You can start ramping up the worker count in this case.
Thus, each sub-process will have a connection to the database, with some amount of server memory per connection.
This also means that each insert should be in isolated transactions, with no additional work on your part.
Given that, simple, append-only, row-by-row transactions should support
relatively high concurrency and high throughput, again depending on how
big and fast your DB server is.
Also, note that you are already queueing :) With no args, Pool() creates a
number of child processes equal to os.cpu_count() (see
the docs).
If that's greater than the number of URLs in your collection, that collection
is a queue of sorts, just not a durable one. If your master process dies, the
list of URLs is gone.
Unrelated - unless you are worried about your URL fetches getting throttled, from a db perspective, there is no need for the time.sleep(1) statement.
Hope this helps.
I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.