I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.
Related
I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.
Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.
How can I refresh my python file in every five minutes in django while running because every hour the data I'm webscraping is changing and I need to change the value of the variable?
You need is a task that runs periodically and a cron job solve that. I recomment to you take a look to django cron or Celerity, both are excelent options to create scheduled tasks.
I recommend using database like sqlite3 is better solution than restart django(web service) per hours.
You can store some datas in database and django can get them like using variables.
The real problem here is that you're fetching the data at the start of the application and keeping it in memory. I see 2 possible better methods:
move the data scraping code to your view function. This means you'll re-scrape on every call ensuring you'll always have the freshest data but at the cost of speed (the time it takes to make the request to your target url).
better yet: same as above except you cache the results locally. This could also be kept in memory (although I'd use a file or database if you're running multiple django app instances to ensure they're all using the same data). With the least amount of change to what you have already, an in memory cache could be achieved with simply adding a timestamp variable that gets the current time on each fetch. If last fetch was more than X minutes ago: refetch your data.
Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.
Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.
I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.
I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.
After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
Is this a flow that would be well suited to parallelizing with MapReduce?
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
Are there any problems you see with the way I've designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
Get Followed Users -- Which returns all the users being followed by the requested user, and
Get Favorite Items -- Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
Jobs in this flow include:
Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
Are there any problems you see with the way I've designed this job flow?
Thanks for reading, I'm looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
So, what I'm leaning towards is:
A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.
The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.
For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.
Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.
This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.
For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.
Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.
This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.
A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).
For bonus points:
Doing this as a MapReduce should be fairly easy.
Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:
"UserX" "UserA"
"UserX" "UserB"
"UserX" "UserC"
An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:
"UserX" "UserA" "Favourite1"
"UserX" "UserA" "Favourite2"
"UserX" "UserA" "Favourite3"
"UserX" "UserB" "Favourite4"
The reduce step for this job will convert this to:
"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]
At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.
I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.
Good luck!
Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.
I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.
I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.
I'll answer your questions as I understand your (and my) problem, and hopefully it helps.
Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.
A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.
I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.