BOTO distribute scraping tasks among AWS - python

I have 200,000 URLs that I need to scrape from a website. This website has a very strict scraping policy and you will get blocked if the scraping frequency is 10+ /min. So I need to control my pace. And I am thinking about start a few AWS instances (say 3) to run in parallel.
In this way, the estimated time to collect all the data will be:
200,000 URL / (10 URL/min) = 20,000 min (one instance only)
4.6 days (three instances)
which is a legit amount of time to get my work done.
However, I am thinking about building a framework using boto. That I have a paragraph of code and a queue of input (a list of URLs) in this case. Meanwhile I also don't want to do any damage to their website so I only want to scrape during the night and weekend. So I am thinking about all of this should be controlled on one box.
And the code should look similar like this:
class worker (job, queue)
url = queue.pop()
aws = new AWSInstance()
result aws.scrape(url)
return result
worker1 = new worker()
worker2 = new worker()
worker3 = new worker()
worker1.start()
worker2.start()
worker3.start()
The code above is totally pseudo and my idea is to pass the work to AWS.
Question:
(1) How to use boto to pass the variable/argument to another AWS instance and start a script to work on those variable and .. use boto to retrieve the result back to the master box.
(2) What is the best way to schedule a job only on specific time period inside Python code.
Say only work on 6:00pm to 6:00 am everyday... I don't think the Linux crontab will fit my need in this situation.
Sorry about that if my question is more verbally descriptive and philosophical.. Even if you can offer me any hint or throw away some package/library name that meet my need. I will be gratefully appreciated!

Question: (1) How to use boto to pass the variable/argument to another
AWS instance and start a script to work on those variable
Use shared datasource, such as DynamoDB or messaging framework such as SQS
and .. use boto to retrieve the result back to the master box.
Again, shared datasource, or messaging.
(2) What is the best way to schedule a job only on specific time
period inside Python code. Say only work on 6:00pm to 6:00 am
everyday... I don't think the Linux crontab will fit my need in this
situation.
I think crontab fits well here.

Related

Is there a way to execute distributed code in Python on AWS?

I created a scraper in Python that is navigating a website. It pulls many links and then It has to visit every link pull the data and parse and store the result.
Is there an easy way to run that script distributed in the cloud (like AWS)?
Ideally, I would like something like this (probably is more difficult, but just to give an idea)
run_in_the_cloud --number-of-instances 5 scraper.py
after the process is done, the instances are killed, so it does not cost more money.
I remember I was doing something similar with hadoop and java with mapreduce long time ago.
If you can put your scraper in a docker image it's relatively trivial to run and scale dockerized applications using AWS ECS Fargate. Just create a task definition and point it at your container registry, then submit runTask requests for however many instances you want. AWS Batch is another tool you could use to trivially parallelize container instances too.

Using Django as GUI for long running python process

This a question about architecture. Say I have a long running process on a server such as machine learning in a middle of a training. Now as this run on external machine I would like to have a tool to quickly see from time to time the results. So I thought the best way would be to have a website which quickly connects to the process for example using RPC to display the results as this allows me to always check in. Now the question is how should Django view gather the information from the server process:
1) Using RPC calls such as rpyc directly in the views?
2) Using some kind of messaging queue such as celery ?
3) Or in a completely different way I am not seeing ?
There's at least 2 possible ways to do this.
Implement your data-refreshing function as a view and visit it by ajax(sync)+javascript timer.Since you visit your page that contains these js, it will fetch your data silently and update the page. However,this solution does not work well when you need to record all the data in a given frequency;the ajax/view only executes when the web page is open.
Use messaging queue like selcuk suggests.Alongside celery, APscheduler is also a good choice because it's easier to install and use.You can implement a task(as modal) queue with status(queue/done/stoped/whatever as field) and check them at the frequency you wanted,save the date you retrieved and do all the other stuff.

How can I schedule or queue api calls to maintain rate limit?

I am trying to continuously crawl a large amount of information from a site using the REST api they provide. I have following constraints-
Stay within api limit (5 calls/sec)
Utilising the full limit (making exactly 5 calls per second, 5*60 calls per minute)
Each call will be with different parameters (params will be fetched from db or in-memory cache)
Calls will be made from AWS EC2 (or GAE) and processed data will be stored in AWS RDS/DynamoDB
For now I am just using a scheduled task that runs a python script every minute- and the script makes 10-20 api calls-> processes response-> stores data to DB. I want to scale this procedure (make 5*60= 300 calls per minute) and make it manageable via code (pushing new tasks, pause/resuming them easily, monitoring failures, changing call frequency).
My question is- what are the best available tools to achieve this? Any suggestion/guidance/link is appreciated.
I do know the names of some task queuing frameworks like Celery/RabbitMQ/Redis, but I do not know much about them. However I am wiling to learn one or each of those if these are the best tools to solve my problem, want to hear from SO veterans before jumping in ☺
Also please let me know if there's any other AWS service I should look to use (SQS or AWS Data Pipeline?) to make any step easier.
You needn't add an external dependency just for rate-limiting, as your use case is rather straightforward.
I can think of two options:
Modify the script (that currently wakes up every minute and makes 10-20 API calls) to wake up every second and make 5 calls (sequentially or in parallel).
In your current design, your API calls might not be properly distributed across 1 minute, i.e. you might be making all your 10-20 calls in the first, say, 20 seconds.
If you change that script to run every second, your API call rate will be more balanced.
Change your Python script to a long running daemon, and use a Rate Limiter library, such as this. You can configure the latter to make 1 call per x seconds.

Should I learn/use MapReduce, or some other type of parallelization for this task?

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
Is this a flow that would be well suited to parallelizing with MapReduce?
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
Are there any problems you see with the way I've designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
Get Followed Users -- Which returns all the users being followed by the requested user, and
Get Favorite Items -- Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
Jobs in this flow include:
Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
Are there any problems you see with the way I've designed this job flow?
Thanks for reading, I'm looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
So, what I'm leaning towards is:
A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.
The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.
For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.
Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.
This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.
For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.
Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.
This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.
A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).
For bonus points:
Doing this as a MapReduce should be fairly easy.
Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:
"UserX" "UserA"
"UserX" "UserB"
"UserX" "UserC"
An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:
"UserX" "UserA" "Favourite1"
"UserX" "UserA" "Favourite2"
"UserX" "UserA" "Favourite3"
"UserX" "UserB" "Favourite4"
The reduce step for this job will convert this to:
"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]
At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.
I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.
Good luck!
Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.
I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.
I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.
I'll answer your questions as I understand your (and my) problem, and hopefully it helps.
Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.
A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.
I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.

Google App Engine - design considerations about cron tasks

I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.

Categories

Resources