Running dozens of Scrapy spiders in a controlled manner

Running dozens of Scrapy spiders in a controlled manner - python

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().
When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.
What is the recommended way to run a large number of spiders with Scrapy?
Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.

The simplest way to do this is to run them all from the command line. For example:
$ scrapy list | xargs -P 4 -n 1 scrapy crawl
Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.
A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.
Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.
You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it
That's probably a sign that you need multiple machines to execute your spiders. A scalability issue. Well, you can also scale vertically to make your single machine more powerful but that would hit a "limit" much faster:
Difference between scaling horizontally and vertically for databases
Check out the Distributed Crawling documentation and the scrapyd project.
There is also a cloud-based distributed crawling service called ScrapingHub which would take away the scalability problems from you altogether (note that I am not advertising them as I have no affiliation to the company).

One solution, if the information is relatively static (based on your mention of the process "finishing"), is to simply set up a script to run the crawls sequentially or in batches. Wait for 1 to finish before starting the next 1 (or 10, or whatever the batch size is).
Another thing to consider if you're only using one machine and this error is cropping up -- having too many files open isn't really a resource bottleneck. You might be better off having each spider run 200 or so threads to make network IO (typically, though sometimes CPU or whatnot) the bottleneck. Each spider will finish faster on average than your current solution which executes them all at once and hits some "maximum file descriptor" limit rather than an actual resource limit.

Related

Python Selenium Web Scraping with Same URL but Multiple Instances (Multiprocessing?)

Running a python selenium script using ChromeDriver that goes to one (and only one) URL, enters data into the form then scrapes the results. Would like to do this in parallel by entering different data into the same URL's form and getting the different results.
In researching Multiprocessing or Multithreading I have found Multithreading is best for I/O bound tasks and Multiprocessing best for CPU bound tasks.
Overall amount of data I'm scraping is small, select text only so don't believe I/O bound? Does this sound correct? From what I've gathered is that in general web scrapers are I/O intensive, maybe my example scenario is just an exception?
Running my current (sequential, non parallel) script, Resource Monitor shows chrome instance CPU usage ramp up AND across all (4) cores. So is chrome using multiprocessing by default and the advantage of multiprocessing within python really in being able to apply the scripts function to each chrome instance? Maybe I got this all wrong...
Also is it that a script that wants to open multiple URL's at once and interact with them inherently CPU bound due to that fact that it runs a lot of chrome instances? Assuming data scraped is small. Ignoring headless for now.
Image attached of CPU usage, spike in the middle (across all 4 CPU's) is when chrome is launched.
Any comments or advice appreciated, including any pseudo code on how you might implement something like this. Didn't share base code, question more around the structure of all this.

Python / Celery / Selenium continuous task (avoid reopening browser)

Biggest issue I have with selenium is long re-opening time of browser(using it to scrape every few minutes). I am also using proxies and running multiple browsers with python's threading - All starting/stopping every few minutes(when new job comes)
Threading also means only 1 CPU is used and performance suffers.
I've been thinking about starting to use celery(out-of-box multi-core support) and make workers(different proxy/browser) run indefinitely(while loop) with open instances of selenium browsers waiting to get exact URLs to scrape - feed via something like redis.
Is it a good idea to be running continuous tasks like this with celery? Is there any better way to do it?

Its never a good idea to hold open instances of selenium indefinitely,
best practice is to reopen with each task.
so for you question, in my opinion its not a good idea.
let me offer you another architecture instead.
use Docker to run your selenium machines,
basically create selenium-grid (first result in google link)
using Docker
once everything is setup correctly the task will become easy, with multiprocessing send to your selenium hub all the jobs in parallel,
and they will run simultaneously on as many containers as you need.
once the job is done, you can destroy the containers and start fresh, with the next cycle.
Using docker will also allow you to scale you operation very easily

How to run multithreaded Python scripts

i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally

You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.

AWS and Python threading scalability

I have a service running on a local server, written using Python threading library. Think of it as a kind of web crawler. It uses 50 threads. I want deploy it on Amazon Web Services cloud and scale it up, so it uses more threads.
Simply, I have two queues: Qinput with URLs and Qoutput with pages content. The threads pick URLs from Qinput, fetch content of the web page an put it to Qoutput
Question: is it enough that I simply increase the number of threads to, say, 500, 5,000 or 50,000 and AWS + Python will handle it? Should I expect the service to run seamlessly or there are some "standard" design pitfalls that I should be aware of when porting a multithreading service on AWS?
I am aware of Global Interpreter Lock although it should not be an issue here, as the main task of the threads is to call outside the interpreter while crawling / scraping pages

Any single instance has its limit. You will probably be able to spawn quite a lot of threads in your instance, especially if you choose the larger ones. But you will get diminished return on the additional threads, until it will not help you any more to get more performance.
However, if you want your system to scale beyond the limitation of a single instance, it is best to be able to run your system on multiple instances. Then your decisions is only operational and not technical. I think that if you are running in AWS environment, which allows you almost endless operational resources, you should think into it.
You can also check out SQS, which is basically a distributed queue system. It will allow you to synchronize the work of as many instances as you need.

Python CGI queue

I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):
User uploads data files and clicks "Run" button
Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
Server displays web page with figures and some summary text
I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).
I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.
What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.
Also, I was trying to decide whether
the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)
OR
the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.
What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.

Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery
It's as simple as this to run a task in the background:
#task
def add(x, y):
return x + y
In a project I have it's distributing the work over 4 machines and it works great.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.