How do I improve scrapy's download speed? - python

I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.
Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.
How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.
Here's the relevant part of the settings.py file. Is there some important setting I've missed?
LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8
A few parameters:
Using scrapy version 0.14
The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)

I had this problem in the past...
And large part of it I solved with a 'Dirty' old tricky.
Do a local cache DNS.
Mostly when you have this high cpu usage accessing simultaneous remote sites it is because scrapy is trying to resolve the urls.
And please remember to change your dns settings at the host (/etc/resolv.conf) to your LOCAL caching DNS server.
In the first ones will be slowly, but as soon it start caching and it is more efficient resolving you are going to see HUGE improvements.
I hope this will help you in your problem!

Related

Running dozens of Scrapy spiders in a controlled manner

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().
When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.
What is the recommended way to run a large number of spiders with Scrapy?
Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.
The simplest way to do this is to run them all from the command line. For example:
$ scrapy list | xargs -P 4 -n 1 scrapy crawl
Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.
A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.
Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.
You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.
it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it
That's probably a sign that you need multiple machines to execute your spiders. A scalability issue. Well, you can also scale vertically to make your single machine more powerful but that would hit a "limit" much faster:
Difference between scaling horizontally and vertically for databases
Check out the Distributed Crawling documentation and the scrapyd project.
There is also a cloud-based distributed crawling service called ScrapingHub which would take away the scalability problems from you altogether (note that I am not advertising them as I have no affiliation to the company).
One solution, if the information is relatively static (based on your mention of the process "finishing"), is to simply set up a script to run the crawls sequentially or in batches. Wait for 1 to finish before starting the next 1 (or 10, or whatever the batch size is).
Another thing to consider if you're only using one machine and this error is cropping up -- having too many files open isn't really a resource bottleneck. You might be better off having each spider run 200 or so threads to make network IO (typically, though sometimes CPU or whatnot) the bottleneck. Each spider will finish faster on average than your current solution which executes them all at once and hits some "maximum file descriptor" limit rather than an actual resource limit.

Python & web scraping performance

I am trying to do some python based web scraping where execution time is pretty critical.
I've tried phantomjs, selenium, and pyqt4 now, and all three libraries have given me similar response times. I'd post example code, but my problem affects all three, so I believe the problem either lies in a shared dependency or outside of my code. At around 50 concurrent requests, we see a huge desegregation in response time. It takes about 40 seconds to get back all 50 pages, and that time gets exponentially slower with greater page demands. Ideally I'm looking for ~200+ requests in about 10 seconds. I used multiprocessing to spawn each instance of phantonjs/pyqt4/selenium, so each url request gets it's own instance so that I'm not blocked by single threading.
I don't believe it's a hardware bottleneck, it's running on 32 dedicated cpu cores, totaling to 64 threads, and cpu usage doesn't typically spike to over 10-12%. Bandwidth as well sits reasonably comfortably at around 40-50% of my total throughput.
I've read about the GIL, which I believe I've addressed with using multiprocessing. Is webscraping just an inherently slow thing? Should I stop expecting to pull 200ish webpages in ~10 seconds?
My overall question is, what is the best approach to high performance web scraping, where evaluating js on the webpage is a requirement?
"evaluating js on the webpage is a requirement" <- I think this is your problem right here. Simply downloading 50 web pages is fairly trivially parallelized and should only take as long as the slowest server takes to respond.
Now, spawning 50 javascript engines in parallel (which is essentially what I guess it is you are doing) to run the scripts on every page is a different matter. Imagine firing up 50 chrome browsers at the same time.
Anyway: profile and measure the parts of your application to find where the bottleneck lies. Only then you can see if you're dealing with an I/O bottleneck (sounds unlikely), a CPU bottleneck (more likely) or a global lock somewhere that serializes stuff (also likely but impossible to say without any code posted)

Scrapy : preventive measures before running the scrape

I'm about to scrape some 50.000 records of a real estate website (with Scrapy).
The programming has been done and tested, and the database properly designed.
But I want to be prepared for unexpected events.
So how do I go about actually running the scrape flawlessly and with minimal risk of failure and loss of time?
More specifically :
Should I carry it out in phases (scraping in smaller batches) ?
What and how should I log ?
Which other points of attention should I take into account before launching ?
First of all, study the following topics to have a general idea on how to be a good web-scraping citizen:
Web scraping etiquette
Screen scraping etiquette
In general, first, you need to make sure you are legally allowed to scrape this particular web-site and follow their Terms of Use rules. Also, check web-site's robots.txt and respect the rules listed there (for example, there can be Crawl-delay directive set). Also, a good idea would be to contact web-site owner's and let them know what you are going to do or ask for the permission.
Identify yourself by explicitly specifying a User-Agent header.
See also:
Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?
What will happen if I don't follow robots.txt while crawling?
Should I carry it out in phases (scraping in smaller batches) ?
This is what DOWNLOAD_DELAY setting is about:
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same website. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP are also relevant.
Tweak these settings for not hitting the web-site servers too often.
What and how should I log ?
The information that Scrapy puts on the console is pretty much extensive, but you may want to log all the errors and exceptions being raised while crawling. I personally like the idea of listening for spider_error signal to be fired, see:
how to process all kinds of exception in a scrapy project, in errback and callback?
Which other points of attention should I take into account before
launching ?
You still have several things to think about.
At some point, you may get banned. There is always a reason for this, the most obvious would be that you would still crawl them too hard and they don't like it. There are certain techniques/tricks to avoid getting banned, like rotating IP addresses, using proxies, web-scraping in the cloud etc, see:
Avoiding getting banned
Another thing to worry about might be the crawling speed and scaling; at this point you may want to think about distributing your crawling process. This is there scrapyd would help, see:
Distributed crawls
Still, make sure you are not crossing the line and staying on the legal side.

How to run multithreaded Python scripts

i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally
You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.

Python,multi-threads,fetch webpages,download webpages

I want to batch dowload webpages in one site. There are 5000000 urls links in my 'urls.txt' file. It's about 300M. How make a multi-threads link these urls and dowload these webpages? or How batch dowload these webpages?
my ideas:
with open('urls.txt','r') as f:
for el in f:
##fetch these urls
or twisted?
Is there a good solution for it?
If this isn't part of a larger program, then notnoop's idea of using some existing tool to accomplish this is a pretty good one. If a shell loop invoking wget solves your problem, that'll be a lot easier than anything involving more custom software development.
However, if you need to fetch these resources as part of a larger program, then doing it with shell may not be ideal. In this case, I'll strongly recommend Twisted, which will make it easy to do many requests in parallel.
A few years ago I wrote up an example of how to do just this. Take a look at http://jcalderone.livejournal.com/24285.html.
Definitely downloading 5M web pages in one go is not a good idea, because you'll max out a lot of things, including your network bandwidth and your OS's file descriptors. I'd go in batches of 100-1000. You can use urllib.urlopen to get a socket and then just read() on several threads. You may be able to use select.select. If so, then go ahead and download all 1000 at once and distribute each file handle that select returns to say 10 worker threads. If select won't work, then limit your batches to 100 downloads and use one thread per download. Certainly you shouldn't start more than 100 threads as your OS might blow up or at least go a bit slow.
First parse your file and push the urls into a queue then spawn 5-10 worker threads to pull urls out of the queue and download. Queue's are your friend with this.
A wget script is probably simplest, but if you're looking for a python-twisted crawling solution, check out scrapy

Categories

Resources