I'm about to scrape some 50.000 records of a real estate website (with Scrapy).
The programming has been done and tested, and the database properly designed.
But I want to be prepared for unexpected events.
So how do I go about actually running the scrape flawlessly and with minimal risk of failure and loss of time?
More specifically :
Should I carry it out in phases (scraping in smaller batches) ?
What and how should I log ?
Which other points of attention should I take into account before launching ?
First of all, study the following topics to have a general idea on how to be a good web-scraping citizen:
Web scraping etiquette
Screen scraping etiquette
In general, first, you need to make sure you are legally allowed to scrape this particular web-site and follow their Terms of Use rules. Also, check web-site's robots.txt and respect the rules listed there (for example, there can be Crawl-delay directive set). Also, a good idea would be to contact web-site owner's and let them know what you are going to do or ask for the permission.
Identify yourself by explicitly specifying a User-Agent header.
See also:
Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?
What will happen if I don't follow robots.txt while crawling?
Should I carry it out in phases (scraping in smaller batches) ?
This is what DOWNLOAD_DELAY setting is about:
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same website. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP are also relevant.
Tweak these settings for not hitting the web-site servers too often.
What and how should I log ?
The information that Scrapy puts on the console is pretty much extensive, but you may want to log all the errors and exceptions being raised while crawling. I personally like the idea of listening for spider_error signal to be fired, see:
how to process all kinds of exception in a scrapy project, in errback and callback?
Which other points of attention should I take into account before
launching ?
You still have several things to think about.
At some point, you may get banned. There is always a reason for this, the most obvious would be that you would still crawl them too hard and they don't like it. There are certain techniques/tricks to avoid getting banned, like rotating IP addresses, using proxies, web-scraping in the cloud etc, see:
Avoiding getting banned
Another thing to worry about might be the crawling speed and scaling; at this point you may want to think about distributing your crawling process. This is there scrapyd would help, see:
Distributed crawls
Still, make sure you are not crossing the line and staying on the legal side.
Related
I recently got into Flask web programming and built a shopping website fro scratch as an engineering school project, however I got lost when it came to product ranking etc.
I had this idea for a dating website as an exercise but as I see it the server will have to run its own calculations to rank different possible couples in terms of compatibility, which is really the interesting part of the project.
I don't really see these ranking calculations only being processed upon request as thay may take some time, but maybe I am highly underestimating SQL processing speed. I believe the data processing and calculations need to be run continuously on the server. If this is in fact continuous server data processing, how would I go about doing that?
I hope the question makes sense, my English tends to be a bit dodgy as I don't live in an English-speaking country.
Regards
If you need background tasks without client requests, you can go for Celery (https://docs.celeryproject.org/en/stable/userguide/periodic-tasks.html). You can assign works to this server and it will run in background without intervening Django server.
i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally
You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.
I need to make a Web Crawling do requests and bring the responses complete and quickly, if possible.
I come from the Java language. I used two "frameworks" and neither fully satisfied my intent.
The Jsoup had the request/response fast but wore incomplete data when the page had a lot of information. The Apache HttpClient was exactly the opposite of this, reliable data but very slow.
I've looked over some of Python modules and I'm testing Scrapy. In my searches, I was unable to conclude whether it is the fastest and brings the data consistently, or is there some other better, even more verbose or difficult.
Second, Python is a good language for this purpose?
Thank you in advance.
+1 votes for Scrapy. For the past several weeks I have been writing crawlers of massive car forums, and Scrapy is absolutely incredible, fast, and reliable.
looking for something to "do requests and bring the responses complete and quickly" makes no sense.
A. Any HTTP library will give you the complete headers/body the server responds with.
B. how "quick" a web request happens is generally dictated by your network connection and server's response time, not the client you are using.
so with those requirements, anything will do.
check out the requests package. It is an excellent http client library for Python.
I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.
Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.
How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.
Here's the relevant part of the settings.py file. Is there some important setting I've missed?
LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8
A few parameters:
Using scrapy version 0.14
The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)
I had this problem in the past...
And large part of it I solved with a 'Dirty' old tricky.
Do a local cache DNS.
Mostly when you have this high cpu usage accessing simultaneous remote sites it is because scrapy is trying to resolve the urls.
And please remember to change your dns settings at the host (/etc/resolv.conf) to your LOCAL caching DNS server.
In the first ones will be slowly, but as soon it start caching and it is more efficient resolving you are going to see HUGE improvements.
I hope this will help you in your problem!
I have a big threaded feed retrieval script in python.
My question is, how can I load balance outgoing requests so that I don't hit any one host too often?
This is a big problem for feedburner, since a large percentage of sites proxy their RSS through feedburner and to further complicate matters many sites will alias a subdomain on their domain to feedburner to obscure the fact that they're using it (e.g. "mysite" sets its RSS url to feeds.mysite.com/mysite, where feeds.mysite.com bounces to feedburner). Sometimes it blocks me for awhile and redirects to their "automated requests" error page.
You should probably do a one-time request (per week/month, whatever fits). for each feed and follow redirects to get the "true" address. Regardless of your throttling situation at the time, you should be able to resolve all feeds, save that data and then just do it once for every new feed you add to the list. You can look at urllib's geturl() as it returns the final url from the URL you put into it. When you do ping the feeds, be sure to use the original (keep the "real" simply for load-balancing) to make sure it redirects properly if the user has moved it or similar.
Once that is done, you can simply devise a load mechanism such as only X requests per hour for a given domain, going through each feed and skipping feeds whose hosts have hit the limit. If feedburner keeps their limits public (not likely) you can use that for X, but otherwise you will just have to estimate it and make a rough estimate that you know to be below the limit. Knowing google however, their limits might measure patterns and not have a specific hard limit.
Edit: Added suggestion from comment.
If your problem is related to Feedburner "throttling you", it most certainly does this because of the source IP of your bot. The way to "load balance to Feedburner" would be to have multiple different source IPs to start from.
Now there are numerous ways to achieving this, 2 of them being:
Multi-homed server: multiple IPs on the same machine
Multiple discrete machines
Of course, don't you go a put a NAT box in front of them now ;-)
The above takes care of the possible "throttling problems", now for the "scheduling part". You should maintain a "virtual scheduler" per "destination" and make sure not to exceed the parameters of the Web Service (e.g. Feedburner) in question. Now, the tricky part is to get hold of these "limits"... sometimes they are advertised and sometimes you need to figure them out experimentally.
I understand this is "high level architectural guidelines" but I am not ready to be coding this for you... I hope you forgive me ;-)
"how can I load balance outgoing requests so that I don't hit any one host too often?"
Generally, you do this by designing a better algorithm.
For example, randomly scramble your requests.
Or shuffle them 'fairly' so so that you round-robin through the sources. That would be a simple list of queues where you dequeue one request from each host.