Being a good citizen and web-scraping

Being a good citizen and web-scraping - python

I have a two part question.
First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?
Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Any advice or resources would be greatly appreciated.

Is there possibly a way to do thing's incrementally
I'm using Scrapy caching ability to scrape site incrementaly
HTTPCACHE_ENABLED = True
Or you can use new 0.14 feature Jobs: pausing and resuming crawls
or put a pause in between different requests?
check this settings:
DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY
is there a method with Scrapy to test a crawler without placing undue stress on a site?
You can try and debug your code in Scrapy shell
I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Also, you can call scrapy.shell.inspect_response at any time in your spider.
Any advice or resources would be greatly appreciated.
Scrapy documentation is the best resource.

You have to start crawling and log everything. In case you get banned, you can add sleep() before pages request.
Changing User-Agent is a good practise, too (http://www.user-agents.org/ http://www.useragentstring.com/ )
If you get banned by ip, use proxy to bypass it. Cheers.

Related

Using a Single Web Crawler to Scrape Multiple websites in a predefined format with attachments?

I have a list of approx. 52 websites which lead to about approx. 150 webpages that i require scraping on. Based on my ignorance and lack of research i started building crawlers per webpage which is starting to become to difficult to complete and maintain.
Based on my analysis thus far I already know what information i want to scrape per webpage and it is clear that these websites have their own structure. On the plus side i noticed that each website has some commonalities in their web structure among their webpages.
My million dollar question, is there a single technique or single web crawler that i can use to scrape these sites? I already know the information that I want, these sites are rarely updated in terms of their web structure and most of these sites have documents that need to be downloaded.
Alternatively, is there a better solution to use that will reduce the amount of web crawlers that I need to build? additionally, these web crawlers will only be used to download the new information of the websites that i am aiming them at.

[…] i started building crawlers per webpage which is starting to become to difficult to complete and maintain […] it is clear that these websites have their own structure. […] these sites are rarely updated in terms of their web structure […]
If websites have different structures, having separate spiders makes sense, and should make maintenance easier in the long term.
You say completing new spiders (I assume you mean developing them, not crawling or something else) is becoming difficult, however if they are similar to an existing spider, you can simply copy-and-paste the most similar existing spider, and make only the necessary changes.
Maintenance should be easiest with separate spiders for different websites. If a single website changes, you can fix the spider for that website. If you have a spider for multiple websites, and only one of them changes, you need to make sure that your changes for the modified website do not break the rest of the websites, which can be a nightmare.
Also, since you say website structures do not change often, maintenance should not be that hard in general.
If you notice you are repeating a lot of code, you might be able to extract some shared code into a spider middleware, a downloader middleware, an extension, an item loader, or even a base spider class shared by two or more spiders. But I would not try to use a single Spider subclass to scrape multiple different websites that are likely to evolve separately.

I suggest you crawl specific tags such as body, h1,h2,h3,h4,h5, h6,p and... for each links. You can gather all p tags and append them into a specific link. It can be used for each tags you want to crawl them. Also, you can append related links of tags to your database.

Python tool to check broken links on a big urls list

I have a search engine in production serving around 700 000 url. The crawling is done using Scrapy, and all spiders are scheduled using DeltaFetch in order to get daily new links.
The difficulty I'm facing is handling broken links.
I have a hard time finding a good way to periodically scan, and remove broken links. I was thinking about a few solutions :
Developing a python script using requests.get, to check on every single url, and delete anything that returns a 404 status.
Using a third party tool like https://github.com/linkchecker/linkchecker, but not sure if it's the best solution since I only need to check up a list of url, not a website.
Using a scrapy spider to scrap this url list, and return any urls that are erroring out. I'm not really confident on that one since I know scrapy tends to timeout when scaning a lot of urls on different domains, this is why I rely so much on deltafetch
Do you have any recommendations / best practice to solve this problem?
Thanks a lot.
Edit : I forgot to give one precision : I'm looking to "validate" those 700k urls, not to crawl them. actually those 700k urls are the crawling result of around 2500k domains.

You could write a small script that just check the return http status like so:
for url in urls:
try:
urllib2.urlopen(url)
except urllib2.HTTPError, e:
# Do something when request fails
print e.code
This would be the same as your first point. You could also run this async in order to optimize the time it takes to run through your 700k links.

I would suggest using scrapy, since you're already looking up each URL with this tool and thus knows which URLs errors out. This means you don't have to check the URLs a second time.
I'd go about it like this:
Save every URL erroring out in a separate list/map with a counter (which is stored between runs).
Every time an URL errors out, increment the counter. If it doesn't, decrement the counter.
After running the Scrapy script, check this list/map for URLs with a high enough counter - let's say more than 10 faults, and remove them - or store them in a seperate list of links to check up on a later time (As a check if you accidentally removed a working URL because a server was down too long).
Since your third bullet is concerned about Scrapy being shaky with URL results, the same could be said for websites in general. If a site errors out on 1 try, it might not mean a broken link.

If you go for creating a script of our own check this solution
In addition an optimization that I suggest is to make heirarchy in your URL repository. If you get 404 from one of a parent URL you can avoid checking all it children URLs

First thought came into my mind is to request URLs with HEAD instead of any other method
Spawn multiple spiders at once assigning them batches like LIMIT 0,10000 and LIMIT 10000,10000
In your data pipeline, instead of running a MySQL DELETE query each time scraper finds 404 status, run DELETE FROM table WHERE link IN(link1,link2) query in bulk
I am sure you have INDEX on link column, if not add it

Passing a request to a different spider

I'm working on a web crawler (using scrapy) that uses 2 different spiders:
Very generic spider that can crawl (almost) any website using a bunch of heuristics to extract data.
Specialized spider capable of crawling a particular website A that can't be crawled with a generic spider because of website's peculiar structure (that website has to be crawled).
Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is there a Scrappy way to pass the request to spider 1?
Solutions I thought about:
Moving all functionality to spider 1. But that might get really messy, spider 1 code is already very long and complicated, I'd like to keep this functionality separate, if possible.
Saving the links to the database like it was suggested in Pass scraped URL's from one spider to another
Is there a better way?

I met such a case, with a spyder retrieving in a first page the URL adresses and the second one being called from there to operate.
I don't know what is your control flow, but depending on it, I would merely call the first spyder just in time when scrapping a new url, or after scrapping all possible url.
Do you have the case where n°2 can retrieve URLs for the very same website? In this case, I would store all urls, sort them as list in a dict for either spider, and roll this again until there are not new element left to the lists to explore. That makes it better as it is more flexible, in my opinion.
Calling just in time might be ok, but depending on your flow, it could make performance poor as multiple calls to the same functions will probably lose lots of time initializing things.
You might also want to make analytical functions independent of the spider in order to make them available to both as you see fit. If your code is very long and complicated, it might help making it lighter and clearer. I know it is not always avoidable to do so, but that might be worth a try and you might end up being more efficient at code level.

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T

I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.

You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.

In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.

Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.