Scrapy Endless Crawling

Scrapy Endless Crawling - python

I have built a crawling spider using Python Scrapy agains a distributor websites. I am just trying to collect all the URLs under that domain and for each page, what URLs are listed under that page. And then probably I want to use Gephi to visualize the network connections for that domain.
(1) How is the crawled URL stored(memory or disk) and what will be the crawl limit?
However, the crawler has been running for 4 days I think and it has crawled about 700K pages.
I know the Scrapy will not crawl the page that it has already crawled but I am wondering: as the number of pages increases, will there be a limit for Scrapy to "remember" which page it has crawled? The crawled URL will stay in the memory or what is the mechanism behind this?
(2) Will there always be an end to crawl a single domain? What if not?
BTW, should I stop crawling right now since I don't know when will be the end of this spider, I don't know if it would be possible that they have some dynamic page so "domain crawling" is actually an endless task.... for example, they have some parametric search box and all the combinations of those search will lead to a new page(javascript call) but actually.. that lead to a huge redundancy..
Before I know Scrapy, I tried to figure out the pattern in the URL first and then populate all the URLs, after that, go to each URL and using urllib2+bs4 to scrape. So I am not quite sure this kind of "blind" crawling is actually controllable.
There might be some "philosophical" questions here instead of specific questions but... Appreciate any thought or idea.

Related

Using a Single Web Crawler to Scrape Multiple websites in a predefined format with attachments?

I have a list of approx. 52 websites which lead to about approx. 150 webpages that i require scraping on. Based on my ignorance and lack of research i started building crawlers per webpage which is starting to become to difficult to complete and maintain.
Based on my analysis thus far I already know what information i want to scrape per webpage and it is clear that these websites have their own structure. On the plus side i noticed that each website has some commonalities in their web structure among their webpages.
My million dollar question, is there a single technique or single web crawler that i can use to scrape these sites? I already know the information that I want, these sites are rarely updated in terms of their web structure and most of these sites have documents that need to be downloaded.
Alternatively, is there a better solution to use that will reduce the amount of web crawlers that I need to build? additionally, these web crawlers will only be used to download the new information of the websites that i am aiming them at.

[…] i started building crawlers per webpage which is starting to become to difficult to complete and maintain […] it is clear that these websites have their own structure. […] these sites are rarely updated in terms of their web structure […]
If websites have different structures, having separate spiders makes sense, and should make maintenance easier in the long term.
You say completing new spiders (I assume you mean developing them, not crawling or something else) is becoming difficult, however if they are similar to an existing spider, you can simply copy-and-paste the most similar existing spider, and make only the necessary changes.
Maintenance should be easiest with separate spiders for different websites. If a single website changes, you can fix the spider for that website. If you have a spider for multiple websites, and only one of them changes, you need to make sure that your changes for the modified website do not break the rest of the websites, which can be a nightmare.
Also, since you say website structures do not change often, maintenance should not be that hard in general.
If you notice you are repeating a lot of code, you might be able to extract some shared code into a spider middleware, a downloader middleware, an extension, an item loader, or even a base spider class shared by two or more spiders. But I would not try to use a single Spider subclass to scrape multiple different websites that are likely to evolve separately.

I suggest you crawl specific tags such as body, h1,h2,h3,h4,h5, h6,p and... for each links. You can gather all p tags and append them into a specific link. It can be used for each tags you want to crawl them. Also, you can append related links of tags to your database.

Does Scrapy 'know' when it has crawled an entire site?

I have used Beautiful Soup with great success when crawling single pages of a site, but I have a new project in which I have to check a large list of sites to see if they contain a mention or a link to my site. Therefore, I need to check the entire site of each site.
With BS I just don't know yet how to tell my scraper that it is done with a site, so I'm hitting recursion limits. Is that something Scrapy handles out of the box?

Scrapy uses a link follower to traverse through a site, until the list of available links is gone. Once a page is visited, it's removed from the list and Scrapy makes sure that link is not visited again.
Assuming all the websites pages have links on other pages, Scrapy would be able to visit every page of a website.
I've used Scrapy to traverse thousands of websites, mainly small businesses, and have had no problems. It's able to walk through the whole site.

I am afraid, no one knows when it crawled entire site. Can you say when you crawled entire Facebook, for example? That is because dynamically generated and cross-linked pages.
To set recursion limit is the only way to plan border after which you will stop your movement. But you can minimise the number of duplicate pages. You can use page link or page text's CRC as identifier and check if it is unique.
You can do something like this in your parse method:
if some_id not in set_of_all_page_ids:
set_of_all_page_ids.add(some_id)
yield scrapy.Request(response.urljoin(next_page_url))

Passing a request to a different spider

I'm working on a web crawler (using scrapy) that uses 2 different spiders:
Very generic spider that can crawl (almost) any website using a bunch of heuristics to extract data.
Specialized spider capable of crawling a particular website A that can't be crawled with a generic spider because of website's peculiar structure (that website has to be crawled).
Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is there a Scrappy way to pass the request to spider 1?
Solutions I thought about:
Moving all functionality to spider 1. But that might get really messy, spider 1 code is already very long and complicated, I'd like to keep this functionality separate, if possible.
Saving the links to the database like it was suggested in Pass scraped URL's from one spider to another
Is there a better way?

I met such a case, with a spyder retrieving in a first page the URL adresses and the second one being called from there to operate.
I don't know what is your control flow, but depending on it, I would merely call the first spyder just in time when scrapping a new url, or after scrapping all possible url.
Do you have the case where n°2 can retrieve URLs for the very same website? In this case, I would store all urls, sort them as list in a dict for either spider, and roll this again until there are not new element left to the lists to explore. That makes it better as it is more flexible, in my opinion.
Calling just in time might be ok, but depending on your flow, it could make performance poor as multiple calls to the same functions will probably lose lots of time initializing things.
You might also want to make analytical functions independent of the spider in order to make them available to both as you see fit. If your code is very long and complicated, it might help making it lighter and clearer. I know it is not always avoidable to do so, but that might be worth a try and you might end up being more efficient at code level.

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Being a good citizen and web-scraping

I have a two part question.
First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?
Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Any advice or resources would be greatly appreciated.

Is there possibly a way to do thing's incrementally
I'm using Scrapy caching ability to scrape site incrementaly
HTTPCACHE_ENABLED = True
Or you can use new 0.14 feature Jobs: pausing and resuming crawls
or put a pause in between different requests?
check this settings:
DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY
is there a method with Scrapy to test a crawler without placing undue stress on a site?
You can try and debug your code in Scrapy shell
I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Also, you can call scrapy.shell.inspect_response at any time in your spider.
Any advice or resources would be greatly appreciated.
Scrapy documentation is the best resource.

You have to start crawling and log everything. In case you get banned, you can add sleep() before pages request.
Changing User-Agent is a good practise, too (http://www.user-agents.org/ http://www.useragentstring.com/ )
If you get banned by ip, use proxy to bypass it. Cheers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.