I have used Beautiful Soup with great success when crawling single pages of a site, but I have a new project in which I have to check a large list of sites to see if they contain a mention or a link to my site. Therefore, I need to check the entire site of each site.
With BS I just don't know yet how to tell my scraper that it is done with a site, so I'm hitting recursion limits. Is that something Scrapy handles out of the box?
Scrapy uses a link follower to traverse through a site, until the list of available links is gone. Once a page is visited, it's removed from the list and Scrapy makes sure that link is not visited again.
Assuming all the websites pages have links on other pages, Scrapy would be able to visit every page of a website.
I've used Scrapy to traverse thousands of websites, mainly small businesses, and have had no problems. It's able to walk through the whole site.
I am afraid, no one knows when it crawled entire site. Can you say when you crawled entire Facebook, for example? That is because dynamically generated and cross-linked pages.
To set recursion limit is the only way to plan border after which you will stop your movement. But you can minimise the number of duplicate pages. You can use page link or page text's CRC as identifier and check if it is unique.
You can do something like this in your parse method:
if some_id not in set_of_all_page_ids:
set_of_all_page_ids.add(some_id)
yield scrapy.Request(response.urljoin(next_page_url))
Related
I'm using scrapy/spyder to build my crawler, using BeautifulSoup as well.. I have been working on a crawler and believe we are at a point that it works as expected with the few individual pages we have scraped, so my next challenge is to scrape the same site, but ONLY pages that are specific to a high level category.
Only thing i have tried is using allowed_domain and start_urls, but when i did that, it was literally hitting every page it was finding and we want to control what pages we scrape so we have a clean list of information.
I understand that on each page there are links that take you outside of the page you are and can end up elsewhere on the site.. but what im trying to do is only focus on a few pages within each category
# allowed_domain = ['dickssportinggoods.com']
# start_urls = ['https://www.dickssportinggoods.com/c/mens-top-trends-gear']
You can either base your spider on Spider class and code the navigation yourself, or base it on the CrawlSpider class and use the rules to control which pages get visited. From the information you provided it seems that the later approach is more appropriate for your requirement. Check out the example to see how the rules work.
Is it possible to use Scrapy to generate a sitemap of a website including the URL of each page and its level/depth (the number of links I need to follow from the home page to get there)? The format of the sitemap doesn't have to be XML, it's just about the information. Furthermore I'd like to save the complete HTML source of the crawled pages for further analysis instead of scraping only certain elements from it.
Could somebody experienced in using Scrapy tell me whether this is a possible/reasonable scenario for Scrapy and give me some hints on how to find instructions? So far I could only find far more complex scenarios but no approach for this seemingly simple problem.
Addon for experienced webcrawlers: Given it is possible, do you think Scrapy is even the right tool for this? Or would it be easier to write my own crawler with a library like requests etc.?
Yes, it's possible to do what you're trying with Scrapy's LinkExtractor library. This will help you document the URLs for all of the pages on your site.
Once this is done, you can iterate through the URLs and the source (HTML) for each page using the urllib Python library.
Then you can use RegEx to find whatever patterns you're looking for within the HTML for each page in order to perform your analysis.
TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).
I have built a crawling spider using Python Scrapy agains a distributor websites. I am just trying to collect all the URLs under that domain and for each page, what URLs are listed under that page. And then probably I want to use Gephi to visualize the network connections for that domain.
(1) How is the crawled URL stored(memory or disk) and what will be the crawl limit?
However, the crawler has been running for 4 days I think and it has crawled about 700K pages.
I know the Scrapy will not crawl the page that it has already crawled but I am wondering: as the number of pages increases, will there be a limit for Scrapy to "remember" which page it has crawled? The crawled URL will stay in the memory or what is the mechanism behind this?
(2) Will there always be an end to crawl a single domain? What if not?
BTW, should I stop crawling right now since I don't know when will be the end of this spider, I don't know if it would be possible that they have some dynamic page so "domain crawling" is actually an endless task.... for example, they have some parametric search box and all the combinations of those search will lead to a new page(javascript call) but actually.. that lead to a huge redundancy..
Before I know Scrapy, I tried to figure out the pattern in the URL first and then populate all the URLs, after that, go to each URL and using urllib2+bs4 to scrape. So I am not quite sure this kind of "blind" crawling is actually controllable.
There might be some "philosophical" questions here instead of specific questions but... Appreciate any thought or idea.
Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.