how to scrawl file hosting website with scrapy in python?

how to scrawl file hosting website with scrapy in python? - python

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Related

Using a Single Web Crawler to Scrape Multiple websites in a predefined format with attachments?

I have a list of approx. 52 websites which lead to about approx. 150 webpages that i require scraping on. Based on my ignorance and lack of research i started building crawlers per webpage which is starting to become to difficult to complete and maintain.
Based on my analysis thus far I already know what information i want to scrape per webpage and it is clear that these websites have their own structure. On the plus side i noticed that each website has some commonalities in their web structure among their webpages.
My million dollar question, is there a single technique or single web crawler that i can use to scrape these sites? I already know the information that I want, these sites are rarely updated in terms of their web structure and most of these sites have documents that need to be downloaded.
Alternatively, is there a better solution to use that will reduce the amount of web crawlers that I need to build? additionally, these web crawlers will only be used to download the new information of the websites that i am aiming them at.

[…] i started building crawlers per webpage which is starting to become to difficult to complete and maintain […] it is clear that these websites have their own structure. […] these sites are rarely updated in terms of their web structure […]
If websites have different structures, having separate spiders makes sense, and should make maintenance easier in the long term.
You say completing new spiders (I assume you mean developing them, not crawling or something else) is becoming difficult, however if they are similar to an existing spider, you can simply copy-and-paste the most similar existing spider, and make only the necessary changes.
Maintenance should be easiest with separate spiders for different websites. If a single website changes, you can fix the spider for that website. If you have a spider for multiple websites, and only one of them changes, you need to make sure that your changes for the modified website do not break the rest of the websites, which can be a nightmare.
Also, since you say website structures do not change often, maintenance should not be that hard in general.
If you notice you are repeating a lot of code, you might be able to extract some shared code into a spider middleware, a downloader middleware, an extension, an item loader, or even a base spider class shared by two or more spiders. But I would not try to use a single Spider subclass to scrape multiple different websites that are likely to evolve separately.

I suggest you crawl specific tags such as body, h1,h2,h3,h4,h5, h6,p and... for each links. You can gather all p tags and append them into a specific link. It can be used for each tags you want to crawl them. Also, you can append related links of tags to your database.

Sitemap creation with Scrapy

Is it possible to use Scrapy to generate a sitemap of a website including the URL of each page and its level/depth (the number of links I need to follow from the home page to get there)? The format of the sitemap doesn't have to be XML, it's just about the information. Furthermore I'd like to save the complete HTML source of the crawled pages for further analysis instead of scraping only certain elements from it.
Could somebody experienced in using Scrapy tell me whether this is a possible/reasonable scenario for Scrapy and give me some hints on how to find instructions? So far I could only find far more complex scenarios but no approach for this seemingly simple problem.
Addon for experienced webcrawlers: Given it is possible, do you think Scrapy is even the right tool for this? Or would it be easier to write my own crawler with a library like requests etc.?

Yes, it's possible to do what you're trying with Scrapy's LinkExtractor library. This will help you document the URLs for all of the pages on your site.
Once this is done, you can iterate through the URLs and the source (HTML) for each page using the urllib Python library.
Then you can use RegEx to find whatever patterns you're looking for within the HTML for each page in order to perform your analysis.

Does Scrapy 'know' when it has crawled an entire site?

I have used Beautiful Soup with great success when crawling single pages of a site, but I have a new project in which I have to check a large list of sites to see if they contain a mention or a link to my site. Therefore, I need to check the entire site of each site.
With BS I just don't know yet how to tell my scraper that it is done with a site, so I'm hitting recursion limits. Is that something Scrapy handles out of the box?

Scrapy uses a link follower to traverse through a site, until the list of available links is gone. Once a page is visited, it's removed from the list and Scrapy makes sure that link is not visited again.
Assuming all the websites pages have links on other pages, Scrapy would be able to visit every page of a website.
I've used Scrapy to traverse thousands of websites, mainly small businesses, and have had no problems. It's able to walk through the whole site.

I am afraid, no one knows when it crawled entire site. Can you say when you crawled entire Facebook, for example? That is because dynamically generated and cross-linked pages.
To set recursion limit is the only way to plan border after which you will stop your movement. But you can minimise the number of duplicate pages. You can use page link or page text's CRC as identifier and check if it is unique.
You can do something like this in your parse method:
if some_id not in set_of_all_page_ids:
set_of_all_page_ids.add(some_id)
yield scrapy.Request(response.urljoin(next_page_url))

Is a Web Crawler more suitable?

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.

I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T

I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.

You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.

In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.

Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.