Is it possible to use Scrapy to generate a sitemap of a website including the URL of each page and its level/depth (the number of links I need to follow from the home page to get there)? The format of the sitemap doesn't have to be XML, it's just about the information. Furthermore I'd like to save the complete HTML source of the crawled pages for further analysis instead of scraping only certain elements from it.
Could somebody experienced in using Scrapy tell me whether this is a possible/reasonable scenario for Scrapy and give me some hints on how to find instructions? So far I could only find far more complex scenarios but no approach for this seemingly simple problem.
Addon for experienced webcrawlers: Given it is possible, do you think Scrapy is even the right tool for this? Or would it be easier to write my own crawler with a library like requests etc.?
Yes, it's possible to do what you're trying with Scrapy's LinkExtractor library. This will help you document the URLs for all of the pages on your site.
Once this is done, you can iterate through the URLs and the source (HTML) for each page using the urllib Python library.
Then you can use RegEx to find whatever patterns you're looking for within the HTML for each page in order to perform your analysis.
Related
We know that most of websites have sitemaps which contain all major categories of the site. Now I have a list of different sitemaps' urls(More than 100K) and I wish to extract a specific category's url from all different sitemaps that I have. For example, suppose I am visiting microsoft's sitemap and there's a place called news, so I can simply using xpath to get that url, but this is only for one site, what if I have huge numbers of sites and I want to extract all 'news' urls from them as long as they exist . My first thought would be training a model to recognize news. However, I am very new to machine learning, if this is the way to solve it, can someone explain to me how to approach this? What steps will need to be taken. Or if this is not the best way, any other suggestions? Thank you.
If you are actually using news sites there is an application for this called newspaper3k. https://github.com/codelucas/newspaper/
You can extract all news links with something like this.
response.css(':contains("News")::attr(href)').extract()
You can use xpath to make the above call a little better and ignore case if necessary.
I imagine there are many other links and you want it to extract from all sitemaps. You can use CrawlSpider and linkextractor rules to crawl these sitemaps....
See this answer Scrapy - Understanding CrawlSpider and LinkExtractor
Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.
I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html
I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data.
How would that work?
What tools/libraries can/should I use?
Are there good tutorials on that?
How do I best deal with binary data (e.g. pretty pdf)?
Are there already good solutions for that?
requests for downloading the pages.
Here's an example of how to login to a website and download pages: https://stackoverflow.com/a/8316989/311220
lxml for scraping the data.
If you want to use a powerful scraping framework there's Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.
Scrapy is probably the best Python library for crawling. It can maintain state for authenticated sessions.
Dealing with binary data should be handled separately. For each file type, you'll have to handle it differently according to your own logic. For almost any kind of format, you'll probably be able to find a library. For instance take a look at PyPDF for handling PDFs. For excel files you can try xlrd.
I liked using BeatifulSoup for extracting html data
It's as easy as this:
from BeautifulSoup import BeautifulSoup
import urllib
ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')
urls = [item.enclosure['url'] for item in items]
For this purpose there is a very useful tool called web-harvest
Link to their website http://web-harvest.sourceforge.net/
I use this to crawl webpages
I have recently started to work with Scrapy. I am trying to gather some info from a large list which is divided into several pages(about 50). I can easily extract what I want from the first page including the first page in the start_urls list. However I don't want to add all the links to these 50 pages to this list. I need a more dynamic way. Does anyone know how I can iteratively scrape web pages? Does anyone have any examples of this?
Thanks!
use urllib2 to download a page. Then use either re (regular expressions) or BeautifulSoup (an HTML parser) to find the link to the next page you need. Download that with urllib2. Rinse and repeat.
Scapy is great, but you dont need it to do what you're trying to do
Why don't you want to add all the links to 50 pages? Are the URLs of the pages consecutive like www.site.com/page=1, www.site.com/page=2 or are they all distinct? Can you show me the code that you have now?