Parsing a webpage for indexing - python

I am trying to understand/optimize the logic for indexing a site. I am new to HTML/JS side of things and so am learning as I go. While indexing a site, I recursively go deeper into the site based on the links on each page. One problem is pages have repeating URLs and text like the header and footer. For the URLs I have a list of URLs I have already processed. Is there something I can do for identifying the text that repeats on each page? I hope my explanation is clear enough. I currently have the code (in python) to get a list of useful URLs for that site. Now I am trying to index the content of these pages. Is there a preferred logic to identify or skip repeating text on these pages (like headers, footers, other blurb). I am using BeautifulSoup + the requests module.

I am not quite sure if this is what you are hoping for, but readability is a popular service that just parses the "useful" content from a page. This is the service that is integrated into safari for ios.
It intelligently gets the worthwhile content of the page while ignorning things like footer/header/ads/etc
There are open source ports for python/ruby/php and probably other languages.
https://github.com/buriy/python-readability

Related

Python multiple web pages scraping with same starting url string

I am trying to read review data data from alexaskillstore.com website using BeautifulSoup. For this, I am specifying the target url as https://www.alexaskillstore.com/Business-Leadership-Series/B078LNGS5T, where the string after Business-Leadership-Series/ keeps changing for all the different skills.
I want to know how can I input a regular expression or similar code to my input url so that I am able to read every link that starts with https://www.alexaskillstore.com/Business-Leadership-Series/.
You can't. The web is client-server based, so unless the server is kind enough to map the content for you, you have no way to know which URLs will be responsive and which won't.
You may be able to scrape some index page(s) to find the keys (B078LNGS5T and the like) you need. Once you have them all, actually generating the URLs is a simple matter of string substitution.

Python Scrapy - Scraping data from multiple website URLs

For one of my web project I need to scrape data from different web sources. To keep it simple i am explaining with an example.
Lets say i want to scrape the data about mobiles listed in their manufacturer site.
http://www.somebrand1.com/mobiles/
.
.
http://www.somebrand3.com/phones/
I have huge list of URLs.
Every brand's page will have their own way of HTML presentation for browser.
How can i write a normalized script to traverse the HTML of those listing web page URLs and scrape the data irrespective of the format they are in?
Or else do i need to write a script to scrape data from every pattern?
This is called a Broad Crawling and, generally speaking, this is not an easy thing to implement because of the different nature, representation, loading mechanisms web-sites use.
The general idea would be to have a generic spider and some sort of a site-specific configuration where you would have a mapping between item fields and xpath expressions or CSS selectors used to retrieve the field values from the page. In a real life, things are not that simple as it seems, some fields would require post-processing, other fields would need to be extracted after sending a separate request etc. In other words, it would be very difficult to keep generic and reliable at the same time.
The generic spider should receive a target site as a parameter, read the site-specific configuration and crawl the site according to it.
Also see:
Broad Crawls

Is a Web Crawler more suitable?

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

How to stop infinite loops while creating a Web Site Crawler due to dynamic links?

I am a doing small project of creating a Crawler which will extract all the links present on a website with the maximum possible depth :
I have shown a portion of the following code, which i am using to avoid erroneous links or the links which take crawler outside the Target Website.
Code Snippet :
# block all things that can't be urls
if url[0:4]!="http" and url[0:4]!="https" and url[0:1]!='/':
continue
# block all links going away from website
if url[0:len(seed)]!=seed and (url[0:4]=='http'or url[0:4]=="https"):
continue
if "php" in url.split('/')[1]:
url = seed + url
What problem I am facing is that I encountered a link as :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
this link keeps producing infinite results the part of the link that i have highlighted shows the date.
Now when the Crawler crawls this link, it gets into an infinite loop as follows. I checked on the website even the link for 2050/10/13 exists, this means it will take huge time.
Few Output Sequences :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/04/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/05/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/06/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/07/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/08/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/09/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/14/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/15/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/16/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/17/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/18/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/19/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/20/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/21/-?Itemid=1
My Question:
My question is how can i avoid this problem?
If you are writing your project for this site specifically, you can try to find out if links are different from past events by comparing the dates in the URL. However, this will most likely result in site specific code, and if this project needs to be more general, is probably not an option.
If this doesn't work for you, can you add some more information (what is this project for, are there time constraints, etc.)
Edit- I missed the part about dynamic links, so this is not a finite set, so the first part of my answer didn't apply
If the content of a site is stored in a database and pulled for display on pages on demand, dynamic URLs maybe used. In that case the site serves basically as a template for the content. Usually, a dynamic URL would look something like this: http://code.google.com/p/google-checkout-php-sample-code/issues/detail?id=31.
You can spot dynamic URLs by looking for characters like: ? = &. Dynamic URLs have the disadvantage that different URLs can have the same content. So different users might link to URLs with different parameters which have the same content. That's one reason why webmasters sometimes want to rewrite their URLs to static ones.

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Categories

Resources