Is a Web Crawler more suitable? - python

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.

I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

Related

Suitable Python modules for navigating a website

I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have
There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

How to stop infinite loops while creating a Web Site Crawler due to dynamic links?

I am a doing small project of creating a Crawler which will extract all the links present on a website with the maximum possible depth :
I have shown a portion of the following code, which i am using to avoid erroneous links or the links which take crawler outside the Target Website.
Code Snippet :
# block all things that can't be urls
if url[0:4]!="http" and url[0:4]!="https" and url[0:1]!='/':
continue
# block all links going away from website
if url[0:len(seed)]!=seed and (url[0:4]=='http'or url[0:4]=="https"):
continue
if "php" in url.split('/')[1]:
url = seed + url
What problem I am facing is that I encountered a link as :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
this link keeps producing infinite results the part of the link that i have highlighted shows the date.
Now when the Crawler crawls this link, it gets into an infinite loop as follows. I checked on the website even the link for 2050/10/13 exists, this means it will take huge time.
Few Output Sequences :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/04/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/05/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/06/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/07/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/08/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/09/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/14/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/15/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/16/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/17/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/18/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/19/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/20/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/21/-?Itemid=1
My Question:
My question is how can i avoid this problem?
If you are writing your project for this site specifically, you can try to find out if links are different from past events by comparing the dates in the URL. However, this will most likely result in site specific code, and if this project needs to be more general, is probably not an option.
If this doesn't work for you, can you add some more information (what is this project for, are there time constraints, etc.)
Edit- I missed the part about dynamic links, so this is not a finite set, so the first part of my answer didn't apply
If the content of a site is stored in a database and pulled for display on pages on demand, dynamic URLs maybe used. In that case the site serves basically as a template for the content. Usually, a dynamic URL would look something like this: http://code.google.com/p/google-checkout-php-sample-code/issues/detail?id=31.
You can spot dynamic URLs by looking for characters like: ? = &. Dynamic URLs have the disadvantage that different URLs can have the same content. So different users might link to URLs with different parameters which have the same content. That's one reason why webmasters sometimes want to rewrite their URLs to static ones.

Parsing a webpage for indexing

I am trying to understand/optimize the logic for indexing a site. I am new to HTML/JS side of things and so am learning as I go. While indexing a site, I recursively go deeper into the site based on the links on each page. One problem is pages have repeating URLs and text like the header and footer. For the URLs I have a list of URLs I have already processed. Is there something I can do for identifying the text that repeats on each page? I hope my explanation is clear enough. I currently have the code (in python) to get a list of useful URLs for that site. Now I am trying to index the content of these pages. Is there a preferred logic to identify or skip repeating text on these pages (like headers, footers, other blurb). I am using BeautifulSoup + the requests module.
I am not quite sure if this is what you are hoping for, but readability is a popular service that just parses the "useful" content from a page. This is the service that is integrated into safari for ios.
It intelligently gets the worthwhile content of the page while ignorning things like footer/header/ads/etc
There are open source ports for python/ruby/php and probably other languages.
https://github.com/buriy/python-readability

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Categories

Resources