Scraping a website for specific data where URLs are inconsistent

Scraping a website for specific data where URLs are inconsistent - python

I want to scrape http://www.narrpr.com/ for data, but I'm running into an issue. Most of the time, formatting URLs to access the specific pages you want to scrape is easy. However, in this instance, the URLs are formatted in the following fashion (for example):
http://www.narrpr.com/homes/mo/independence/64055/2412-s-ellison-way/38664800-summary.aspx
Where the number 38664800 appears to be some kind of unique ID.
When navigating the site manually, I enter 2412 S Ellison Way into the form, which then redirects me to the URL above.
How can I programmatically reach the correct page without needing to know that ID? Alternatively, how can I get that ID?
Thanks.

Related

Why does this search URL redirect to a different search URL when copied and pasted?

Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.

The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.

After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.

Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.

If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

Scrapy: Get data on page and following link

I have been using scrapy for a personal project. My problem is very similar to the question asked on the following page:
Scrapy: Follow link to get additional Item data?
The page I am scraping is the following:
http://www.tennisinsight.com/player_activity.php?player_id=51
This page has a list of matches in this form for eg:
Round of 16 Def. Ivan Dodig(37,2.41) (CRO) 6-3 6-3 Recap Match Stats $1.043
I have currently written in scrapy code that opens every link on the page which has the "Match Stats" link, and scrapes data on that page into an individual record
In addition to this, I want to scrape the "Odds" column (which is the $1.043 above) and add this data to the record.
I have searched for an answer and it seems that I have to use the Request meta field and pass this data along to the parse method. However, I have a problem because I am struggling to incorporate it into my code. The answer from the stackoverflow link I linked above is "To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter."
This makes perfect sense, however, the URLs that I scrape are in the rules, so I dont know how to extract the required data.
Here is part of my code so far which will hopefully better explain my problem.
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopupLink), callback='parse_match', follow=True)
The parse_match function parses the match stats into one item.
So what happens is that each of these match stats links are opened up, and there is no way for me to access the main page's Odds column.
Any help will be much appreciated.

How to stop infinite loops while creating a Web Site Crawler due to dynamic links?

I am a doing small project of creating a Crawler which will extract all the links present on a website with the maximum possible depth :
I have shown a portion of the following code, which i am using to avoid erroneous links or the links which take crawler outside the Target Website.
Code Snippet :
# block all things that can't be urls
if url[0:4]!="http" and url[0:4]!="https" and url[0:1]!='/':
continue
# block all links going away from website
if url[0:len(seed)]!=seed and (url[0:4]=='http'or url[0:4]=="https"):
continue
if "php" in url.split('/')[1]:
url = seed + url
What problem I am facing is that I encountered a link as :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
this link keeps producing infinite results the part of the link that i have highlighted shows the date.
Now when the Crawler crawls this link, it gets into an infinite loop as follows. I checked on the website even the link for 2050/10/13 exists, this means it will take huge time.
Few Output Sequences :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/04/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/05/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/06/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/07/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/08/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/09/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/14/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/15/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/16/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/17/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/18/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/19/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/20/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/21/-?Itemid=1
My Question:
My question is how can i avoid this problem?

If you are writing your project for this site specifically, you can try to find out if links are different from past events by comparing the dates in the URL. However, this will most likely result in site specific code, and if this project needs to be more general, is probably not an option.
If this doesn't work for you, can you add some more information (what is this project for, are there time constraints, etc.)
Edit- I missed the part about dynamic links, so this is not a finite set, so the first part of my answer didn't apply

If the content of a site is stored in a database and pulled for display on pages on demand, dynamic URLs maybe used. In that case the site serves basically as a template for the content. Usually, a dynamic URL would look something like this: http://code.google.com/p/google-checkout-php-sample-code/issues/detail?id=31.
You can spot dynamic URLs by looking for characters like: ? = &. Dynamic URLs have the disadvantage that different URLs can have the same content. So different users might link to URLs with different parameters which have the same content. That's one reason why webmasters sometimes want to rewrite their URLs to static ones.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a website for specific data where URLs are inconsistent - python

Related

Why does this search URL redirect to a different search URL when copied and pasted?

Get method from requests library seems to return homepage rather than specific URL

Script cannot fetch data from a web page

Scrapy: Get data on page and following link

How to stop infinite loops while creating a Web Site Crawler due to dynamic links?

Categories

Resources