How to get elements of webpage that load after initial webpage? - python

I'm trying to download stock option data from Yahoo Finance (here's Google as an example) with requests.get, which doesn't seem to be downloading everything. I'm trying to get the dropdown of dates with an XPath but even //option doesn't return anything even though Chrome DevTools says there are 13 instances!
I expect this has something to do with the fact that the parts of the site that actually matter are being loaded after all the navigation bars and such, and I don't know how to get all of it. Could you please suggest a method for getting the text of each item in the date dropdown menu?

If you open the dev console and refresh the page again (caches might need to be purged), you can see some requests with type xhr.
They are usually initiated by JavaScript programs and will load some data besides those provided by HTML body.
That's what you can look into.

Related

How to extract the value of an input element from a web page using python?

I am working on a project to extract the next three upcoming busses' e.t.a. from the bus tracker Miami website, but for some reason they display these times using an input element with an empty value in HTML. I am no expert on HTML so I am sure they must have their reason, likely some sort of async script in javascript or something. When I do a get request on this website from python I can see the input fields but next to values its empty, same when I inspect them using chrome. I can see their values but only after going into the accessibility tab in the chrome inspect utility. Is there any way to see this value using python? The website is "https://www.miamidade.gov/transportation-publicworks/bustracker.asp?RouteID=37&Dir=Northbound&StopID=7535&Sequence=42" just for reference
here is what it looks like and here is where I can see the value
Check out the request being made by the page in your browser's debug tools network tab.
Make a request to https://www.miamidade.gov/transportation-publicworks/mobile/xml/bustracker/?RouteID=37&Dir=Northbound&StopID=7535&Sequence=42 instead and parse the returned XML.

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

The HTML code I scrape seems to be incomplete in comparison to the full website. Could the HTML be changing dynamically?

I am currently scraping a website for work to be able to sort the data locally, however when I do this the code seems to be incomplete, and I feel may be changing whilst I scroll on the website to add more content. Can this happen ? And if so, how can I ensure I am able to scrape the whole website for processing?
I only currently know some python and html for web scraping, looking into what other elements may be affecting this issue (javascript or ReactJS etc).
I am expecting to get a list of 50 names when scraping the website, but it only returns 13. I have downloaded the whole HTML file to go through it and none of the other names seem to exist in the file, i.e. why I think the file may be changing dynamically
Yes, the content of the HTML can be dynamic, and Javascript loading should be the most essential . For Python, scrapy+splash maybe a good choice to get started.
Depending on how the data is handled, you can have different methods to handle dyamic content HTML

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

Issues downloading full HTML of webpage with Python

I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;

Categories

Resources