Unable to find exact source code of my blog - python

I am into a project where I deal with parsing HTML of web pages. So, I took my blog (Bloggers Blog - Dynamic Template) and tried to read the content of it. Unfortunately I failed to look at "actual" source of the blog's webpage.
Here is what I observed:
I clicked view source on a random article of my blog and tried to find the content in it. and I couldn't find any. It was all JavaScript.
So, I saved the webpage to my laptop and checked the source again, this time I found the content.
I also checked the source using developers tools in browsers and again found the content in it.
Now, I tried the python way
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup( urllib.urlopen("my-webpage-address") )
print soup.prettify()
I even didn't find the content in the HTML code in it.
Finally, why I am unable to find the content in the source code in case1, 4.
How should I get the actual HTML code? I wish to hear any python library that would do the job.

The content is loaded via JavaScript (AJAX). It's not in the "source".
In step 2, you are saving the resulting page, not the original source. In step 3, you're seeing what's being rendered by the browser.
Steps 1 and 4 "don't work" because you're getting the page's source (which doesn't contain the content). You need to actually run the JavaScript, which isn't easy for a screen scraper to do.

Related

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

Issues downloading full HTML of webpage with Python

I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;

Using Python requests.get to parse html code that does not load at once

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)

Ensure a page has downloaded correctly in Python

I am writing a basic screen scraping script using Mechanize and BeautifulSoup (BS) in Python. However, the problem I am running into is that for some reason the requested page does not download correctly every time. I am concluding this because when searching the downloaded pages using BS for present tags, I get an error. If I download the page again, it works.
Hence, I would like to write a small function that checks to see if the page has correctly downloaded and re-download if necessary (I could also solve it by figuring out what goes wrong, but that is probably too advanced for me). My question is how would I go about checking to see if the page has been downloaded correctly?
You can just check for a tag you expect to be there, and if it fails, repeat the download.
page = BeautifulSoup(page)
while page.body = None:
#redownload the page
page = BeautifulSoup(page)
#now you can use the data
I think you may simple search for html ending tag if this tag is in - this is a valid page.
The most generic solution is to check that the </html> closing tag exists. That will allow you to detect truncation of the page.
Anything else, and you will have to describe your failure mode more clearly.

Categories

Resources