I am writing a basic screen scraping script using Mechanize and BeautifulSoup (BS) in Python. However, the problem I am running into is that for some reason the requested page does not download correctly every time. I am concluding this because when searching the downloaded pages using BS for present tags, I get an error. If I download the page again, it works.
Hence, I would like to write a small function that checks to see if the page has correctly downloaded and re-download if necessary (I could also solve it by figuring out what goes wrong, but that is probably too advanced for me). My question is how would I go about checking to see if the page has been downloaded correctly?
You can just check for a tag you expect to be there, and if it fails, repeat the download.
page = BeautifulSoup(page)
while page.body = None:
#redownload the page
page = BeautifulSoup(page)
#now you can use the data
I think you may simple search for html ending tag if this tag is in - this is a valid page.
The most generic solution is to check that the </html> closing tag exists. That will allow you to detect truncation of the page.
Anything else, and you will have to describe your failure mode more clearly.
Related
I'm trying to iterate over some pages. The different pages of are marked with or10,or20,or30 etc. for the website. i.e.
/Restaurant_Review
is the first page
/Restaurant_Review-or10
Is the second page
/Restaurant_Review-or20
3rd page etc.
The problem is that I get redirected from those sites to the normal url (1st one) if the -or- version doesnt exist. I'm currently looping over a range in a for loop, and dynamically changing the -or- value.
def parse(self,response):
l = range(100)
reviewRange = l[10::10]
for x in reviewRange:
yield((url+"-or"+str(x)), callback=self.parse_page)
def parse_page(self,response):
#do something
#How can I from here tell the for loop to stop
if(oldurl == response.url):
return break
#this doesnt work
The problem is that I need to do the request even if the page doesn't exist, and this is not scalable. I've tried comparing the URLs, but still did not understand how I can return from the parse_page() function something that would tell the parse() function to stop.
You can check what is in response.meta.get('redirect_urls'), for example. In case you have something there, retry original url with dont_filter.
Or try to catch such cases with RetryMiddleware.
This is not an answer to the actual question, but rather an alternative solution that does not require redirect detection.
In the HTML you can already find all those pagination URLs by using:
response.css('.pageNum::attr(href)').getall()
Regarding #Anton's question in a comment about how I got this:
You can check this by opening a random restaurant review page with the Scrapy shell:
scrapy shell "https://www.tripadvisor.co.za/Restaurant_Review-g32655-d348825-Reviews-Brent_s_Delicatessen_Restaurant-Los_Angeles_California.html"
Inside the shell you can view the received HTML in your browser with:
view(response)
There you'll see that it includes the HTML (and that specific class) for the pagination links. The real website does use Javascript to render the next page, but it does so by retrieving the full HTML for the next page based on the URL. Basicallty, it just replaces the entire page, there's very little additional processing involved. So this means if you open the link yourself you get the full HTML too. Hence, the Javascript issue is irrelevant here.
I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.
I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)
I am into a project where I deal with parsing HTML of web pages. So, I took my blog (Bloggers Blog - Dynamic Template) and tried to read the content of it. Unfortunately I failed to look at "actual" source of the blog's webpage.
Here is what I observed:
I clicked view source on a random article of my blog and tried to find the content in it. and I couldn't find any. It was all JavaScript.
So, I saved the webpage to my laptop and checked the source again, this time I found the content.
I also checked the source using developers tools in browsers and again found the content in it.
Now, I tried the python way
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup( urllib.urlopen("my-webpage-address") )
print soup.prettify()
I even didn't find the content in the HTML code in it.
Finally, why I am unable to find the content in the source code in case1, 4.
How should I get the actual HTML code? I wish to hear any python library that would do the job.
The content is loaded via JavaScript (AJAX). It's not in the "source".
In step 2, you are saving the resulting page, not the original source. In step 3, you're seeing what's being rendered by the browser.
Steps 1 and 4 "don't work" because you're getting the page's source (which doesn't contain the content). You need to actually run the JavaScript, which isn't easy for a screen scraper to do.