I'm new to web scraping and have been using BeautifulSoup to scrape daily mortgage rates. However, a lot of the servicer sites that I am trying to scrape return 'none' or an empty list. The tags that I'm trying to scrape definitely exist and I just don't understand why this is happening. One example is below.
html=urlopen('https://www.popular.com/en/mortgages/')
bs0bj=BeautifulSoup(html, 'html.parser');
rate=bs0bj.find('span', {'class':'text-md text-popular-medium-blue'}).div
print(rate)
If you check the page source (for example via view-source: in Chrome or Firefox, or by writing your html string to a file) you'll see that the element you are looking for is not there. In fact, the rates are loaded dynamically:
<td>
<span class="text-md text-popular-medium-blue">
<script type="text/javascript" src = "https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1"></script>
</span>
</td>
You can follow the script URL and you'll see that the response is something like the following:
document.write('<div>2.75%</div>')
This response is probably regular enough to be able to use regexes on it.
To get the data you are after you can use selenium in combination with python something like below:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.popular.com/en/mortgages/')
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.select('.table-responsive')[0].select("span div")[0].text
print(item)
driver.quit()
Result:
2.75%
As rubik said, the rates are loaded dynamically using JS. Luckily, the structure of the content is relatively simple, here is how I analyze it:
Open a new tab in Chrome(or other Browsers) and right-click, then choose view-source. Switch to Network tab and check the preserve log option.
Now, open the website https://www.popular.com/en/mortgages/. The loaded contents can be seen at the left panel.
Check each item and analyze its Preview content till you find the one you want to scrap. Here is what I found, the 2.75% match the rate value of mortgage shown on the website.
Now, switch to Headers tab and check the Request URL, this is the final request sent to the server.
The next step is to analyze the Request URL https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1
I guessed that textcolor and backgroundcolor indicate css information, so I removed them and found the url is still valid.
Now we have a simpler url:https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1
It is obvious that id_rates indicate the order of the mortgages rates without any analyzation. The question is: what does t mean?
This can be answered by analyzing other Preview contents to find the rule. Here I want to skip the process and just give the conclusion.
t=1 indicate Annual interest, t=2 indicate APR, t=6 indicate P&I Payment etc:
After doing these, now you can scrap content from the Request URL directly:
from urllib2 import urlopen
import re
file=urlopen('https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1')
annual_interest = re.findall(r"\d+\.\d+", str(file.read()))[0]
#the annual interest is 2.75
Use pip install html5lib but I think with "pip install bs4(beautifulSoup)everything should be installed automatically If you are using PyCharm like me, after "pip install bs4" in the command line, open Pycharm and go to interpreters setting, add beautifulsooup and html5lib html5lib is like parser it's the same thing as HTML.parser. Both are the parser for more Info here is the beautifulsoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Related
I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.
Newbie here. I'm trying to scrape some sports statistics off a website using BeautifulSoup4. The script below does output a table, but it's not actually the specific data that appears in the browser (the data that appears in browser is the data I'm after - goalscorer data for a season, not all time records).
#import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
#specify the url
stat_page = 'https://www.premierleague.com/stats/top/players/goals?se=79'
# query the website and return the html to the variable ‘page’
page = urlopen(stat_page)
#parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
stats = soup.find('tbody', attrs={'class': 'statsTableContainer'})
name = stats.text.strip()
print(name)
It appears there is some filtering of data going on behind the scenes but I am not sure how I can filter the output with BeautifulSoup4. It would appear there is some Javascript filtering happening on top of the HTML.
I have tried to identify what this specific filter is, and it appears the filtering is done here.
<div class="current" data-dropdown-current="FOOTBALL_COMPSEASON" role="button" tabindex="0" aria-expanded="false" aria-labelledby="dd-FOOTBALL_COMPSEASON" data-listen-keypress="true" data-listen-click="true">2017/18</div>
I've had a read of the below link, but I'm not entirely sure how to apply it to my answer (again, beginner here).
Having problems understanding BeautifulSoup filtering
I've tried installing, importing and applying the different parsers, but I always get the same error (Couldn't find a Tree Builder). Any suggestions on how I can pull data off a website that appears to be using a JS filter?
Thanks.
In these cases, it's usually useful to track the network requests using your browser's developer tools, since the data is usually retrieved using AJAX and then displayed in the browser with JS.
In this case, it looks like the data you're looking for can be accessed at:
https://footballapi.pulselive.com/football/stats/ranked/players/goals?page=0&pageSize=20&compSeasons=79&comps=1&compCodeForActivePlayer=EN_PR&altIds=true
It has a standard JSON format so you should be able to parse and extract the data with minimal effort.
However, note that this endpoint requieres the Origin HTTP header to be set to https://www.premierleague.com in order for it to serve your request.
I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'§ionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.
I run this program but it is giving me only "[]" instead of giving the web page data.please help
import urllib
import re
import lxml.html
start_link= "http://aepcindia.com/ApparelMarketplaces/detail"
html_string = urllib.urlopen(start_link)
dom = lxml.html.fromstring(html_string.read())
side_bar_link = dom.xpath("//*[#id='show_cont']/div/table/tr[2]/td[2]/text()")
print side_bar_link
file = open("next_page.txt","w")
for link in side_bar_link:
file.write(link)
print link
file.close()
The HTML source you are downloading contains an empty content area: <div id="show_cont"></div>. This div is populated later by a javascript function showData(). When you look at the page in a browser, the javascript is executed before, which is not the case when you just download the HTML source using urllib.
To get the data you want, you can try to mimic the POST request in the showData() function or, preferably, scrape the website using a scriptable headless browser.
Update: While a headless browser would be a much more generally applicable approach, in this case it might be overhead here. You actually will be better off reverse engineering the showData() function. The alax-call in that is all too obvious, delivers a plain HTML table and you can also limit searches :)
http://aepcindia.com/ApparelMarketplaces/ajax_detail/search_type:/search_value:
Quite often i have to download the pdfs from websites but sometimes they are not on one page.
They have divided the links in pagination and I have to click on every page of get the links.
I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.
I am new to python so can anyone please give me the directions how can i do it
Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:
# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse
# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'
# fetch the page
res = urllib2.urlopen(base_url)
# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())
# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}
# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])
Result:
http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...
If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/).
It is pretty easy to understand how to use it and can download pdf files you need.
By phone, maybe it is not very readable
If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests
import requests
page_content=requests.get(url)
But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)
Frist way:make your requests more like a browser(human).
add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers)
make the right post form.This one should copy the ways you post the form by browser.
get the cookies, and add it to requests
Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver)
remeber to add chromedriver to the path
Or use code to load the driver.exe
driver=WebDriver.Chrome(path)
not sure is this set up code
driver.get(url)
It is trully surf the url by browser, so it will decrease the difficulty of grabing things
get the web page
page=driver.page_soruces
some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.
try:
certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID)
WebDriverWait(certain_element)
or use implict wait: wait the time you like
driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)
And you can controll the website by WebDriver. Here is not going to describe. You can search the module.