Why is the page source different between Selenium and BeautifulSoup? - python

As the Title, I am crawling data from the Vietnam's website (https://webgia.com/lai-suat/). I have used BeautifulSoup at first and it does not return the data as its html source showing on Chrome, the data number is hide. However, I changed the method to use Selenium for getting html source and it returns ideally result as all data number has shown.
The code is as below:
Using bs4:
import requests
from bs4 import BeautifulSoup
url = "https://webgia.com/lai-suat/"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
table = soup.find_all('table', attrs={'class': 'table table-radius table-hover text-center'})
table_body = table[0].find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
print(col)
The data is hiden as the result is:
<td class="text-left"><a class="bank-icon" href="https://webgia.com/lai-suat/abbank/" title="Lãi suất ABBank - Ngân hàng TMCP An Bình"><span class="bak-icon bi-abbank"></span><span>ABBank</span></a></td>
<td class="text-right lsd" nb="E3c7370616e20636c617C37B33d2B2746578742d6772H65I656e223e3A02c32303c2f7370616e3Ie"><small>web giá</small></td>
<td class="text-right lsd" nb="R3ZJ3YKJ2c3F635D"><small>xem tại webgia.com</small></td>
<td class="text-right lsd" nb="3c7370616e20636Fc61C73733d22746578742dC6772A65656e223e3S42cT303N03c2f7370616e3e"><small>webgia.com</small></td>
<td class="text-right lsd" nb="352cMA3Z6BE30"><small>web giá</small></td>
<td class="text-right lsd" nb="352cLXG3A7I30"><small>web giá</small></td>
But if I get html source by using Selenium, then using the same code above:
s = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service = s)
driver.maximize_window()
url = "https://webgia.com/lai-suat/"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
...
The result was showing all data number
<td class="text-right"><span class="text-green">0,20</span></td>
<td class="text-right">3,65</td>
<td class="text-right"><span class="text-green">4,00</span></td>
<td class="text-right">5,60</td>
<td class="text-right">5,70</td>
<td class="text-right">5,70</td>
...
So can anyone explain why they have the difference like this? How to get the same result by just using bs4 instead of Selenium?
Thank you guys

The difference is because most websites today are shipped with not only HTML, but also JS scripts capable of modifying the HTML when executed. To execute those scripts, a JS engine is required and that's exactly what web browsers provide you with - a JS Engine (V8 for Chrome).
HTML contents fetched using BeautifulSoup are "raw" ones, unmodified by any JS scripts because there's no JS engine to execute them in the first place. It is those JS scripts who are in charge of data fetching and updating HTML with the fetched data
HTML contents provided by Selenium, on the other hand, are the ones after JS scripts have been executed. Selenium can do this because it has an external webdriver execute the scripts for you, not because Selenium itself can execute JS scripts
Since you'll eventually need a JS engine to execute the JS scripts, I don't think BeautifulSoup alone can cut it.

The reason is that selenium runs JavaScript, which can modify the contents of the page, whereas using requests to get the page only returns the HTML of the page that is initially sent in the request and does not execute the JavaScript.

The page source has that content obfuscated and placed inside the nb attribute of the relevant tds. When JavaScript runs in the browser the following script content runs which converts the obfuscated data into what you see on the page.
function gm(r) {
r = r.replace(/A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/g, "");
for (var n = [], t = 0; t < r.length - 1; t += 2) n.push(parseInt(r.substr(t, 2), 16));
return String.fromCharCode.apply(String, n)
}
$(document).ready(function() {
$("td.blstg").each(function() {
var gtls = $(this).attr("nb");
$(this).removeClass("blstg").removeAttr("nb");
if (gtls) {
$(this).html(gm(gtls));
} else {
$(this).html("-");
}
});
});
With requests this script doesn't run so you are left with the generic text.
To answer your question about how to use bs4 to get this, you could write your own custom function(s) to reproduce the logic of the script.
Additionally, the class of these target elements, whose nb attribute require conversion, is dynamic, so that needs to be picked up also. In the above JavaScript the dynamic class value was blstg at the time of viewing. In the code below, I use regex to pick up the correct current value.
I have used thousands = None, as per this GitHub pandas issue, to preserve "," as the decimal point, as per source, when using read_html() to generate the final dataframe.
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
def gm(r):
r = re.sub(r'A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z', '', r)
n = []
t = 0
while t < len(r) - 1:
n.append(int(r[t:t+2], 16))
t+=2
return ''.join(map(chr, n))
url = "https://webgia.com/lai-suat/"
req = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(req.text, "lxml")
dynamic_class = re.search(r'\$\("td\.([a-z]+)"', req.text).group(1)
for i in soup.select(f'td.{dynamic_class}'):
replacement = i['nb']
del i['class'] # not actually needed as I replace innerText
del i['nb'] # not actually needed as I replace innerText
if replacement:
i.string.replace_with(bs(gm(replacement), 'lxml')
else:
i.replace_with('-')
df = pd.read_html(str(soup.select_one(".table-radius")), thousands=None)[0]
print(df)

expanding on the above answer and generally speaking
in order to tell if specific data is fetched/generated by js or returned with the page html
you can use a feature in chrome dev tools called block js execution (click inspect then f1) if you keep the chrome dev tools open when you visit the page and the data is there this is a clear indication the data is fetched with the html
if its not than its either fetched or generated by js
if the data is fetched simply inspecting the network requests your browser makes while you visit the website you should see the call to fetch the data and you should be able to replicate it using requests module
if not then you have to reverse engineer js by setting a onpageload breakpoint and refreshing the page the js execution will stop on the page being loaded by right clicking the element the data is set to you can click break on subtree modification or attribute modification removing the onpageload and refreshing the page chrome now will break on the js code responsible of the data generation

Related

Scraping Data from Table with Multiple Pages

I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.

Beatiful soup parse page table probelm

I want to get the data (numbers) from this page. With those numbers I want to do some math.
My current code:
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
cld=soup.select("#d03")
print(cld)
================
output : []
From the page-request I get this result:
<td id="d04" class="">2,105</td>
<td id="d03" class=""><span style="font-size:15px;font-weight:bold">2,147</span> <span style="font-size:11px;color:green">305 (16.56%)</span></td>
<td id="d05" class="">1,842</td>
From this result I only want the <td> ID's outputted.
The problem with that page is that it's content is generated dynamically. By the time you fetch the html of the page, the actual elements aren't generated (I suppose they are filled in by the javascript on the page). There are two ways you can approach this.
Try using selenium which simulates a browser. You can in fact wait for the response to be generated and then fetch the html element you want.
The other way would be just to see any network requests being done by the page to fetch the data. If it was not loaded in the html, surely there must be another API call made to their servers to fetch the data.
On an initial look, I can see that the data you need is being fetched with this URL . (http://www.tsetmc.com/tsev2/data/instinfodata.aspx?i=45050389997905274&c=57+). The response looks like this.
12:29:48,A ,2150,2147,2105,1842,2210,2105,2700,53654226,115204065144,1,20190814,122948;98/5/23 16:30:51,F,261391.50,<div class='pn'>4294.29</div>,9596315531133973,3376955600,11101143554708,345522,F,2046434489,11459858578563,282945,F,12927,3823488480,235,;8#240000#2148#2159#500#1,1#600#2145#2160#198067#2,10#1000000#2141#2161#2000#1,;61157,377398,660897;;;;0;
You can figure out the parsing logic in detail by going through their code I suppose. But it looks like you only need the second element 2147.
Perhaps this might work:
result=requests.get("http://www.tsetmc.com/Loader.aspxParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
for tag in soup.find_all('td')[0:2]:
print(tag.get('id'))

Trouble Scraping site with BS4

usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within "&quot"if you look what Table outputs.
The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...
So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)

parse html tables with lxml

I have been trying to parse the table contents from here
i have tried a couple of alternatives, like
xpath('//table//tr/td//text()')
xpath('//div[#id="replacetext"]/table/tbody//tr/td/a//text()')
here is my last code:
import requests, lxml.html
url ='https://nseindia.com/products/content/derivatives/equities/fo_underlying_home.htm'
url = requests.get(url)
html = lxml.html.fromstring(url.content)
packages = html.xpath('//div[#id="replacetext"]/table/tbody//tr/td/a//text()') # get the text inside all "<tr><td><a ...>text</a></td></tr>"
however none of the alternatives seems to be working. In the past, i have scraped data with similar code (although not from this url!). Any guidance will be really helpful.
I tried you code. The problem is not caused by lxml. It is caused by how you load the webpage.
I know that you use the requests to get the content of webpage, however, the content you get from requests may be different from the content you see in the browser.
In this page, 'https://nseindia.com/products/content/derivatives/equities/fo_underlying_home.htm', print the content of request.get, you will find that the source code of this page contains no table!!! The table is loaded by ajax query.
So find a way to load the 'really' page you want, the you can use 'lxml`.
By the way, in web scraping, there are also something you need to mention, for example, request headers. It's a good practice to set your request headers when you do the http request. Some sites may block you, if you do not provide a reasonable User-Agent in the header. Though there is nothing to do with your current problem.
Thanks.
In the HTML page, there is a namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
So, you need to specify it:
NSMAP = {'html' : "http://www.w3.org/1999/xhtml"}
path = '//html:div[#id="replacetext"]/html:table/html:tbody//html:tr/html:td/html:a//text()'
packages = html.xpath(path, namespaces=NSMAP)
See http://lxml.de/xpathxslt.html#namespaces-and-prefixes
Interpreting Ajax call
import requests
from lxml import html
base_url = 'https://nseindia.com'
# sumulate the JavaScript
url = base_url + "/products/content/derivatives/equities/fo_underlyinglist.htm"
url = requests.get(url)
content = url.content
# -> <table>
# <tr><th>S. No.</td>
# <th>Underlying</td>
# <th>Symbol</th></tr>
# <tr>
# <td style='text-align: right;' >1</td>
# <td class="normalText" ><a href=fo_INDIAVIX.htm>INDIA VIX</a></td>
# <td class="normalText" >INDIAVIX</td>
# </tr>
# ...
html = html.fromstring(content)
packages = html.xpath('//td/a//text()')
# -> ['INDIA VIX',
# 'INDIAVIX',
# 'Nifty 50',
# 'NIFTY',
# 'Nifty IT',
# 'NIFTYIT',
# 'Nifty Bank',
# 'BANKNIFTY',
# 'Nifty Midcap 50',

Extracting href links from within website source w/ Python

I've asked this question before to no avail. I am trying to figure out how to implement bs4 to grab the links to be used for downloading from within the website's source. The problem I can't figure out is the links are within a dynamic content library. I've removed previous html snippet, look below
We've been able to grab the links with this script only after manually grabbing the source code from the website:
import re
enter code here
line = line.rstrip()
x = re.findall('href=[\'"]?([^\'" >]+)tif', line)
if len(x) > 0 :
result.write('tif">link</a><br>\n<a href="'.join(x))
`result.write('tif">link</a><br>\n\n</html>\n</body>\n')
result.write("There are " + len(x) + " links")
print "Download HTML page created."
But only after going into the website ctrl + a -> view source -> select all & copy -> paste onto SourceCode.txt. I would like to remove the manual labor from all this.
I'd greatly appreciate any information/tips/advice!
EDIT
I wanted to add some more information regarding the website we are using, the Library content will only show up when it has been manually expanded. Otherwise, the content (i.e., the download links/href *.tif) are not visible. Here's an example of what we see:
Source Code of site without opening the library element.
<html><body>
Source Code after opening library element.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td>Button</td>
</tr>
</tbody></table></div>
</div>
Source code after expanding all download options.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td>Button</td>
</tr>
<tr>
<td>Tile12_Set1.tif</td>
<td>Button</td>
</tr>
<tr>
<td>Tile12_Set2.tif</td>
<td>Button</td>
</tr>
</tbody></table></div>
</div>
Our end goal would be to grab the downloads link with only having to input the website url. The issue seems to be in the way the content is displayed (i.e., dynamic content only visible after manual expansion of the library.
Do not try and parse HTML with regular expressions. It's not possible and it won't work. Use BeautifulSoup4 instead:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.your-server.com/page.html"
document = urlopen(url)
soup = BeautifulSoup(document)
# look for all URLs:
found_urls = [link["href"] for link in soup.find_all("a", href=True)]
# look only for URLs to *.tif files:
found_tif_urls = [link["href"] for link in soup.find_all("a", href=True) if link["href"].endswith(".tif")]
You may as well take a look at PyQuery library, which uses the (sub)set of CSS selectors from JQuery:
pq = PyQuery(body)
pq('div.content div#filter-container div.filter-section')

Categories

Resources