Beatiful soup parse page table probelm - python

I want to get the data (numbers) from this page. With those numbers I want to do some math.
My current code:
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
cld=soup.select("#d03")
print(cld)
================
output : []
From the page-request I get this result:
<td id="d04" class="">2,105</td>
<td id="d03" class=""><span style="font-size:15px;font-weight:bold">2,147</span> <span style="font-size:11px;color:green">305 (16.56%)</span></td>
<td id="d05" class="">1,842</td>
From this result I only want the <td> ID's outputted.

The problem with that page is that it's content is generated dynamically. By the time you fetch the html of the page, the actual elements aren't generated (I suppose they are filled in by the javascript on the page). There are two ways you can approach this.
Try using selenium which simulates a browser. You can in fact wait for the response to be generated and then fetch the html element you want.
The other way would be just to see any network requests being done by the page to fetch the data. If it was not loaded in the html, surely there must be another API call made to their servers to fetch the data.
On an initial look, I can see that the data you need is being fetched with this URL . (http://www.tsetmc.com/tsev2/data/instinfodata.aspx?i=45050389997905274&c=57+). The response looks like this.
12:29:48,A ,2150,2147,2105,1842,2210,2105,2700,53654226,115204065144,1,20190814,122948;98/5/23 16:30:51,F,261391.50,<div class='pn'>4294.29</div>,9596315531133973,3376955600,11101143554708,345522,F,2046434489,11459858578563,282945,F,12927,3823488480,235,;8#240000#2148#2159#500#1,1#600#2145#2160#198067#2,10#1000000#2141#2161#2000#1,;61157,377398,660897;;;;0;
You can figure out the parsing logic in detail by going through their code I suppose. But it looks like you only need the second element 2147.

Perhaps this might work:
result=requests.get("http://www.tsetmc.com/Loader.aspxParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
for tag in soup.find_all('td')[0:2]:
print(tag.get('id'))

Related

Why is the page source different between Selenium and BeautifulSoup?

As the Title, I am crawling data from the Vietnam's website (https://webgia.com/lai-suat/). I have used BeautifulSoup at first and it does not return the data as its html source showing on Chrome, the data number is hide. However, I changed the method to use Selenium for getting html source and it returns ideally result as all data number has shown.
The code is as below:
Using bs4:
import requests
from bs4 import BeautifulSoup
url = "https://webgia.com/lai-suat/"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
table = soup.find_all('table', attrs={'class': 'table table-radius table-hover text-center'})
table_body = table[0].find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
print(col)
The data is hiden as the result is:
<td class="text-left"><a class="bank-icon" href="https://webgia.com/lai-suat/abbank/" title="Lãi suất ABBank - Ngân hàng TMCP An Bình"><span class="bak-icon bi-abbank"></span><span>ABBank</span></a></td>
<td class="text-right lsd" nb="E3c7370616e20636c617C37B33d2B2746578742d6772H65I656e223e3A02c32303c2f7370616e3Ie"><small>web giá</small></td>
<td class="text-right lsd" nb="R3ZJ3YKJ2c3F635D"><small>xem tại webgia.com</small></td>
<td class="text-right lsd" nb="3c7370616e20636Fc61C73733d22746578742dC6772A65656e223e3S42cT303N03c2f7370616e3e"><small>webgia.com</small></td>
<td class="text-right lsd" nb="352cMA3Z6BE30"><small>web giá</small></td>
<td class="text-right lsd" nb="352cLXG3A7I30"><small>web giá</small></td>
But if I get html source by using Selenium, then using the same code above:
s = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service = s)
driver.maximize_window()
url = "https://webgia.com/lai-suat/"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
...
The result was showing all data number
<td class="text-right"><span class="text-green">0,20</span></td>
<td class="text-right">3,65</td>
<td class="text-right"><span class="text-green">4,00</span></td>
<td class="text-right">5,60</td>
<td class="text-right">5,70</td>
<td class="text-right">5,70</td>
...
So can anyone explain why they have the difference like this? How to get the same result by just using bs4 instead of Selenium?
Thank you guys
The difference is because most websites today are shipped with not only HTML, but also JS scripts capable of modifying the HTML when executed. To execute those scripts, a JS engine is required and that's exactly what web browsers provide you with - a JS Engine (V8 for Chrome).
HTML contents fetched using BeautifulSoup are "raw" ones, unmodified by any JS scripts because there's no JS engine to execute them in the first place. It is those JS scripts who are in charge of data fetching and updating HTML with the fetched data
HTML contents provided by Selenium, on the other hand, are the ones after JS scripts have been executed. Selenium can do this because it has an external webdriver execute the scripts for you, not because Selenium itself can execute JS scripts
Since you'll eventually need a JS engine to execute the JS scripts, I don't think BeautifulSoup alone can cut it.
The reason is that selenium runs JavaScript, which can modify the contents of the page, whereas using requests to get the page only returns the HTML of the page that is initially sent in the request and does not execute the JavaScript.
The page source has that content obfuscated and placed inside the nb attribute of the relevant tds. When JavaScript runs in the browser the following script content runs which converts the obfuscated data into what you see on the page.
function gm(r) {
r = r.replace(/A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/g, "");
for (var n = [], t = 0; t < r.length - 1; t += 2) n.push(parseInt(r.substr(t, 2), 16));
return String.fromCharCode.apply(String, n)
}
$(document).ready(function() {
$("td.blstg").each(function() {
var gtls = $(this).attr("nb");
$(this).removeClass("blstg").removeAttr("nb");
if (gtls) {
$(this).html(gm(gtls));
} else {
$(this).html("-");
}
});
});
With requests this script doesn't run so you are left with the generic text.
To answer your question about how to use bs4 to get this, you could write your own custom function(s) to reproduce the logic of the script.
Additionally, the class of these target elements, whose nb attribute require conversion, is dynamic, so that needs to be picked up also. In the above JavaScript the dynamic class value was blstg at the time of viewing. In the code below, I use regex to pick up the correct current value.
I have used thousands = None, as per this GitHub pandas issue, to preserve "," as the decimal point, as per source, when using read_html() to generate the final dataframe.
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
def gm(r):
r = re.sub(r'A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z', '', r)
n = []
t = 0
while t < len(r) - 1:
n.append(int(r[t:t+2], 16))
t+=2
return ''.join(map(chr, n))
url = "https://webgia.com/lai-suat/"
req = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(req.text, "lxml")
dynamic_class = re.search(r'\$\("td\.([a-z]+)"', req.text).group(1)
for i in soup.select(f'td.{dynamic_class}'):
replacement = i['nb']
del i['class'] # not actually needed as I replace innerText
del i['nb'] # not actually needed as I replace innerText
if replacement:
i.string.replace_with(bs(gm(replacement), 'lxml')
else:
i.replace_with('-')
df = pd.read_html(str(soup.select_one(".table-radius")), thousands=None)[0]
print(df)
expanding on the above answer and generally speaking
in order to tell if specific data is fetched/generated by js or returned with the page html
you can use a feature in chrome dev tools called block js execution (click inspect then f1) if you keep the chrome dev tools open when you visit the page and the data is there this is a clear indication the data is fetched with the html
if its not than its either fetched or generated by js
if the data is fetched simply inspecting the network requests your browser makes while you visit the website you should see the call to fetch the data and you should be able to replicate it using requests module
if not then you have to reverse engineer js by setting a onpageload breakpoint and refreshing the page the js execution will stop on the page being loaded by right clicking the element the data is set to you can click break on subtree modification or attribute modification removing the onpageload and refreshing the page chrome now will break on the js code responsible of the data generation

Can't find <tr> data with BeautifulSoup

I am trying to access rows of the table from https://www.parliament.gov.za/hansard?sorts[date]=-1. I ultimately want to download the PDFs contained in each row, but I am having a hard time accessing the rows of the table. When I inspect a table row element, I see that it is under the <tbody> tag. However, I can't seem to access this data using BeautifulSoup. I have done a decent amount of web scraping, but this is the first time I've run into this issue. This is the code that I currently have:
import requests
from bs4
import BeautifulSoup as bs
url = 'https://www.parliament.gov.za/hansard?sorts[date]=-1'
request = requests.get(url)
soup = bs(request.text, 'html.parser')
table1 = soup.findAll('table')[0]
print(table1)
Output:
<table id="papers-table">
<thead>
<th>Name</th>
<th>House</th>
<th>Language</th>
<th>Date</th>
<th>Type</th>
<th data-dynatable-column="file_location" style="display:none">File Location</th>
</thead>
<tbody>
</tbody>
</table>
Clearly, there is nothing in the <tbody> tag even though this is where I believe the row data should be. In general, whenever I try to find the tr tags, which is where Chrome says the row data is stored, I can't find any of the ones with the PDFs. I am fairly certain that the issue has something to do with the fact that the source code is missing this data as well, but I have no idea how to find it. Since it's on the website, I assume that there must be a way, right? Thanks!
The data is loaded dynamically, therefore requests won't support it. However, the data is available via sending a GET request to the websites API:
https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0
There's no need to use BeautifuSoup, using just the requests library is enough:
import requests
URL = "https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0"
BASE_URL = "https://www.parliament.gov.za/storage/app/media/Docs/"
response = requests.get(URL).json()
for data in response["records"]:
print(BASE_URL + data["file_location"])
Output:
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3a888bc6-ffc7-46a1-9803-ffc148b07bfc.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3eb3103c-2d3c-418f-bb24-494b17bdeb22.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/bf0afdf8-352c-4dde-a380-11ce0a038dad.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/285e1633-aaeb-4a0d-bd54-98a4d5ec5127.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/966926ce-4cfe-4f68-b4a1-f99a09433137.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/d4bdb2c2-e8c8-461f-bc0b-9ffff3403be3.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/daecc145-bb44-47f1-a3b2-9400437f71d8.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/4f204d7e-0a25-4b64-b5a7-46c8730abe91.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/f2863e16-b448-46e3-939d-e14859984513.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/cd30e289-2ff2-47f5-b2a7-77e496e52f3a.pdf
you're trying to scrape dynamic content whereas you're only loading the html for a static page with requests. Look for ways to get the dynamically generated html code or use selenium.

Scraping href which includes ads information

I would like to count how many ads there are in this website: https://www.lastampa.it/?refresh_ce
I am using BeautifulSoup to do this. I would need to extra info within the following:
<a id="aw0" target="_blank" href="https://googleads.g.doubleclick.net/pcs/click?xai=AKAOjssYz5VxTdwhxCBCrbtSi0dfGqGd25s7Ub6CCjsHLqd__OqfDKLyOWi6bKE3CL4XIJ0xDHy3ey-PGjm3_yVqTe0_IZ1g9AsvZmO1u8gciKpEKYMj1TIvl6KPivBuwgpfUDf8g2EvMyCD5r6tQ8Mx6Oa4G4yZoPYxFRN7ieFo7UbMr8FF2k6FL6R2qegawVLKVB5WHVAbwNQu4rVx4GE8KuxowGjcfecOnagp9uAHY2qiDE55lhdGqmXmuIEAK8UdaIKeRr6aBBVCR40LzY4&sig=Cg0ArKJSzEIRw7NDzCe7&adurl=https://track.adform.net/C/%3Fbn%3D38337867&nm=3&nx=357&ny=-4&mb=2" onfocus="ss('aw0')" onmousedown="st('aw0')" onmouseover="ss('aw0')" onclick="ha('aw0')"><img src="https://tpc.googlesyndication.com/simgad/5262715044200667305" border="0" width="990" height="30" alt="" class="img_ad"></a>
i.e. parts containing ads information.
The code that I am using is the following:
from bs4 import BeautifulSoup
import requests
from lxml import html
r = requests.get("https://www.lastampa.it/?refresh_ce")
soup = BeautifulSoup(r.content, "html.parser")
ads_div = soup.find('div')
if ads_div:
for link in ads_div.find_all('a'):
print (link['href'])
It does not scrape any information because I am considering the wrong tag/href. How could I get ads information in order to count how many ads there are in that webpage?
How about use a regular expression to match "googleads" and count how many you get.
Recursively searching from the body gives you all the links in the whole page. If you want to search in a specific div you can supply parameters such as the class or id that you want to match as a dictionary.
You can filter the links once you obtain them.
body = soup.find('body')
if body:
for link in body.find_all('a'):
if "ad" in link['href']:
print (link['href'])
When looking at the response that I get, I notice there are no ads at all. This could be because the ads are loaded via some script, which means the ads won't be rendered and requests won't download it. To get around this you can use a webdriver with selenium. That should do the trick.

Nothing return in prompt when Scraping Product data using BS4 and Request Python3

hope you're all doing good.
I am trying to scrape a specific product from https://footlocker.fr in order to get product's data such as sizes available. The thing is each time i try to run my script nothing returns.
Thank you in advance!
import requests
from bs4 import BeautifulSoup
url = 'https://www.footlocker.fr/fr/p/jordan-1-mid-bebe-chaussures-69677?v=316161155904'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
name_box = soup.find_all('div', attrs={'class':'fl-product-details--headline'})
size = soup.find_all('div', attrs={'class':'fl-size-316161155904-UE-21'})
for product in zip(name_box,size):
name,price=product
name_proper=name.text.strip()
size_proper=size.text.strip()
print(name_proper,'-',price_proper)```
Okay. So I found a solution, but it is far from ideal. It is for the following link https://www.footlocker.fr/fr/p/jordan-1-mid-bebe-chaussures-69677?v=316160178204. If you look at the resulting html in page.content, you will obviously notice that the size details are not there. If you read through it a bit, you will see a bunch of references to AJAX leading me to believe it is making an AJAX call and pulling the information in, then parsing it. (This is expected behaviour as stock of items can change over time).
There are two ways to get your data.
You know the URL you are trying to fetch data from. The value after v= is the SKU of the product. For example, if the SKU is 316160178204 you can directly make a request to https://www.footlocker.fr/INTERSHOP/web/FLE/Footlocker-Footlocker_FR-Site/fr_FR/-/EUR/ViewProduct-ProductVariationSelect?BaseSKU=316160178204&InventoryServerity=ProductDetail
For each URL you request, you have to locate the following DIV with class f1-load-animation, then get the data-ajaxcontent-url attribute. Now if you get the data-ajaxcontent-url attribute which is https://www.footlocker.fr/INTERSHOP/web/FLE/Footlocker-Footlocker_FR-Site/fr_FR/-/EUR/ViewProduct-ProductVariationSelect?BaseSKU=316160178204&InventoryServerity=ProductDetail
Now you make a request to this new URL you have, and somewhere in that JSON, you will see values such as
<button class=\"fl-product-size--item fl-product-size--item__not-available\" type=\"button\"\n\n>\n<span>20</span>\n</button>
<button class=\"fl-product-size--item\" type=\"button\"\n\ndata-form-field-target=\"SKU\"\ndata-form-field-base-css-name=\"fl-product-size--item\"\ndata-form-field-value=\"316160178204050\"\ndata-form-field-unselect-group\n\ndata-testid=\"fl-size-316160178204-UE-21\"\ndata-product-size-select-item=\"316160178204050\"\n\n>\n<span>21</span>\n</button>
You will have to parse this snippet of data (I think you can use BeautifulSoup for it). You can see that it has a class of f1-product-size--item__not-available if it is not available, and the size value is in the span.
name_box is empty because you search for <div> and the element that contains the class fl-product-details--headline is a <h1>
size is empty because, as #Sri pointed out, there are some AJAX requests that insert that information in the page after the first request

How to take table data from a website using bs4

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>
You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.

Categories

Resources