Unable to parse some links from a local html file - python

I'm trying to scrape the links connected to View Bill button within a table from a local html file. Here is the file link.
This is how the first container looks like:
<tr class="clickable collapsed ng-isolate-scope blue-row" data-parent="#parent-table-body" data-target="#tr1" data-toggle="collapse" role="button">
<td>
<div class="accordion-icon" ng-enterclick="" tabindex="0"></div>
</td>
<td>09/18/2020</td>
<td>$183.47</td>
<td>10/02/2020</td>
<td>29</td>
<td>$0.00</td>
<td>
<form action="/my-account/view-bill" class="hiddenForm ng-pristine ng-valid" method="post" target="_blank">
<input name="actionType" type="hidden" value="View Bill" autocomplete="off">
<input name="billDate" type="hidden" value="2020-09-18" autocomplete="off">
<a class="link" data-action="View Bill " data-category="billing_payment_history" data-label="Billing & Payment History" href="https://www.duke-energy.com/?_ga=2.36159203.2592906.1601114887-735893428.1601114887" onclick="this.parentNode.submit(); return false;">View Bill </a>
</form>
</td>
</tr>
I've tried with:
from bs4 import BeautifulSoup
local_file = r"C:\Users\WCS\Desktop\htmlfile.html"
with open(local_file,"r") as f:
page_content = f.read()
soup = BeautifulSoup(page_content,"lxml")
for item in soup.select("#parent-table-body a.link:contains('View Bill')"):
print(item)
break
Output I'm getting from the first container:
<a class="link" data-action="View Bill " data-category="billing_payment_history" data-label="Billing & Payment History" href="" onclick="this.parentNode.submit(); return false;">View Bill </a>
So, you can see that there is no link in the output above.
How can I parse the links from the view bill buttons?

from bs4 import BeautifulSoup
with open('htmlfile.html') as f:
soup = BeautifulSoup(f, 'html.parser')
target = [x['href'] for x in soup.select("a[data-action^=View]")]
print(target)

Try this:
import re
from bs4 import BeautifulSoup
html = """
<tr class="clickable collapsed ng-isolate-scope blue-row" data-parent="#parent-table-body" data-target="#tr1" data-toggle="collapse" role="button">
<td>
<div class="accordion-icon" ng-enterclick="" tabindex="0"></div>
</td>
<td>09/18/2020</td>
<td>$183.47</td>
<td>10/02/2020</td>
<td>29</td>
<td>$0.00</td>
<td>
<form action="/my-account/view-bill" class="hiddenForm ng-pristine ng-valid" method="post" target="_blank">
<input name="actionType" type="hidden" value="View Bill" autocomplete="off">
<input name="billDate" type="hidden" value="2020-09-18" autocomplete="off">
<a class="link" data-action="View Bill " data-category="billing_payment_history" data-label="Billing & Payment History" href="https://www.duke-energy.com/?_ga=2.36159203.2592906.1601114887-735893428.1601114887" onclick="this.parentNode.submit(); return false;">View Bill </a>
</form>
</td>
</tr>"""
soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all(lambda t: t.name == 'a' and re.search(r'View Bill\s+', t.text)):
print(f"{anchor.text.strip()} - {anchor.get('href')}")
Output:
View Bill - https://www.duke-energy.com/?_ga=2.36159203.2592906.1601114887-735893428.1601114887

Related

Finding certain element using bs4 beautifulSoup

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:
You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)
Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text
To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

BS4 cannot select correct 'span'

I have tried to scrape a price from a certain website, a small sample of the HTML code is below:
</div>
</div>
<div class="right custom">
<div class="description custom">
<aside>
<h4>Availability:</h4>
<div>
<span class="label green">In Stock</span>
</div>
</aside>
<aside>
<h4>Price:</h4>
<div>
<span class="label">£65.40</span>
</div>
</aside>
<aside>
<h4>Ex Tax:</h4>
<div>
<span class="label">£54.50</span>
</div>
</aside>
<div class="price">
£65.40 </div>
<section class="custom-order">
<div class="options">
<div class="option" id="option-276">
<span class="required">*</span>
<label>Type & Extras:</label><br/>
<select name="option[276]">
<option value=""> --- Please Select --- </option>
<option value="146">Each </option>
</select>
</div>
</div>
<div class="quantity custom">
<label>Quantity:</label><br/>
<input name="quantity" size="2" type="text" value="1"/>
</div>
</section>
<!-- -->
<div class="cart">
<div>
I am trying to select the price of £54.50 (which is the price without UK tax).
The code I have used is below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
var1 = requests.get("https://www.website.co.uk",
headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
var2 = var1.content
soup=BeautifulSoup(var2, "html.parser")
span = soup.find("span", {"class":"label"})
price = span.text
price
Output: 'In Stock'
This 'In Stock' is located a few lines earlier in the HTML code.
<div>
<span class="label green">In Stock</span>
Can somebody please point me in the direction of picking up the correct span?
You selected span = soup.find("span", {"class":"label"}), the first span with class label and you got it. You get the expected value with span = soup.find_all("span", {"class":"label"}, limit=3)[2]
You can use a CSS Selector nth-child():
from bs4 import BeautifulSoup
txt = """THE ABOVE HTML"""
soup = BeautifulSoup(txt, "html.parser")
print(soup.select_one("aside:nth-child(3) > div > span").text)
Output:
£54.50
Another method.
from simplified_scrapy.spider import SimplifiedDoc
html = '''your html
'''
doc = SimplifiedDoc(html) # create doc
span = doc.getElement('span', start="Price:")
print (span.text)
Result:
£65.40

How do I extract information from this source code. I want to extract the Name , address,course,institute type from this link

I want to extract the Name, address, course, institute type from this code. I am not able to do it I guess it because of the table. Every time I try it gives me a blank list. I don't know what to do
<div class="row">
<div class="col-md-12">
<div class="panel panel-default">
<div class="panel-body ">
<div class="row">
<div id="ContentPlaceHolder1_pnldefault">
<table id="ContentPlaceHolder1_dlstCollege" class="table table-bordered table responsive" cellspacing="0" style="border-collapse:collapse;">
<tr>
<td>
<input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl00$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_0" value="968 " />
<a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_0" href="CollegeDetailedInformation.aspx?Inst=968 ">**A R INSTITUTE OF PHARMACY , BIJNOR (968)**</a>
<br />
<b>Location:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblAddress_0">**TAJPUR** </span>
<br />
<b>Course:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblCourse_0">**B.Pharm**,</span>
<br />
<b>Category:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_0">**Private**</span>
<br />
<b>Web Address:</b>
<a id="lnkBtnWebURL" href='' target="_blank"></a>
<br />
</td>
</tr>
res = requests.get('http://kyc.aktu.ac.in/')
soup = BeautifulSoup(res.content, 'html.parser')
weblinks = soup.find_all('a', attrs = {'id':'ContentPlaceHolder1_dlstCollege_hlpkInstituteName_0'})
pagelinks = []
for link in weblinks:
link = link.find('a')
pagelinks.append(link.get('href'))
Try this:
from bs4 import BeautifulSoup as bs
html = '<div class="row"><div class="col-md-12"><div class="panel panel-default"><div class="panel-body "><div class="row"><div id="ContentPlaceHolder1_pnldefault"><table id="ContentPlaceHolder1_dlstCollege" class="table table-bordered table responsive" cellspacing="0" style="border-collapse:collapse;"><tr><td><input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl00$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_0" value="968 " /><a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_0" href="CollegeDetailedInformation.aspx?Inst=968 ">**A R INSTITUTE OF PHARMACY , BIJNOR (968)**</a><br /><b>Location:</b><span id="ContentPlaceHolder1_dlstCollege_lblAddress_0">**TAJPUR** </span><br /><b>Course:</b><span id="ContentPlaceHolder1_dlstCollege_lblCourse_0">**B.Pharm**,</span><br /><b>Category:</b><span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_0">**Private**</span><br /><b>Web Address:</b><a id="lnkBtnWebURL" href='' target="_blank"></a><br /></td></tr>'
soup = bs(html , 'lxml')
name = soup.find('a', id='ContentPlaceHolder1_dlstCollege_hlpkInstituteName_0').text.strip()
address = soup.find('span', id= 'ContentPlaceHolder1_dlstCollege_lblAddress_0').text.strip()
course = soup.find('span', id = 'ContentPlaceHolder1_dlstCollege_lblCourse_0').text.strip()
institute_type = soup.find('span', id = 'ContentPlaceHolder1_dlstCollege_lblInstituteType_0').text.strip()
print(name)
print(address)
print(course)
print(institute_type)
Output:
**A R INSTITUTE OF PHARMACY , BIJNOR (968)**
**TAJPUR**
**B.Pharm**,
**Private**

Parse Table tag in Python

I am trying to extract data from a HTML file using python. I am trying to extract the table content from the file.
Below is the HTML content of the table:
<table class="radiobutton" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay" onclick="return false;">
<tbody>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_0" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="1" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_0">Fitting</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_1" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="2" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_1">Material</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_2" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="4" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_2">Appliance</label>
</td>
</tr>
<tr>
<td>
<input checked="checked" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_3" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="8" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_3">Apparatus</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_4" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="16" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_4">Other procedures</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_5" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="32" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_5">Alternative fuel oils</label>
</td>
</tr>
<tr>
<td>
<input id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_6" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="64" />
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_6">Other compliance method:</label>
</td>
</tr>
</tbody>
</table>
Below is the python code to print the properties from the tags.
from bs4 import BeautifulSoup
from pyparsing import makeHTMLTags
with open('.\ABC.html', 'r') as read_file:
data = read_file.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.find("table", attrs={"id":"ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay"})
spotterTag, spotterEndTag = makeHTMLTags("input")
for spotter in spotterTag.searchString(table):
print(spotter.checked)
print(spotter.id)
How can I print the label of the radio buttons along with checked property?
Examle: For below tag, it should print : Fitting
And "checked" for Input tag mentioned below:
<label for="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_0">Fitting</label>
<input checked="checked" id="ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay_3" name="ctl00$bodyPlaceHolder$ctl00$Reg$rblTypeDisplay" type="radio" value="8"/>
Below code works but needs a better solution:
from bs4 import BeautifulSoup
from pyparsing import makeHTMLTags
with open('.\ABC.html', 'r') as read_file:
data = read_file.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.find("table", attrs={"id":"ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay"})
spotterTag, spotterEndTag = makeHTMLTags("input")
for spotter in spotterTag.searchString(table):
if spotter.checked == 'checked':
label = soup.find("label", attrs={"for":spotter.id})
print(str(label)[str(label).find('>')+1:str(label).find('<',2)])
print(spotter.checked)
Thanks in advance for help!
I'm not sure if I understand you correctly, but do you want to zip input and labels together? If yes, you can use zip() function. For example (data is your HTML string):
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
print('{:^25} {:^15} {:^15}'.format('Text', 'Value', 'Checked'))
for inp, lbl in zip(soup.select('table#ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay input'),
soup.select('table#ctl00_bodyPlaceHolder_ctl00_Reg_rblTypeDisplay label')):
print('{:<25} {:^15} {:^15}'.format(lbl.text, inp['value'], 'checked' if 'checked' in inp.attrs else '-'))
Prints:
Text Value Checked
Fitting 1 -
Material 2 -
Appliance 4 -
Apparatus 8 checked
Other procedures 16 -
Alternative fuel oils 32 -
Other compliance method: 64 -

how to extract span info from div with soup

I have a piece of HTML code below:
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
I want to extract "Portugal" from it, note the span class is a dynamic one, it is not always class="country-flag-small flag-113" but indeed changes per the value of country generated for this div block.
To get the player1 and 1357, I am using the following cumbersome code:
player1info = soup.findAll('div', attrs={'class':'user-tagline'})[0].text.split("\n")
player1 = player1info[1]
pscore1 = player1info[1].replace('(','').replace(')', '')
It would be appreciated if someone can share with your better solution here. Thank you in advance
UPDATE:
With the initial HTML div info extracted, now I would like to expand it to extract more for the entire row, here is the row:
<tr board-popover="" fen="r1bk2r1/1p2n3/pN6/1B1qQp2/P2Pp2p/1P6/2P2PPP/R3K1R1 b Q -" flip-board="1" highlight-squares="c4b6">
<td>
<a class="clickable-link td-user" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<span class="time-control">
<i class="icon-rapid">
</i>
</span>
<div class="user-tagline ">
<span class="username " data-avatar="https://betacssjs.chesscomfiles.com/bundles/web/images/noavatar_l.1c5172d5.gif" data-country="Portugal" data-enabled="true" data-flag="113" data-joined="Joined Jun 19, 2016" data-logged="Online 6 hrs ago" data-membership="basic" data-name="Atikinounette" data-popup="hover" data-title="" data-username="Atikinounette">
Atikinounette
</span>
<span class="user-rating">
(1357)
</span>
<span class="country-flag-small flag-113" tip="Portugal">
</span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="https://images.chesscomfiles.com/uploads/v1/user/28196414.83e31ff1.50x50o.3a6f77e4aa44.jpeg" data-country="Indonesia" data-enabled="true" data-flag="70" data-joined="Joined May 15, 2016" data-logged="Online Nov 7, 2017" data-membership="basic" data-name="belemnarmada" data-popup="hover" data-title="" data-username="belemnarmada">
belemnarmada
</span>
<span class="user-rating">
(1387)
</span>
<span class="country-flag-small flag-70" tip="Indonesia">
</span>
</div>
</a>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">
1
</span>
<span class="game-result">
0
</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost">
</i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
30 min
</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
25
</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
Aug 9, 2017
</a>
</td>
<td class="text-center miniboard">
<input class="checkbox" game-checkbox="" game-id="2249663029" game-is-live="true" ng-model="model.gameIds[2249663029].checked" type="checkbox"/>
</td>
</tr>
Needed info are:
player's info (answer provided by #balderman already got that)
game-result (1, 0)
playing time (30 min in this row)
total moves (25)
playing date (Aug 9, 2017)
Thank you so much here.
How about the code below?
The idea that the user attributes are 3 spans under the div. So the code points to those spans and extract the data.
from bs4 import BeautifulSoup
html = '''<html><body> <div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div><body></html>'''
soup = BeautifulSoup(html, 'html.parser')
users = soup.findAll('div', attrs={'class': 'user-tagline'})
for user in users:
user_properties = user.findAll('span')
for idx, prop in enumerate(user):
if idx == 1:
print('user name: {}'.format(prop.text))
elif idx == 3:
print('user rating: {}'.format(prop.text))
elif idx == 5:
print('user country: {}'.format(prop.attrs['tip']))
Output
user name: player1
user rating: (1357)
user country: Portugal
user name: player2
user rating: (1387)
user country: Indonesia
This is a more readable solution:
div1 = soup.select("div.user-tagline")[0]
player1 = div1.select_one("span.user-rating").text
pscore1 = div1.select_one("span.country-flag-small").text
To extract data of all divs, just use a loop. And replace "0" with "i".
If you are interested only in the first div, you can go with this:
res = bsobj.find('div', {'class':'user-tagline'}).findAll('span')
print(res[0].text, res[1].text, res[2]['tip'])

Categories

Resources