How to find elements that match specific conditions in Beautiful Soup - python

I am learning and trying both Python (2.7) and Beautiful Soup (3.2.0). I already got some help here with my first problems (Beautiful Soup throws `IndexError`)
This is the Python code so far:
# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup
# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)
# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)
# From the source HTML page, search and store all <div class="date">...</div> and it's content
datesDiv = soup.findAll('div', { "class" : "date" })
# Loop through the tag and store only the needed information, being the actual date
dates = [tag.contents[0] for tag in datesDiv]
# From the source HTML page, search and store all <span class="time">...</span> and it's content
timesSpan = soup.findAll('span', { "class" : "time" })
# Loop through the tag and store only the needed information, being the actual times
times = [tag.contents[0] for tag in timesSpan]
# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
# if tag.contents[1] != 'Dutch KNVB Beker' - Do a direct test if output is needed or not
hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']
# From the source HTML page, search and store all <td class="away">..</td> and it's content
# [1:] at the end meand slice the first one found
awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]
# From the source HTML page, search and store all <a class="broadcast" href="...">..</a> and it's content
broadcastsA = soup.findAll('a', { "class" : "broadcast" })
# Loop through the tag and store only the needed information, being the the broadcast URL, where we can find the streams
broadcasts = [tag['href'] for tag in broadcastsA]
The problem I got is that the arrays are not equal to each other:
len(dates) #9, should be 6
len(times) #18, should be 12
len(hometeams) #6, is correct
len(awayteams) #6, is correct
len(broadcasts) #9, should be 6
Problem I have is that I do the following search for getting the dates array: soup.findAll('div', { "class" : "date" }). Which obviously gives me all the <div> elements with class="date". But the problem is, that I only need the date when there is also a <td> element with class="away".
See next part of the HTML that I am scraping:
<tr class="odd">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
Dutch Cup
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380054900">Tuesday, September 24</div> <!-- This date is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380054900">22:35</span> - <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380058500">23:35</span> <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
</td>
<td class="home" colspan="3">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>Dutch KNVB Beker<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221554&part=sports">Live</a> <!-- This href is not needed, because within this <tr> there is no <td class="away"> -->
</td>
</tr>
<tr class="even">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
Dutch Cup
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380127500">Wednesday, September 25</div> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380127500">18:45</span> - <!-- This time we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380134700">20:45</span> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
<td class="home">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>PSV<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3"/>
</td>
<td>vs.</td>
<td class="away">
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428"/>Stormvogels Telstar<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221555&part=sports">Live</a> <!-- This href we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
</tr>

How about rethinking the way you scrape the data. You have a table with matches - then just iterate over the rows:
for tr in soup.findAll('tr', {'class': ['odd', 'even']}):
home_team = tr.find('td', {'class': 'home'}).text
if home_team == 'Dutch KNVB Beker':
continue
away_team = tr.find('td', {'class': 'away'}).text
date = ' - '.join([span.text for span in tr.findAll('span', {'class': 'time'})])
broadcast = tr.find('a', {'class': 'broadcast'})['href']
print home_team, away_team, date, broadcast
prints 5 rows:
RKC Waalwijk Heracles 20:45 - 22:45 /broadcast.php?matchid=221553&part=sports
PSV Stormvogels Telstar 18:45 - 20:45 /broadcast.php?matchid=221555&part=sports
Ajax FC Volendam 20:45 - 22:45 /broadcast.php?matchid=221556&part=sports
SC Heerenveen FC Twente 18:45 - 20:45 /broadcast.php?matchid=221558&part=sports
Feyenoord FC Dordrecht 20:45 - 22:45 /broadcast.php?matchid=221559&part=sports
Then, you can collect results into the list of dicts.

Related

Getting content of multiple tags with beautifulsoup in python

I have an HTML temp like below
<tr>
<td width="45">
<p style="text-align: center;"><strong>STT</strong></p>
</td>
<td width="204">
<p style="text-align: center;"><strong>Tên bệnh viện</strong></p>
</td>
<td width="364">
<p style="text-align: center;"><strong>Địa chỉ</strong></p>
</td>
</tr>,
<tr>
<td width="45"><strong> </strong>
<p><strong>1</strong></p>
</td>
<td width="204"><strong> </strong>
<h3><span id="list hospital"><strong> ABC HOSPITAL</strong></span></h3>
</td>
<td width="364">
<img alt="abc hospital" class="aligncenter size-full wp-image-5549" height="470" sizes="(max-width: 705px) 100vw, 705px" src="https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t.jpg" srcset="https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t.jpg 705w, https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t-696x464.jpg 696w, https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t-630x420.jpg 630w" width="705"/>
<p><iframe allowfullscreen="allowfullscreen" frameborder="0" height="450" src="https://www.google.com/maps/embed?pb=!1m18!221m12!1m3!1d3919.743463626014!2d106.66938211450157!3d10.75424379233658!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x31752efc4039dee3%3A0x9157c2008d49be79!2sAn+Binh+Hospital!5e0!3m2!1sen!2s!4v1557202875759!5m2!1sen!2s" style="border: 0;" width="600"></iframe></p>
<ul>
<li>address: 1345 Golden View , LA</li>
<li>phonenumber: 3923 4260</li>
<li>Email: abc#hospital.com</li>
</ul>
<ul>
<li>Website: xxxxxxxxx</li>
I would like to have an output like this:
ABC HOSPITAL
address: 1345 Golden View , LA
phonenumber: 3923 4260
Email: abc#hospital.com
Because it has many li tags, I don't know how to get exactly all the fields I wish. Could you please help assist on this?
My code like below:
res = '''html code above'''
soup = BeautifulSoup(res, 'html.parser')
data = soup.find_all('tr')
for temp in data:
each = temp.find('h3')
print(each)
Output I got:
None
<h3><span id="list hospital"><strong> ABC HOSPITAL</strong></span></h3>
This should work.
soup = BeautifulSoup(res, 'html.parser')
data = soup.find_all('tr')
accepted_li = ('address', 'phonenumber', 'email') # tuple of "li" informations you want to get
for tr in data:
hospital_span = tr.find('span', {'id': 'list hospital'}) # get span of the hospital name
if hospital_span is not None:
print(hospital_span.find('strong').text.strip())
for li in tr.find_all('li'): # iterate over every li
if li.text.lower().startswith(accepted_li): # check if li element starts with any value in tuple
print(li.text)

Finding certain element using bs4 beautifulSoup

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:
You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)
Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text
To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

How to extract specific <td> from table

I'm working on a web scraping program using Python & BeautifulSoup. I encountered a problem when scraping a table.
My problem is, I need to extract selected <td> tags only and not the entire table.
I only need the numbers for 52 Week High, 52 Week Low, Earnings Per Share and Price to book value.
Is there anyway I can do that?
Sample Table
<table id="TABLE_1">
<tbody id="TBODY_2">
<tr id="TR_3">
<td id="TD_4">
<strong id="STRONG_5">52-Week High:</strong>
</td>
<td id="TD_6">
1,116.00
</td>
<td id="TD_7">
<strong id="STRONG_8">Earnings Per Share TTM (EPS):</strong>
</td>
<td id="TD_9">
47.87 (15.57%)
</td>
<td id="TD_10">
<strong id="STRONG_11">Price to Book Value (P/BV):</strong>
</td>
<td id="TD_12">
2.5481125565
</td>
</tr>
<tr id="TR_13">
<td id="TD_14">
<strong id="STRONG_15">52-Week Low:</strong>
</td>
<td id="TD_16">
867.50
</td>
<td id="TD_17">
<strong id="STRONG_18">Price-Earnings Ratio TTM (P/E):</strong>
</td>
<td id="TD_19">
20.8272404429
</td>
<td id="TD_20">
<strong id="STRONG_21">Return on Equity (ROE):</strong>
</td>
<td id="TD_22">
12.42%
</td>
</tr>
<tr id="TR_23">
<td id="TD_24">
<strong id="STRONG_25">Fair Value:</strong>
</td>
<td id="TD_26">
-
</td>
<td id="TD_27">
<strong id="STRONG_28">Dividends Per Share (DPS):</strong>
</td>
<td id="TD_29">
-
</td>
<td id="TD_30">
<strong id="STRONG_31">Recommendation:</strong>
</td>
<td id="TD_32">
None<span id="SPAN_33"></span>
</td>
</tr>
<tr id="TR_34">
<td id="TD_35">
<strong id="STRONG_36">Last Price:</strong>
</td>
<td id="TD_37">
<span id="SPAN_38"></span> <span id="SPAN_39">984.5</span>
</td>
</tr>
</tbody>
</table>
I also showed my codes for your reference.
Any help would be very much appreciated! Thank you!
from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
import pandas as pd
myurl = "https://www.investagrams.com/Stock/ac"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(myurl,headers=hdr)
# Open connection to website
uClient = urlopen(req)
# Offloads the content to variable
page_html = uClient.read()
#just closing it
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
table = page_soup.find("div", {"id":"FundamentalAnalysisPanel"}).find("table")
print(table.text)
You can do it with findNextSibling method.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.investagrams.com/Stock/ac')
soup = BeautifulSoup(r.text)
# specify table parameters for which you want to find values
parameters = ['52-Week High:', '52-Week Low:', 'Earnings Per Share TTM (EPS):', 'Price-Earnings Ratio TTM (P/E):', 'Price to Book Value (P/BV):']
# iterate all <td> tags and print text of the next sibling (with value),
# if this <td> contains specified parameter.
for td in soup.findAll('td'):
for p in parameters:
if td.find('strong', text=p) is not None:
print(td.findNextSibling().text.strip())
Result:
1,116.00
47.87 (15.57%)
2.5481125565
867.50
20.8272404429
This might be what you want
page_soup = soup(req.data.decode('utf-8'))
#tables = page_soup.find_all('table')
tables = page_soup.find_all('td')
df = pd.read_html(str(tables[i]))
where i is the table you want

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

Parsing html table with BeautifulSoup to python dictionary

This is an html code than I'm trying to parse with BeautifulSoup:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
... (amount of this tags isn't fixed too)
</ul>
</td>
</tr>
</table>
The output I would like to get is a dictionary like this:
DICT = {
'menu1': ['Some data1','Foo1 Bar1'],
'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}
As I already mentioned in the code, amount of <li> tags is not fixed. Additionally, there could be:
menu1 and menu2
just menu1
just menu2
no menu1 and menu2 (just <table></table>)
so e.g. it could looks just like this:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
</table>
I was trying to use this example but with no success. I think it's because of that <ul> tags, I can't read proper data from table. Problem for me is also variable amount of menus and <li> tags.
So my question is how to parse this particular table to python dictionary?
I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is.
request = c.get('http://example.com/somepage.html)
soup = bs(request.text)
and this is always the first table of the page, so I can get it with:
table = soup.find_all('table')[0]
Thank you in advance for any help.
html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1Bar1</li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2Bar2</li>
<li>Foo3Bar3</li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
.
{
u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'],
u'menu1': [u'Some data1', u'Foo1']
}

Categories

Resources