Finding certain element using bs4 beautifulSoup - python

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:

You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)

Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text

To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

Related

How to store a set of multiple p tag texts to a single variable in space delimit with BeautifulSoup in Python

How can I store texts from multiple HTML p tags in a single variable with space delimit with BeautifulSoup in the following example? I'm brand new to Python. Thank you!
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
value = value.text
print(value)
In print statement itself you can put end="," as parameter to make answer in one line
from bs4 import BeautifulSoup
html= """<tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
print(value.text,end=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%,
OR :
you can try to append data to list and print in one line
lst=[i.get_text(strip=True) for i in values]
print(*lst,sep=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%
To get in single variable
x=",".join(lst)
print(x)
Output:
+1.30%,-1.33%,+1.58%,+1.61%
You can do like this using string concatenation.
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
ans = ''
for value in values:
ans += value.text.strip() + ' '
print(ans)
ans is a string that has space separated texts of <p> tags.
+1.30% -1.33% +1.58% +1.61%

How to extract pairs of (href, alt) wih python scrapy

I have an html page (seed) of the form:
<div class="sth1">
<table cellspacing="6" width="600">
<tr>
<td>
<img alt="alt1" border="0" height="22" src="img1" width="92">
</td>
<td>
name1
</td>
<td>
<img alt="alt2" border="0" height="22" src="img2" width="92">
</td>
<td>
name2
</td>
</tr>
</table>
</div>
What I would like to do is loop into all <tr>'s and extract all href, alt pairs with python scrapy. In this example, I should get:
link1, alt1
link2, alt2
Here is an example from the Scrapy Shell:
$ scrapy shell index.html
In [1]: for cell in response.xpath("//div[#class='sth1']/table/tr/td"):
...: href = cell.xpath("a/#href").extract()
...: alt = cell.xpath("a/img/#alt").extract()
...: print href, alt
[u'link1'] [u'alt1']
[u'link1'] []
[u'link2'] [u'alt2']
[u'link2'] []
where index.html contains the sample HTML provided in the question.
You could try Scrapy's built-in SelectorList combined with Python's zip():
from scrapy.selector import SelectorList
xpq = '//div[#class="sth1"]/table/tr/td[./a/img]'
cells = SelectorList(response.xpath(xpq))
zip(cells.xpath('a/#href'), cells.xpath('a/img/#alt'))
=> [('link1', 'alt1'), ('link2', 'alt2')]

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

extract class name from tag beautifulsoup python

I have the following HTML code:
<td class="image">
<a href="/target/tt0111161/" title="Target Text 1">
<img alt="target img" height="74" src="img src url" title="image title" width="54"/>
</a>
</td>
<td class="title">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
</span>
<a href="/target/tt0111161/">
Other Text
</a>
<span class="year_type">
(2013)
</span>
I am trying to use beautiful soup to parse certain elements into a tab-delimited file.
I got some great help and have:
for td in soup.select('td.title'):
span = td.select('span.wlb_wrapper')
if span:
print span[0].get('data-tconst') # To get `tt0082971`
Now I want to get "Target Text 1" .
I've tried some things like the above text such as:
for td in soup.select('td.image'): #trying to select the <td class="image"> tag
img = td.select('a.title') #from inside td I now try to look inside the a tag that also has the word title
if img:
print img[2].get('title') #if it finds anything, then I want to return the text in class 'title'
If you're trying to get a different td based on the class (i.e. td class="image" and td class="title" you can use beautiful soup as a dictionary to get the different classes.
This will find all the td class="image" in the table.
from bs4 import BeautifulSoup
page = """
<table>
<tr>
<td class="image">
<a href="/target/tt0111161/" title="Target Text 1">
<img alt="target img" height="74" src="img src url" title="image title" width="54"/>
</a>
</td>
<td class="title">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
</span>
<a href="/target/tt0111161/">
Other Text
</a>
<span class="year_type">
(2013)
</span>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(page)
tbl = soup.find('table')
rows = tbl.findAll('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
if col.has_attr('class') and col['class'][0] == 'image':
hrefs = col.find_all('a')
for href in hrefs:
print href.get('title')
elif col.has_attr('class') and col['class'][0] == 'title':
spans = col.find_all('span')
for span in spans:
if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
print span.get('data-tconst')
span.wlb_wrapper is a selector used to select <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">. Refer this & this for more information on selectors
change this in your python code span = td.select('span.wlb_wrapper') to span = td.select('span') & also span = td.select('span.year_type') and see what it returns.
If you try above and analyze what span holds you will get what you want.

How to find elements that match specific conditions in Beautiful Soup

I am learning and trying both Python (2.7) and Beautiful Soup (3.2.0). I already got some help here with my first problems (Beautiful Soup throws `IndexError`)
This is the Python code so far:
# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup
# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)
# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)
# From the source HTML page, search and store all <div class="date">...</div> and it's content
datesDiv = soup.findAll('div', { "class" : "date" })
# Loop through the tag and store only the needed information, being the actual date
dates = [tag.contents[0] for tag in datesDiv]
# From the source HTML page, search and store all <span class="time">...</span> and it's content
timesSpan = soup.findAll('span', { "class" : "time" })
# Loop through the tag and store only the needed information, being the actual times
times = [tag.contents[0] for tag in timesSpan]
# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
# if tag.contents[1] != 'Dutch KNVB Beker' - Do a direct test if output is needed or not
hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']
# From the source HTML page, search and store all <td class="away">..</td> and it's content
# [1:] at the end meand slice the first one found
awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]
# From the source HTML page, search and store all <a class="broadcast" href="...">..</a> and it's content
broadcastsA = soup.findAll('a', { "class" : "broadcast" })
# Loop through the tag and store only the needed information, being the the broadcast URL, where we can find the streams
broadcasts = [tag['href'] for tag in broadcastsA]
The problem I got is that the arrays are not equal to each other:
len(dates) #9, should be 6
len(times) #18, should be 12
len(hometeams) #6, is correct
len(awayteams) #6, is correct
len(broadcasts) #9, should be 6
Problem I have is that I do the following search for getting the dates array: soup.findAll('div', { "class" : "date" }). Which obviously gives me all the <div> elements with class="date". But the problem is, that I only need the date when there is also a <td> element with class="away".
See next part of the HTML that I am scraping:
<tr class="odd">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
Dutch Cup
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380054900">Tuesday, September 24</div> <!-- This date is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380054900">22:35</span> - <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380058500">23:35</span> <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
</td>
<td class="home" colspan="3">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>Dutch KNVB Beker<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221554&part=sports">Live</a> <!-- This href is not needed, because within this <tr> there is no <td class="away"> -->
</td>
</tr>
<tr class="even">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
Dutch Cup
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380127500">Wednesday, September 25</div> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380127500">18:45</span> - <!-- This time we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380134700">20:45</span> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
<td class="home">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>PSV<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3"/>
</td>
<td>vs.</td>
<td class="away">
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428"/>Stormvogels Telstar<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221555&part=sports">Live</a> <!-- This href we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
</tr>
How about rethinking the way you scrape the data. You have a table with matches - then just iterate over the rows:
for tr in soup.findAll('tr', {'class': ['odd', 'even']}):
home_team = tr.find('td', {'class': 'home'}).text
if home_team == 'Dutch KNVB Beker':
continue
away_team = tr.find('td', {'class': 'away'}).text
date = ' - '.join([span.text for span in tr.findAll('span', {'class': 'time'})])
broadcast = tr.find('a', {'class': 'broadcast'})['href']
print home_team, away_team, date, broadcast
prints 5 rows:
RKC Waalwijk Heracles 20:45 - 22:45 /broadcast.php?matchid=221553&part=sports
PSV Stormvogels Telstar 18:45 - 20:45 /broadcast.php?matchid=221555&part=sports
Ajax FC Volendam 20:45 - 22:45 /broadcast.php?matchid=221556&part=sports
SC Heerenveen FC Twente 18:45 - 20:45 /broadcast.php?matchid=221558&part=sports
Feyenoord FC Dordrecht 20:45 - 22:45 /broadcast.php?matchid=221559&part=sports
Then, you can collect results into the list of dicts.

Categories

Resources