Scraping table with BeautifulSoup4

Scraping table with BeautifulSoup4 - python

I am trying to scrape some particulars rows inside a table but I don't know how to access the information properly. Here is the html:
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Michael</td>
<td class="right">57</td>
<td class="right">0</td>
<td class="right">5</td>
</tr>
<tr class="odd">
<td style="background: #8FB9B0; color: #8FB9B0;">1 </td>
<td>Clara</td>
<td class="right">48</td>
<td class="right">0</td>
<td class="right">5</td>
</tr>
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Lisa</td>
<td class="right">44</td>
<td class="right">2</td>
<td class="right">5</td>
</tr>
<tr class="odd">
<td style="background: #8FB9B0; color: #8FB9B0;">0 </td>
<td>Joe</td>
<td class="right">43</td>
<td class="right">0</td>
<td class="right">13</td>
</tr>
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>John</td>
<td class="right">38</td>
<td class="right">3</td>
<td class="right">4</td>
</tr>
<tr class="odd">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Francesca</td>
<td class="right">35</td>
<td class="right">2</td>
<td class="right">5</td>
</tr>
<tr class="even">
<td style="background: #8FB9B0; color: #8FB9B0;">0 </td>
<td>Carlos</td>
<td class="right">27</td>
<td class="right">1</td>
<td class="right">2</td>
</tr>
What I try to obtain, is the text on the next td that comes after every td with the style of color F5645C, but unfortunately I am running into problems.
This is what I want the script to return:
Michael
Lisa
John
Francesca
Here is the code I currently have:
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find('td', style='background: #F5645C; color: #F5645C;').find_next_sibling('td').get_text()
print(td)
On running the script: AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

You can use CSS selector to select all <td> tags that contain attribute style with string color: #F5645C and then apply method find_next():
for td in soup.select('td[style*="color: #F5645C"]'):
print(td.find_next('td').text)
This prints:
Michael
Lisa
John
Francesca

data = BeautifulSoup(html)
for tr in data.find_all('tr'):
td = tr.find_all('td')
print(td[1].text)
Now you can take it further i think..

Use .findNext("td").text
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tr in soup.find_all("tr"):
print(tr.td.findNext("td").text)
Output:
Michael
Clara
Lisa
Joe
John
Francesca
Carlos

Use can use find_all and a filter for the style atribute:
bs = BeautifulSoup(htmlcontent)
bs.find_all('td', attrs={'style':'background-color: #F5645C, color: #F5645C'})

Related

Beautiful Soup, for loop to write data

Im having trouble writing the contents of this soup function to the my ide.
I have the following soup function:
row = soup.find_all('td', attrs = {'class': 'Table__TD'})
here is the a subset of what it returns:
[<td class="Table__TD">Sat 11/9</td>,
<td class="Table__TD"><span class="flex"><span class="pr2">vs</span><span class="pr2 TeamLink__Logo"><a class="AnchorLink v-mid" data-clubhouse-uid="s:40~l:46~t:6" href="/nba/team/_/name/dal/dallas-mavericks" title="Team - Dallas Mavericks"><img alt="DAL" class="v-mid" data-clubhouse-uid="s:40~l:46~t:6" height="20" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" title="DAL" width="20"/></a></span><span><a class="AnchorLink v-mid" data-clubhouse-uid="s:40~l:46~t:6" href="/nba/team/_/name/dal/dallas-mavericks" title="Team - Dallas Mavericks">DAL</a></span></span></td>,
<td class="Table__TD"><a class="AnchorLink" data-game-link="true" href="http://www.espn.com/nba/game?gameId=401160772"><span class="flex tl"><span class="pr2"><div class="ResultCell tl loss-stat">L</div></span><span>138-122</span></span></a></td>,
<td class="Table__TD">31</td>,
<td class="Table__TD">6-12</td>,
<td class="Table__TD">50.0</td>,
<td class="Table__TD">4-9</td>,
<td class="Table__TD">44.4</td>,
<td class="Table__TD">2-2</td>,
<td class="Table__TD">100.0</td>,
<td class="Table__TD">4</td>,
<td class="Table__TD">4</td>,
<td class="Table__TD">2</td>,
<td class="Table__TD">3</td>,
<td class="Table__TD">2</td>,
<td class="Table__TD">1</td>,
<td class="Table__TD">18</td>,
<td class="Table__TD">Fri 11/8</td>,
I am trying to use a for loop to write these out but my console is not returning anything.
for data in row[0].find_all('td'):
print(data.get_text())
Can anyone tell me what I am doing wrong? Thanks.

With the initial search, you don't need to re-find_all on the tag name.
Just do something like:
for data in row:
print(data.get_text())

how do we select the child element tbody after extracting the entire html?

I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)

from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]

Searching for a color tag in html (Python 3)

I am trying to grab elements from a table if a cell has a certain color. Only issue is that for the color tags, grabbing the color does not seem possible just yet.
jump = []
for tr in site.findAll('tr'):
for td in site.findAll('td'):
if td == 'td bgcolor':
jump.append(td)
print(jump)
This returns an empty list
How do I grab just the color from the below html?
I need to get the color from the [td] tag (it would also be useful to get the color from the [tr] tag)
<tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP"> CME_ES </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:00 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>

How about this:
jump = []
for tr in site.findAll('tr'):
for td in site.findAll('td'):
if 'bgcolor' in td.attrs:
#jump.append(td)
print(td.attrs['bgcolor'])
print(jump)

you can use has_attr to check if an element has a certain attribute:
if td.has_attr('bgcolor'):
jump.append(td)
If i misread your answer and you want to only find tds of a certain color, use find_all:
tr.find_all("td", {"bgcolor": "55aa2a"}) # returns list of matches
PS: if someone has a better docs snippet for has_attr, please edit this answer.

Beautiful soup web page scraper

I am trying to scrape a webpage with following url
https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00
and I want to scrape a table with following html code. I have tried few things but not able to achieve the desired table to insert into csv.Here the <"tr"> tag is not closed for the data so segregating the data into different row is an issue.
Thanks for help
--J
<table border='0' width='900' align='center' cellspacing='1' cellpadding='4'>
<tr>
<td class='innertable_header1' rowspan='3'>Category of shareholder</td>
<td class='innertable_header1' rowspan='3'>Nos. of shareholders</td>
<td class='innertable_header1' rowspan='3'>No. of fully paid up equity shares held</td>
<td class='innertable_header1' rowspan='3'>No. of shares underlying Depository Receipts</td>
<td class='innertable_header1' rowspan='3'>Total nos. shares held</td>
<td class='innertable_header1' rowspan='3'>Shareholding as a % of total no. of shares (calculated as per SCRR, 1957)As a % of (A+B+C2)</td>
<td class='innertable_header1' rowspan='3'> Number of equity shares held in dematerialized form</td>
</tr>
<tr></tr>
<tr></tr>
<tr>
<td class='TTRow_left'>(A) Promoter & Promoter Group</td>
<td class='TTRow_right'>19</td>
<td class='TTRow_right'>28,17,02,889</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>28,17,02,889</td>
<td class='TTRow_right'>12.90</td>
<td class='TTRow_right'>28,17,02,889</td>
<tr>
<td class='TTRow_left'>(B) Public</td>
<td class='TTRow_right'>9,16,058</td>
<td class='TTRow_right'>1,87,81,45,362</td>
<td class='TTRow_right'>1,32,95,642</td>
<td class='TTRow_right'>1,89,14,41,004</td>
<td class='TTRow_right'>86.61</td>
<td class='TTRow_right'>1,88,74,40,959</td>
<tr>
<td class='TTRow_left'>(C1) Shares underlying DRs</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'></td>
<td class='TTRow_right'></td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>0.00</td>
<td class='TTRow_right'></td>
<tr>
<td class='TTRow_left'>(C2) Shares held by Employee Trust</td>
<td class='TTRow_right'>1</td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'>0.49</td>
<td class='TTRow_right'>1,08,05,896</td>
<tr>
<td class='TTRow_left'>(C) Non Promoter-Non Public</td>
<td class='TTRow_right'>1</td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'>0.49</td>
<td class='TTRow_right'>1,08,05,896</td>
<tr>
<td class='TTRow_left'>Grand Total</td>
<td class='TTRow_right'>9,16,078</td>
<td class='TTRow_right'>2,17,06,54,147</td>
<td class='TTRow_right'>1,32,95,642</td>
<td class='TTRow_right'>2,18,39,49,789</td>
<td class='TTRow_right'>100.00</td>
<td class='TTRow_right'>2,17,99,49,744</td>
</tr>
</table>

You can try this:
from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00').read()), 'lxml')
results = filter(None, [re.sub('[\n\r]+|\s{2,}', '', i.text) for i in s.find_all('td', {'class':re.compile('TTRow_right|TTRow_left')})])
Output:
[u'(A) Promoter & Promoter Group', u'19', u'28,17,02,889', u'28,17,02,889', u'12.90', u'28,17,02,889', u'(B) Public', u'9,16,058', u'1,87,81,45,362', u'1,32,95,642', u'1,89,14,41,004', u'86.61', u'1,88,74,40,959', u'(C1) Shares underlying DRs', u'0.00', u'(C2) Shares held by Employee Trust', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'(C) Non Promoter-Non Public', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'Grand Total', u'9,16,078', u'2,17,06,54,147', u'1,32,95,642', u'2,18,39,49,789', u'100.00', u'2,17,99,49,744']

BeautifulSoup not parsing every tag of the html

I'm having a problem with BeautifulSoup not completely parsing the html received. I tried with both lxml and html5lib parsers and I had the same problem.
html = '<td style="vertical-align: top">1</td> <td style="vertical-align: top"><span class="ui-icon country flg-fr"></span>\t</td><td class="pn"><a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span class="player-meta-data">29</span><span class="player-meta-data">, GK </span></td> <td class="ShotsTotal ">0\t</td><td class="ShotOnTarget ">0\t</td><td class="KeyPassTotal ">0\t</td><td class="PassSuccessInMatch ">88\t</td><td class="DuelAerialWon ">0\t</td><td class="Touches ">35\t</td><td class="rating ">6.24</td> <td style="text-align: left"><span class="incident-wrapper"></span></td> '
parsed_html = ipdb> BeautifulSoup(html, 'html5lib')
<html><head></head><body>1 <span class="ui-icon country flg-fr"></span> <a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span class="player-meta-data">29</span><span class="player-meta-data">, GK </span> 0 0 0 88 0 35 6.24 <span class="incident-wrapper"></span> </body></html>

It is working for me. I execute the following code (using beautifulsoup4==4.4.1):
from bs4 import BeautifulSoup
html = """
<td style="vertical-align: top">1</td>
<td style="vertical-align: top"><span class="ui-icon country flg-fr"></span>\t</td>
<td class="pn"><a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span
class="player-meta-data">29</span><span class="player-meta-data">, GK </span></td>
<td class="ShotsTotal ">0\t</td>
<td class="ShotOnTarget ">0\t</td>
<td class="KeyPassTotal ">0\t</td>
<td class="PassSuccessInMatch ">88\t</td>
<td class="DuelAerialWon ">0\t</td>
<td class="Touches ">35\t</td>
<td class="rating ">6.24</td>
<td style="text-align: left"><span class="incident-wrapper"></span></td>
"""
parsed_html = BeautifulSoup(html, 'html5lib')
print(html)
And I've got the following html printed:
<td style="vertical-align: top">1</td>
<td style="vertical-align: top"><span class="ui-icon country flg-fr"></span> </td>
<td class="pn"><a class="player-link" href="/Players/25604">Hugo Lloris <span class="incident-wrapper"></span> </a><span
class="player-meta-data">29</span><span class="player-meta-data">, GK </span></td>
<td class="ShotsTotal ">0 </td>
<td class="ShotOnTarget ">0 </td>
<td class="KeyPassTotal ">0 </td>
<td class="PassSuccessInMatch ">88 </td>
<td class="DuelAerialWon ">0 </td>
<td class="Touches ">35 </td>
<td class="rating ">6.24</td>
<td style="text-align: left"><span class="incident-wrapper"></span></td>
Don't see anything missing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping table with BeautifulSoup4 - python

You can use CSS selector to select all <td> tags that contain attribute style with string color: #F5645C and then apply method find_next(): for td in soup.select('td[style*="color: #F5645C"]'): print(td.find_next('td').text) This prints: Michael Lisa John Francesca

data = BeautifulSoup(html) for tr in data.find_all('tr'): td = tr.find_all('td') print(td[1].text) Now you can take it further i think..

Use .findNext("td").text Ex: from bs4 import BeautifulSoup soup = BeautifulSoup(html, "html.parser") for tr in soup.find_all("tr"): print(tr.td.findNext("td").text) Output: Michael Clara Lisa Joe John Francesca Carlos

Use can use find_all and a filter for the style atribute: bs = BeautifulSoup(htmlcontent) bs.find_all('td', attrs={'style':'background-color: #F5645C, color: #F5645C'})

Related

Beautiful Soup, for loop to write data

how do we select the child element tbody after extracting the entire html?

Searching for a color tag in html (Python 3)

Beautiful soup web page scraper

BeautifulSoup not parsing every tag of the html

Categories

Resources