How to extract specific <td> from table

How to extract specific <td> from table - python

I'm working on a web scraping program using Python & BeautifulSoup. I encountered a problem when scraping a table.
My problem is, I need to extract selected <td> tags only and not the entire table.
I only need the numbers for 52 Week High, 52 Week Low, Earnings Per Share and Price to book value.
Is there anyway I can do that?
Sample Table
<table id="TABLE_1">
<tbody id="TBODY_2">
<tr id="TR_3">
<td id="TD_4">
<strong id="STRONG_5">52-Week High:</strong>
</td>
<td id="TD_6">
1,116.00
</td>
<td id="TD_7">
<strong id="STRONG_8">Earnings Per Share TTM (EPS):</strong>
</td>
<td id="TD_9">
47.87 (15.57%)
</td>
<td id="TD_10">
<strong id="STRONG_11">Price to Book Value (P/BV):</strong>
</td>
<td id="TD_12">
2.5481125565
</td>
</tr>
<tr id="TR_13">
<td id="TD_14">
<strong id="STRONG_15">52-Week Low:</strong>
</td>
<td id="TD_16">
867.50
</td>
<td id="TD_17">
<strong id="STRONG_18">Price-Earnings Ratio TTM (P/E):</strong>
</td>
<td id="TD_19">
20.8272404429
</td>
<td id="TD_20">
<strong id="STRONG_21">Return on Equity (ROE):</strong>
</td>
<td id="TD_22">
12.42%
</td>
</tr>
<tr id="TR_23">
<td id="TD_24">
<strong id="STRONG_25">Fair Value:</strong>
</td>
<td id="TD_26">
-
</td>
<td id="TD_27">
<strong id="STRONG_28">Dividends Per Share (DPS):</strong>
</td>
<td id="TD_29">
-
</td>
<td id="TD_30">
<strong id="STRONG_31">Recommendation:</strong>
</td>
<td id="TD_32">
None<span id="SPAN_33"></span>
</td>
</tr>
<tr id="TR_34">
<td id="TD_35">
<strong id="STRONG_36">Last Price:</strong>
</td>
<td id="TD_37">
<span id="SPAN_38"></span> <span id="SPAN_39">984.5</span>
</td>
</tr>
</tbody>
</table>
I also showed my codes for your reference.
Any help would be very much appreciated! Thank you!
from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
import pandas as pd
myurl = "https://www.investagrams.com/Stock/ac"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(myurl,headers=hdr)
# Open connection to website
uClient = urlopen(req)
# Offloads the content to variable
page_html = uClient.read()
#just closing it
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
table = page_soup.find("div", {"id":"FundamentalAnalysisPanel"}).find("table")
print(table.text)

You can do it with findNextSibling method.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.investagrams.com/Stock/ac')
soup = BeautifulSoup(r.text)
# specify table parameters for which you want to find values
parameters = ['52-Week High:', '52-Week Low:', 'Earnings Per Share TTM (EPS):', 'Price-Earnings Ratio TTM (P/E):', 'Price to Book Value (P/BV):']
# iterate all <td> tags and print text of the next sibling (with value),
# if this <td> contains specified parameter.
for td in soup.findAll('td'):
for p in parameters:
if td.find('strong', text=p) is not None:
print(td.findNextSibling().text.strip())
Result:
1,116.00
47.87 (15.57%)
2.5481125565
867.50
20.8272404429

This might be what you want
page_soup = soup(req.data.decode('utf-8'))
#tables = page_soup.find_all('table')
tables = page_soup.find_all('td')
df = pd.read_html(str(tables[i]))
where i is the table you want

Related

How to store a set of multiple p tag texts to a single variable in space delimit with BeautifulSoup in Python

How can I store texts from multiple HTML p tags in a single variable with space delimit with BeautifulSoup in the following example? I'm brand new to Python. Thank you!
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
value = value.text
print(value)

In print statement itself you can put end="," as parameter to make answer in one line
from bs4 import BeautifulSoup
html= """<tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
print(value.text,end=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%,
OR :
you can try to append data to list and print in one line
lst=[i.get_text(strip=True) for i in values]
print(*lst,sep=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%
To get in single variable
x=",".join(lst)
print(x)
Output:
+1.30%,-1.33%,+1.58%,+1.61%

You can do like this using string concatenation.
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
ans = ''
for value in values:
ans += value.text.strip() + ' '
print(ans)
ans is a string that has space separated texts of <p> tags.
+1.30% -1.33% +1.58% +1.61%

how do we select the child element tbody after extracting the entire html?

I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)

from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]

python beautifulsoup parsing recursing

I'm a python/BeautifulSoup beginner, I'm trying to extract all the content in <td width="473" valign="top"> -> <strong>.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl">
<head>
<title>MIEJSKI OŚRODEK KULTURY W ŻORACH Repertuar Kina Na Starówce</title>
</head>
<body>
<div class="page_content">
<p> </p>
<p>
<table style="width: 450px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="57" valign="top">
<p align="center"><strong>Data</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>Tytuł Filmu</strong></p>
</td>
<td width="95" valign="top">
<p align="center"><strong>Godzina</strong></p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong> </strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>1 - 5.05</strong></p>
</td>
<td width="95" valign="top">
<p align="center"> </p>
</td>
</tr>
<tr>
<td width="57" valign="top">
<p align="center"><strong>1</strong></p>
</td>
<td width="473" valign="top">
<p align="center"><strong>KINO POWTÓREK: ZWIERZOGRÓD </strong>USA/b.o cena 10 zł</p>
</td>
<td width="95" valign="top">
<p align="center">16:30</p>
</td>
</tr>
</tbody>
</table>
</p>
</body>
</html>
The furthest I can go is to get a list of all the tags with this code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("zory1.html"), "html.parser")
y = soup.find_all(width="473")
newy = str(y)
newsoup = BeautifulSoup(newy ,"html.parser")
stronglist = newsoup.find_all('strong')
lasty = str(stronglist)
lastsoup = BeautifulSoup(lasty , "html.parser")
lst = soup.find_all('strong')
for item in lst:
print item
How can I take out the content within the tag, in a beginner's level?
Thanks

Use get_text() to get a node's text.
Complete working example where we go over all the rows and all the cells inside the table:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "html.parser")
for row in soup.find_all("tr"):
print([cell.get_text(strip=True) for cell in row.find_all("td")])
Prints:
['Data', 'Tytuł Filmu', 'Godzina']
['', '1 - 5.05', '']
['1', 'KINO POWTÓREK: ZWIERZOGRÓDUSA/b.o\xa0 cena 10 zł', '16:30']

Here you are
from bs4 import BeautifulSoup
navigator = BeautifulSoup(open("zory1.html"), "html.parser")
tds = navigator.find_all("td", {"width":"473"})
resultList = [item.strong.get_text() for item in tds]
for item in resultList:
print item
Result
$ python test.py
Tytuł Filmu
1 - 5.05
KINO POWTÓREK: ZWIERZOGRÓD

Trying to grab certain parts of the NFL stats table using BeautifulSoup

I am trying to grab each certain stat that is on the table. I have narrowed it down to each column for a team, and just have to grab the actual number! The code I have is:
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nfl/statistics/team/_/stat/defense/position/defense'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
HoustonDefense = soup.find('tr', class_="oddrow team-28-34")
print (HoustonDefense.prettify())
This will have the HoustonDefense column printed as this:
<tr align="right" class="oddrow team-28-34">
<td align="left">
1
</td>
<td align="left">
<a href="http://espn.go.com/nfl/team/_/name/hou/houston-texans">
Houston
</a>
</td>
<td>
539
</td>
<td>
272
</td>
<td class="sortcell">
811
</td>
<td>
22.0
</td>
<td>
136
</td>
<td>
65
</td>
<td>
9
</td>
<td>
102
</td>
<td>
38
</td>
<td>
1
</td>
<td>
17
</td>
<td>
5
</td>
<td>
2
</td>
</tr>
I want to grab those numbers between each <td></td> and assign them to a variable. Any help would be amazing! Thanks!

Use find_all() to find all td elements inside the tr and get the text of every td found except the first two cells (the ranking and the team name itself):
[td.text for td in HoustonDefense.find_all("td")[2:]]
Prints:
[u'539', u'272', u'811', u'22.0', u'136', u'65', u'9', u'102', u'38', u'1', u'17', u'5', u'2']

Get href Attribute Link from td tag BeautifulSoup Python

I am new in Python and someone suggested me to use Beautiful soup for Scrapping and i am struck in a problem to fetch the href attribute from a td tag Column 2 on the basis of year in column 4.
<table class="tableFile2" summary="Results">
<tr>
<th width="7%" scope="col">Filings</th>
<th width="10%" scope="col">Format</th>
<th scope="col">Description</th>
<th width="10%" scope="col">Filing Date</th>
<th width="15%" scope="col">File/Film Number</th>
</tr>
<tr>
<td nowrap="nowrap">8-K</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Current report, items 8.01 and 9.01
<br />Acc-no: 0001193125</td>
<td>2013-05-03</td>
<td nowrap="nowrap">000-10030<br>13813281 </td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">424B2</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Prospectus [Rule 424(b)(2)]<br />Acc-no: 0001193125</td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13802405 </td>
</tr>
<tr>
<td nowrap="nowrap">FWP</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Filing under Securities Act Rules 163/433 of free writing prospectuses<br />Acc-no: 0001193125-13-189053 (34 Act) Size: 52 KB </td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13800170 </td>
</tr>
</table>
table = soup.find('table', class="tableFile2")
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if "2013" in cols[3]
link = cols[1].find('a').get('href')
print

This works for me in Python 2.7:
table = soup.find('table', {'class': 'tableFile2'})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 4 and "2013" in cols[3].text:
link = cols[1].find('a').get('href')
print link
A few issues with your previous code:
soup.find() requires a dictionary of attributes (e.g., {'class' : 'tableFile2'})
Not every cols instance will have at least 3 columns, so you need to check length first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract specific <td> from table - python

This might be what you want page_soup = soup(req.data.decode('utf-8')) #tables = page_soup.find_all('table') tables = page_soup.find_all('td') df = pd.read_html(str(tables[i])) where i is the table you want

Related

How to store a set of multiple p tag texts to a single variable in space delimit with BeautifulSoup in Python

how do we select the child element tbody after extracting the entire html?

python beautifulsoup parsing recursing

Trying to grab certain parts of the NFL stats table using BeautifulSoup

Get href Attribute Link from td tag BeautifulSoup Python

Categories

Resources