Need help parsing through this HTML using BeautifulSoup and Python - python

I have the following HTML I would like to parse using BeautifulSoup:
<tr class="TrGameOdd">
<td align="center">
<a href="Schedule.aspx?WT=0&lg=778&id=,1583114">
<img border="0" src="/core/engine/App_Themes/Global/images/plus.gif">
</a>
</td>
<td align="left">Oct 20</td>
<td>777</td>
<td align="left" colspan="2">Cupcakes</td>
<td align="right">7+3
<input type="checkbox" value="0_1583114_-3440" name="text_">
</td>
<td align="right">a199
<input type="checkbox" value="2_1583114_-199.5_-110" name="text_">
</td>
</tr>
There are a whole bunch of lines like this, but I only need specifics out of it. For example, I want to parse 777, Cupcakes, 7+3, -3440, a199 out of all of this. How would I go about doing that? I'd like it to print side by side and I would have a few of these lines I want to parse, so when it prints it should be like this:
777 Cupcakes 7+3 -3440
X X X X
X X X X
etc

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
trs = soup.find("tr",{"class":"TrGameOdd"})
for tr in trs:
tds = tr.findAll("td")
print tds[1].string # Oct 20
print tds[2].string # 777
print tds[3].string # Cupcakes
...
You need to continue yourself
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

How to store a set of multiple p tag texts to a single variable in space delimit with BeautifulSoup in Python

How can I store texts from multiple HTML p tags in a single variable with space delimit with BeautifulSoup in the following example? I'm brand new to Python. Thank you!
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
value = value.text
print(value)
In print statement itself you can put end="," as parameter to make answer in one line
from bs4 import BeautifulSoup
html= """<tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
print(value.text,end=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%,
OR :
you can try to append data to list and print in one line
lst=[i.get_text(strip=True) for i in values]
print(*lst,sep=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%
To get in single variable
x=",".join(lst)
print(x)
Output:
+1.30%,-1.33%,+1.58%,+1.61%
You can do like this using string concatenation.
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
ans = ''
for value in values:
ans += value.text.strip() + ' '
print(ans)
ans is a string that has space separated texts of <p> tags.
+1.30% -1.33% +1.58% +1.61%

Finding certain element using bs4 beautifulSoup

I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:
You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)
Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text
To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014

How to extract specific <td> from table

I'm working on a web scraping program using Python & BeautifulSoup. I encountered a problem when scraping a table.
My problem is, I need to extract selected <td> tags only and not the entire table.
I only need the numbers for 52 Week High, 52 Week Low, Earnings Per Share and Price to book value.
Is there anyway I can do that?
Sample Table
<table id="TABLE_1">
<tbody id="TBODY_2">
<tr id="TR_3">
<td id="TD_4">
<strong id="STRONG_5">52-Week High:</strong>
</td>
<td id="TD_6">
1,116.00
</td>
<td id="TD_7">
<strong id="STRONG_8">Earnings Per Share TTM (EPS):</strong>
</td>
<td id="TD_9">
47.87 (15.57%)
</td>
<td id="TD_10">
<strong id="STRONG_11">Price to Book Value (P/BV):</strong>
</td>
<td id="TD_12">
2.5481125565
</td>
</tr>
<tr id="TR_13">
<td id="TD_14">
<strong id="STRONG_15">52-Week Low:</strong>
</td>
<td id="TD_16">
867.50
</td>
<td id="TD_17">
<strong id="STRONG_18">Price-Earnings Ratio TTM (P/E):</strong>
</td>
<td id="TD_19">
20.8272404429
</td>
<td id="TD_20">
<strong id="STRONG_21">Return on Equity (ROE):</strong>
</td>
<td id="TD_22">
12.42%
</td>
</tr>
<tr id="TR_23">
<td id="TD_24">
<strong id="STRONG_25">Fair Value:</strong>
</td>
<td id="TD_26">
-
</td>
<td id="TD_27">
<strong id="STRONG_28">Dividends Per Share (DPS):</strong>
</td>
<td id="TD_29">
-
</td>
<td id="TD_30">
<strong id="STRONG_31">Recommendation:</strong>
</td>
<td id="TD_32">
None<span id="SPAN_33"></span>
</td>
</tr>
<tr id="TR_34">
<td id="TD_35">
<strong id="STRONG_36">Last Price:</strong>
</td>
<td id="TD_37">
<span id="SPAN_38"></span> <span id="SPAN_39">984.5</span>
</td>
</tr>
</tbody>
</table>
I also showed my codes for your reference.
Any help would be very much appreciated! Thank you!
from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
import pandas as pd
myurl = "https://www.investagrams.com/Stock/ac"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(myurl,headers=hdr)
# Open connection to website
uClient = urlopen(req)
# Offloads the content to variable
page_html = uClient.read()
#just closing it
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
table = page_soup.find("div", {"id":"FundamentalAnalysisPanel"}).find("table")
print(table.text)
You can do it with findNextSibling method.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.investagrams.com/Stock/ac')
soup = BeautifulSoup(r.text)
# specify table parameters for which you want to find values
parameters = ['52-Week High:', '52-Week Low:', 'Earnings Per Share TTM (EPS):', 'Price-Earnings Ratio TTM (P/E):', 'Price to Book Value (P/BV):']
# iterate all <td> tags and print text of the next sibling (with value),
# if this <td> contains specified parameter.
for td in soup.findAll('td'):
for p in parameters:
if td.find('strong', text=p) is not None:
print(td.findNextSibling().text.strip())
Result:
1,116.00
47.87 (15.57%)
2.5481125565
867.50
20.8272404429
This might be what you want
page_soup = soup(req.data.decode('utf-8'))
#tables = page_soup.find_all('table')
tables = page_soup.find_all('td')
df = pd.read_html(str(tables[i]))
where i is the table you want

Established html table line to python

Let's say, i have an HTML Table like this:
<tr>
<td class="Klasse gerade">12A<br></td>
<td class="Stunde gerade">4<br></td>
<td class="Fach gerade">GEO statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">603<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
<tr>
<td class="Klasse gerade">10A<br></td>
<td class="Stunde gerade">2<br></td>
<td class="Fach gerade">MA statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">406<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
if phrase the HTML to python(2.7) with:
link = "http://www.test.com/vplan.html"
f = urllib.urlopen(link)
vplan = f.read()
print vplan
how can i do this?: if td=10A then print the complete tr of 10A
Sorry for the bad formulation but this is in my opinion the easiest was to explain my question and don't wonder about the German word's (I'm a German)
You need an HTML parser like Beautifulsoup. Assuming the table in question is the only one or the first one in the document, the program may look like this:
#!/usr/bin/env python
import urllib
from bs4 import BeautifulSoup
def main():
link = 'http://www.test.com/vplan.html'
soup = BeautifulSoup(urllib.urlopen(link), 'lxml')
table = soup.find('table')
rows = [x.find_parent('tr') for x in table.find_all(text='10A')]
for row in rows:
for cell in row.find_all('td'):
print cell.text
print '-' * 10

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

Categories

Resources