Trying to grab certain parts of the NFL stats table using BeautifulSoup

Trying to grab certain parts of the NFL stats table using BeautifulSoup - python

I am trying to grab each certain stat that is on the table. I have narrowed it down to each column for a team, and just have to grab the actual number! The code I have is:
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nfl/statistics/team/_/stat/defense/position/defense'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
HoustonDefense = soup.find('tr', class_="oddrow team-28-34")
print (HoustonDefense.prettify())
This will have the HoustonDefense column printed as this:
<tr align="right" class="oddrow team-28-34">
<td align="left">
1
</td>
<td align="left">
<a href="http://espn.go.com/nfl/team/_/name/hou/houston-texans">
Houston
</a>
</td>
<td>
539
</td>
<td>
272
</td>
<td class="sortcell">
811
</td>
<td>
22.0
</td>
<td>
136
</td>
<td>
65
</td>
<td>
9
</td>
<td>
102
</td>
<td>
38
</td>
<td>
1
</td>
<td>
17
</td>
<td>
5
</td>
<td>
2
</td>
</tr>
I want to grab those numbers between each <td></td> and assign them to a variable. Any help would be amazing! Thanks!

Use find_all() to find all td elements inside the tr and get the text of every td found except the first two cells (the ranking and the team name itself):
[td.text for td in HoustonDefense.find_all("td")[2:]]
Prints:
[u'539', u'272', u'811', u'22.0', u'136', u'65', u'9', u'102', u'38', u'1', u'17', u'5', u'2']

Related

Collect text using XPath

is it possible to capture all EAN numbers in such a construct using XPath, or do I need to use regular expressions?
<table>
<tr>
<td>
EAN Giftbox
</td>
<td>
7350034654483
</td>
</tr>
<tr>
<td>
EAN Export Carton:
</td>
<td>
17350034643958
</td>
</tr>
</table>
I want to get a list of ['7350034654483', '17350034643958']

from lxml import html as lh
html = """<table>
<tr>
<td>
EAN Giftbox
</td>
<td>
7350034654483
</td>
</tr>
<tr>
<td>
EAN Export Carton:
</td>
<td>
17350034643958
</td>
</tr>
</table>
"""
root = lh.fragment_fromstring(html)
tds = root.xpath('//tr[*]/td[2]')
for td in tds:
print(td.text.strip())
Output:
7350034654483
17350034643958

How to get text from nested html table with beautifulsoup?

Within each of the main tables respectively, there are two tables nested of which the first one contains the data A_A_A_A that i want to extract to a pandas.dataframe
<table>
<tr valign="top">
<td> </td>
<td>
<br/>
<center>
<h2>asd</h2>
</center>
<h4>asd</h4>
<table>
<tr>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
<tr>
<td> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" width="100%">
<tr>
<td width="2%"> </td>
<td> A_A_A_A <br/> A_A_A_A 111-222<br/> </td>
<td width="2%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td width="4%"> </td>
<td class="unique"> asd <br/> asd </td>
<td width="4%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td> </td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
.
.
.
</table>
<br/>
<table>
</table>
</td>
</tr>
</table>
I figured that because of the limited availiability of attributes the only way to go forward would be an iteration over a td siblings with .next_siblings and if needed .next_elements
data1 = []
for item in soup.find_all('td', attrs={'width': '2%'}):
data = item.find_next_sibling().text
data1.append(data)
returns and empty list []. Now i dont know forward because i cannot identify any other helpful attributes/classes that would help me get to the middle td that contains the information.

.find_next(name=None, attrs={}, text=None, **kwargs)
Returns the first item that matches the given criteria and appears after this Tag in the document. So in your case:
item = soup.find('td', attrs={'width': '2%'})
data = item.find_next('td').text
Note that, I removed for loop since the desired data is coming after first td with width: '2%'. After running this, data will be:
' A_A_A_A A_A_A_A 111-222 '

I took #Wiktor Stribiżew answer from here regex for loop over list in python
and kind of merged it with yours #Rustam Garayev
item = soup.find_all('td', attrs={'width': '2%'})
data = [x.find_next('td').text for x in item]
since i needed not only the first AAAA but from all the following tables as well. The code above gives this output:
['A_A_A_A',
'\xa0',
'A_A_A_A',
'\xa0', ...]
which is good enough for my purpose. I think the '\xa0' comes from it trying to do the find_next on the third td sibling, which does not have a consecutive.

How to find a sibling HTML table element by specific href using Python Beautiful Soup

Using Beautiful Soup, I am trying to scrape data from HTML tables which look like the following:
<table class="ipl-zebra-list ipl-zebra-list--fixed-first release-dates-table-test-only">
<tr class="ipl-zebra-list__item release-date-item">
<td class="release-date-item__country-name"><a href="/calendar/?region=de">Germany
</a></td>
<td align="right" class="release-date-item__date">15 September 2017</td> <td align="left" class="release-date-item__attributes">(Oldenburg Film Festival)
</td>
</tr>
<tr class="ipl-zebra-list__item release-date-item">
<td class="release-date-item__country-name"><a href="/calendar/?region=gb">UK
</a></td>
<td align="right" class="release-date-item__date">23 March 2018</td> <td class="release-date-item__attributes--empty"></td>
</tr>
</table>
I am looking for the date which appears in the sibling element to the <td> element which includes the following href:
<a href="/calendar/?region=gb">UK
In the example above this the 23 March 2018 but the date is different for every instance in which the href occurs. However the href is always identical.
To summarise, I am looking for the data which appears in the adjacent cell to href listed above.
Thanks!

So if you want to have the country name and the date linked to that country name you could create a dictionary like this:
html = '''<table class="ipl-zebra-list ipl-zebra-list--fixed-first release-dates-table-test-only">
<tr class="ipl-zebra-list__item release-date-item">
<td class="release-date-item__country-name"><a href="/calendar/?region=de">Germany
</a></td>
<td align="right" class="release-date-item__date">15 September 2017</td> <td align="left" class="release-date-item__attributes">(Oldenburg Film Festival)
</td>
</tr>
<tr class="ipl-zebra-list__item release-date-item">
<td class="release-date-item__country-name"><a href="/calendar/?region=gb">UK
</a></td>
<td align="right" class="release-date-item__date">23 March 2018</td> <td class="release-date-item__attributes--empty"></td>
</tr>
</table>'''
html_code = BeautifulSoup(html, 'html.parser')
countries = html_code.find_all('td', class_='release-date-item__country-name')
dates = html_code.find_all('td', class_='release-date-item__date')
dates_as_dic = {}
for i in range(len(dates)):
dates_as_dic[countries[i].text.strip()] = dates[i].text
print(dates_as_dic)
output:
{'Germany': '15 September 2017', 'UK': '23 March 2018'}

How to extract specific <td> from table

I'm working on a web scraping program using Python & BeautifulSoup. I encountered a problem when scraping a table.
My problem is, I need to extract selected <td> tags only and not the entire table.
I only need the numbers for 52 Week High, 52 Week Low, Earnings Per Share and Price to book value.
Is there anyway I can do that?
Sample Table
<table id="TABLE_1">
<tbody id="TBODY_2">
<tr id="TR_3">
<td id="TD_4">
<strong id="STRONG_5">52-Week High:</strong>
</td>
<td id="TD_6">
1,116.00
</td>
<td id="TD_7">
<strong id="STRONG_8">Earnings Per Share TTM (EPS):</strong>
</td>
<td id="TD_9">
47.87 (15.57%)
</td>
<td id="TD_10">
<strong id="STRONG_11">Price to Book Value (P/BV):</strong>
</td>
<td id="TD_12">
2.5481125565
</td>
</tr>
<tr id="TR_13">
<td id="TD_14">
<strong id="STRONG_15">52-Week Low:</strong>
</td>
<td id="TD_16">
867.50
</td>
<td id="TD_17">
<strong id="STRONG_18">Price-Earnings Ratio TTM (P/E):</strong>
</td>
<td id="TD_19">
20.8272404429
</td>
<td id="TD_20">
<strong id="STRONG_21">Return on Equity (ROE):</strong>
</td>
<td id="TD_22">
12.42%
</td>
</tr>
<tr id="TR_23">
<td id="TD_24">
<strong id="STRONG_25">Fair Value:</strong>
</td>
<td id="TD_26">
-
</td>
<td id="TD_27">
<strong id="STRONG_28">Dividends Per Share (DPS):</strong>
</td>
<td id="TD_29">
-
</td>
<td id="TD_30">
<strong id="STRONG_31">Recommendation:</strong>
</td>
<td id="TD_32">
None<span id="SPAN_33"></span>
</td>
</tr>
<tr id="TR_34">
<td id="TD_35">
<strong id="STRONG_36">Last Price:</strong>
</td>
<td id="TD_37">
<span id="SPAN_38"></span> <span id="SPAN_39">984.5</span>
</td>
</tr>
</tbody>
</table>
I also showed my codes for your reference.
Any help would be very much appreciated! Thank you!
from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
import pandas as pd
myurl = "https://www.investagrams.com/Stock/ac"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(myurl,headers=hdr)
# Open connection to website
uClient = urlopen(req)
# Offloads the content to variable
page_html = uClient.read()
#just closing it
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
table = page_soup.find("div", {"id":"FundamentalAnalysisPanel"}).find("table")
print(table.text)

You can do it with findNextSibling method.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.investagrams.com/Stock/ac')
soup = BeautifulSoup(r.text)
# specify table parameters for which you want to find values
parameters = ['52-Week High:', '52-Week Low:', 'Earnings Per Share TTM (EPS):', 'Price-Earnings Ratio TTM (P/E):', 'Price to Book Value (P/BV):']
# iterate all <td> tags and print text of the next sibling (with value),
# if this <td> contains specified parameter.
for td in soup.findAll('td'):
for p in parameters:
if td.find('strong', text=p) is not None:
print(td.findNextSibling().text.strip())
Result:
1,116.00
47.87 (15.57%)
2.5481125565
867.50
20.8272404429

This might be what you want
page_soup = soup(req.data.decode('utf-8'))
#tables = page_soup.find_all('table')
tables = page_soup.find_all('td')
df = pd.read_html(str(tables[i]))
where i is the table you want

BeautifulSoup, get text of all td's (some text with commas) inside tr's

Im currently working on a table that is created in ASP, its very messy but with some code help I think Ill be getting what I need from this table.
I have an HTML code that I want the output to be one array for each tr with td's. I also do not want the "-" to be a part of the output in the arrays.
Some td's have 2 commas and some texts in the td's are separated by only an empty space " ":
The code is like this
<tr bgcolor="#EFEFEF">
<td>
<a href="free.asp?detail=hide&c_id=4342141">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4342141
</td>
<td width="10">
</td>
<td>
25.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Golbasi Ankara, Turkey
</td>
<td width="10">
-
</td>
<td>
Konya Havalimani Turkey
</td>
<td colspan="2">
</td>
</tr>
<tr bgcolor="#EFEFEF" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DDDDDD" height="6">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7">
<td>
<a href="free.asp?detail=hide&c_id=4134123">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4134123
</td>
<td width="10">
</td>
<td>
26.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Kucuktepe, Van, Turkey
</td>
<td width="10">
-
</td>
<td>
Maltepe, Istanbul, Turkey
</td>
<td colspan="2">
</td>
</tr>
Some td's have 2 commas and some texts in the td's are separated by only an empty space " ":
[['4342141', '25.07.2018', '09:00', 'Golbasi Ankara, Turkey', '-', 'Konya Havalimani Turkey', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Kucuktepe, Van, Turkey', '-', 'Maltepe, Istanbul, Turkey', 'free.asp?detail=hide&c_id=4134123']]

Assuming data will hold the HTML text:
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('tr'):
row = [td.text.strip() for td in tr.select('td') if td.text.strip() and td.text.strip() != '-']
if row:
rows.append(row)
pprint(rows, width=120)
This will print:
[['4342141', '25.07.2018 09:00', 'Golbasi Ankara, Turkey', 'Konya Havalimani Turkey'],
['4134123', '26.07.2018 09:00', 'Kucuktepe, Van, Turkey', 'Maltepe, Istanbul, Turkey']]
For writing the rows list to csv you can use this script:
import csv
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(rows)
Then in data.csv file you will have:
4342141,25.07.2018 09:00,"Golbasi Ankara, Turkey",Konya Havalimani Turkey
4134123,26.07.2018 09:00,"Kucuktepe, Van, Turkey","Maltepe, Istanbul, Turkey"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to grab certain parts of the NFL stats table using BeautifulSoup - python

Related

Collect text using XPath

How to get text from nested html table with beautifulsoup?

How to find a sibling HTML table element by specific href using Python Beautiful Soup

How to extract specific <td> from table

BeautifulSoup, get text of all td's (some text with commas) inside tr's

Categories

Resources