Beautiful Soup not finding specific table by ID - python

I am trying to parse a basketball reference player page to extract one of the tables from the page and work with the data from it. For some reason, though, beautiful soup cannot find the table in the page. I have tried to search for other tables in the page and it has successfully found them but for some reason will not find this specific one.
I have the following line which takes a link to the page of the specific player I am searching for and gets the BeautifulSoup version of it:
page_soup = BeautifulSoup(bball_ref_page.content, 'lxml')
I then search for the table with the following line:
table = page_soup.find('table', attrs={'id': 'per_poss'})
Whenever I try to print(table) it just comes out as None.
I have also tried searching for the contents by doing:
table = page_soup.find(attrs={'id': 'per_poss'})
same result of None
I have also tried searching for all tables in the page_soup and it returns a list of a bunch of tables not including the one I am looking for
I have tried changing the parse in the page_soup assignment to html.parser and the result remains the same. I have also tried printing the contents of page_soup and can find the table in their:
<div class="table_container current" id="div_per_poss">
<table class="stats_table sortable row_summable" id="per_poss" data-cols-to-freeze="1,3"> <caption>Per 100 Poss Table</caption> <colgroup><col>....
Any ideas what might be causing this to happen?

The page is storing the <table> data inside the HTML comment <!-- --> so normally BeautifulSoup doesn't see it. To load it as pandas dataframe you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
url = "https://www.basketball-reference.com/players/j/jordami01.html"
soup = BeautifulSoup(requests.get(url).content, "lxml")
soup = BeautifulSoup("\n".join(soup.find_all(text=Comment)), "lxml")
df = pd.read_html(str(soup.select_one("table#per_poss")))[0]
print(df.to_markdown())
Prints:
Season
Age
Tm
Lg
Pos
G
GS
MP
FG
FGA
FG%
3P
3PA
3P%
2P
2PA
2P%
FT
FTA
FT%
ORB
DRB
TRB
AST
STL
BLK
TOV
PF
PTS
Unnamed: 29
ORtg
DRtg
0
1984-85
21
CHI
NBA
SG
82
82
3144
12.9
25
0.515
0.1
0.8
0.173
12.7
24.2
0.526
9.7
11.5
0.845
2.6
5.6
8.2
7.4
3
1.1
4.5
4.4
35.5
nan
118
107
1
1985-86
22
CHI
NBA
SG
18
7
451
16
35
0.457
0.3
1.9
0.167
15.7
33.1
0.474
11.2
13.3
0.84
2.5
4.4
6.8
5.7
3.9
2.2
4.8
4.9
43.5
nan
109
107
2
1986-87
23
CHI
NBA
SG
82
82
3281
16.8
34.8
0.482
0.2
1
0.182
16.6
33.8
0.491
12.7
14.8
0.857
2.5
4
6.6
5.8
3.6
1.9
4.2
3.6
46.4
nan
117
104
3
1987-88
24
CHI
NBA
SG
82
82
3311
16.2
30.3
0.535
0.1
0.8
0.132
16.1
29.5
0.546
11
13.1
0.841
2.1
4.7
6.8
7.4
3.9
2
3.8
4.1
43.6
nan
123
101
4
1988-89
25
CHI
NBA
SG
81
81
3255
14.7
27.3
0.538
0.4
1.5
0.276
14.3
25.8
0.553
10.2
12.1
0.85
2.3
7.6
9.9
9.9
3.6
1
4.4
3.8
40
nan
123
103
5
1989-90
26
CHI
NBA
SG
82
82
3197
16
30.5
0.526
1.4
3.8
0.376
14.6
26.7
0.548
9.2
10.8
0.848
2.2
6.6
8.8
8.1
3.5
0.8
3.8
3.7
42.7
nan
123
106
6
1990-91
27
CHI
NBA
SG
82
82
3034
16.4
30.4
0.539
0.5
1.5
0.312
15.9
28.9
0.551
9.4
11.1
0.851
2
6.2
8.1
7.5
3.7
1.4
3.3
3.8
42.7
nan
125
102
7
1991-92
28
CHI
NBA
SG
80
80
3102
15.5
29.8
0.519
0.4
1.6
0.27
15
28.2
0.533
8
9.7
0.832
1.5
6.9
8.4
8
3
1.2
3.3
3.3
39.4
nan
121
102
8
1992-93
29
CHI
NBA
SG
78
78
3067
16.8
33.9
0.495
1.4
3.9
0.352
15.4
30
0.514
8.1
9.6
0.837
2.3
6.5
8.8
7.2
3.7
1
3.5
3.2
43
nan
119
102
9
1994-95
31
CHI
NBA
SG
17
17
668
13
31.5
0.411
1.2
2.5
0.5
11.7
29
0.403
8.5
10.6
0.801
2
7.2
9.1
7
2.3
1
2.7
3.7
35.7
nan
109
103
10
1995-96
32
CHI
NBA
SG
82
82
3090
15.6
31.5
0.495
1.9
4.4
0.427
13.7
27.1
0.506
9.3
11.2
0.834
2.5
6.7
9.3
6
3.1
0.7
3.4
3.3
42.5
nan
124
100
11
1996-97
33
CHI
NBA
SG
82
82
3106
15.8
32.5
0.486
1.9
5.1
0.374
13.9
27.4
0.507
8.2
9.9
0.833
1.9
6.3
8.3
6
2.4
0.8
2.9
2.7
41.8
nan
121
102
12
1997-98
34
CHI
NBA
SG
82
82
3181
14.9
32.1
0.465
0.5
2.1
0.238
14.4
30
0.482
9.6
12.2
0.784
2.2
5.8
8.1
4.8
2.4
0.8
3.1
2.6
40
nan
114
100
13
2001-02
38
WAS
NBA
SF
60
53
2093
14.3
34.4
0.416
0.3
1.4
0.189
14
33
0.426
6.8
8.6
0.79
1.3
7.5
8.8
8
2.2
0.7
4.2
3.1
35.7
nan
99
105
14
2002-03
39
WAS
NBA
SF
82
67
3031
12.2
27.4
0.445
0.3
1
0.291
11.9
26.4
0.45
4.8
5.8
0.821
1.3
7.7
8.9
5.6
2.2
0.7
3.1
3.1
29.5
nan
101
103
15
Career
nan
nan
NBA
nan
1072
1039
41011
15.3
30.7
0.497
0.7
2.2
0.327
14.5
28.5
0.51
9.2
11
0.835
2.1
6.3
8.3
7
3.1
1.1
3.7
3.5
40.4
nan
118
103
16
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
17
13 seasons
nan
CHI
NBA
nan
930
919
35887
15.5
30.8
0.505
0.8
2.4
0.332
14.8
28.4
0.52
9.6
11.5
0.838
2.2
6.1
8.3
7.1
3.3
1.2
3.7
3.5
41.5
nan
120
103
18
2 seasons
nan
WAS
NBA
nan
142
120
5124
13.1
30.3
0.431
0.3
1.1
0.241
12.8
29.1
0.439
5.6
7
0.805
1.3
7.6
8.9
6.6
2.2
0.7
3.6
3.1
32
nan
100
104
To iterate the rows of dataframe, you can use df.iterrows() for example:
for index, row in df.iterrows():
print(row["Season"], row["Age"])
Prints:
1984-85 21.0
1985-86 22.0
1986-87 23.0
1987-88 24.0
1988-89 25.0
...

Related

Trying to use the BeautifulSoup Python module to pull individual elements from table data

I am new to Python and currently using BeautifulSoup with Python to try and pull some table data. I cannot get the individual elements out of the td. What I have so far is:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(td)
This does display all of the td that I want to extract but am unable to figure out how to get each individual element out of the td.
Thank you in advanced for the help, it is much appreciated.
Try this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(*[text.get_text(strip=True) + '\n' for text in td])
Prints:
S10
NA
14
35.7%
0.91
1744
-48
33:19
11.2
12.4
5.5
7.0
50.0
64.3
2.71
54.2
1.00
57.1
1.14
and so on....
The following script extracts the data and saves the data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/')
soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find("table", class_="table_list playerslist tablesaw trhover")
columns = [i.get_text(strip=True) for i in table.find("thead").find_all("th")]
data = []
table.find("thead").extract()
for tr in table.find_all("tr"):
data.append([td.get_text(strip=True) for td in tr.find_all("td")])
df = pd.DataFrame(data, columns=columns)
df.to_csv("data.csv", index=False)
Output:
Name Season Region Games Win rate K:D GPM GDM Game duration Kills / game Deaths / game Towers killed Towers lost FB% FT% DRAPG DRA% HERPG HER% DRA#15 TD#15 GD#15 NASHPG NASH% CSM DPM WPM VWPM WCPM
0 100 Thieves S10 NA 14 35.7% 0.91 1744 -48 33:19 11.2 12.4 5.5 7.0 50.0 64.3 2.71 54.2 1.00 57.1 1.14 0.4 -378 0.64 42.9 33.2 1937 3.0 1.19 1.31
1 CLG S10 NA 14 35.7% 0.81 1705 -120 35:25 10.6 13.2 4.9 7.9 28.6 28.6 1.93 31.5 0.57 28.6 0.64 -0.6 -1297 0.57 30.4 32.6 1826 3.2 1.17 1.37
2 Cloud9 S10 NA 14 78.6% 1.91 1922 302 28:52 15.0 7.9 8.3 3.1 64.3 64.3 3.07 72.5 1.43 71.4 1.29 0.7 2410 1.00 78.6 33.3 1921 3.0 1.10 1.26
3 Dignitas S10 NA 14 28.6% 0.86 1663 -147 32:44 8.9 10.4 3.9 8.1 42.9 35.7 2.14 41.7 0.57 28.6 0.79 -0.7 -796 0.36 25.0 32.5 1517 3.1 1.28 1.23
4 Evil Geniuses S10 NA 14 50.0% 0.85 1738 -0 34:09 11.1 13.1 6.5 6.0 64.3 57.1 2.36 48.5 1.00 53.6 1.00 0.5 397 0.50 46.5 32.3 1895 3.2 1.36 1.34
5 FlyQuest S10 NA 14 57.1% 1.28 1770 65 34:55 13.4 10.4 6.5 5.2 71.4 35.7 2.86 53.4 1.00 50.0 0.79 -0.1 69 0.71 69.2 32.7 1801 3.2 1.16 1.72
6 Golden Guardians S10 NA 14 50.0% 0.96 1740 6 36:13 10.7 11.1 6.3 6.1 50.0 35.7 3.29 62.8 0.86 42.9 1.43 0.1 711 0.50 43.6 33.7 1944 3.2 1.27 1.53
7 Immortals S10 NA 14 21.4% 0.54 1609 -246 33:54 7.5 14.0 4.3 7.9 35.7 35.7 2.29 39.9 1.00 53.6 0.79 -0.4 -1509 0.36 25.0 31.4 1734 3.3 1.37 1.47
8 Team Liquid S10 NA 14 78.6% 1.31 1796 135 35:07 11.4 8.6 7.9 4.4 42.9 64.3 2.36 43.6 0.93 50.0 1.14 0.2 522 1.21 78.6 33.1 1755 3.5 1.27 1.42
9 TSM S10 NA 14 64.3% 1.12 1768 52 34:20 11.6 10.4 7.2 5.7 50.0 78.6 2.79 51.9 1.21 64.3 0.93 0.1 -129 0.86 57.1 32.6 1729 3.2 1.33 1.33

bs4 not giving table

URL = 'https://www.basketball-reference.com/leagues/NBA_2019.html'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find_all('table', {'class' : 'sortable stats_table now_sortable'})
rows = table.find_all('td')
for i in rows:
print(i.get_text())
I want to get content of the table with team per game stats from this website but I got error
>>>AttributeError: 'NoneType' object has no attribute 'find_all'
The table that you want is dynamically loaded, meaning it not loaded into the html when you first make a request to the page. So, the table you are searching for does not yet exist.
To scrape sites that use javascript, you can look into using selenium webdriver and PhantomJS, better described by this post –> https://stackoverflow.com/a/26440563/13275492
Actually you can use pandas.read_html() which will read the all tables in nice format. it's will return tables as list. so you can access it as DataFrame with index such as print(df[0]) for example
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2019.html")
print(df)
The tables (with the exception of a few) in these sports reference sites are within the comments. You would need to pull out the comments, then render these tables with pandas.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/leagues/NBA_2019.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in each and 'id="team-stats-per_game"' in each:
df = pd.read_html(each, attrs = {'id': 'team-stats-per_game'})[0]
Output:
print (df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 241.2 43.4 ... 7.5 5.9 13.9 19.6 118.1
1 2.0 Golden State Warriors* 82 241.5 44.0 ... 7.6 6.4 14.3 21.4 117.7
2 3.0 New Orleans Pelicans 82 240.9 43.7 ... 7.4 5.4 14.8 21.1 115.4
3 4.0 Philadelphia 76ers* 82 241.5 41.5 ... 7.4 5.3 14.9 21.3 115.2
4 5.0 Los Angeles Clippers* 82 241.8 41.3 ... 6.8 4.7 14.5 23.3 115.1
5 6.0 Portland Trail Blazers* 82 242.1 42.3 ... 6.7 5.0 13.8 20.4 114.7
6 7.0 Oklahoma City Thunder* 82 242.1 42.6 ... 9.3 5.2 14.0 22.4 114.5
7 8.0 Toronto Raptors* 82 242.4 42.2 ... 8.3 5.3 14.0 21.0 114.4
8 9.0 Sacramento Kings 82 240.6 43.2 ... 8.3 4.4 13.4 21.4 114.2
9 10.0 Washington Wizards 82 243.0 42.1 ... 8.3 4.6 14.1 20.7 114.0
10 11.0 Houston Rockets* 82 241.8 39.2 ... 8.5 4.9 13.3 22.0 113.9
11 12.0 Atlanta Hawks 82 242.1 41.4 ... 8.2 5.1 17.0 23.6 113.3
12 13.0 Minnesota Timberwolves 82 241.8 41.6 ... 8.3 5.0 13.1 20.3 112.5
13 14.0 Boston Celtics* 82 241.2 42.1 ... 8.6 5.3 12.8 20.4 112.4
14 15.0 Brooklyn Nets* 82 243.7 40.3 ... 6.6 4.1 15.1 21.5 112.2
15 16.0 Los Angeles Lakers 82 241.2 42.6 ... 7.5 5.4 15.7 20.7 111.8
16 17.0 Utah Jazz* 82 240.9 40.4 ... 8.1 5.9 15.1 21.1 111.7
17 18.0 San Antonio Spurs* 82 241.5 42.3 ... 6.1 4.7 12.1 18.1 111.7
18 19.0 Charlotte Hornets 82 241.8 40.2 ... 7.2 4.9 12.2 18.9 110.7
19 20.0 Denver Nuggets* 82 240.6 41.9 ... 7.7 4.4 13.4 20.0 110.7
20 21.0 Dallas Mavericks 82 241.2 38.8 ... 6.5 4.3 14.2 20.1 108.9
21 22.0 Indiana Pacers* 82 240.3 41.3 ... 8.7 4.9 13.7 19.4 108.0
22 23.0 Phoenix Suns 82 242.4 40.1 ... 9.0 5.1 15.6 23.6 107.5
23 24.0 Orlando Magic* 82 241.2 40.4 ... 6.6 5.4 13.2 18.6 107.3
24 25.0 Detroit Pistons* 82 242.1 38.8 ... 6.9 4.0 13.8 22.1 107.0
25 26.0 Miami Heat 82 240.6 39.6 ... 7.6 5.5 14.7 20.9 105.7
26 27.0 Chicago Bulls 82 242.7 39.8 ... 7.4 4.3 14.1 20.3 104.9
27 28.0 New York Knicks 82 241.2 38.2 ... 6.8 5.1 14.0 20.9 104.6
28 29.0 Cleveland Cavaliers 82 240.9 38.9 ... 6.5 2.4 13.5 20.0 104.5
29 30.0 Memphis Grizzlies 82 242.4 38.0 ... 8.3 5.5 14.0 22.0 103.5
30 NaN League Average 82 241.6 41.1 ... 7.6 5.0 14.1 20.9 111.2
[31 rows x 25 columns]

To scrape the data from span tag using beautifulsoup

I am trying to scrape the webpage, where I need to decode the entire table into a dataframe. I am using beautiful soup for this purpose. In certain td tags, there are span tags which do not have any text. But the values are shown on the webpage in that particular span tag.
The following html code corresponds to that webpage,
<td>
<span class="nttu">::after</span>
<span class="ntbb">::after</span>
<span class="ntyc">::after</span>
<span class="nttu">::after</span>
</td>
But, the value shown in this td tag is 23.8. I tried to scrape it, but I am getting am empty text.
How to scrape this value using beautiful soup.
URL: https://en.tutiempo.net/climate/ws-432950.html
and my code is for scraping the table is given below,
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
climate_table = soup.find("table", attrs={"class": "medias mensuales numspan"})
climate_data = climate_table.find_all("tr")
for data in climate_data[1:-2]:
table_data = data.find_all("td")
row_data = []
for row in table_data:
row_data.append(row.get_text())
climate_df.loc[len(climate_df)] = row_data
Misunderstood your question as you have 2 different urls referenced. I see now what you mean.
Ya that is weird that in that second table, they used CSS to fill in the content of some of those <td> tags. What you need to do is pull out those special cases from the <style> tag. Once you have that, you can replace those elements within the html source, and finally parse it into a dataframe. I used pandas as it uses BeautifulSoup under the hood to parse <table> tags. But I believe this will get you what you want:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
hiddenData = str(soup.find_all('style')[1])
hiddenSpan = {}
for group in re.findall(r'span\.(.+?)}',hiddenData):
class_attr = group.split('span.')[-1].split('::')[0]
content = group.split('"')[1]
hiddenSpan[class_attr] = content
climate_table = str(soup.find("table", attrs={"class": "medias mensuales numspan"}))
for k, v in hiddenSpan.items():
climate_table = climate_table.replace('<span class="%s"></span>' %(k), hiddenSpan[k])
df = pd.read_html(climate_table)[0]
Output:
print (df.to_string())
Day T TM Tm SLP H PP VV V VM VG RA SN TS FG
0 1 23.4 30.3 19 - 59 0 6.3 4.3 5.4 - NaN NaN NaN NaN
1 2 22.4 30.3 16.9 - 57 0 6.9 3.3 7.6 - NaN NaN NaN NaN
2 3 24 31.8 16.9 - 51 0 6.9 2.8 5.4 - NaN NaN NaN NaN
3 4 24.2 32 17.4 - 53 0 6 3.3 5.4 - NaN NaN NaN NaN
4 5 23.8 32 18 - 58 0 6.9 3.1 7.6 - NaN NaN NaN NaN
5 6 23.3 31 18.3 - 60 0 6.9 5 9.4 - NaN NaN NaN NaN
6 7 22.8 30.2 17.6 - 55 0 7.7 3.7 7.6 - NaN NaN NaN NaN
7 8 23.1 30.6 17.4 - 46 0 6.9 3.3 5.4 - NaN NaN NaN NaN
8 9 22.9 30.6 17.4 - 51 0 6.9 3.5 3.5 - NaN NaN NaN NaN
9 10 22.3 30 17 - 56 0 6.3 3.3 7.6 - NaN NaN NaN NaN
10 11 22.3 29.4 17 - 53 0 6.9 4.3 7.6 - NaN NaN NaN NaN
11 12 21.8 29.4 15.7 - 54 0 6.9 2.8 3.5 - NaN NaN NaN NaN
12 13 22.3 30.1 15.7 - 43 0 6.9 2.8 5.4 - NaN NaN NaN NaN
13 14 21.8 30.6 14.8 - 41 0 6.9 1.9 5.4 - NaN NaN NaN NaN
14 15 21.6 30.6 14.2 - 43 0 6.9 3.1 7.6 - NaN NaN NaN NaN
15 16 21.1 29.9 15.4 - 55 0 6.9 4.1 7.6 - NaN NaN NaN NaN
16 17 20.4 28.1 15.4 - 59 0 6.9 5 11.1 - NaN NaN NaN NaN
17 18 21.2 28.3 14.5 - 53 0 6.9 3.1 7.6 - NaN NaN NaN NaN
18 19 21.6 29.6 16.4 - 58 0 6.9 2.2 3.5 - NaN NaN NaN NaN
19 20 21.9 29.6 16.6 - 58 0 6.9 2.4 5.4 - NaN NaN NaN NaN
20 21 22.3 29.9 17.5 - 55 0 6.9 3.1 5.4 - NaN NaN NaN NaN
21 22 21.9 29.9 15.1 - 46 0 6.9 4.3 7.6 - NaN NaN NaN NaN
22 23 21.3 29 15.2 - 50 0 6.9 3.3 5.4 - NaN NaN NaN NaN
23 24 21.3 28.8 14.6 - 45 0 6.9 3 5.4 - NaN NaN NaN NaN
24 25 21.6 29.1 15.5 - 47 0 7.7 4.8 7.6 - NaN NaN NaN NaN
25 26 21.8 29.2 14.6 - 41 0 6.9 2.8 3.5 - NaN NaN NaN NaN
26 27 22.3 30.1 15.6 - 40 0 6.9 2.4 5.4 - NaN NaN NaN NaN
27 28 22.4 30.3 16 - 51 0 6.9 2.8 3.5 - NaN NaN NaN NaN
28 29 23 30.3 16.9 - 53 0 6.6 2.8 5.4 - NaN NaN NaN o
29 30 23.1 30 17.8 - 54 0 6.9 5.4 7.6 - NaN NaN NaN NaN
30 31 22.1 29.8 17.3 - 54 0 6.9 5.2 9.4 - NaN NaN NaN NaN
31 Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals:
32 NaN 22.3 30 16.4 - 51.6 0 6.9 3.5 6.3 NaN 0 0 0 1

Trying to scrape a webpage with multiple data tables, however only the first table is being extracted?

I am trying to extract data from basketball players off of Basketball-Reference for a project I am working on. On B-R, a player page has multiple tables of data and I want to grab all of it. However, when I try to grab the tables from the page, it only gives me the first instance of a table tag, i.e only the first table.
I have searched through the html and found that outside the first instance of the table tag, all the table tags are under a comment block. When I parse their parent tag and try and search for the child tag that contains the table information, it returns nothing. Here is a link to an example page, and here is my code:
url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')
per_36 = soup.find(id='all_per_minute')
table = per_36.find('table')
This returns nothing, however, if I were to instead look for the first table, it would return the contents. I don't understand what is going on, but I think it may have something to do with those comment blocks?
To scrape comments via BeautifulSoup, you could use this script:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')
pl = soup.select_one('#all_per_minute .placeholder')
comments = pl.find_next(string=lambda text: isinstance(text, Comment))
soup = BeautifulSoup(comments, 'html.parser')
rows = []
for tr in soup.select('tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td, th')])
for row in rows:
print(''.join('{: ^7}'.format(td) for td in row))
Prints:
Season Age Tm Lg Pos G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
2003-04 19 CLE NBA SG 79 79 3122 7.2 17.2 .417 0.7 2.5 .290 6.4 14.7 .438 4.0 5.3 .754 1.1 3.8 5.0 5.4 1.5 0.7 3.1 1.7 19.1
2004-05 20 CLE NBA SF 80 80 3388 8.4 17.9 .472 1.1 3.3 .351 7.3 14.6 .499 5.1 6.8 .750 1.2 5.1 6.2 6.1 1.9 0.6 2.8 1.6 23.1
2005-06 21 CLE NBA SF 79 79 3361 9.4 19.5 .480 1.4 4.1 .335 8.0 15.5 .518 6.4 8.7 .738 0.8 5.2 6.0 5.6 1.3 0.7 2.8 1.9 26.5
2006-07 22 CLE NBA SF 78 78 3190 8.7 18.3 .476 1.1 3.5 .319 7.6 14.8 .513 5.5 7.9 .698 0.9 5.0 5.9 5.3 1.4 0.6 2.8 1.9 24.1
2007-08 23 CLE NBA SF 75 74 3027 9.4 19.5 .484 1.3 4.3 .315 8.1 15.3 .531 6.5 9.2 .712 1.6 5.5 7.0 6.4 1.6 1.0 3.0 2.0 26.8
2008-09 24 CLE NBA SF 81 81 3054 9.3 19.0 .489 1.6 4.5 .344 7.7 14.5 .535 7.0 9.0 .780 1.2 6.0 7.2 6.9 1.6 1.1 2.8 1.6 27.2
2009-10 25 CLE NBA SF 76 76 2966 9.3 18.5 .503 1.6 4.7 .333 7.8 13.8 .560 7.2 9.4 .767 0.9 5.9 6.7 7.9 1.5 0.9 3.2 1.4 27.4
2010-11 26 MIA NBA SF 79 79 3063 8.9 17.5 .510 1.1 3.3 .330 7.8 14.2 .552 5.9 7.8 .759 0.9 6.0 6.9 6.5 1.5 0.6 3.3 1.9 24.8
2011-12 27 MIA NBA SF 62 62 2326 9.6 18.1 .531 0.8 2.3 .362 8.8 15.8 .556 6.0 7.8 .771 1.5 6.2 7.6 6.0 1.8 0.8 3.3 1.5 26.0
...and so on.

python: import data from text

I tried importing float numbers from P-I curve.txt file which contains my data. however i get an error when converting this into float. i used the following code.
with open('C:/Users/Kevin/Documents/4e Jaar/fotonica/Metingen/P-I curve.txt') as csvfile:
data= csv.reader(csvfile, delimiter = '\t')
current=[]
P_15=[]
P_20=[]
P_25=[]
P_30=[]
P_35=[]
P_40=[]
P_45=[]
P_50=[]
for row in data:
current.append(float(row[0].replace(',','.')))
P_15.append(float(row[2].replace(',','.')))
P_20.append(float(row[4].replace(',','.')))
P_25.append(float(row[6].replace(',','.')))
P_30.append(float(row[8].replace(',','.')))
P_35.append(float(row[10].replace(',','.')))
P_40.append(float(row[12].replace(',','.')))
P_45.append(float(row[14].replace(',','.')))
P_50.append(float(row[16].replace(',','.')))
with this code i got the following error which i understand that row 2 is a string but if so then why did this error not occur for row 1. Is there any other data to import float numbers without using csv import? I have copied and pasted the data from excel to a .txt file.
returned error:
File "C:/Users/Kevin/Documents/Python Scripts/P-I curves.py", line 29, in <module>
P_15.append(float(row[2].replace(',','.')))
ValueError: could not convert string to float:
I tried another following code:
import pandas as pd
df=pd.read_csv('C:/Users/Kevin/Documents/4e Jaar/fotonica/Metingen/P-I curve.txt', decimal=',', sep='\t',header=0,names=['current','15','20','25','30','35','40','45','50'] )
#curre=df['current']
print(current)
The txt file has a header and looks like this:
1.8 1.9 0.4 1.9 0.4 1.9 0.4 1.9 0.4
3.8 1.9 1.3 1.9 1.3 1.9 1.3 1.9 1.2
5.8 2.0 2.5 2.0 2.4 2.0 2.3 2.0 2.2
7.8 2.0 3.7 2.0 3.6 2.0 3.5 2.0 3.4
9.8 2.1 5.2 2.0 5.1 2.0 4.9 2.0 4.7
11.8 2.1 6.9 2.1 6.7 2.1 6.4 2.1 6.1
13.8 2.1 9.0 2.0 8.6 2.1 8.2 2.1 7.8
15.8 2.1 11.5 2.1 10.8 2.1 10.2 2.1 9.7
17.8 2.2 14.7 2.2 13.7 2.2 12.7 2.2 11.8
19.8 2.2 19.5 2.2 17.5 2.2 15.9 2.2 14.5
21.8 2.2 28.9 2.2 23.6 2.2 20.3 2.2 17.9
23.8 2.3 125.8 2.2 38.4 2.2 27.8 2.2 22.8
25.8 2.3 1669.0 2.3 634.0 2.3 51.7 2.3 31.4
27.8 2.3 3142.0 2.3 2154.0 2.3 982.0 2.3 62.2
29.8 2.3 4560.0 2.3 3594.0 2.3 2460.0 2.3 1075.0
31.8 2.3 5950.0 2.3 5010.0 2.3 3872.0 2.3 2540.0
33.8 2.4 7320.0 2.4 6360.0 2.4 5230.0 2.3 3880.0
35.8 2.4 8670.0 2.4 7700.0 2.4 6550.0 2.4 5210.0
37.8 NaN NaN NaN NaN 2.4 7850.0 2.4 6480.0
39.8 NaN NaN NaN NaN NaN NaN NaN NaN
41.8 NaN NaN NaN NaN NaN NaN NaN NaN
Name: current, dtype: float64
python seems to be returning everything instead of just line 1 which i want by printing the header current. I only want to take this line so i can save it as in an array. But How do i specifically draw the line with header current out of the data?.
I am not sure why it returned everything but i think that there is something wrong with encoding because i copied and pasted the data from excel.
Please look at the image of how the .txt looks like when copied from excel.
i tried out another short code (i also deleted the header manually for the .txt file!!), see description below:
data=np.loadtxt('C:/Users/Kevin/Documents/4e Jaar/fotonica/Metingen/ttest.txt',delimiter='\t')
data=float(data.replace(',','.'))
print(data[0])
with this code, i get the followin error.
ValueError: could not convert string to float: b'1,8'
I find this weird to occur. Is floating and replacing not enough for this
I think you need omit header=0:
df=pd.read_csv('C:/Users/Kevin/Documents/4e Jaar/fotonica/Metingen/P-I curve.txt',
decimal=',',
sep='\t',
names=['current','15','20','25','30','35','40','45','50'])
EDIT:
df=pd.read_csv('ttest.txt',
decimal=',',
sep='\t',
names=['current','15','20','25','30','35','40','45','50'])
print (df)
current 15 20 25 30 35 40 45 50
0 1.8 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3
1 3.8 1.3 1.3 1.3 1.2 1.2 1.1 1.1 1.1
2 5.8 2.5 2.4 2.3 2.2 2.2 2.1 2.0 1.9
3 7.8 3.7 3.6 3.5 3.4 3.3 3.1 3.0 2.9
4 9.8 5.2 5.1 4.9 4.7 4.5 4.3 4.1 4.0
5 11.8 6.9 6.7 6.4 6.1 5.9 5.6 5.3 5.1
6 13.8 9.0 8.6 8.2 7.8 7.4 7.0 6.6 6.3
7 15.8 11.5 10.8 10.2 9.7 9.1 8.6 8.0 7.6
8 17.8 14.7 13.7 12.7 11.8 11.0 10.3 9.6 9.0
9 19.8 19.5 17.5 15.9 14.5 13.3 12.2 11.3 10.5
10 21.8 28.9 23.6 20.3 17.9 16.0 14.5 13.2 12.2
11 23.8 125.8 38.4 27.8 22.8 19.6 17.2 15.4 14.1
12 25.8 1669.0 634.0 51.7 31.4 24.5 20.6 17.9 16.2
13 27.8 3142.0 2154.0 982.0 62.2 33.1 25.3 21.0 18.5
14 29.8 4560.0 3594.0 2460.0 1075.0 60.0 32.6 25.0 21.3
15 31.8 5950.0 5010.0 3872.0 2540.0 903.0 49.9 30.8 24.6
16 33.8 7320.0 6360.0 5230.0 3880.0 2294.0 387.0 40.9 28.8
17 35.8 8670.0 7700.0 6550.0 5210.0 3621.0 1733.0 71.0 34.8
18 37.8 NaN NaN 7850.0 6480.0 4880.0 3026.0 751.0 44.6
19 39.8 NaN NaN NaN NaN 6100.0 4240.0 1998.0 70.2
20 41.8 NaN NaN NaN NaN NaN NaN 3161.0 650.0
#list from column 15 with all values include NaNs
L1 = df['15'].tolist()
print (L1)
[0.4, 1.3, 2.5, 3.7, 5.2, 6.9, 9.0, 11.5, 14.7, 19.5, 28.9, 125.8, 1669.0,
3142.0, 4560.0, 5950.0, 7320.0, 8670.0, nan, nan, nan]
#list from column 15 with removing NaNs
L2 = df['15'].dropna().tolist()
print (L2)
[0.4, 1.3, 2.5, 3.7, 5.2, 6.9, 9.0, 11.5, 14.7, 19.5, 28.9, 125.8, 1669.0,
3142.0, 4560.0, 5950.0, 7320.0, 8670.0]
#convert all NaNs in all columns to 0
df = df.fillna(0)
print (df)
current 15 20 25 30 35 40 45 50
0 1.8 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3
1 3.8 1.3 1.3 1.3 1.2 1.2 1.1 1.1 1.1
2 5.8 2.5 2.4 2.3 2.2 2.2 2.1 2.0 1.9
3 7.8 3.7 3.6 3.5 3.4 3.3 3.1 3.0 2.9
4 9.8 5.2 5.1 4.9 4.7 4.5 4.3 4.1 4.0
5 11.8 6.9 6.7 6.4 6.1 5.9 5.6 5.3 5.1
6 13.8 9.0 8.6 8.2 7.8 7.4 7.0 6.6 6.3
7 15.8 11.5 10.8 10.2 9.7 9.1 8.6 8.0 7.6
8 17.8 14.7 13.7 12.7 11.8 11.0 10.3 9.6 9.0
9 19.8 19.5 17.5 15.9 14.5 13.3 12.2 11.3 10.5
10 21.8 28.9 23.6 20.3 17.9 16.0 14.5 13.2 12.2
11 23.8 125.8 38.4 27.8 22.8 19.6 17.2 15.4 14.1
12 25.8 1669.0 634.0 51.7 31.4 24.5 20.6 17.9 16.2
13 27.8 3142.0 2154.0 982.0 62.2 33.1 25.3 21.0 18.5
14 29.8 4560.0 3594.0 2460.0 1075.0 60.0 32.6 25.0 21.3
15 31.8 5950.0 5010.0 3872.0 2540.0 903.0 49.9 30.8 24.6
16 33.8 7320.0 6360.0 5230.0 3880.0 2294.0 387.0 40.9 28.8
17 35.8 8670.0 7700.0 6550.0 5210.0 3621.0 1733.0 71.0 34.8
18 37.8 0.0 0.0 7850.0 6480.0 4880.0 3026.0 751.0 44.6
19 39.8 0.0 0.0 0.0 0.0 6100.0 4240.0 1998.0 70.2
20 41.8 0.0 0.0 0.0 0.0 0.0 0.0 3161.0 650.0
#list from column 15
L3 = df['15'].tolist()
print (L3)
[0.4, 1.3, 2.5, 3.7, 5.2, 6.9, 9.0, 11.5, 14.7, 19.5, 28.9, 125.8, 1669.0,
3142.0, 4560.0, 5950.0, 7320.0, 8670.0, 0.0, 0.0, 0.0]
if importing data from .txt file as csv, the missing data should be added. So in this by manually adding 0 to the .txt file and retrying this code
with open('C:/Users/Kevin/Documents/4e Jaar/fotonica/Metingen/P-I curve.txt') as csvfile:
data= csv.reader(csvfile, delimiter = '\t')
current=[]
P_15=[]
P_20=[]
P_25=[]
P_30=[]
P_35=[]
P_40=[]
P_45=[]
P_50=[]
for row in data:
current.append(float(row[0].replace(',','.')))
P_15.append(float(row[2].replace(',','.')))
print(P_15)
it works for any row to print out.

Categories

Resources