Scraping a table from the web omits certain values - python

I am working on a little coding project to help learn how webscraping works, and decided to extract a table from a fantasy football website I like, which can be found here. https://fantasydata.com/nfl/fantasy-football-leaders?position=1&team=1&season=2018&seasontype=1&scope=1&subscope=1&scoringsystem=2&aggregatescope=1&range=1
When I attempt to grab the table the first 10 rows come out okay, but starting Brian Hill's row every value in my table comes up blank. I have inspected the web page as I usually do whenever I run into an issue, and the rows following Hill's seem to follow an identical structure to the ones before it. Any helping both resolving the issue and potentially explaining why it is happening in the first place would be much appreciated!
import pandas
from bs4 import BeautifulSoup
from selenium import webdriver
URLA = 'https://fantasydata.com/nfl/fantasy-football-leaders?position='
URLB = '&team='
URLC = '&season='
URLD = '&seasontype=1&scope=1&subscope=1&scoringsystem=2&aggregatescope=1&range=3'
POSITIONNUMBER = [1,6,7]
TEAMNUMBER = [1]
def buildStatsTable(year):
fullDF = pandas.DataFrame()
fullLength = 0
position = 1
headers = ['Name', 'Team', 'Pos', 'GMS', 'PassingYards', 'PassingTDs', 'PassingINTs',
'RushingYDs', 'RushingTDs', 'ReceivingRECs', 'ReceivingYDs', 'ReceivingTDs',
'FUM LST', 'PPG', 'FPTS']
for team in TEAMNUMBER:
currURL = URLA + str(position)+ URLB + str(team)+URLC+str(year)+URLD
driver = webdriver.Chrome()
driver.get(currURL)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
tr = soup.findAll('tr', {'role' : 'row'})
length = len(tr)
offset = length/2
maxCap = int((length - 1)/2) + 1
tableList = []
for i, row in enumerate(tr[2:maxCap]):
player = row.get_text().split('\n', 2)[1]
player_row = [value.get_text() for value in tr[int(i + offset + 1)].contents]
tableList.append([player] + player_row)
teamDF = pandas.DataFrame(columns = headers, data = tableList)
fullLength = fullLength + len(tableList)
fullDF = fullDF.append(teamDF)
fullDF.index = list(range(0,fullLength))
return fullDF
falcons = buildStatsTable(2018)
Actual Results (only showed the fist few columns to make the post shorter, the issue is consistent across every column)
Name Team Pos GMS PassingYards PassingTDs PassingINTs \
0 Matt Ryan ATL QB 16 4924 35 7
1 Julio Jones ATL WR 16 0 0 0
2 Calvin Ridley ATL WR 16 0 0 0
3 Tevin Coleman ATL RB 16 0 0 0
4 Mohamed Sanu ATL WR 16 5 1 0
5 Austin Hooper ATL TE 16 0 0 0
6 Ito Smith ATL RB 14 0 0 0
7 Justin Hardy ATL WR 16 0 0 0
8 Marvin Hall ATL WR 16 0 0 0
9 Logan Paulsen ATL TE 15 0 0 0
10 Brian Hill ATL RB
11 Devonta Freeman ATL RB
12 Russell Gage ATL WR
13 Eric Saubert ATL TE

Related

How to scrape this football page?

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.
You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 Egídio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 Matías Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier García
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián Martínez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 Valdívia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]

Python: Replace multiple old values to new value Pandas

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})
This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0

scrape tennis results including tournament for each match row

I want to scrape tennis matches results from this website
The results table I want has the columns: tournament_name match_time player_1 player_2 player_1_score player_2_score
This is an example
tournament_name match_time player_1 player_2 p1_set1 p2_set1
Roma / Italy 11:00 Krajinovic Filip Auger Aliassime Felix 6 4
Iasi (IX) / Romania 10:00 Bourgue Mathias Martineau Matteo 6 1
I can't associate each tournament name on the id="main_tour" with each row (one row is 2 class="match" or 2 class="match1"
I tried this code:
import requests
from bs4 import BeautifulSoup
u = "http://www.tennisprediction.com/?year=2020&month=9&day=14"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(u, timeout=30, headers=headers)
# print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
for table in soup.select('#main_tur'):
tourn_value = [i.get_text(strip=True) for i in table.select('tr:nth-child(1)')][0].split('/')[0].strip()
tourn_name = [i.get_text(strip=True) for i in table.select('tr td#main_tour')]
row = [i.get_text(strip=True) for i in table.select('.match')]
row2 = [i.get_text(strip=True) for i in table.select('.match1')]
print(tourn_value, tourn_name)
You can use this script to save the table to CSV in your format:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Saves data.csv:
EDIT: To add Odd, Prob columns for both players:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
odd1 = t.find_next(class_='main_odds_m')
odd2 = t.parent.find_next_sibling().find_next(class_='main_odds_m')
prob1 = t.find_next(class_='main_perc')
prob2 = t.parent.find_next_sibling().find_next(class_='main_perc')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
'odd1': odd1.text,
'prob1': prob1.text,
'odd2': odd2.text,
'prob2': prob2.text
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Andrej's solution is really nice and elegant. Accept his solution, but here was my go at it:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
rows=[]
for matchClass in ['match','match1']:
matches = soup.find_all('tr',{'class':'match'})
for idx, match in enumerate(matches):
if idx%2 != 0:
continue
row = {}
tourny = match.find_previous('td',{'id':'main_tour'}).text
time = match.find('td',{'class':'main_time'}).text
p1 = match.find('td',{'class':'main_player'})
player_1 = p1.text
row.update({'tournament_name':tourny,'match_time':time,'player_1':player_1})
sets = p1.find_previous('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p1_set%d'%(idx+1):each_set.text})
p2 = match.find_next('td',{'class':'main_player'})
player_2 = p2.text
row.update({'player_2':player_2})
sets = p2.find_next('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p2_set%d'%(idx+1):each_set.text})
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df.head(10).to_string())
tournament_name match_time player_1 p1_set1 p1_set2 p1_set3 p1_set4 p1_set5 player_2 p2_set1 p2_set2 p2_set3 p2_set4 p2_set5
0 Roma / Italy prize / money : 5791 000 USD 11:10 Krajinovic Filip (SRB) (26) 6 7 Krajinovic Filip (SRB) (26) 4 5
1 Roma / Italy prize / money : 5791 000 USD 13:15 Dimitrov Grigor (BGR) (20) 7 6 Dimitrov Grigor (BGR) (20) 5 1
2 Roma / Italy prize / money : 5791 000 USD 13:50 Coric Borna (HRV) (32) 6 6 Coric Borna (HRV) (32) 4 4
3 Roma / Italy prize / money : 5791 000 USD 15:30 Humbert Ugo (FRA) (42) 6 7 Humbert Ugo (FRA) (42) 3 6 (5)
4 Roma / Italy prize / money : 5791 000 USD 19:00 Nishikori Kei (JPN) (34) 6 7 Nishikori Kei (JPN) (34) 4 6 (3)
5 Roma / Italy prize / money : 5791 000 USD 22:00 Travaglia Stefano (ITA) (87) 6 7 Travaglia Stefano (ITA) (87) 4 6 (4)
6 Iasi (IX) / Romania prize / money : 100 000 USD 10:05 Menezes Joao (BRA) (189) 6 6 Menezes Joao (BRA) (189) 4 4
7 Iasi (IX) / Romania prize / money : 100 000 USD 12:05 Cretu Cezar (2001) (ROU) 2 6 6 Cretu Cezar (2001) (ROU) 6 3 4
8 Iasi (IX) / Romania prize / money : 100 000 USD 14:35 Zuk Kacper (POL) (306) 6 6 Zuk Kacper (POL) (306) 2 0
9 Roma / Italy prize / money : 3452 000 USD 11:05 Pavlyuchenkova Anastasia (RUS) (32) 6 6 6 Pavlyuchenkova Anastasia (RUS) (32) 4 7 (5) 1

bs4 find table by id, returning 'None'

Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.
The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0
There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.
This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript
import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output

Setting values when iterating through a DataFrame

I have a dictionary of states (example IA:Idaho). I have loaded the dictionary into a DataFrame bystate_df.
then I am importing a CSV with states deaths that I want to add them to the bystate_df as I read the lines:
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df[(byState_df[0] == row['Area'])]['Deaths'] = row['Deaths']
print byState_df
but the byState_df is still 0 afterwords:
0 1 Deaths
0 WA Washington 0
1 WI Wisconsin 0
2 WV West Virginia 0
3 FL Florida 0
4 WY Wyoming 0
5 NH New Hampshire 0
6 NJ New Jersey 0
7 NM New Mexico 0
8 NA National 0
I test row['Deaths'] while it iterates and it's producing the correct values, it just seem to be setting the byState_df value incorrectly.
Can you try the following code where I use .loc instead of [][].
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df.loc[byState_df[0] == row['Area'], 'Deaths'] = row['Deaths']
print byState_df

Categories

Resources