How to scrape this football page? - python

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.

You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 Egídio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 Matías Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier García
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián Martínez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 Valdívia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]

Related

get the number of involved singer in a phase

I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I convert it to 1NF.
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF
You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
.explode('singer')
.groupby('phase')['singer'].count()
)
output:
phase
1 12
2 4
3 6
Name: singer, dtype: int64

BeautifulSoup doesn't display the content

I want to scrape spot price data from MCX India website.
The HTML script as visible on inspecting an element is as follows:
<div class="contents spotmarketprice">
<div id="cont-1" style="display: block;">
<table class="mcx-table mrB20" width="100%" cellspacing="8" id="tblSMP">
<thead>
<tr>
<th class="symbol-head">
Commodity
</th>
<th>
Unit
</th>
<th class="left1">
Location
</th>
<th class="right1">
Spot Price (Rs.)
</th>
<th>
Up/Down
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="symbol" style="width:30%;">ALMOND</td>
<td style="width:17%;">1 KGS</td>
<td align="left" style="width:17%;">DELHI</td>
<td align="right" style="width:17%;">558.00</td>
<td align="right" class="padR20" style="width:19%;">=</td>
</tr>
The code I have written is:
#import the required libraries
from bs4 import BeautifulSoup
import requests
#Getting data from website
source= requests.get('http://www.mcxindia.com/market-data/spot-market-price').text
#Getting the html code of the website
soup = BeautifulSoup(source, 'lxml')
#Navigating to the blocks where required content is present
division_1= soup.find('div', class_="contents spotmarketprice").div.table
#Displaying the results
print(division_1.tbody)
Output:
<tbody>
</tbody>
On the website, the content that I want to get is available in ... But, it is not showing any content here. Please, suggest a solution to this.
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
allin.append([item[x] for x in goal])
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
Output:
EnSymbol Unit Location TodaysSpotPrice
0 ALMOND 1 KGS DELHI 558.00
1 ALUMINIUM 1 KGS THANE 137.60
2 CARDAMOM 1 KGS VANDANMEDU 2525.00
3 CASTORSEED 100 KGS DEESA 3626.00
4 CHANA 100 KGS DELHI 4163.00
5 COPPER 1 KGS THANE 388.30
6 COTTON 1 BALES RAJKOT 15880.00
7 CPO 10 KGS KANDLA 635.90
8 CRUDEOIL 1 BBL MUMBAI 2418.00
9 GOLD 10 GRMS AHMEDABAD 40989.00
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00
11 GOLDM 10 GRMS AHMEDABAD 40989.00
12 GOLDPETAL 1 GRMS MUMBAI 4129.00
13 GUARGUM 100 KGS JODHPUR 5880.00
14 GUARSEED 100 KGS JODHPUR 3660.00
15 KAPAS 20 KGS RAJKOT 927.50
16 LEAD 1 KGS CHENNAI 141.60
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10
18 NATURALGAS 1 mmBtu HAZIRA 138.50
19 NICKEL 1 KGS THANE 892.00
20 PEPPER 100 KGS KOCHI 32700.00
21 RAW JUTE 100 KGS KOLKATA 4999.00
22 RBD PALMOLEIN 10 KGS KANDLA 700.40
23 REFSOYOIL 10 KGS INDORE 845.25
24 SILVER 1 KGS AHMEDABAD 36871.00
25 SILVERM 1 KGS AHMEDABAD 36871.00
26 SILVERMIC 1 KGS AHMEDABAD 36871.00
27 SUGARMDEL 100 KGS DELHI 3380.00
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00
30 TIN 1 KGS MUMBAI 1160.50
31 WHEAT 100 KGS DELHI 1977.50
32 ZINC 1 KGS THANE 155.15
In case if you want to have the symbol of changes:
Here's the version of it:
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice', 'Change']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
item = [item[x] for x in goal]
item[-1] = '▲' if item[-1] > 0 else '▼' if item[-1] < 0 else "="
allin.append(item)
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
Output:
EnSymbol Unit Location TodaysSpotPrice Change
0 ALMOND 1 KGS DELHI 558.00 =
1 ALUMINIUM 1 KGS THANE 137.60 =
2 CARDAMOM 1 KGS VANDANMEDU 2525.00 =
3 CASTORSEED 100 KGS DEESA 3626.00 =
4 CHANA 100 KGS DELHI 4163.00 =
5 COPPER 1 KGS THANE 388.30 =
6 COTTON 1 BALES RAJKOT 15880.00 ▲
7 CPO 10 KGS KANDLA 635.90 ▲
8 CRUDEOIL 1 BBL MUMBAI 2418.00 ▲
9 GOLD 10 GRMS AHMEDABAD 40989.00 =
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00 =
11 GOLDM 10 GRMS AHMEDABAD 40989.00 =
12 GOLDPETAL 1 GRMS MUMBAI 4129.00 =
13 GUARGUM 100 KGS JODHPUR 5880.00 =
14 GUARSEED 100 KGS JODHPUR 3660.00 =
15 KAPAS 20 KGS RAJKOT 927.50 ▲
16 LEAD 1 KGS CHENNAI 141.60 =
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10 =
18 NATURALGAS 1 mmBtu HAZIRA 138.50 ▲
19 NICKEL 1 KGS THANE 892.00 =
20 PEPPER 100 KGS KOCHI 32600.00 ▼
21 RAW JUTE 100 KGS KOLKATA 4999.00 =
22 RBD PALMOLEIN 10 KGS KANDLA 700.40 ▼
23 REFSOYOIL 10 KGS INDORE 845.25 =
24 SILVER 1 KGS AHMEDABAD 36871.00 =
25 SILVERM 1 KGS AHMEDABAD 36871.00 =
26 SILVERMIC 1 KGS AHMEDABAD 36871.00 =
27 SUGARMDEL 100 KGS DELHI 3380.00 ▼
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00 ▲
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00 ▼
30 TIN 1 KGS MUMBAI 1160.50 ▼
31 WHEAT 100 KGS DELHI 1977.50 ▲
32 ZINC 1 KGS THANE 155.15 =
It does seem like data within the table is being uploaded through JavaScript.
That's why, if you are trying to fetch this information using requests library, you don't receive table's data on return. requests simply doesn't support JS. Therefore, the problem here isn't in BeautifulSoup.
To scrape JS-driven data, consider using selenium and chromedriver. The solution in this case will look like:
# import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# create a webdriver
chromedriver_path = 'C:\\path\\to\\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
# go to the page and get its source
driver.get('http://www.mcxindia.com/market-data/spot-market-price')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# fetch mentioned data
table = soup.find('table', {'id': 'tblSMP'})
for tr in table.tbody.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print(row)
# close the webdriver
driver.quit()
The output of the above script is:
['ALMOND', '1 KGS', 'DELHI', '558.00', '=']
['ALUMINIUM', '1 KGS', 'THANE', '137.60', '=']
['CARDAMOM', '1 KGS', 'VANDANMEDU', '2,525.00', '=']
['CASTORSEED', '100 KGS', 'DEESA', '3,626.00', '▼']
['CHANA', '100 KGS', 'DELHI', '4,163.00', '▲']
['COPPER', '1 KGS', 'THANE', '388.30', '=']
['COTTON', '1 BALES', 'RAJKOT', '15,790.00', '▲']
['CPO', '10 KGS', 'KANDLA', '630.10', '▼']
['CRUDEOIL', '1 BBL', 'MUMBAI', '2,418.00', '▲']
['GOLD', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDGUINEA', '8 GRMS', 'AHMEDABAD', '32,923.00', '=']
['GOLDM', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDPETAL', '1 GRMS', 'MUMBAI', '4,129.00', '=']
['GUARGUM', '100 KGS', 'JODHPUR', '5,880.00', '=']
['GUARSEED', '100 KGS', 'JODHPUR', '3,660.00', '=']
UPD: I must specify that the code above answers to the question of seeing this specific table. However, sometimes websites store data in 'application/json' or similar tags that can be reached with 'requests' library (since they don't require JS).
As discovered by αԋɱҽԃ αмєяιcαη, current website contains such tag. Please, check his answer. It is indeed better to use requests, than selenium in this situation.

Need efficient Beautifulsoup webscrape with nested divs & spans into pandas dataframe python

I'm trying to scrape the results of a sports tournament into a pandas dataframe where each row is a different fighter's name.
Here is my code:
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.bjjcompsystem.com/tournaments/1221/categories/1532871")
soup = BeautifulSoup(page.content, 'lxml')
body = list(soup.children)[1]
alldivs = list(body.children)[3]
sections = list(alldivs.children)[5]
division = list(sections.children)[1]
div_name = division.get_text().replace('\n','')
bracket = list(sections.children)[3]
import pandas as pd
data = []
div_name = division.get_text().replace('\n','')
bracket = list(sections.children)[3]
for i in bracket:
bracket_title = [bt.get_text() for bt in bracket.select(".bracket-title")]
location = [l.get_text() for l in bracket.select(".bracket-match-header__where")]
time = [t.get_text() for t in bracket.select(".bracket-match-header__when")]
fighter_rank = [fr.get_text() for fr in bracket.select(".match-card__competitor-n")]
competitor_desc = [cd.get_text() for cd in bracket.select(".match-card__competitor-description")]
loser_name = [ln.get_text() for ln in bracket.select(".match-competitor--loser")]
data.append((div_name,bracket_title,location,time,fighter_rank,competitor_desc,loser_name))
df = pd.DataFrame(pd.DataFrame(data, columns=['Division','Bracket','Location','Time','Rank','Fighter','Loser']))
df
However, this results in each cell by row containing a list. I modified it to the following code:
import pandas as pd
data = []
div_name = division.get_text().replace('\n','')
bracket2 = soup.find_all('div', class_='tournament-category__brackets')
for i in bracket2:
bracketNo = i.find_all('div', class_='bracket-title')
section = i.find_all('div', class_='tournament-category__bracket tournament-category__bracket-15')
for a in section:
cats = a.find_all('div', class_='tournament-category__match')
for j in cats:
fight = j.find_all('div', class_='bracket-match-header')
for k in fight:
where = k.find('div', class_='bracket-match-header__where').get_text().replace('\n',' ')
when = k.find('div', class_='bracket-match-header__when').get_text().replace('\n',' ')
match = j.find_all('div', class_='match-card match-card--yellow')
for b in match:
rank = b.find_all('span', class_='match-card__competitor-n')
fighter = b.find_all('div', class_='match-card__competitor-name')
gym = b.find_all('div', class_='match-card__club-name')
loser = b.find_all('span', class_='match-competitor--loser')
data.append((div_name,bracketNo,when,where,rank,fighter,gym,loser,))
df1 = pd.DataFrame(pd.DataFrame(data, columns=['Division','Bracket','Time','Location','Rank','Fighter','Gym','Loser']))
df1
There is only 1 division, so this will be the same in every row. There are 5 bracket categories (1/4,2/4,3/4,4/4,finals). I want the corresponding time/location for each bracket. Each rank, fighter, and gym have two in each cell and I want this to be one per row. The sections in the dataframe are of different lengths, so that is causing some issues.
Ideally I want the dataframe to look like the following:
Division Bracket Time Location Rank Fighter Gym Loser
Master 1 Male BLACK Middle Bracket 1/4 Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 16 Jeffery Bynum Hammon Caique Jiu-Jitsu None
Master 1 Male BLACK Middle Bracket 1/4 Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 53 Fábio Junior Batista da Evolve MMA Fábio Junior Batista da Evolve MMA
Master 1 Male BLACK Middle Bracket 2/4 Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 14 André Felipe Maciel Fre Carlson Gracie None
Master 1 Male BLACK Middle Bracket 2/4 Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 50 Jerardo Linares Cleber Jiu Jitsu Jerardo Linares Cleber Jiu Jitsu
Any advice would be extremely helpful. I tried to create nested loops and follow the structure, but the HTML tree was rather complicated for me. The least amount of formatting in the df is ideal as I will later loop this over multiple pages. Thanks in advance!
EDIT: Next step - looping this program over multiple pages:
pages = [ #sample, no brackets
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533466', #example of category__bracket-1
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533387', #example of category__bracket-3
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533372', #example of category__bracket-7
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533022', #example of category__bracket-15
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532847',
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532871', #example of category__bracket-15 plus finals
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532889', #example of bracket with two losers in a match, so throws an error in fight 32 on fighter a name
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532856', #example of no winner on fight 11 so throws error on fight be name
]
first I define the multiple links. This is a subset of 411 different divisions.
results = pd.DataFrame()
for page in pages:
response = requests.get(page)
soup = BeautifulSoup(response.text, 'html.parser')
division = soup.find('span', {'class':'category-title__label category-title__age-division'}).text.strip()
label = soup.find('i', {'class':'fa fa-mars'}).parent.text.strip()
belt = soup.find('i', {'class':'fa fa-belt'}).parent.text.strip()
weight = soup.find('i', {'class':'fa fa-weight'}).parent.text.strip()
# PARSE BRACKETS
brackets = soup.find_all(['div', {'class':'tournament-category__bracket tournament-category__bracket-15'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-1'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-3'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-7'}])
#results = pd.DataFrame()
for bracket in brackets:
...etc
Is there a way to write into the programming how to account for different size divisions? The example at the top uses 4 brackets+finals and 15 match brackets. There are other divisions with 1 match, or 3, 7, or just 15 and not multiple brackets. Without segmenting out all links by size and re-writing the program, I'm wondering if there is an if/then statement I can add or try/except?
This was tricky as some of the attributes included the loser of the match, and then for some reason, others didn't. So had to figure out a way to fill in those missing nulls.
But none-the-less I think I managed to fill it all in correctly. Just iterated through each match of each bracket, then append them all into one table. To fill in the missing 'Loser' column, I sorted by Fight number, and basically looked at the rows with missing "Loser", and checked to see which fighter fought in a later match. Obviously, if the fighter had another match later, then his opponent was the loser.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import natsort as ns
pages = [ #sample, no brackets
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533466', #example of category__bracket-1
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533387', #example of category__bracket-3
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533372', #example of category__bracket-7
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533022', #example of category__bracket-15
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532847',
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532871', #example of category__bracket-15 plus finals
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532889', #example of bracket with two losers in a match, so throws an error in fight 32 on fighter a name
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532856', #example of no winner on fight 11 so throws error on fight be name
]
for url in pages:
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
division = soup.find('span', {'class':'category-title__label category-title__age-division'}).text.strip()
label = soup.find('i', {'class':'fa fa-mars'}).parent.text.strip()
belt = soup.find('i', {'class':'fa fa-belt'}).parent.text.strip()
weight = soup.find('i', {'class':'fa fa-weight'}).parent.text.strip()
# PARSE BRACKETS
#brackets = soup.find_all('div', {'class':'tournament-category__bracket tournament-category__bracket-15'})
brackets = soup.select('div[class*="tournament-category__bracket tournament-category__bracket-"]')
results = pd.DataFrame()
for bracket in brackets:
try:
bracketTitle = bracket.find_previous_sibling('div').text
except:
bracketTitle = 'Bracket 1/1'
rows = bracket.find_all('div', {'class':'row'})
for row in rows:
matches = row.find_all('div', {'class':'tournament-category__match'})
for match in matches:
#match = matches[0]#delete
bye = False
try:
match.find("div", {"class": "match-card__bye"}).text
where = match.find("div", {"class": "match-card__bye"}).text
when = match.find("div", {"class": "match-card__bye"}).text
loser = match.find("div", {"class": "match-card__bye"}).text
fighter_b_name = match.find("div", {"class": "match-card__bye"}).text
fighter_b_rank = match.find("div", {"class": "match-card__bye"}).text
fighter_b_club = match.find("div", {"class": "match-card__bye"}).text
bye = True
except:
where = match.find('div',{'class':'bracket-match-header__where'}).text
when = match.find('div',{'class':'bracket-match-header__when'}).text
fighter_a_desc = match.find_all('div',{'class':'match-card__competitor'})[0]
try:
fighter_a_name = fighter_a_desc.find('div', {'class':'match-card__competitor-name'}).text
except:
fighter_a_name = 'UNKNOWN'
try:
fighter_a_rank = fighter_a_desc.find('span', {'class':'match-card__competitor-n'}).text
except:
fighter_a_rank = 'N/A'
try:
fighter_a_club = fighter_a_desc.find('div', {'class':'match-card__club-name'}).text
except:
fighter_a_club = 'N/A'
cols = ['Bracket Title','Divison','Label','Belt','Weight','Where','When','Rank','Fighter','Opponent', 'Opponent Rank' ,'Gym','Loser']
if bye == False:
fighter_b_desc = match.find_all('div',{'class':'match-card__competitor'})[1]
try:
fighter_b_name = fighter_b_desc.find('div', {'class':'match-card__competitor-name'}).text
except:
fighter_b_name = 'UNKNOWN'
try:
fighter_b_rank = fighter_b_desc.find('span', {'class':'match-card__competitor-n'}).text
except:
fighter_b_rank = 'N/A'
try:
fighter_b_club = fighter_b_desc.find('div', {'class':'match-card__club-name'}).text
except:
fighter_b_club = 'N/A'
try:
loser = match.find('span', {'class':'match-card__competitor-description match-competitor--loser'}).find('div', {'class':'match-card__competitor-name'}).text
except:
loser = None
#print ('Loser could not be idenetified by html class')
temp_df_b = pd.DataFrame([[bracketTitle,division, label, belt, weight, where, when, fighter_b_rank, fighter_b_name, fighter_a_name, fighter_a_rank, fighter_b_club ,loser]], columns=cols)
temp_df = pd.DataFrame([[bracketTitle,division, label, belt, weight, where, when, fighter_a_rank, fighter_a_name, fighter_b_name, fighter_b_rank, fighter_a_club ,loser]], columns=cols)
temp_df = temp_df.append(temp_df_b, sort=True)
results = results.append(temp_df, sort=True).reset_index(drop=True)
# IDENTIFY LOSERS THAT WHERE NOT FOUND BY HTML ATTRIBUTES
results['Fight Number'] = results['Where'].str.split('FIGHT ', expand=True)[1].str.split(':', expand=True)[0].fillna(0)
results['Fight Number'] = pd.Categorical(results['Fight Number'], ordered=True, categories= ns.natsorted(results['Fight Number'].unique()))
results = results.sort_values('Fight Number')
results = results.drop_duplicates().reset_index(drop=True)
for idx, row in results.iterrows():
if row['Loser'] == None:
idx_save = idx
check = idx + 1
fighter_check_name = row['Fighter']
if fighter_check_name in list(results.loc[check:, 'Fighter']):
results.at[idx_save,'Loser'] = row['Opponent']
else:
results.at[idx_save,'Loser'] = row['Fighter']
print ('Processed url: %s' %url)
except:
print ('Error accessing url: %s' %url)
Output: I'm just showing the first 25 rows. 116 in total
print (results.head(25).to_string())
Belt Bracket Title Divison Fighter Gym Label Loser Opponent Opponent Rank Rank Weight When Where Fight Number
0 BLACK Bracket 2/4 Master 1 Marcelo França Mafra CheckMat Male BYE BYE BYE 4 Middle BYE BYE 0
1 BLACK Bracket 4/4 Master 1 Dealonzio Jerome Jackson Team Lloyd Irvin Male BYE BYE BYE 5 Middle BYE BYE 0
2 BLACK Bracket 2/4 Master 1 Oliver Leys Geddes Gracie Elite Team Male BYE BYE BYE 6 Middle BYE BYE 0
3 BLACK Bracket 1/4 Master 1 Gabriel Procópio da Fonseca Brazilian Top Team Male BYE BYE BYE 9 Middle BYE BYE 0
4 BLACK Bracket 2/4 Master 1 Igor Mocaiber Peralva de Mello Cicero Costha Internacional Male BYE BYE BYE 10 Middle BYE BYE 0
5 BLACK Bracket 1/4 Master 1 Sandro Gabriel Vieira Cantagalo Team Male BYE BYE BYE 1 Middle BYE BYE 0
6 BLACK Bracket 4/4 Master 1 Paulo Cesar Schauffler de Oliveira Gracie Elite Team Male BYE BYE BYE 8 Middle BYE BYE 0
7 BLACK Bracket 3/4 Master 1 Paulo César Ledesma Atos Jiu-Jitsu Male BYE BYE BYE 7 Middle BYE BYE 0
8 BLACK Bracket 3/4 Master 1 Vitor Henrique Silva Oliveira GF Team Male BYE BYE BYE 2 Middle BYE BYE 0
9 BLACK Bracket 4/4 Master 1 Clark Rouson Gracie Gracie Allegiance Male BYE BYE BYE 3 Middle BYE BYE 0
10 BLACK Bracket 4/4 Master 1 Phillip V. Fitzpatrick CheckMat Male Jonathan M. Perrine Jonathan M. Perrine 29 45 Middle Wed 08/21 at 10:06 AM FIGHT 1: Mat 8 1
11 BLACK Bracket 2/4 Master 1 André Felipe Maciel Freire Carlson Gracie Male Jerardo Linares Jerardo Linares 50 14 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 1
12 BLACK Bracket 2/4 Master 1 Jerardo Linares Cleber Jiu Jitsu Male Jerardo Linares André Felipe Maciel Freire 14 50 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 1
13 BLACK Bracket 1/4 Master 1 Fábio Junior Batista da Mata Evolve MMA Male Fábio Junior Batista da Mata Jeffery Bynum Hammond 16 53 Middle Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 1
14 BLACK Bracket 4/4 Master 1 Jonathan M. Perrine Gracie Humaita Male Jonathan M. Perrine Phillip V. Fitzpatrick 45 29 Middle Wed 08/21 at 10:06 AM FIGHT 1: Mat 8 1
15 BLACK Bracket 1/4 Master 1 Jeffery Bynum Hammond Caique Jiu-Jitsu Male Fábio Junior Batista da Mata Fábio Junior Batista da Mata 53 16 Middle Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 1
16 BLACK Bracket 3/4 Master 1 David Benzaken Teampact Male Evan Franklin Barrett Evan Franklin Barrett 54 15 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 7 1
17 BLACK Bracket 3/4 Master 1 Evan Franklin Barrett Zenith BJJ - Las Vegas Male Evan Franklin Barrett David Benzaken 15 54 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 7 1
18 BLACK Bracket 2/4 Master 1 Nathan S Santos Zenith BJJ - Las Vegas Male Nathan S Santos Jose A. Llanas-Campos 30 46 Middle Wed 08/21 at 10:16 AM FIGHT 2: Mat 6 2
19 BLACK Bracket 3/4 Master 1 Javier Arroyo Team Shawn Hammonds Male Javier Arroyo Kaisar Adilevich Saulebayev 43 27 Middle Wed 08/21 at 10:18 AM FIGHT 2: Mat 7 2
20 BLACK Bracket 4/4 Master 1 Manuel Ray Gonzales II Ralph Gracie Male Steven J. Patterson Steven J. Patterson 13 49 Middle Wed 08/21 at 10:10 AM FIGHT 2: Mat 8 2
21 BLACK Bracket 2/4 Master 1 Jose A. Llanas-Campos Ribeiro Jiu-Jitsu Male Nathan S Santos Nathan S Santos 46 30 Middle Wed 08/21 at 10:16 AM FIGHT 2: Mat 6 2
22 BLACK Bracket 4/4 Master 1 Steven J. Patterson Brasa CTA Male Steven J. Patterson Manuel Ray Gonzales II 49 13 Middle Wed 08/21 at 10:10 AM FIGHT 2: Mat 8 2
23 BLACK Bracket 3/4 Master 1 Kaisar Adilevich Saulebayev Charles Gracie Jiu-Jitsu Academy Male Javier Arroyo Javier Arroyo 27 43 Middle Wed 08/21 at 10:18 AM FIGHT 2: Mat 7 2
24 BLACK Bracket 1/4 Master 1 Matthew Romino Fox Team Lloyd Irvin Male Thiago Alves Cavalcante Rodrigues Thiago Alves Cavalcante Rodrigues 33 48 Middle Wed 08/21 at 10:15 AM FIGHT 2: Mat 5 2

bs4 find table by id, returning 'None'

Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.
The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0
There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.
This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript
import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output

Scraping a table from the web omits certain values

I am working on a little coding project to help learn how webscraping works, and decided to extract a table from a fantasy football website I like, which can be found here. https://fantasydata.com/nfl/fantasy-football-leaders?position=1&team=1&season=2018&seasontype=1&scope=1&subscope=1&scoringsystem=2&aggregatescope=1&range=1
When I attempt to grab the table the first 10 rows come out okay, but starting Brian Hill's row every value in my table comes up blank. I have inspected the web page as I usually do whenever I run into an issue, and the rows following Hill's seem to follow an identical structure to the ones before it. Any helping both resolving the issue and potentially explaining why it is happening in the first place would be much appreciated!
import pandas
from bs4 import BeautifulSoup
from selenium import webdriver
URLA = 'https://fantasydata.com/nfl/fantasy-football-leaders?position='
URLB = '&team='
URLC = '&season='
URLD = '&seasontype=1&scope=1&subscope=1&scoringsystem=2&aggregatescope=1&range=3'
POSITIONNUMBER = [1,6,7]
TEAMNUMBER = [1]
def buildStatsTable(year):
fullDF = pandas.DataFrame()
fullLength = 0
position = 1
headers = ['Name', 'Team', 'Pos', 'GMS', 'PassingYards', 'PassingTDs', 'PassingINTs',
'RushingYDs', 'RushingTDs', 'ReceivingRECs', 'ReceivingYDs', 'ReceivingTDs',
'FUM LST', 'PPG', 'FPTS']
for team in TEAMNUMBER:
currURL = URLA + str(position)+ URLB + str(team)+URLC+str(year)+URLD
driver = webdriver.Chrome()
driver.get(currURL)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
tr = soup.findAll('tr', {'role' : 'row'})
length = len(tr)
offset = length/2
maxCap = int((length - 1)/2) + 1
tableList = []
for i, row in enumerate(tr[2:maxCap]):
player = row.get_text().split('\n', 2)[1]
player_row = [value.get_text() for value in tr[int(i + offset + 1)].contents]
tableList.append([player] + player_row)
teamDF = pandas.DataFrame(columns = headers, data = tableList)
fullLength = fullLength + len(tableList)
fullDF = fullDF.append(teamDF)
fullDF.index = list(range(0,fullLength))
return fullDF
falcons = buildStatsTable(2018)
Actual Results (only showed the fist few columns to make the post shorter, the issue is consistent across every column)
Name Team Pos GMS PassingYards PassingTDs PassingINTs \
0 Matt Ryan ATL QB 16 4924 35 7
1 Julio Jones ATL WR 16 0 0 0
2 Calvin Ridley ATL WR 16 0 0 0
3 Tevin Coleman ATL RB 16 0 0 0
4 Mohamed Sanu ATL WR 16 5 1 0
5 Austin Hooper ATL TE 16 0 0 0
6 Ito Smith ATL RB 14 0 0 0
7 Justin Hardy ATL WR 16 0 0 0
8 Marvin Hall ATL WR 16 0 0 0
9 Logan Paulsen ATL TE 15 0 0 0
10 Brian Hill ATL RB
11 Devonta Freeman ATL RB
12 Russell Gage ATL WR
13 Eric Saubert ATL TE

Categories

Resources