Need efficient Beautifulsoup webscrape with nested divs & spans into pandas dataframe python

Need efficient Beautifulsoup webscrape with nested divs & spans into pandas dataframe python - python

I'm trying to scrape the results of a sports tournament into a pandas dataframe where each row is a different fighter's name.
Here is my code:
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.bjjcompsystem.com/tournaments/1221/categories/1532871")
soup = BeautifulSoup(page.content, 'lxml')
body = list(soup.children)[1]
alldivs = list(body.children)[3]
sections = list(alldivs.children)[5]
division = list(sections.children)[1]
div_name = division.get_text().replace('\n','')
bracket = list(sections.children)[3]
import pandas as pd
data = []
div_name = division.get_text().replace('\n','')
bracket = list(sections.children)[3]
for i in bracket:
bracket_title = [bt.get_text() for bt in bracket.select(".bracket-title")]
location = [l.get_text() for l in bracket.select(".bracket-match-header__where")]
time = [t.get_text() for t in bracket.select(".bracket-match-header__when")]
fighter_rank = [fr.get_text() for fr in bracket.select(".match-card__competitor-n")]
competitor_desc = [cd.get_text() for cd in bracket.select(".match-card__competitor-description")]
loser_name = [ln.get_text() for ln in bracket.select(".match-competitor--loser")]
data.append((div_name,bracket_title,location,time,fighter_rank,competitor_desc,loser_name))
df = pd.DataFrame(pd.DataFrame(data, columns=['Division','Bracket','Location','Time','Rank','Fighter','Loser']))
df
However, this results in each cell by row containing a list. I modified it to the following code:
import pandas as pd
data = []
div_name = division.get_text().replace('\n','')
bracket2 = soup.find_all('div', class_='tournament-category__brackets')
for i in bracket2:
bracketNo = i.find_all('div', class_='bracket-title')
section = i.find_all('div', class_='tournament-category__bracket tournament-category__bracket-15')
for a in section:
cats = a.find_all('div', class_='tournament-category__match')
for j in cats:
fight = j.find_all('div', class_='bracket-match-header')
for k in fight:
where = k.find('div', class_='bracket-match-header__where').get_text().replace('\n',' ')
when = k.find('div', class_='bracket-match-header__when').get_text().replace('\n',' ')
match = j.find_all('div', class_='match-card match-card--yellow')
for b in match:
rank = b.find_all('span', class_='match-card__competitor-n')
fighter = b.find_all('div', class_='match-card__competitor-name')
gym = b.find_all('div', class_='match-card__club-name')
loser = b.find_all('span', class_='match-competitor--loser')
data.append((div_name,bracketNo,when,where,rank,fighter,gym,loser,))
df1 = pd.DataFrame(pd.DataFrame(data, columns=['Division','Bracket','Time','Location','Rank','Fighter','Gym','Loser']))
df1
There is only 1 division, so this will be the same in every row. There are 5 bracket categories (1/4,2/4,3/4,4/4,finals). I want the corresponding time/location for each bracket. Each rank, fighter, and gym have two in each cell and I want this to be one per row. The sections in the dataframe are of different lengths, so that is causing some issues.
Ideally I want the dataframe to look like the following:
Division Bracket Time Location Rank Fighter Gym Loser
Master 1 Male BLACK Middle Bracket 1/4 Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 16 Jeffery Bynum Hammon Caique Jiu-Jitsu None
Master 1 Male BLACK Middle Bracket 1/4 Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 53 Fábio Junior Batista da Evolve MMA Fábio Junior Batista da Evolve MMA
Master 1 Male BLACK Middle Bracket 2/4 Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 14 André Felipe Maciel Fre Carlson Gracie None
Master 1 Male BLACK Middle Bracket 2/4 Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 50 Jerardo Linares Cleber Jiu Jitsu Jerardo Linares Cleber Jiu Jitsu
Any advice would be extremely helpful. I tried to create nested loops and follow the structure, but the HTML tree was rather complicated for me. The least amount of formatting in the df is ideal as I will later loop this over multiple pages. Thanks in advance!
EDIT: Next step - looping this program over multiple pages:
pages = [ #sample, no brackets
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533466', #example of category__bracket-1
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533387', #example of category__bracket-3
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533372', #example of category__bracket-7
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533022', #example of category__bracket-15
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532847',
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532871', #example of category__bracket-15 plus finals
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532889', #example of bracket with two losers in a match, so throws an error in fight 32 on fighter a name
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532856', #example of no winner on fight 11 so throws error on fight be name
]
first I define the multiple links. This is a subset of 411 different divisions.
results = pd.DataFrame()
for page in pages:
response = requests.get(page)
soup = BeautifulSoup(response.text, 'html.parser')
division = soup.find('span', {'class':'category-title__label category-title__age-division'}).text.strip()
label = soup.find('i', {'class':'fa fa-mars'}).parent.text.strip()
belt = soup.find('i', {'class':'fa fa-belt'}).parent.text.strip()
weight = soup.find('i', {'class':'fa fa-weight'}).parent.text.strip()
# PARSE BRACKETS
brackets = soup.find_all(['div', {'class':'tournament-category__bracket tournament-category__bracket-15'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-1'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-3'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-7'}])
#results = pd.DataFrame()
for bracket in brackets:
...etc
Is there a way to write into the programming how to account for different size divisions? The example at the top uses 4 brackets+finals and 15 match brackets. There are other divisions with 1 match, or 3, 7, or just 15 and not multiple brackets. Without segmenting out all links by size and re-writing the program, I'm wondering if there is an if/then statement I can add or try/except?

This was tricky as some of the attributes included the loser of the match, and then for some reason, others didn't. So had to figure out a way to fill in those missing nulls.
But none-the-less I think I managed to fill it all in correctly. Just iterated through each match of each bracket, then append them all into one table. To fill in the missing 'Loser' column, I sorted by Fight number, and basically looked at the rows with missing "Loser", and checked to see which fighter fought in a later match. Obviously, if the fighter had another match later, then his opponent was the loser.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import natsort as ns
pages = [ #sample, no brackets
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533466', #example of category__bracket-1
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533387', #example of category__bracket-3
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533372', #example of category__bracket-7
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533022', #example of category__bracket-15
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532847',
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532871', #example of category__bracket-15 plus finals
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532889', #example of bracket with two losers in a match, so throws an error in fight 32 on fighter a name
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532856', #example of no winner on fight 11 so throws error on fight be name
]
for url in pages:
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
division = soup.find('span', {'class':'category-title__label category-title__age-division'}).text.strip()
label = soup.find('i', {'class':'fa fa-mars'}).parent.text.strip()
belt = soup.find('i', {'class':'fa fa-belt'}).parent.text.strip()
weight = soup.find('i', {'class':'fa fa-weight'}).parent.text.strip()
# PARSE BRACKETS
#brackets = soup.find_all('div', {'class':'tournament-category__bracket tournament-category__bracket-15'})
brackets = soup.select('div[class*="tournament-category__bracket tournament-category__bracket-"]')
results = pd.DataFrame()
for bracket in brackets:
try:
bracketTitle = bracket.find_previous_sibling('div').text
except:
bracketTitle = 'Bracket 1/1'
rows = bracket.find_all('div', {'class':'row'})
for row in rows:
matches = row.find_all('div', {'class':'tournament-category__match'})
for match in matches:
#match = matches[0]#delete
bye = False
try:
match.find("div", {"class": "match-card__bye"}).text
where = match.find("div", {"class": "match-card__bye"}).text
when = match.find("div", {"class": "match-card__bye"}).text
loser = match.find("div", {"class": "match-card__bye"}).text
fighter_b_name = match.find("div", {"class": "match-card__bye"}).text
fighter_b_rank = match.find("div", {"class": "match-card__bye"}).text
fighter_b_club = match.find("div", {"class": "match-card__bye"}).text
bye = True
except:
where = match.find('div',{'class':'bracket-match-header__where'}).text
when = match.find('div',{'class':'bracket-match-header__when'}).text
fighter_a_desc = match.find_all('div',{'class':'match-card__competitor'})[0]
try:
fighter_a_name = fighter_a_desc.find('div', {'class':'match-card__competitor-name'}).text
except:
fighter_a_name = 'UNKNOWN'
try:
fighter_a_rank = fighter_a_desc.find('span', {'class':'match-card__competitor-n'}).text
except:
fighter_a_rank = 'N/A'
try:
fighter_a_club = fighter_a_desc.find('div', {'class':'match-card__club-name'}).text
except:
fighter_a_club = 'N/A'
cols = ['Bracket Title','Divison','Label','Belt','Weight','Where','When','Rank','Fighter','Opponent', 'Opponent Rank' ,'Gym','Loser']
if bye == False:
fighter_b_desc = match.find_all('div',{'class':'match-card__competitor'})[1]
try:
fighter_b_name = fighter_b_desc.find('div', {'class':'match-card__competitor-name'}).text
except:
fighter_b_name = 'UNKNOWN'
try:
fighter_b_rank = fighter_b_desc.find('span', {'class':'match-card__competitor-n'}).text
except:
fighter_b_rank = 'N/A'
try:
fighter_b_club = fighter_b_desc.find('div', {'class':'match-card__club-name'}).text
except:
fighter_b_club = 'N/A'
try:
loser = match.find('span', {'class':'match-card__competitor-description match-competitor--loser'}).find('div', {'class':'match-card__competitor-name'}).text
except:
loser = None
#print ('Loser could not be idenetified by html class')
temp_df_b = pd.DataFrame([[bracketTitle,division, label, belt, weight, where, when, fighter_b_rank, fighter_b_name, fighter_a_name, fighter_a_rank, fighter_b_club ,loser]], columns=cols)
temp_df = pd.DataFrame([[bracketTitle,division, label, belt, weight, where, when, fighter_a_rank, fighter_a_name, fighter_b_name, fighter_b_rank, fighter_a_club ,loser]], columns=cols)
temp_df = temp_df.append(temp_df_b, sort=True)
results = results.append(temp_df, sort=True).reset_index(drop=True)
# IDENTIFY LOSERS THAT WHERE NOT FOUND BY HTML ATTRIBUTES
results['Fight Number'] = results['Where'].str.split('FIGHT ', expand=True)[1].str.split(':', expand=True)[0].fillna(0)
results['Fight Number'] = pd.Categorical(results['Fight Number'], ordered=True, categories= ns.natsorted(results['Fight Number'].unique()))
results = results.sort_values('Fight Number')
results = results.drop_duplicates().reset_index(drop=True)
for idx, row in results.iterrows():
if row['Loser'] == None:
idx_save = idx
check = idx + 1
fighter_check_name = row['Fighter']
if fighter_check_name in list(results.loc[check:, 'Fighter']):
results.at[idx_save,'Loser'] = row['Opponent']
else:
results.at[idx_save,'Loser'] = row['Fighter']
print ('Processed url: %s' %url)
except:
print ('Error accessing url: %s' %url)
Output: I'm just showing the first 25 rows. 116 in total
print (results.head(25).to_string())
Belt Bracket Title Divison Fighter Gym Label Loser Opponent Opponent Rank Rank Weight When Where Fight Number
0 BLACK Bracket 2/4 Master 1 Marcelo França Mafra CheckMat Male BYE BYE BYE 4 Middle BYE BYE 0
1 BLACK Bracket 4/4 Master 1 Dealonzio Jerome Jackson Team Lloyd Irvin Male BYE BYE BYE 5 Middle BYE BYE 0
2 BLACK Bracket 2/4 Master 1 Oliver Leys Geddes Gracie Elite Team Male BYE BYE BYE 6 Middle BYE BYE 0
3 BLACK Bracket 1/4 Master 1 Gabriel Procópio da Fonseca Brazilian Top Team Male BYE BYE BYE 9 Middle BYE BYE 0
4 BLACK Bracket 2/4 Master 1 Igor Mocaiber Peralva de Mello Cicero Costha Internacional Male BYE BYE BYE 10 Middle BYE BYE 0
5 BLACK Bracket 1/4 Master 1 Sandro Gabriel Vieira Cantagalo Team Male BYE BYE BYE 1 Middle BYE BYE 0
6 BLACK Bracket 4/4 Master 1 Paulo Cesar Schauffler de Oliveira Gracie Elite Team Male BYE BYE BYE 8 Middle BYE BYE 0
7 BLACK Bracket 3/4 Master 1 Paulo César Ledesma Atos Jiu-Jitsu Male BYE BYE BYE 7 Middle BYE BYE 0
8 BLACK Bracket 3/4 Master 1 Vitor Henrique Silva Oliveira GF Team Male BYE BYE BYE 2 Middle BYE BYE 0
9 BLACK Bracket 4/4 Master 1 Clark Rouson Gracie Gracie Allegiance Male BYE BYE BYE 3 Middle BYE BYE 0
10 BLACK Bracket 4/4 Master 1 Phillip V. Fitzpatrick CheckMat Male Jonathan M. Perrine Jonathan M. Perrine 29 45 Middle Wed 08/21 at 10:06 AM FIGHT 1: Mat 8 1
11 BLACK Bracket 2/4 Master 1 André Felipe Maciel Freire Carlson Gracie Male Jerardo Linares Jerardo Linares 50 14 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 1
12 BLACK Bracket 2/4 Master 1 Jerardo Linares Cleber Jiu Jitsu Male Jerardo Linares André Felipe Maciel Freire 14 50 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 1
13 BLACK Bracket 1/4 Master 1 Fábio Junior Batista da Mata Evolve MMA Male Fábio Junior Batista da Mata Jeffery Bynum Hammond 16 53 Middle Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 1
14 BLACK Bracket 4/4 Master 1 Jonathan M. Perrine Gracie Humaita Male Jonathan M. Perrine Phillip V. Fitzpatrick 45 29 Middle Wed 08/21 at 10:06 AM FIGHT 1: Mat 8 1
15 BLACK Bracket 1/4 Master 1 Jeffery Bynum Hammond Caique Jiu-Jitsu Male Fábio Junior Batista da Mata Fábio Junior Batista da Mata 53 16 Middle Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 1
16 BLACK Bracket 3/4 Master 1 David Benzaken Teampact Male Evan Franklin Barrett Evan Franklin Barrett 54 15 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 7 1
17 BLACK Bracket 3/4 Master 1 Evan Franklin Barrett Zenith BJJ - Las Vegas Male Evan Franklin Barrett David Benzaken 15 54 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 7 1
18 BLACK Bracket 2/4 Master 1 Nathan S Santos Zenith BJJ - Las Vegas Male Nathan S Santos Jose A. Llanas-Campos 30 46 Middle Wed 08/21 at 10:16 AM FIGHT 2: Mat 6 2
19 BLACK Bracket 3/4 Master 1 Javier Arroyo Team Shawn Hammonds Male Javier Arroyo Kaisar Adilevich Saulebayev 43 27 Middle Wed 08/21 at 10:18 AM FIGHT 2: Mat 7 2
20 BLACK Bracket 4/4 Master 1 Manuel Ray Gonzales II Ralph Gracie Male Steven J. Patterson Steven J. Patterson 13 49 Middle Wed 08/21 at 10:10 AM FIGHT 2: Mat 8 2
21 BLACK Bracket 2/4 Master 1 Jose A. Llanas-Campos Ribeiro Jiu-Jitsu Male Nathan S Santos Nathan S Santos 46 30 Middle Wed 08/21 at 10:16 AM FIGHT 2: Mat 6 2
22 BLACK Bracket 4/4 Master 1 Steven J. Patterson Brasa CTA Male Steven J. Patterson Manuel Ray Gonzales II 49 13 Middle Wed 08/21 at 10:10 AM FIGHT 2: Mat 8 2
23 BLACK Bracket 3/4 Master 1 Kaisar Adilevich Saulebayev Charles Gracie Jiu-Jitsu Academy Male Javier Arroyo Javier Arroyo 27 43 Middle Wed 08/21 at 10:18 AM FIGHT 2: Mat 7 2
24 BLACK Bracket 1/4 Master 1 Matthew Romino Fox Team Lloyd Irvin Male Thiago Alves Cavalcante Rodrigues Thiago Alves Cavalcante Rodrigues 33 48 Middle Wed 08/21 at 10:15 AM FIGHT 2: Mat 5 2

Related

How to scrape this football page?

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.

You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 Egídio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 Matías Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier García
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián Martínez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 Valdívia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]

Problem concatenating URL and scraping data

I am trying to append a URL in python to scrape details from the target URL.
I have the below code but it seems to be scraping the data from url1 rather than URL.
I have scraped the team names from the NFL websit without any issue. The issue is with the spotrac URL where I am appending the team name which I have scraped from the NFL website.
import requests
from bs4 import BeautifulSoup
URL ='https://www.nfl.com/teams/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text)
for team in team_name:
team = team.replace(" ", "-").lower()
url1 = 'https://www.spotrac.com/nfl/rankings/'
URL = url1 +str(team)
print(URL)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
spotrac_df = pd.DataFrame(columns = ['Name', 'Salary'])
for h3 in bs_soup.select('h3'):
spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)
I'm almost certain the problem is coming from the URL not appending properly. The scraping is taking the salaries etc from url1 rather than URL.
My console output (using Spyder IDE) is as below for print(URL)

url is appending correctly, but you have a leading white space in your team names. I also made a few other changes and noted them in the code.
Lastly, (and I used to do this two), creating an empty dataframe then appending to it after each iteration I suppose isn't the best method. I've been told it better to construct your rows using lists/dictionaries, and then when done, then call on pandas to construct the dataframe, so changed that as well.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ='https://www.nfl.com/teams/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text.strip()) #<- remove leading/trailing white space
url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name:
team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team
url1 = 'https://www.spotrac.com/nfl/rankings/'
url = url1 + str(team)
print(url)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in bs_soup.select('h3'):
spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary
spotrac_df = pd.DataFrame(spotrac_rows)
Output:
print(spotrac_df)
Name Salary
0 Chandler Jones $21,333,333
1 Patrick Peterson $13,184,588
2 D.J. Humphries $12,800,000
3 DeAndre Hopkins $12,500,000
4 Larry Fitzgerald $11,750,000
5 Jordan Hicks $10,500,000
6 Justin Pugh $10,500,000
7 Kenyan Drake $8,483,000
8 Kyler Murray $8,080,601
9 Robert Alford $7,500,000
10 J.R. Sweezy $6,500,000
11 Corey Peters $4,437,500
12 Haason Reddick $4,288,444
13 Jordan Phillips $4,000,000
14 Isaiah Simmons $3,757,101
15 Maxx Williams $3,400,000
16 Zane Gonzalez $3,259,000
17 Devon Kennard $2,500,000
18 Budda Baker $2,173,184
19 De'Vondre Campbell $2,000,000
20 Andy Lee $2,000,000
21 Byron Murphy $1,815,795
22 Christian Kirk $1,607,691
23 Aaron Brewer $1,168,750
24 Max Garcia $1,143,125
25 Andy Isabella $1,052,244
26 Mason Cole $977,629
27 Zach Allen $975,855
28 Chris Banjo $887,500
29 Jonathan Bullard $887,500
... ...
2530 Khari Blasingame $675,000
2531 Kenneth Durden $675,000
2532 Cody Hollister $675,000
2533 Joey Ivie $675,000
2534 Greg Joseph $675,000
2535 Kareem Orr $675,000
2536 David Quessenberry $675,000
2537 Derick Roberson $675,000
2538 Shaun Wilson $675,000
2539 Cole McDonald $635,421
2540 Chris Jackson $629,570
2541 Kobe Smith $614,333
2542 Aaron Brewer $613,333
2543 Cale Garrett $613,333
2544 Tommy Hudson $613,333
2545 Kristian Wilkerson $613,333
2546 Khaylan Kearse-Thomas $612,500
2547 Nick Westbrook $612,333
2548 Kyle Williams $611,833
2549 Mason Kinsey $611,666
2550 Tucker McCann $611,666
2551 Cameron Scarlett $611,666
2552 Teair Tart $611,666
2553 Brandon Kemp $611,333
2554 Wyatt Ray $610,000
2555 Josh Smith $610,000
2556 Logan Woodside $610,000
2557 Rashard Davis $610,000
2558 Avery Gennesy $610,000
2559 Parker Hesse $610,000
[2560 rows x 2 columns]

How to count paragraphs from each article from dataframe?

I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object

Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)

This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())

This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa

Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

printing multiple sections of text between two markers in python

I converted this page (it's squad lists for different sports teams) from PDF to text using this code:
import PyPDF3
import sys
import tabula
import pandas as pd
#One method
pdfFileObj = open(sys.argv[1],'rb')
pdfReader = PyPDF3.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(text)
The output looks like this:
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF
I wanted to transform this output to a tab delimited file with three columns: team name, player name, and number. So for the example I gave, the output would look like:
Bohemians James Talbot 1
Bohemians Derek Pender 2
Bohemians Darragh Leahy 3
Cork City Mark McNulty 1
Cork City Colm Horgan 2
Cork City Alan Bennett 3
Derry City Peter Cherrie 1
Derry City Conor McDermott 2
Derry City Ciaran Coll 3
I know I need to first (1) Divide the file into sections based on team, and then (2) within each team section; combine each name + number field into pairs to assign each number to a name.
I wrote this little bit of code to parse the big file into each sports team:
import sys
fileopen = open(sys.argv[1])
recording = False
for line in fileopen:
if not recording:
if line.startswith('PREMI'):
recording = True
elif line.startswith('2019 SEA'):
recording = False
else:
print(line)
But I'm stuck, because the above code won't divide up the block of text per team (i.e. i need multiple blocks of text extracted to separate strings or lists?). Can someone advise how to divide up the text file I have per team (so in this example, I should be left with three blocks of text...and then somehow I can work on each team-divided block of text to pair numbers and names).

Soooo, not necessarily true to form and I don't take into consideration the other libraries you'd used, but it was designed to give you a start. You can reformat it however you wish.
>>> string = '''2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF'''
>>> def reorder(string):
import re
headers = ['Team', 'Name', 'Number']
print('\n')
print(headers)
print()
paragraphs = re.findall('2019[\S\s]+?(?=2019|$)', string)
for paragraph in paragraphs:
club = re.findall('(?i)CLUB:[\s]*([\S\s]+?)\n', paragraph)
names_numbers = re.findall('(?i)([\d]+)[\n]{1,3}[\s]*([\S\ ]+)', paragraph)
for i in range(len(names_numbers)):
if len(club) == 1:
print(club[0]+' | '+names_numbers[i][1]+' | '+names_numbers[i][0])
>>> reorder(string)
['Team', 'Name', 'Number']
BOHEMIANS | James Talbot | 1
BOHEMIANS | Derek Pender | 2
BOHEMIANS | Darragh Leahy | 3
CORK CITY | Mark McNulty | 1
CORK CITY | Colm Horgan | 2
CORK CITY | Alan Bennett | 3
DERRY CITY | Peter Cherrie | 1
DERRY CITY | Conor McDermott | 2
DERRY CITY | Ciaran Coll | 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need efficient Beautifulsoup webscrape with nested divs & spans into pandas dataframe python - python

Related

How to scrape this football page?

Problem concatenating URL and scraping data

How to count paragraphs from each article from dataframe?

Folium FeatureGroup in Python

printing multiple sections of text between two markers in python

Categories

Resources