BeautifulSoup doesn't display the content - python

I want to scrape spot price data from MCX India website.
The HTML script as visible on inspecting an element is as follows:
<div class="contents spotmarketprice">
<div id="cont-1" style="display: block;">
<table class="mcx-table mrB20" width="100%" cellspacing="8" id="tblSMP">
<thead>
<tr>
<th class="symbol-head">
Commodity
</th>
<th>
Unit
</th>
<th class="left1">
Location
</th>
<th class="right1">
Spot Price (Rs.)
</th>
<th>
Up/Down
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="symbol" style="width:30%;">ALMOND</td>
<td style="width:17%;">1 KGS</td>
<td align="left" style="width:17%;">DELHI</td>
<td align="right" style="width:17%;">558.00</td>
<td align="right" class="padR20" style="width:19%;">=</td>
</tr>
The code I have written is:
#import the required libraries
from bs4 import BeautifulSoup
import requests
#Getting data from website
source= requests.get('http://www.mcxindia.com/market-data/spot-market-price').text
#Getting the html code of the website
soup = BeautifulSoup(source, 'lxml')
#Navigating to the blocks where required content is present
division_1= soup.find('div', class_="contents spotmarketprice").div.table
#Displaying the results
print(division_1.tbody)
Output:
<tbody>
</tbody>
On the website, the content that I want to get is available in ... But, it is not showing any content here. Please, suggest a solution to this.

import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
allin.append([item[x] for x in goal])
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
Output:
EnSymbol Unit Location TodaysSpotPrice
0 ALMOND 1 KGS DELHI 558.00
1 ALUMINIUM 1 KGS THANE 137.60
2 CARDAMOM 1 KGS VANDANMEDU 2525.00
3 CASTORSEED 100 KGS DEESA 3626.00
4 CHANA 100 KGS DELHI 4163.00
5 COPPER 1 KGS THANE 388.30
6 COTTON 1 BALES RAJKOT 15880.00
7 CPO 10 KGS KANDLA 635.90
8 CRUDEOIL 1 BBL MUMBAI 2418.00
9 GOLD 10 GRMS AHMEDABAD 40989.00
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00
11 GOLDM 10 GRMS AHMEDABAD 40989.00
12 GOLDPETAL 1 GRMS MUMBAI 4129.00
13 GUARGUM 100 KGS JODHPUR 5880.00
14 GUARSEED 100 KGS JODHPUR 3660.00
15 KAPAS 20 KGS RAJKOT 927.50
16 LEAD 1 KGS CHENNAI 141.60
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10
18 NATURALGAS 1 mmBtu HAZIRA 138.50
19 NICKEL 1 KGS THANE 892.00
20 PEPPER 100 KGS KOCHI 32700.00
21 RAW JUTE 100 KGS KOLKATA 4999.00
22 RBD PALMOLEIN 10 KGS KANDLA 700.40
23 REFSOYOIL 10 KGS INDORE 845.25
24 SILVER 1 KGS AHMEDABAD 36871.00
25 SILVERM 1 KGS AHMEDABAD 36871.00
26 SILVERMIC 1 KGS AHMEDABAD 36871.00
27 SUGARMDEL 100 KGS DELHI 3380.00
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00
30 TIN 1 KGS MUMBAI 1160.50
31 WHEAT 100 KGS DELHI 1977.50
32 ZINC 1 KGS THANE 155.15
In case if you want to have the symbol of changes:
Here's the version of it:
import requests
import re
import json
import pandas as pd
goal = ['EnSymbol', 'Unit', 'Location', 'TodaysSpotPrice', 'Change']
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'"Data":(\[.*?\])', r.text).group(1))
allin = []
for item in match:
item = [item[x] for x in goal]
item[-1] = '▲' if item[-1] > 0 else '▼' if item[-1] < 0 else "="
allin.append(item)
df = pd.DataFrame(allin, columns=goal)
print(df)
main("https://www.mcxindia.com/market-data/spot-market-price")
Output:
EnSymbol Unit Location TodaysSpotPrice Change
0 ALMOND 1 KGS DELHI 558.00 =
1 ALUMINIUM 1 KGS THANE 137.60 =
2 CARDAMOM 1 KGS VANDANMEDU 2525.00 =
3 CASTORSEED 100 KGS DEESA 3626.00 =
4 CHANA 100 KGS DELHI 4163.00 =
5 COPPER 1 KGS THANE 388.30 =
6 COTTON 1 BALES RAJKOT 15880.00 ▲
7 CPO 10 KGS KANDLA 635.90 ▲
8 CRUDEOIL 1 BBL MUMBAI 2418.00 ▲
9 GOLD 10 GRMS AHMEDABAD 40989.00 =
10 GOLDGUINEA 8 GRMS AHMEDABAD 32923.00 =
11 GOLDM 10 GRMS AHMEDABAD 40989.00 =
12 GOLDPETAL 1 GRMS MUMBAI 4129.00 =
13 GUARGUM 100 KGS JODHPUR 5880.00 =
14 GUARSEED 100 KGS JODHPUR 3660.00 =
15 KAPAS 20 KGS RAJKOT 927.50 ▲
16 LEAD 1 KGS CHENNAI 141.60 =
17 MENTHAOIL 1 KGS CHANDAUSI 1295.10 =
18 NATURALGAS 1 mmBtu HAZIRA 138.50 ▲
19 NICKEL 1 KGS THANE 892.00 =
20 PEPPER 100 KGS KOCHI 32600.00 ▼
21 RAW JUTE 100 KGS KOLKATA 4999.00 =
22 RBD PALMOLEIN 10 KGS KANDLA 700.40 ▼
23 REFSOYOIL 10 KGS INDORE 845.25 =
24 SILVER 1 KGS AHMEDABAD 36871.00 =
25 SILVERM 1 KGS AHMEDABAD 36871.00 =
26 SILVERMIC 1 KGS AHMEDABAD 36871.00 =
27 SUGARMDEL 100 KGS DELHI 3380.00 ▼
28 SUGARMKOL 100 KGS KOLHAPUR 3334.00 ▲
29 SUGARSKLP 100 KGS KOLHAPUR 3275.00 ▼
30 TIN 1 KGS MUMBAI 1160.50 ▼
31 WHEAT 100 KGS DELHI 1977.50 ▲
32 ZINC 1 KGS THANE 155.15 =

It does seem like data within the table is being uploaded through JavaScript.
That's why, if you are trying to fetch this information using requests library, you don't receive table's data on return. requests simply doesn't support JS. Therefore, the problem here isn't in BeautifulSoup.
To scrape JS-driven data, consider using selenium and chromedriver. The solution in this case will look like:
# import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# create a webdriver
chromedriver_path = 'C:\\path\\to\\chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
# go to the page and get its source
driver.get('http://www.mcxindia.com/market-data/spot-market-price')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# fetch mentioned data
table = soup.find('table', {'id': 'tblSMP'})
for tr in table.tbody.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print(row)
# close the webdriver
driver.quit()
The output of the above script is:
['ALMOND', '1 KGS', 'DELHI', '558.00', '=']
['ALUMINIUM', '1 KGS', 'THANE', '137.60', '=']
['CARDAMOM', '1 KGS', 'VANDANMEDU', '2,525.00', '=']
['CASTORSEED', '100 KGS', 'DEESA', '3,626.00', '▼']
['CHANA', '100 KGS', 'DELHI', '4,163.00', '▲']
['COPPER', '1 KGS', 'THANE', '388.30', '=']
['COTTON', '1 BALES', 'RAJKOT', '15,790.00', '▲']
['CPO', '10 KGS', 'KANDLA', '630.10', '▼']
['CRUDEOIL', '1 BBL', 'MUMBAI', '2,418.00', '▲']
['GOLD', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDGUINEA', '8 GRMS', 'AHMEDABAD', '32,923.00', '=']
['GOLDM', '10 GRMS', 'AHMEDABAD', '40,989.00', '=']
['GOLDPETAL', '1 GRMS', 'MUMBAI', '4,129.00', '=']
['GUARGUM', '100 KGS', 'JODHPUR', '5,880.00', '=']
['GUARSEED', '100 KGS', 'JODHPUR', '3,660.00', '=']
UPD: I must specify that the code above answers to the question of seeing this specific table. However, sometimes websites store data in 'application/json' or similar tags that can be reached with 'requests' library (since they don't require JS).
As discovered by αԋɱҽԃ αмєяιcαη, current website contains such tag. Please, check his answer. It is indeed better to use requests, than selenium in this situation.

Related

How to scrape this football page?

https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A
I wanna scrape the Team Stats, such as Possession and Shots on Target, also whats below like Fouls, Corners...
What I have now is very over complicated code, basically stripping and splitting multiple times this string to grab the values I want.
#getting a general info dataframe with all matches
championship_url = 'https://fbref.com/en/comps/24/1495/schedule/2016-Serie-A-Scores-and-Fixtures'
data = requests.get(URL)
time.sleep(3)
matches = pd.read_html(data.text, match="Resultados e Calendários")[0]
#putting stats info in each match entry (this is an example match to test)
match_url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
data = requests.get(match_url)
time.sleep(3)
soup = BeautifulSoup(data.text, features='lxml')
# ID the match to merge later on
home_team = soup.find("h1").text.split()[0]
round_week = float(soup.find("div", {'id': 'content'}).text.split()[18].strip(')'))
# collecting stats
stats = soup.find("div", {"id": "team_stats"}).text.split()[5:] #first part of stats with the progress bars
stats_extra = soup.find("div", {"id": "team_stats_extra"}).text.split()[2:] #second part
all_stats = {'posse_casa':[], 'posse_fora':[], 'chutestotais_casa':[], 'chutestotais_fora':[],
'acertopasses_casa':[], 'acertopasses_fora':[], 'chutesgol_casa':[], 'chutesgol_fora':[],
'faltas_casa':[], 'faltas_fora':[], 'escanteios_casa':[], 'escanteios_fora':[],
'cruzamentos_casa':[], 'cruzamentos_fora':[], 'contatos_casa':[], 'contatos_fora':[],
'botedef_casa':[], 'botedef_fora':[], 'aereo_casa':[], 'aereo_fora':[],
'defesas_casa':[], 'defesas_fora':[], 'impedimento_casa':[], 'impedimento_fora':[],
'tirometa_casa':[], 'tirometa_fora':[], 'lateral_casa':[], 'lateral_fora':[],
'bolalonga_casa':[], 'bolalonga_fora':[], 'Em casa':[home_team], 'Sem':[round_week]}
#not gonna copy everything but is kinda like this for each stat
#stats = '\nEstatísticas do time\n\n\nCoritiba \n\n\n\t\n\n\n\n\n\n\n\n\n\n Cuiabá\n\nPosse\n\n\n\n42%\n\n\n\n\n\n58%\n\n\n\n\nChutes ao gol\n\n\n\n2 of 4\xa0—\xa050%\n\n\n\n\n\n0%\xa0—\xa00 of 8\n\n\n\n\nDefesas\n\n\n\n0 of 0\xa0—\xa0%\n\n\n\n\n\n50%\xa0—\xa01 of 2\n\n\n\n\nCartões\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
#first grabbing 42% possession
all_stats['posse_casa']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[0]
#grabbing 58% possession
all_stats['posse_fora']=stats.replace('\n','').replace('\t','')[20:].split('Posse')[1][:5].split('%')[1]
all_stats_df = pd.DataFrame.from_dict(all_stats)
championship_data = matches.merge(all_stats_df, on=['Em casa','Sem'])
There are a lot of stats in that dic bc in previous championship years, FBref has all those stats, only in the current year championship there is only 12 of them to fill. I do intend to run the code in 5-6 different years, so I made a version with all stats, and in current year games I intend to fill with nothing when there's no stat in the page to scrap.
You can get Fouls, Corners and Offsides and 7 tables worth of data from that page with the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://fbref.com/en/partidas/25d5b9bd/Coritiba-Cuiaba-2022Julho25-Serie-A'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
coritiba_fouls = soup.find('div', string='Fouls').previous_sibling.text.strip()
cuiaba_fouls = soup.find('div', string='Fouls').next_sibling.text.strip()
coritiba_corners = soup.find('div', string='Corners').previous_sibling.text.strip()
cuiaba_corners = soup.find('div', string='Corners').next_sibling.text.strip()
coritiba_offsides = soup.find('div', string='Offsides').previous_sibling.text.strip()
cuiaba_offsides = soup.find('div', string='Offsides').next_sibling.text.strip()
print('Coritiba Fouls: ' + coritiba_fouls, 'Cuiaba Fouls: ' + cuiaba_fouls)
print('Coritiba Corners: ' + coritiba_corners, 'Cuiaba Corners: ' + cuiaba_corners)
print('Coritiba Offsides: ' + coritiba_offsides, 'Cuiaba Offsides: ' + cuiaba_offsides)
dfs = pd.read_html(r.text)
print('Number of tables: ' + str(len(dfs)))
for df in dfs:
print(df)
print('___________')
This will print in the terminal:
Coritiba Fouls: 16 Cuiaba Fouls: 12
Coritiba Corners: 4 Cuiaba Corners: 4
Coritiba Offsides: 0 Cuiaba Offsides: 1
Number of tables: 7
Coritiba (4-2-3-1) Coritiba (4-2-3-1).1
0 23 Alex Muralha
1 2 Matheus Alexandre
2 3 Henrique
3 4 Luciano Castán
4 6 Egídio Pereira Júnior
5 9 Léo Gamalho
6 11 Alef Manga
7 25 Bernanrdo Lemes
8 78 Régis
9 97 Valdemir
10 98 Igor Paixão
11 Bench Bench
12 21 Rafael William
13 5 Guillermo de los Santos
14 15 Matías Galarza
15 16 Natanael
16 18 Guilherme Biro
17 19 Thonny Anderson
18 28 Pablo Javier García
19 32 Bruno Gomes
20 44 Márcio Silva
21 52 Adrián Martínez
22 75 Luiz Gabriel
23 88 Hugo
___________
Cuiabá (4-1-4-1) Cuiabá (4-1-4-1).1
0 1 Walter
1 2 João Lucas
2 3 Joaquim
3 4 Marllon Borges
4 5 Camilo
5 6 Igor Cariús
6 7 Alesson
7 8 João Pedro Pepê
8 9 Valdívia
9 10 Rodriguinho Marinho
10 11 Rafael Gava
11 Bench Bench
12 12 João Carlos
13 13 Daniel Guedes
14 14 Paulão
15 15 Marcão Silva
16 16 Cristian Rivas
17 17 Gabriel Pirani
18 18 Jenison
19 19 André
20 20 Kelvin Osorio
21 21 Jonathan Cafu
22 22 André Luis
23 23 Felipe Marques
___________
Coritiba Cuiabá
Possession Possession
0 42% 58%
1 Shots on Target Shots on Target
2 2 of 4 — 50% 0% — 0 of 8
3 Saves Saves
4 0 of 0 — % 50% — 1 of 2
5 Cards Cards
6 NaN NaN
_____________
[....]

How to scrape all data from first page to last page using beautifulsoup

I have been trying to scrape all data from the first page to the last page, but it returns only the first page as the output. How can I solve this? Below is my code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
pages = np.arange(2, 1589, 20)
for page in pages:
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_="project-card-vertical h-full flex flex-col rounded border-thin border-inactive-blue overflow-hidden pointer")
for list in lists:
title = list.find('p', class_ ="project-location text-body text-base mb-3").text. replace ('\n', '',).strip()
location = list.find('span', class_ ="text-gray-1").text. replace ('\n', '',).strip()
status = list.find('span', class_ ="text-purple-1 font-bold").text. replace ('\n', '',).strip()
units = list.find('span', class_ ="text-body font-semibold").text. replace ('\n', '',).strip()
info = [title,location,status,units]
print(info)
The page is loaded dynamically using the API. Therefore, with a regular GET request, you will always get the first page. You need to study how the page communicates with the browser and find the request you need, I wrote an example for review.
import json
import requests
def get_info(page):
url = f"https://services.estateintel.com/api/v2/properties?type\\[\\]=residential&page={page}"
headers = {
'accept': 'application/json',
'authorization': 'false',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'
}
response = requests.request("GET", url, headers=headers)
json_obj = json.loads(response.text)
for data in json_obj['data']:
print(data['name'])
print(data['area'], data['state'])
print(data['status'])
print(data['size']['value'], data['size']['unit'])
print('------')
for page in range(1, 134):
get_info(page)
You can choose the fields you need, this is just an example, also add to dataframe. Output:
Twin Oaks Apartment
Kilimani Nairobi
Completed
0 units
------
Duchess Park
Lavington Nairobi
Completed
62 units
------
Greenvale Apartments
Kileleshwa Nairobi
Completed
36 units
------
The Urban apartments & Suites
Osu Greater Accra
Completed
28 units
------
Chateau Towers
Osu Greater Accra
Completed
120 units
------
Cedar Haus Gardens
Oluyole Oyo
Under Construction
38 units
------
10 Agoro Street
Oluyole Oyo
Completed
1 units
..............
Think it is working well, but needs the time to sleep - Just in case, you could select your elements more specific e.g. with css selectors and store information in a list of dicts instead just printing it.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
data = []
for page in range(1,134):
print(page)
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
for item in soup.select('div.project-grid > a'):
data.append({
'title' : item.h3.text.strip(),
'location' : item.find('span', class_ ="text-gray-1").text.strip(),
'status' : item.find('span', class_ ="text-purple-1 font-bold").text.strip(),
'units' : item.find('span', class_ ="text-body font-semibold").text.strip()
})
pd.DataFrame(data)
Output
title
location
status
units
0
Twin Oaks Apartment
Kilimani, Nairobi
Completed
Size: --
1
Duchess Park
Lavington, Nairobi
Completed
Size: 62 units
2
Greenvale Apartments
Kileleshwa, Nairobi
Completed
Size: 36 units
3
The Urban apartments & Suites
Osu, Greater Accra
Completed
Size: 28 units
4
Chateau Towers
Osu, Greater Accra
Completed
Size: 120 units
5
Cedar Haus Gardens
Oluyole, Oyo
Under Construction
Size: 38 units
6
10 Agoro Street
Oluyole, Oyo
Completed
Size: 1 units
7
Villa O
Oluyole, Oyo
Completed
Size: 2 units
8
Avenue Road Apartments
Oluyole, Oyo
Completed
Size: 6 units
9
15 Alafia Street
Oluyole, Oyo
Completed
Size: 4 units
10
12 Saint Mary Street
Oluyole, Oyo
Nearing Completion
Size: 8 units
11
RATCON Estate
Oluyole, Oyo
Completed
Size: --
12
1 Goodwill Road
Oluyole, Oyo
Completed
Size: 4 units
13
Anike's Court
Oluyole, Oyo
Completed
Size: 3 units
14
9 Adeyemo Quarters
Oluyole, Oyo
Completed
Size: 4 units
15
Marigold Residency
Nairobi West, Nairobi
Under Construction
Size: --
16
Kings Distinction
Kilimani, Nairobi
Completed
Size: --
17
Riverview Apartments
Kyumvi, Machakos
Completed
Size: --
18
Serene Park
Kyumvi, Machakos
Under Construction
Size: --
19
Gitanga Duplexes
Lavington, Nairobi
Under Construction
Size: 36 units
20
Westpointe Apartments
Upper Hill, Nairobi
Completed
Size: 254 units
21
10 Olaoluwa Street
Oluyole, Oyo
Under Construction
Size: 12 units
22
Rosslyn Grove
Nairobi West, Nairobi
Under Construction
Size: 90 units
23
7 Kamoru Ajimobi Street
Oluyole, Oyo
Completed
Size: 2 units
#pip install trio httpx pandas
import trio
import httpx
import pandas as pd
allin = []
keys1 = ['name', 'area', 'state']
keys2 = ['value', 'unit']
async def scraper(client, page):
client.params = client.params.merge({'page': page})
r = await client.get('/properties')
allin.extend([[i.get(k, 'N/A') for k in keys1] +
[i['size'].get(b, 'N/A')
for b in keys2] for i in r.json()['data']])
async def main():
async with httpx.AsyncClient(timeout=None, base_url='https://services.estateintel.com/api/v2') as client, trio.open_nursery() as nurse:
client.params = {
'type[]': 'residential'
}
for page in range(1, 3):
nurse.start_soon(scraper, client, page)
df = pd.DataFrame(allin, columns=[keys1 + keys2])
print(df)
if __name__ == "__main__":
trio.run(main)
Output:
0 Cedar Haus Gardens Oluyole Oyo 38 units
1 10 Agoro Street Oluyole Oyo 1 units
2 Villa O Oluyole Oyo 2 units
3 Avenue Road Apartments Oluyole Oyo 6 units
4 15 Alafia Street Oluyole Oyo 4 units
5 12 Saint Mary Street Oluyole Oyo 8 units
6 RATCON Estate Oluyole Oyo 0 units
7 1 Goodwill Road Oluyole Oyo 4 units
8 Anike's Court Oluyole Oyo 3 units
9 9 Adeyemo Quarters Oluyole Oyo 4 units
10 Marigold Residency Nairobi West Nairobi 0 units
11 Riverview Apartments Kyumvi Machakos 0 units
12 Socian Villa Apartments Kileleshwa Nairobi 36 units
13 Kings Pearl Residency Lavington Nairobi 55 units
14 Touchwood Gardens Kilimani Nairobi 32 units
15 Panorama Apartments Upper Hill Nairobi 0 units
16 Gitanga Duplexes Lavington Nairobi 36 units
17 Serene Park Kyumvi Machakos 25 units
18 Kings Distinction Kilimani Nairobi 48 units
19 Twin Oaks Apartment Kilimani Nairobi 0 units
20 Duchess Park Lavington Nairobi 70 units
21 Greenvale Apartments Kileleshwa Nairobi 36 units
22 The Urban apartments & Suites Osu Greater Accra 28 units
23 Chateau Towers Osu Greater Accra 120 units

Cannot get response.get() to load full webpage

When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgàrida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb

Scrape College Football team recruiting rankings page

So I have been able to scrape the first 50 teams in the team rankings webpage from 247sports.
I was able to get the following results:
index Rank Team Total Recruits Average Rating Total Rating
0 0 1 Ohio State 17 94.35 286.75
1 10 11 Alabama 10 94.16 210.61
2 8 9 Georgia 11 93.38 219.60
3 31 32 Clemson 8 92.02 161.74
4 3 4 LSU 14 91.92 240.57
5 4 5 Oklahoma 13 91.81 229.03
6 22 23 USC 9 91.60 174.69
7 11 12 Texas A&M 11 91.59 203.03
8 1 2 Notre Dame 18 91.01 250.35
9 2 3 Penn State 18 90.04 243.95
10 6 7 Texas 14 90.04 222.03
11 14 15 Missouri 12 89.94 196.37
12 7 8 Oregon 15 89.91 220.66
13 5 6 Florida State 15 89.88 224.51
14 25 26 Florida 10 89.15 167.89
15 37 38 North Carolina 9 88.94 152.79
16 9 10 Michigan 16 88.76 216.07
17 33 34 UCLA 10 88.49 160.00
18 23 24 Kentucky 11 88.46 173.12
19 12 13 Rutgers 14 88.44 198.56
20 19 20 Indiana 12 88.41 181.20
21 49 50 Washington 8 88.21 132.55
22 20 21 Oklahoma State 13 88.18 177.91
23 43 44 Ole Miss 10 87.80 143.35
24 44 45 California 9 87.78 141.80
25 17 18 Arkansas 15 87.75 188.64
26 16 17 South Carolina 15 87.61 190.84
27 32 33 Georgia Tech 11 87.30 161.33
28 35 36 Tennessee 11 87.25 157.77
29 39 40 NC State 11 87.18 150.18
30 46 47 SMU 9 87.08 138.50
31 36 37 Wisconsin 11 87.00 157.55
32 21 22 Mississippi State 15 86.96 177.33
33 24 25 West Virginia 13 86.78 171.72
34 30 31 Northwestern 14 86.76 162.66
35 40 41 Maryland 12 86.31 149.77
36 15 16 Virginia Tech 18 86.23 191.06
37 18 19 Baylor 19 85.90 184.68
38 13 14 Boston College 22 85.88 197.15
39 26 27 Michigan State 14 85.85 167.60
40 29 30 Cincinnati 14 85.68 164.90
41 34 35 Minnesota 13 85.55 159.35
42 28 29 Iowa State 14 85.54 166.50
43 48 49 Virginia 10 85.39 133.93
44 45 46 Arizona 11 85.27 140.90
45 41 42 Pittsburgh 12 85.10 147.58
46 47 48 Duke 13 85.02 137.40
47 27 28 Vanderbilt 16 85.01 166.77
48 38 39 Purdue 13 84.83 152.55
49 42 43 Illinois 13 84.15 143.86
From the following script:
year = '2022'
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeTeamRankings/'
print(url)
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="rankings-page__list-item"):
rank = tag.find('div',{'class':'primary'}).text.strip()
team = tag.find('div',{'class':'team'}).find('a').text.strip()
total_recruits = tag.find('div',{'class':'total'}).find('a').text.split(' ')[0].strip()
# five_stars = tag.find('div',{'class':'gold'}).text.strip()
# four_stars = tag.find('div',{'class':'gold'}).text.strip()
# three_stars = tag.find('div',{'class':'metrics'}).text.strip()
avg_rating = tag.find('div',{'class':'avg'}).text.strip()
total_rating = tag.find('div',{'class':'points'}).text.strip()
data.append(
{
"Rank": rank,
"Team": team,
"Total Recruits": total_recruits,
# "Five-Star Recruits": five_stars,
# "Four-Star Recruits": four_stars,
# "Three-Star Recruits": three_stars,
"Average Rating": avg_rating,
"Total Rating": total_rating
}
)
df = pd.DataFrame(data)
df[['Rank', 'Total Recruits', 'Average Rating', 'Total Rating']] = df[['Rank', 'Total Recruits', 'Average Rating', 'Total Rating']].apply(pd.to_numeric)
df.sort_values('Average Rating', ascending = False).reset_index()
# soup
However, I would like to achieve three things.
I would like to grab the data from the "5-stars", "4-stars", "3-stars" columns in the webpage.
I would like to not just get the first 50 schools, but also tell the webpage to click "load more" enough times so that I can get the table with ALL schools in it.
I want to not only get the 2022 team rankings, but every team ranking that 247sports has to offer (2000 through 2024).
I tried to give it a go with this one script, but I constantly get the top-50 schools being outputted in one loop in the "print(row) portion" of the code.
print(datetime.datetime.now().time())
# years = ['2000', '2001', '2002', '2003', '2004',
# '2005', '2006', '2007', '2008', '2009',
# '2010', '2011', '2012', '2013', '2014',
# '2015', '2016', '2017', '2018', '2019',
# '2020', '2021', '2022', '2023']
years = ['2022']
rows = []
page_totals = []
# recruits_final = []
for year in years:
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeTeamRankings/'
print(url)
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
page = 0
while True:
page +=1
payload = {'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find_all('li',{'class':'rankings-page__list-item'})
if len(tags) == 0:
print('Page: %s' %page)
page_totals.append(page)
break
continue_loop = True
while continue_loop == True:
for tag in tags:
if tag.text.strip() == 'Load More':
continue_loop = False
continue
# primary_rank = tag.find('div',{'class':'rank-column'}).find('div',{'class':'primary'}).text.strip()
# try:
# other_rank = tag.find('div',{'class':'rank-column'}).find('div',{'class':'other'}).text.strip()
# except:
# other_rank = ''
rank = tag.find('div',{'class':'primary'}).text.strip()
team = tag.find('div',{'class':'team'}).find('a').text.strip()
total_recruits = tag.find('div',{'class':'total'}).find('a').text.split(' ')[0].strip()
# five_stars = tag.find('div',{'class':'gold'}).text.strip()
# four_stars = tag.find('div',{'class':'gold'}).text.strip()
# three_stars = tag.find('div',{'class':'metrics'}).text.strip()
avg_rating = tag.find('div',{'class':'avg'}).text.strip()
total_rating = tag.find('div',{'class':'points'}).text.strip()
try:
team = athlete.find('div',{'class':'status'}).find('img')['title']
except:
team = ''
row = {'Rank': rank,
'Team': team,
'Total Recruits': total_recruits,
'Average Rating': avg_rating,
'Total Rating': total_rating,
'Year': year}
print(row)
rows.append(row)
recruits = pd.DataFrame(rows)
print(datetime.datetime.now().time())
Any assistance on this is truly appreciated. Thanks in advance.
First, you can extract the year ranges from the dropdown with BeautifulSoup (no need to click the button, as the dropdown is already on the page), and then navigate to each link with selenium, using the latter to interact with the "load more" toggle, and then finally scraping the resulting tables:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time, urllib.parse, re
d = webdriver.Chrome('path/to/chromedriver')
d.get((url:='https://247sports.com/Season/2022-Football/CompositeTeamRankings/'))
result = {}
for i in soup(d.page_source, 'html.parser').select('.rankings-page__header-nav > .rankings-page__nav-block .flyout_cmp.year.tooltip li a'):
if (y:=int(i.get_text(strip=True))) > 1999:
d.get(urllib.parse.urljoin(url, i['href']))
while d.execute_script("""return document.querySelector('a[data-js="showmore"]') != null"""):
d.execute_script("""document.querySelector('a[data-js="showmore"]').click()""")
time.sleep(1)
result[y] = [{"Rank":i.select_one('div.wrapper .rank-column .other').get_text(strip=True),
"Team":i.select_one('.team').get_text(strip=True),
"Total":i.select_one('.total').get_text(strip=True).split()[0],
"5-Stars":i.select_one('.star-commits-list li:nth-of-type(1) div').get_text(strip=True),
"4-Stars":i.select_one('.star-commits-list li:nth-of-type(2) div').get_text(strip=True),
"3-Stars":i.select_one('.star-commits-list li:nth-of-type(3) div').get_text(strip=True),
"Ave":i.select_one('.avg').get_text(strip=True),
"Points":i.select_one('.points').get_text(strip=True),
}
for i in soup(d.page_source, 'html.parser').select("""ul[data-js="rankings-list"].rankings-page__list li.rankings-page__list-item""")]
result stores all the team rankings for a given year, 2000-2024 (list(result) produces [2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000]). To convert the results to a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame([{'Year':a, **i} for a, b in result.items() for i in b])
print(df)
Output:
Year Rank Team Total 5-Stars 4-Stars 3-Stars Ave Points
0 2024 N/A Iowa 1 0 0 0 0.00 0.00
1 2024 N/A Florida State 3 0 0 0 0.00 0.00
2 2024 N/A BYU 1 0 0 0 0.00 0.00
3 2023 1 Georgia 4 0 4 0 93.86 93.65
4 2023 3 Notre Dame 2 1 1 0 95.98 51.82
... ... ... ... ... ... ... ... ... ...
3543 2000 N/A NC State 18 0 0 0 70.00 0.00
3544 2000 N/A Colorado State 14 0 0 0 70.00 0.00
3545 2000 N/A Oregon 27 0 0 0 70.00 0.00
3546 2000 N/A California 25 0 0 0 70.00 0.00
3547 2000 N/A Texas Tech 20 0 0 0 70.00 0.00
[3548 rows x 9 columns]
Edit: instead of using selenium, you can send requests to the API endpoints that the site uses to retrieve and display the ranking data:
import requests, pandas as pd
from bs4 import BeautifulSoup as soup
def extract_rankings(source):
return [{"Rank":i.select_one('div.wrapper .rank-column .other').get_text(strip=True),
"Team":i.select_one('.team').get_text(strip=True),
"Total":i.select_one('.total').get_text(strip=True).split()[0],
"5-Stars":i.select_one('.star-commits-list li:nth-of-type(1) div').get_text(strip=True),
"4-Stars":i.select_one('.star-commits-list li:nth-of-type(2) div').get_text(strip=True),
"3-Stars":i.select_one('.star-commits-list li:nth-of-type(3) div').get_text(strip=True),
"Ave":i.select_one('.avg').get_text(strip=True),
"Points":i.select_one('.points').get_text(strip=True),
}
for i in soup(source, 'html.parser').select("""li.rankings-page__list-item""")]
def year_rankings(year):
page, results = 1, []
vals = extract_rankings(requests.get(f'https://247sports.com/Season/{year}-Football/CompositeTeamRankings/?ViewPath=~%2FViews%2FSkyNet%2FInstitutionRanking%2F_SimpleSetForSeason.ascx&Page={page}', headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}).text)
while vals:
results.extend(vals)
page += 1
vals = extract_rankings(requests.get(f'https://247sports.com/Season/{year}-Football/CompositeTeamRankings/?ViewPath=~%2FViews%2FSkyNet%2FInstitutionRanking%2F_SimpleSetForSeason.ascx&Page={page}', headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}).text)
return results
results = {y:year_rankings(y) for y in range(2000, 2025)}
df = pd.DataFrame([{'Year':a, **i} for a, b in results.items() for i in b])
print(df)

bs4 find table by id, returning 'None'

Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.
The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0
There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.
This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript
import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output

Categories

Resources