Scrape for Table with Limits - python

There is a website that has data on it that it pulls from an API. The max number of rows you can have per page is 100. If you were to check the API URL from page 1, 2, 3, etc they change each time. So far I have taken the same script each time and just switched out the URL but I then also have to save it in a different excel file every time or else it removes data.
I'd like to have a script to be able to pull all information from this table and then place them all into an excel on the same sheet without the values being overwritten.
The main page I'm using is http://www.nhl.com/stats/teams?aggregate=0&report=daysbetweengames&reportType=game&dateFrom=2021-10-12&dateTo=2021-11-30&gameType=2&filter=gamesPlayed,gte,1&sort=a_teamFullName,daysRest&page=0&pageSize=50 but please keep in mind that all the information on that page is being pulled from an API.
Here is the code I'm using:
import requests
import json
import pandas as pd
url = ('https://api.nhle.com/stats/rest/en/team/daysbetweengames? `isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22teamFullName%22,%22direction%22:%22ASC%22%7D,%7B%22property%22:%22daysRest%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=500&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameDate%3C=%222021-11-30%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-10-12%22%20and%20gameTypeId=2')`
resp = requests.get(url).text
resp = json.loads(resp)
df = pd.DataFrame(resp['data'])
df.to_excel('Master File.xlsx', sheet_name = 'Info')
Any help would be greatly appreciated.
Thanks!

url has start=... - so you can use for-loop and replace this value 0, 100, 200, etc. and run code for different urls, and append() to DataFrame
It can be simpler if you put all arguments from url (after char ?) to dictionary and later run as get(url, params=...)
requests has response.json so it doesn't need json.loads(response.text)
import requests
import pandas as pd
# --- before loop ---
url = 'https://api.nhle.com/stats/rest/en/team/daysbetweengames'
payload = {
'isAggregate': 'false',
'isGame': 'true',
'start': 0,
'limit': 100,
'sort': '[{"property":"teamFullName","direction":"ASC"},{"property":"daysRest","direction":"DESC"},{"property":"teamId","direction":"ASC"}]',
'factCayenneExp': 'gamesPlayed>=1',
'cayenneExp': 'gameDate<="2021-11-30 23:59:59" and gameDate>="2021-10-12" and gameTypeId=2',
}
df = pd.DataFrame()
# --- loop ---
for start in range(0, 1000, 100):
print('start:', start)
payload['start'] = start
response = requests.get(url, params=payload)
data = response.json()
df = df.append(data['data'], ignore_index=True)
# --- after loop ---
print(df)
df.to_excel('Master File.xlsx', sheet_name='Info')
Result:
daysRest faceoffWinPct gameDate ... ties timesShorthandedPerGame wins
0 4 0.47169 2021-10-13 ... None 5.0 1
1 3 0.50847 2021-11-22 ... None 4.0 0
2 2 0.45762 2021-10-26 ... None 1.0 0
3 2 0.56666 2021-11-05 ... None 2.0 1
4 2 0.54716 2021-11-14 ... None 1.0 1
.. ... ... ... ... ... ... ...
675 1 0.37209 2021-10-28 ... None 2.0 1
676 1 0.48000 2021-10-21 ... None 3.0 1
677 0 0.57692 2021-11-06 ... None 1.0 0
678 0 0.32727 2021-11-19 ... None 3.0 0
679 0 0.47169 2021-11-27 ... None 4.0 1

Related

How to web scrape Rotowire iframe table

I am try to scrape tables from Rotowire. pd.read is only returning the Headers.
import pandas as pd
url = pd.read_html("http://www.rotowire.com/daily/mlb/optimizer.htm?site=DraftKings&sport=MLB")
# for idx, table in enumerate(url):
# print("***************************")
# print(idx)
# print(table)
url[5]
Output:
Player Team Position Salary Fpts. Val Min. % Max. % Exposure
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
No idea what table you want, but you're not going to get anything from the static html response as the page is rendered through javascript. They do have some data you can access though. You'd have to work out the parameters:
import pandas as pd
import requests
url = 'https://www.rotowire.com/daily/tables/optimizer-mlb.php'
payload = {
'siteID': '1',
'slateID': '6441',
'projSource': 'RotoWire',
'rst': 'RotoWire'}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
id playerID rotoPlayerID ... ie_green_lights ie_matchup_notes ie_volatility
0 12739 11095 12739 ... 0 0
1 10510 4081 10510 ... 0 0
2 16036 5163 16036 ... 0 0
3 14194 10827 14194 ... 0 0
4 14865 15463 14865 ... 0 0
.. ... ... ... ... ... ... ...
687 14444 11330 14444 ... 0 0
688 14440 18894 14440 ... 0 0
689 14439 18905 14439 ... 0 0
690 14435 5058 14435 ... 0 0
691 17921 18828 17921 ... 0 0
[692 rows x 99 columns]

Scrape web with info from several years and create a csv file for each year

I have scraped information with the results of the 2016 Chess Olympiad, using the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table, but just for the first eleven columns in the webpage
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
I want to do the same thing for the results of 2014 and 2012 (the Olympics are played every two years normally), authomatically. I have advanced the code half the way, but I really don't know how to continue. This is what I've done so far.
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
start_year=2012
i=2
end_year=2016
def download_chess(start_year):
url = f'https://www.olimpbase.org/{start_year}/{start_year}te14.html'
response = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
while start_year<end_year:
download_chess(start_year)
start_year+=i
download_chess(start_year)
I don't have much experience so I don't quite understand the logic of writing filenames. I hope you can help me.
The following will retrieve information for a range of years - in this case, 2000 -- 2018, and save each table to csv as well:
import requests
import pandas as pd
years = range(2000, 2019, 2)
for y in years:
try:
df = pd.read_html(f'https://www.olimpbase.org/{y}/{y}te14.html')[1]
new_header = df.iloc[2]
df = df[3:]
df.columns = new_header
print(df)
df.to_csv(f'chess_olympics_{y}.csv')
except Exception as e:
print(y, 'error', e)
This will print out the results table for each year:
no.
team
Elo
flag
code
pos.
pts
Buch
MP
gms
nan
+
=
-
nan
+
=
-
nan
%
Eloav
Elop
ind.medals
3
1
Russia
2685
nan
RUS
1
38
457.5
20
56
nan
8
4
2
nan
23
30
3
nan
67.9
2561
2694
1 - 0 - 2
4
2
Germany
2604
nan
GER
2
37
455.5
22
56
nan
10
2
2
nan
21
32
3
nan
66.1
2568
2685
0 - 0 - 2
5
3
Ukraine
2638
nan
UKR
3
35½
457.5
21
56
nan
8
5
1
nan
18
35
3
nan
63.4
2558
2653
1 - 0 - 0
6
4
Hungary
2661
nan
HUN
4
35½
455.5
21
56
nan
8
5
1
nan
22
27
7
nan
63.4
2570
2665
0 - 0 - 0
7
5
Israel
2652
nan
ISR
5
34½
463.5
20
56
nan
7
6
1
nan
17
35
4
nan
61.6
2562
2649
0 - 0 - 0
[...]
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Python BeautifulSoup html parser not working

Here i'm trying to read the page and create a csv with columns respectively. But i'm unable to read the parsed data to use find function. The soup data doesn't have the data present in webpage
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.fancraze.com/marketplace/sales/mornemorkel1?tab=latest-sales"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
Site use API to get data, so you can handle it
import pandas as pd
import requests
url = 'https://api.faze.app/v1/latestSalesInAGroup/mornemorkel1'
result = []
response = requests.get(url=url)
for data in response.json()['data']:
data = {
'id': data['momentId']['id'],
'seller': data['sellerAddress']['userName'],
'buyer': data['buyerAddress']['userName'],
'price': data['price'],
'created': data['createdAt']
}
result.append(data)
df = pd.DataFrame(result)
print(df)
OUTPUT:
id seller ... price created
0 1882 singal22 ... 8 2022-06-22T14:34:39.403Z
1 1737 olive_creepy2343 ... 7 2022-06-22T14:09:32.070Z
2 1256 tomato_wicked3294 ... 10 2022-06-22T13:49:20.895Z
3 1931 aquamarine_productive9244 ... 6 2022-06-22T13:41:49.153Z
4 1603 aquamarine_productive9244 ... 9 2022-06-22T13:28:01.624Z
.. ... ... ... ... ...
95 1026 olive_creepy2343 ... 7 2022-04-16T18:00:00.662Z
96 1719 Hhassan136 ... 5 2022-04-14T23:14:12.037Z
97 2054 Cricket101 ... 5 2022-04-14T21:30:13.185Z
98 1961 emzeden_9 ... 6 2022-04-14T18:02:05.194Z
99 1194 amaranth_curious1871 ... 5 2022-04-14T17:45:25.266Z

Webscraping jTable with hidden columns?

I am currently trying to setup a webscraper in Python for the following webpage:
https://understat.com/team/Juventus/2018
specifically for the 'team-players jTable'
I have managed to scrape the table successfully with BeautifulSoup and selenium, but there are hidden columns (accessible via the options popup window) that I can't initialize and include in my scraping.
Anyone know how to change this?
import urllib.request
from bs4 import BeautifulSoup
import lxml
import re
import requests
from selenium import webdriver
import pandas as pd
import re
import random
import datetime
base_url = 'https://understat.com/team/Juventus/2018'
url = base_url
data = requests.get(url)
html = data.content
soup = BeautifulSoup(html, 'lxml')
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('/Users/kylecaron/Desktop/souptest/chromedriver',options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
headers = soup.find('div', attrs={'class':'players jTable'}).find('table').find_all('th',attrs={'class':'sort'})
headers_list = [header.get_text(strip=True) for header in headers]
body = soup.find('div', attrs={'class':'players jTable'}).table.tbody
all_rows_list = []
for tr in body.find_all('tr'):
row = tr.find_all('td')
current_row = []
for item in row:
current_row.append(item.get_text(strip=True))
all_rows_list.append(current_row)
headers_list = ['№', 'Player', 'Positions', 'Apps', 'Min', 'G', 'A', 'Sh90', 'KP90', 'xG', 'xA', 'xG90', 'xA90']
xg_df = pd.DataFrame(all_rows_list, columns=headers_list)
If you navigate to the website, there are hidden table columns such as 'XGChain'. I want all of these hidden elements scraped, but having trouble doing it.
Best,
Kyle
Here you go. You could still use BeautifulSoup to iterate through the tr and td tags, but I always find pandas much easier to get tables, as it does the work for you.
from selenium import webdriver
import pandas as pd
url = 'https://understat.com/team/Juventus/2018'
driver = webdriver.Chrome()
driver.get(url)
# Click the Options Button
driver.find_element_by_xpath('//*[#id="team-players"]/div[1]/button/i').click()
# Click the fields that are hidden
hidden = [7, 12, 14, 15, 17, 19, 20, 21, 22, 23, 24]
for val in hidden:
x_path = '//*[#id="team-players"]/div[2]/div[2]/div/div[%s]/div[2]/label' %val
driver.find_element_by_xpath(x_path).click()
# Appy the filter
driver.find_element_by_xpath('//*[#id="team-players"]/div[2]/div[3]/a[2]').click()
# get the tables in source
tables = pd.read_html(driver.page_source)
data = tables[1]
data.rename(columns={'Unnamed: 22':"Yellow_Cards", "Unnamed: 23":"Red_Cards"})
driver.close()
Output:
print (data.columns)
Index(['№', 'Player', 'Pos', 'Apps', 'Min', 'G', 'NPG', 'A', 'Sh90', 'KP90',
'xG', 'NPxG', 'xA', 'xGChain', 'xGBuildup', 'xG90', 'NPxG90', 'xA90',
'xG90 + xA90', 'NPxG90 + xA90', 'xGChain90', 'xGBuildup90',
'Yellow_Cards', 'Red_Cards'],
dtype='object')
print (data)
№ Player ... Yellow_Cards Red_Cards
0 1.0 Cristiano Ronaldo ... 2 0
1 2.0 Mario Mandzukic ... 3 0
2 3.0 Paulo Dybala ... 1 0
3 4.0 Federico Bernardeschi ... 2 0
4 5.0 Blaise Matuidi ... 2 0
5 6.0 Rodrigo Bentancur ... 5 1
6 7.0 Juan Cuadrado ... 2 0
7 8.0 Leonardo Bonucci ... 1 0
8 9.0 Miralem Pjanic ... 4 0
9 10.0 Sami Khedira ... 0 0
10 11.0 Giorgio Chiellini ... 1 0
11 12.0 Medhi Benatia ... 2 0
12 13.0 Douglas Costa ... 2 1
13 14.0 Emre Can ... 2 0
14 15.0 Mattia Perin ... 1 0
15 16.0 Mattia De Sciglio ... 0 0
16 17.0 Wojciech Szczesny ... 0 0
17 18.0 Andrea Barzagli ... 0 0
18 19.0 Alex Sandro ... 3 0
19 20.0 Daniele Rugani ... 1 0
20 21.0 Moise Kean ... 0 0
21 22.0 João Cancelo ... 2 0
22 NaN NaN ... 36 2
[23 rows x 24 columns]

How to create pandas dataframe from web scrape?

I would like to use this web scrape to create a pandas dataframe that way I can export the data to excel. Is anyone familiar with this? I have seen different methods online and on this site but have been unable to successfully duplicate the results with this scrape.
Here is the code so far:
import requests
source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
for team in source['data']:
print("\n%s players\n" % team['home_route'].capitalize())
for player in team['home_players']:
print(player['name'])
print("\n%s players\n" % team['away_route'].capitalize())
for player in team['away_players']:
print(player['name'])
This site seems useful but the examples are different:
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm
Here is another example from stackoverflow.com:
Loading web scraping results into Pandas DataFrame
I am new to coding/scraping so any help will greatly appreciated. Thanks in advance for your time and effort!
I have added a solution to have a dataframe teamwise, I hope this helps. Updated code
import requests
source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
players = []
teams = []
for team in source['data']:
print("\n%s players\n" % team['home_route'].capitalize())
teams.append(team['home_route'].capitalize())
teams.append(team['away_route'].capitalize())
temp = []
temp1 = []
for player in team['home_players']:
print(player['name'])
temp.append(player['name'])
print("\n%s players\n" % team['away_route'].capitalize())
for player in team['away_players']:
print(player['name'])
temp1.append(player['name'])
players.append(temp)
players.append(temp1)
import pandas as pd
df = pd.DataFrame(columns=teams)
for i in range(0, len(df.columns)):
df[df.columns[i]] = players[i]
df
In order to export to excel, you can do
df.to_excel('result.xlsx')
Python requests conveniently renders the json as a dict so you can just use the dict in a pd.DataFrame constructor.
import pandas as pd
df = pd.DataFrame([dict1, dict2, dict3])
# Do your data processing here
df.to_csv("myfile.csv")
Pandas also has pd.io.json with helpers like json_normalize so once your data is in a dataframe you can process nested json in to tabular data, and so on.
you can try like below..
>>> import pandas as pd
>>> import json
>>> import requests
>>> source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
>>> df = pd.DataFrame.from_dict(source) # directly use source as itself is a dict
Now you can take the dataframe into csv format by df.to_csv as follows:
>>> df.to_csv("nba_play.csv")
Below are Just your columns which you can process for your data as desired..
>>> df.columns
Index(['bottom_header', 'bottom_paragraph', 'data', 'heading',
'intro_paragraph', 'page_title', 'twitter_link'],
dtype='object')
However as Charles said, you can use json_normalize which will give you better view of data in a tabular form..
>>> from pandas.io.json import json_normalize
>>> json_normalize(df['data']).head()
away_bets.key away_bets.moneyline away_bets.over_under \
0 ATL 500 o232.0
1 POR 165 o217.0
2 SAC 320 o225.0
3 BKN 110 o216.0
4 TOR -140 o221.0
away_bets.over_under_moneyline away_bets.spread \
0 -115 11.0
1 -115 4.5
2 -105 9.0
3 -105 2.0
4 -105 -2.0
away_bets.spread_moneyline away_bets.total \
0 -110 121.50
1 -105 110.75
2 -115 117.00
3 -110 109.00
4 -115 109.50
away_injuries \
0 [{'name': 'J. Collins', 'profile_url': '/nba/p...
1 [{'name': 'M. Harkless', 'profile_url': '/nba/...
2 [{'name': 'K. Koufos', 'profile_url': '/nba/pl...
3 [{'name': 'T. Graham', 'profile_url': '/nba/pl...
4 [{'name': 'O. Anunoby', 'profile_url': '/nba/p...
away_players away_route \
0 [{'draftkings_projection': 30.04, 'yahoo_posit... atlanta-hawks
1 [{'draftkings_projection': 47.33, 'yahoo_posit... portland-trail-blazers
2 [{'draftkings_projection': 28.88, 'yahoo_posit... sacramento-kings
3 [{'draftkings_projection': 37.02, 'yahoo_posit... brooklyn-nets
4 [{'draftkings_projection': 45.2, 'yahoo_positi... toronto-raptors
... nav.matchup_season nav.matchup_time \
0 ... 2019 2018-10-29T23:00:00+00:00
1 ... 2019 2018-10-29T23:00:00+00:00
2 ... 2019 2018-10-29T23:30:00+00:00
3 ... 2019 2018-10-29T23:30:00+00:00
4 ... 2019 2018-10-30T00:00:00+00:00
nav.status.away_team_score nav.status.home_team_score nav.status.minutes \
0 None None None
1 None None None
2 None None None
3 None None None
4 None None None
nav.status.quarter_integer nav.status.seconds nav.status.status \
0 None Scheduled
1 None Scheduled
2 None Scheduled
3 None Scheduled
4 None Scheduled
nav.updated order
0 2018-10-29T17:51:05+00:00 0
1 2018-10-29T17:51:05+00:00 1
2 2018-10-29T17:51:05+00:00 2
3 2018-10-29T17:51:05+00:00 3
4 2018-10-29T17:51:05+00:00 4
[5 rows x 383 columns]
Hope, this will help

Categories

Resources