How to web scrape Rotowire iframe table - python

I am try to scrape tables from Rotowire. pd.read is only returning the Headers.
import pandas as pd
url = pd.read_html("http://www.rotowire.com/daily/mlb/optimizer.htm?site=DraftKings&sport=MLB")
# for idx, table in enumerate(url):
# print("***************************")
# print(idx)
# print(table)
url[5]
Output:
Player Team Position Salary Fpts. Val Min. % Max. % Exposure
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN

No idea what table you want, but you're not going to get anything from the static html response as the page is rendered through javascript. They do have some data you can access though. You'd have to work out the parameters:
import pandas as pd
import requests
url = 'https://www.rotowire.com/daily/tables/optimizer-mlb.php'
payload = {
'siteID': '1',
'slateID': '6441',
'projSource': 'RotoWire',
'rst': 'RotoWire'}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
id playerID rotoPlayerID ... ie_green_lights ie_matchup_notes ie_volatility
0 12739 11095 12739 ... 0 0
1 10510 4081 10510 ... 0 0
2 16036 5163 16036 ... 0 0
3 14194 10827 14194 ... 0 0
4 14865 15463 14865 ... 0 0
.. ... ... ... ... ... ... ...
687 14444 11330 14444 ... 0 0
688 14440 18894 14440 ... 0 0
689 14439 18905 14439 ... 0 0
690 14435 5058 14435 ... 0 0
691 17921 18828 17921 ... 0 0
[692 rows x 99 columns]

Related

Scrape for Table with Limits

There is a website that has data on it that it pulls from an API. The max number of rows you can have per page is 100. If you were to check the API URL from page 1, 2, 3, etc they change each time. So far I have taken the same script each time and just switched out the URL but I then also have to save it in a different excel file every time or else it removes data.
I'd like to have a script to be able to pull all information from this table and then place them all into an excel on the same sheet without the values being overwritten.
The main page I'm using is http://www.nhl.com/stats/teams?aggregate=0&report=daysbetweengames&reportType=game&dateFrom=2021-10-12&dateTo=2021-11-30&gameType=2&filter=gamesPlayed,gte,1&sort=a_teamFullName,daysRest&page=0&pageSize=50 but please keep in mind that all the information on that page is being pulled from an API.
Here is the code I'm using:
import requests
import json
import pandas as pd
url = ('https://api.nhle.com/stats/rest/en/team/daysbetweengames? `isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22teamFullName%22,%22direction%22:%22ASC%22%7D,%7B%22property%22:%22daysRest%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=500&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameDate%3C=%222021-11-30%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-10-12%22%20and%20gameTypeId=2')`
resp = requests.get(url).text
resp = json.loads(resp)
df = pd.DataFrame(resp['data'])
df.to_excel('Master File.xlsx', sheet_name = 'Info')
Any help would be greatly appreciated.
Thanks!
url has start=... - so you can use for-loop and replace this value 0, 100, 200, etc. and run code for different urls, and append() to DataFrame
It can be simpler if you put all arguments from url (after char ?) to dictionary and later run as get(url, params=...)
requests has response.json so it doesn't need json.loads(response.text)
import requests
import pandas as pd
# --- before loop ---
url = 'https://api.nhle.com/stats/rest/en/team/daysbetweengames'
payload = {
'isAggregate': 'false',
'isGame': 'true',
'start': 0,
'limit': 100,
'sort': '[{"property":"teamFullName","direction":"ASC"},{"property":"daysRest","direction":"DESC"},{"property":"teamId","direction":"ASC"}]',
'factCayenneExp': 'gamesPlayed>=1',
'cayenneExp': 'gameDate<="2021-11-30 23:59:59" and gameDate>="2021-10-12" and gameTypeId=2',
}
df = pd.DataFrame()
# --- loop ---
for start in range(0, 1000, 100):
print('start:', start)
payload['start'] = start
response = requests.get(url, params=payload)
data = response.json()
df = df.append(data['data'], ignore_index=True)
# --- after loop ---
print(df)
df.to_excel('Master File.xlsx', sheet_name='Info')
Result:
daysRest faceoffWinPct gameDate ... ties timesShorthandedPerGame wins
0 4 0.47169 2021-10-13 ... None 5.0 1
1 3 0.50847 2021-11-22 ... None 4.0 0
2 2 0.45762 2021-10-26 ... None 1.0 0
3 2 0.56666 2021-11-05 ... None 2.0 1
4 2 0.54716 2021-11-14 ... None 1.0 1
.. ... ... ... ... ... ... ...
675 1 0.37209 2021-10-28 ... None 2.0 1
676 1 0.48000 2021-10-21 ... None 3.0 1
677 0 0.57692 2021-11-06 ... None 1.0 0
678 0 0.32727 2021-11-19 ... None 3.0 0
679 0 0.47169 2021-11-27 ... None 4.0 1

Union of several DataFrames stored in the same variable

I have imported information about some stocks through a loop through the MetaTrader 5 module.
import MetaTrader5 as mt5
tickers = ['Apple', 'Amazon', 'Facebook', 'Microsoft']
results = {}
for ticker in tickers:
results[ticker] = mt5.copy_rates_range(ticker, mt5.TIMEFRAME_M1, inicio, fin)
results[ticker] = pd.DataFrame(results[ticker]).set_index('time')
The data has been stored in results [ticker]. For example, when ticker = 'Apple'
results['Apple']
{'Apple': open high low close tick_volume spread real_volume
time
1606149300 117.33 117.55 117.31 117.47 126 12 0
1606149360 117.48 117.54 117.31 117.39 134 12 0
1606149420 117.38 117.54 117.36 117.41 95 12 0
1606149480 117.43 117.47 117.32 117.33 90 12 0
1606149540 117.32 117.33 117.24 117.26 123 12 0
... ... ... ... ... ... ... ...
when ticker = 'Amazon'
results['Amazon']
open high low close tick_volume spread real_volume
time
1606149300 3114.25 3132.43 3114.25 3131.28 44 429 0
1606149360 3131.28 3133.25 3122.69 3131.52 83 450 0
1606149420 3131.52 3132.12 3122.69 3130.11 61 449 0
1606149480 3127.53 3135.92 3122.69 3127.05 80 448 0
1606149540 3129.77 3135.54 3123.50 3131.98 49 441 0
... ... ... ... ... ... ... ...
My question is how can I join all these tables into a single DataFrame? For example, the 'close' column for each of the tickers in a single DataFrame as in the example below
CLOSE Apple Amazon Microsoft ETC...
time
1606149300 3114.25 3132.43 3114.25
1606149360 3131.28 3133.25 3122.69
1606149420 3131.52 3132.12 3122.69
1606149480 3127.53 3135.92 3122.69
1606149540 3129.77 3135.54 3123.50
... ... ... ... ... ... ... ...
Thanks in advance for the help
You could try with the join function in Pandas.
merged_df = results[tickers[0]]
for t in tickers[1:]:
merged_df = merged_df.merge(results[t][['close']], left_index=True, right_index=True)
I hope it solves your problem!

Returns a NoneType value while using find function in Beautiful Soup

I am extracting tables from the website by using Beautiful Soup.
The find function returns a NoneType value and I have no idea how to continue to extract all tables to pandas DataFrames.
import pandas as pd
import datetime as dt
import pandas_datareader as web
import matplotlib.pyplot as plt
from matplotlib import style
import matplotlib.ticker as ticker
from bs4 import BeautifulSoup
import requests
url='https://www.federalreserve.gov/monetarypolicy/bst_recenttrends_accessible.htm'
html_content=requests.get(url).content
soup = BeautifulSoup(html_content, "html.parser")
get_table = soup.find("table", class_='pubtables')
get_table_data = get_table.find_all("tr")
print(type(get_table_data))
The data you see in the tables is loaded from external URL. You can use this example to load the tables to various DataFrames:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.federalreserve.gov/data.xml'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for chart in soup.select('chart'):
series = {}
index = []
for s in chart.select('series'):
series[s['description']] = []
temp_index = []
for o in s.select('observation'):
temp_index.append(o['index'])
series[s['description']].append(o['value'])
if len(temp_index) > len(index):
index = temp_index
series['index'] = index
max_len = len(max(series.values(), key=len))
for k in series:
series[k] = series[k] + ['No Data'] * (max_len - len(series[k]))
df = pd.DataFrame(series).set_index('index')
print(df)
print('-' * 80)
Prints:
Total Assets
index
1-Aug-07 870261.00
8-Aug-07 865453.00
15-Aug-07 864931.00
22-Aug-07 862775.00
29-Aug-07 872873.00
... ...
12-Aug-20 6957277.00
19-Aug-20 7010637.00
26-Aug-20 6990418.00
2-Sep-20 7017492.00
9-Sep-20 7010614.00
[685 rows x 1 columns]
--------------------------------------------------------------------------------
Total Assets ... Support for Specific Institutions**
index ...
1-Aug-07 870261.00 ... 0
8-Aug-07 865453.00 ... 0
15-Aug-07 864931.00 ... 0
22-Aug-07 862775.00 ... 0
29-Aug-07 872873.00 ... 0
... ... ... ...
12-Aug-20 6957277.00 ... No Data
19-Aug-20 7010637.00 ... No Data
26-Aug-20 6990418.00 ... No Data
2-Sep-20 7017492.00 ... No Data
9-Sep-20 7010614.00 ... No Data
[685 rows x 4 columns]
--------------------------------------------------------------------------------
All Liquidity Facilities* ... Term Asset-Backed Securities Loan Facility
index ...
1-Aug-07 235.00 ... 0
8-Aug-07 255.00 ... 0
15-Aug-07 264.00 ... 0
22-Aug-07 2262.00 ... 0
29-Aug-07 1358.00 ... 0
... ... ... ...
12-Aug-20 116308.00 ... 1619.00
19-Aug-20 112435.00 ... 2266.00
26-Aug-20 107342.00 ... 2256.00
2-Sep-20 103978.00 ... 2639.00
9-Sep-20 85581.00 ... 2639.00
[685 rows x 5 columns]
--------------------------------------------------------------------------------
Total Support to AIG*** ... Maiden Lane II LLC Maiden Lane III LLC
index ...
1-Aug-07 0 0 ... 0 0
8-Aug-07 0 0 ... 0 0
15-Aug-07 0 0 ... 0 0
22-Aug-07 0 0 ... 0 0
29-Aug-07 0 0 ... 0 0
... ... ... ... ... ...
12-Feb-20 0 No Data ... 0 0
19-Feb-20 0 No Data ... 0 0
26-Feb-20 0 No Data ... 0 0
4-Mar-20 0 No Data ... 0 0
11-Mar-20 0 No Data ... 0 0
[659 rows x 5 columns]
--------------------------------------------------------------------------------
Currency in Circulation ... Treasury Balance
index ...
1-Aug-07 814159.00 ... 4769.00
8-Aug-07 814587.00 ... 4670.00
15-Aug-07 813042.00 ... 5109.00
22-Aug-07 811795.00 ... 5329.00
29-Aug-07 812431.00 ... 4924.00
... ... ... ...
12-Aug-20 2006160.00 ... 1635143.00
19-Aug-20 2009610.00 ... 1636393.00
26-Aug-20 2013933.00 ... 1607449.00
2-Sep-20 2021810.00 ... 1651823.00
9-Sep-20 2030151.00 ... 1570533.00
[685 rows x 3 columns]
--------------------------------------------------------------------------------

Getting date field from JSON url as pandas DataFrame

I am trying to bring this API URL into a pandas DataFrame and getting the values but still needing to add the date as a column like the other values:
import pandas as pd
from pandas.io.json import json_normalize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = pd.read_json("https://covidapi.info/api/v1/country/DOM")
df = pd.DataFrame(df['result'].values.tolist())
print (df)
Getting this output:
confirmed deaths recovered
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
72 1488 68 16
73 1488 68 16
74 1745 82 17
75 1828 86 33
76 1956 98 36
You need to pass the index from your dataframe as well as the data itself:
df = pd.DataFrame(index=df.index, data=df['result'].values.tolist())
The line above creates the same columns, but keeps the original date index from the API call.

Webscraping jTable with hidden columns?

I am currently trying to setup a webscraper in Python for the following webpage:
https://understat.com/team/Juventus/2018
specifically for the 'team-players jTable'
I have managed to scrape the table successfully with BeautifulSoup and selenium, but there are hidden columns (accessible via the options popup window) that I can't initialize and include in my scraping.
Anyone know how to change this?
import urllib.request
from bs4 import BeautifulSoup
import lxml
import re
import requests
from selenium import webdriver
import pandas as pd
import re
import random
import datetime
base_url = 'https://understat.com/team/Juventus/2018'
url = base_url
data = requests.get(url)
html = data.content
soup = BeautifulSoup(html, 'lxml')
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('/Users/kylecaron/Desktop/souptest/chromedriver',options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
headers = soup.find('div', attrs={'class':'players jTable'}).find('table').find_all('th',attrs={'class':'sort'})
headers_list = [header.get_text(strip=True) for header in headers]
body = soup.find('div', attrs={'class':'players jTable'}).table.tbody
all_rows_list = []
for tr in body.find_all('tr'):
row = tr.find_all('td')
current_row = []
for item in row:
current_row.append(item.get_text(strip=True))
all_rows_list.append(current_row)
headers_list = ['№', 'Player', 'Positions', 'Apps', 'Min', 'G', 'A', 'Sh90', 'KP90', 'xG', 'xA', 'xG90', 'xA90']
xg_df = pd.DataFrame(all_rows_list, columns=headers_list)
If you navigate to the website, there are hidden table columns such as 'XGChain'. I want all of these hidden elements scraped, but having trouble doing it.
Best,
Kyle
Here you go. You could still use BeautifulSoup to iterate through the tr and td tags, but I always find pandas much easier to get tables, as it does the work for you.
from selenium import webdriver
import pandas as pd
url = 'https://understat.com/team/Juventus/2018'
driver = webdriver.Chrome()
driver.get(url)
# Click the Options Button
driver.find_element_by_xpath('//*[#id="team-players"]/div[1]/button/i').click()
# Click the fields that are hidden
hidden = [7, 12, 14, 15, 17, 19, 20, 21, 22, 23, 24]
for val in hidden:
x_path = '//*[#id="team-players"]/div[2]/div[2]/div/div[%s]/div[2]/label' %val
driver.find_element_by_xpath(x_path).click()
# Appy the filter
driver.find_element_by_xpath('//*[#id="team-players"]/div[2]/div[3]/a[2]').click()
# get the tables in source
tables = pd.read_html(driver.page_source)
data = tables[1]
data.rename(columns={'Unnamed: 22':"Yellow_Cards", "Unnamed: 23":"Red_Cards"})
driver.close()
Output:
print (data.columns)
Index(['№', 'Player', 'Pos', 'Apps', 'Min', 'G', 'NPG', 'A', 'Sh90', 'KP90',
'xG', 'NPxG', 'xA', 'xGChain', 'xGBuildup', 'xG90', 'NPxG90', 'xA90',
'xG90 + xA90', 'NPxG90 + xA90', 'xGChain90', 'xGBuildup90',
'Yellow_Cards', 'Red_Cards'],
dtype='object')
print (data)
№ Player ... Yellow_Cards Red_Cards
0 1.0 Cristiano Ronaldo ... 2 0
1 2.0 Mario Mandzukic ... 3 0
2 3.0 Paulo Dybala ... 1 0
3 4.0 Federico Bernardeschi ... 2 0
4 5.0 Blaise Matuidi ... 2 0
5 6.0 Rodrigo Bentancur ... 5 1
6 7.0 Juan Cuadrado ... 2 0
7 8.0 Leonardo Bonucci ... 1 0
8 9.0 Miralem Pjanic ... 4 0
9 10.0 Sami Khedira ... 0 0
10 11.0 Giorgio Chiellini ... 1 0
11 12.0 Medhi Benatia ... 2 0
12 13.0 Douglas Costa ... 2 1
13 14.0 Emre Can ... 2 0
14 15.0 Mattia Perin ... 1 0
15 16.0 Mattia De Sciglio ... 0 0
16 17.0 Wojciech Szczesny ... 0 0
17 18.0 Andrea Barzagli ... 0 0
18 19.0 Alex Sandro ... 3 0
19 20.0 Daniele Rugani ... 1 0
20 21.0 Moise Kean ... 0 0
21 22.0 João Cancelo ... 2 0
22 NaN NaN ... 36 2
[23 rows x 24 columns]

Categories

Resources