How to create pandas dataframe from web scrape?

How to create pandas dataframe from web scrape? - python

I would like to use this web scrape to create a pandas dataframe that way I can export the data to excel. Is anyone familiar with this? I have seen different methods online and on this site but have been unable to successfully duplicate the results with this scrape.
Here is the code so far:
import requests
source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
for team in source['data']:
print("\n%s players\n" % team['home_route'].capitalize())
for player in team['home_players']:
print(player['name'])
print("\n%s players\n" % team['away_route'].capitalize())
for player in team['away_players']:
print(player['name'])
This site seems useful but the examples are different:
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm
Here is another example from stackoverflow.com:
Loading web scraping results into Pandas DataFrame
I am new to coding/scraping so any help will greatly appreciated. Thanks in advance for your time and effort!

I have added a solution to have a dataframe teamwise, I hope this helps. Updated code
import requests
source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
players = []
teams = []
for team in source['data']:
print("\n%s players\n" % team['home_route'].capitalize())
teams.append(team['home_route'].capitalize())
teams.append(team['away_route'].capitalize())
temp = []
temp1 = []
for player in team['home_players']:
print(player['name'])
temp.append(player['name'])
print("\n%s players\n" % team['away_route'].capitalize())
for player in team['away_players']:
print(player['name'])
temp1.append(player['name'])
players.append(temp)
players.append(temp1)
import pandas as pd
df = pd.DataFrame(columns=teams)
for i in range(0, len(df.columns)):
df[df.columns[i]] = players[i]
df
In order to export to excel, you can do
df.to_excel('result.xlsx')

Python requests conveniently renders the json as a dict so you can just use the dict in a pd.DataFrame constructor.
import pandas as pd
df = pd.DataFrame([dict1, dict2, dict3])
# Do your data processing here
df.to_csv("myfile.csv")
Pandas also has pd.io.json with helpers like json_normalize so once your data is in a dataframe you can process nested json in to tabular data, and so on.

you can try like below..
>>> import pandas as pd
>>> import json
>>> import requests
>>> source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
>>> df = pd.DataFrame.from_dict(source) # directly use source as itself is a dict
Now you can take the dataframe into csv format by df.to_csv as follows:
>>> df.to_csv("nba_play.csv")
Below are Just your columns which you can process for your data as desired..
>>> df.columns
Index(['bottom_header', 'bottom_paragraph', 'data', 'heading',
'intro_paragraph', 'page_title', 'twitter_link'],
dtype='object')
However as Charles said, you can use json_normalize which will give you better view of data in a tabular form..
>>> from pandas.io.json import json_normalize
>>> json_normalize(df['data']).head()
away_bets.key away_bets.moneyline away_bets.over_under \
0 ATL 500 o232.0
1 POR 165 o217.0
2 SAC 320 o225.0
3 BKN 110 o216.0
4 TOR -140 o221.0
away_bets.over_under_moneyline away_bets.spread \
0 -115 11.0
1 -115 4.5
2 -105 9.0
3 -105 2.0
4 -105 -2.0
away_bets.spread_moneyline away_bets.total \
0 -110 121.50
1 -105 110.75
2 -115 117.00
3 -110 109.00
4 -115 109.50
away_injuries \
0 [{'name': 'J. Collins', 'profile_url': '/nba/p...
1 [{'name': 'M. Harkless', 'profile_url': '/nba/...
2 [{'name': 'K. Koufos', 'profile_url': '/nba/pl...
3 [{'name': 'T. Graham', 'profile_url': '/nba/pl...
4 [{'name': 'O. Anunoby', 'profile_url': '/nba/p...
away_players away_route \
0 [{'draftkings_projection': 30.04, 'yahoo_posit... atlanta-hawks
1 [{'draftkings_projection': 47.33, 'yahoo_posit... portland-trail-blazers
2 [{'draftkings_projection': 28.88, 'yahoo_posit... sacramento-kings
3 [{'draftkings_projection': 37.02, 'yahoo_posit... brooklyn-nets
4 [{'draftkings_projection': 45.2, 'yahoo_positi... toronto-raptors
... nav.matchup_season nav.matchup_time \
0 ... 2019 2018-10-29T23:00:00+00:00
1 ... 2019 2018-10-29T23:00:00+00:00
2 ... 2019 2018-10-29T23:30:00+00:00
3 ... 2019 2018-10-29T23:30:00+00:00
4 ... 2019 2018-10-30T00:00:00+00:00
nav.status.away_team_score nav.status.home_team_score nav.status.minutes \
0 None None None
1 None None None
2 None None None
3 None None None
4 None None None
nav.status.quarter_integer nav.status.seconds nav.status.status \
0 None Scheduled
1 None Scheduled
2 None Scheduled
3 None Scheduled
4 None Scheduled
nav.updated order
0 2018-10-29T17:51:05+00:00 0
1 2018-10-29T17:51:05+00:00 1
2 2018-10-29T17:51:05+00:00 2
3 2018-10-29T17:51:05+00:00 3
4 2018-10-29T17:51:05+00:00 4
[5 rows x 383 columns]
Hope, this will help

Related

Scrape for Table with Limits

There is a website that has data on it that it pulls from an API. The max number of rows you can have per page is 100. If you were to check the API URL from page 1, 2, 3, etc they change each time. So far I have taken the same script each time and just switched out the URL but I then also have to save it in a different excel file every time or else it removes data.
I'd like to have a script to be able to pull all information from this table and then place them all into an excel on the same sheet without the values being overwritten.
The main page I'm using is http://www.nhl.com/stats/teams?aggregate=0&report=daysbetweengames&reportType=game&dateFrom=2021-10-12&dateTo=2021-11-30&gameType=2&filter=gamesPlayed,gte,1&sort=a_teamFullName,daysRest&page=0&pageSize=50 but please keep in mind that all the information on that page is being pulled from an API.
Here is the code I'm using:
import requests
import json
import pandas as pd
url = ('https://api.nhle.com/stats/rest/en/team/daysbetweengames? `isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22teamFullName%22,%22direction%22:%22ASC%22%7D,%7B%22property%22:%22daysRest%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=500&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameDate%3C=%222021-11-30%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-10-12%22%20and%20gameTypeId=2')`
resp = requests.get(url).text
resp = json.loads(resp)
df = pd.DataFrame(resp['data'])
df.to_excel('Master File.xlsx', sheet_name = 'Info')
Any help would be greatly appreciated.
Thanks!

url has start=... - so you can use for-loop and replace this value 0, 100, 200, etc. and run code for different urls, and append() to DataFrame
It can be simpler if you put all arguments from url (after char ?) to dictionary and later run as get(url, params=...)
requests has response.json so it doesn't need json.loads(response.text)
import requests
import pandas as pd
# --- before loop ---
url = 'https://api.nhle.com/stats/rest/en/team/daysbetweengames'
payload = {
'isAggregate': 'false',
'isGame': 'true',
'start': 0,
'limit': 100,
'sort': '[{"property":"teamFullName","direction":"ASC"},{"property":"daysRest","direction":"DESC"},{"property":"teamId","direction":"ASC"}]',
'factCayenneExp': 'gamesPlayed>=1',
'cayenneExp': 'gameDate<="2021-11-30 23:59:59" and gameDate>="2021-10-12" and gameTypeId=2',
}
df = pd.DataFrame()
# --- loop ---
for start in range(0, 1000, 100):
print('start:', start)
payload['start'] = start
response = requests.get(url, params=payload)
data = response.json()
df = df.append(data['data'], ignore_index=True)
# --- after loop ---
print(df)
df.to_excel('Master File.xlsx', sheet_name='Info')
Result:
daysRest faceoffWinPct gameDate ... ties timesShorthandedPerGame wins
0 4 0.47169 2021-10-13 ... None 5.0 1
1 3 0.50847 2021-11-22 ... None 4.0 0
2 2 0.45762 2021-10-26 ... None 1.0 0
3 2 0.56666 2021-11-05 ... None 2.0 1
4 2 0.54716 2021-11-14 ... None 1.0 1
.. ... ... ... ... ... ... ...
675 1 0.37209 2021-10-28 ... None 2.0 1
676 1 0.48000 2021-10-21 ... None 3.0 1
677 0 0.57692 2021-11-06 ... None 1.0 0
678 0 0.32727 2021-11-19 ... None 3.0 0
679 0 0.47169 2021-11-27 ... None 4.0 1

Python Pandas : Getting only 3 first elements from table

I using pandas to webscrape this site https://www.mapsofworld.com/lat_long/poland-lat-long.html but i only gettin 3 elements. How could I get all elements from table?
import numpy as np
import pandas as pd
#for getting world map
import folium
# Retreiving Latitude and Longitude coordinates
info = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",match='Augustow',skiprows=2)
#convering the table data into DataFrame
coordinates = pd.DataFrame(info[0])
data = coordinates.head()
print(data)

It looks like if you install and use html5lib as your parser it may fix your issues:
df = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",attrs={"class":"tableizer-table"},skiprows=2,flavor="html5lib")
>>>df
[ 0 1 2
0 Locations Latitude Longitude
1 NaN NaN NaN
2 Augustow 53°51'N 23°00'E
3 Auschwitz/Oswiecim 50°02'N 19°11'E
4 Biala Podxlaska 52°04'N 23°06'E
.. ... ... ...
177 Zawiercie 50°30'N 19°24'E
178 Zdunska Wola 51°37'N 18°59'E
179 Zgorzelec 51°10'N 15°0'E
180 Zyrardow 52°3'N 20°28'E
181 Zywiec 49°42'N 19°10'E
[182 rows x 3 columns]]

Which is the correct way to use `to_csv` after reading `json` from restapi ? How to get data in tabular format?

I am trying to read data from : http://dummy.restapiexample.com/api/v1/employees and trying to put it out in tabular format.
I am getting the output. But columns are not created from json file.
How can do this in right way?
Code:
import pandas as pd
import json
df1 = pd.read_json('http://dummy.restapiexample.com/api/v1/employees')
df1.to_csv('try.txt',sep='\t',index=False)
Expected Output:
employee_name employee_salary employee_age profile_image
Tiger Nixon 320800 61
(along with other rows)

You can read the data directly from the web, like you're doing, but you need to help pandas interpret your data with the orient parameter:
df = pd.read_json('http://dummy.restapiexample.com/api/v1/employees', orient='index')
Then there's a second step to focus on the data you want:
df1 = pd.DataFrame(df.loc['data', 0])
Now you can write your csv.

Here are the different steps (note: the data is in [data] array of the JSON response):
import json
import pandas as pd
import requests
res = requests.get('http://dummy.restapiexample.com/api/v1/employees')
data_str = res.content
data_dict = json.loads(data_str)
data_df = pd.DataFrame(data_dict['data'])
data_df.to_csv('try.txt', sep='\t', index=False)

you have to parse your json first.
import pandas as pd
import json
import requests
r = requests.get('http://dummy.restapiexample.com/api/v1/employees')
j = json.loads(r.text)
df = pd.DataFrame(j['data'])
output
id employee_name employee_salary employee_age profile_image
0 1 Tiger Nixon 320800 61
1 2 Garrett Winters 170750 63
2 3 Ashton Cox 86000 66
3 4 Cedric Kelly 433060 22
4 5 Airi Satou 162700 33
5 6 Brielle Williamson 372000 61
6 7 Herrod Chandler 137500 59
7 8 Rhona Davidson 327900 55
8 9 Colleen Hurst 205500 39

create a new column in pandas dataframe using if condition from another dataframe

I have two dataframes as follows
transactions
buy_date buy_price
0 2018-04-16 33.23
1 2018-05-09 33.51
2 2018-07-03 32.74
3 2018-08-02 33.68
4 2019-04-03 33.58
and
cii
from_fy to_fy score
0 2001-04-01 2002-03-31 100
1 2002-04-01 2003-03-31 105
2 2003-04-01 2004-03-31 109
3 2004-04-01 2005-03-31 113
4 2005-04-01 2006-03-31 117
In the transactions dataframe I need to create a new columns cii_score based on the following condition
if transactions['buy_date'] is between cii['from_fy'] and cii['to_fy'] take the cii['score'] value for transactions['cii_score']
I have tried list comprehension but it is no good.
Request your inputs to tackle this.

First, we set up your dfs. Note I modified the dates in transactions in this short example to make it more interesting
import pandas as pd
from io import StringIO
trans_data = StringIO(
"""
,buy_date,buy_price
0,2001-04-16,33.23
1,2001-05-09,33.51
2,2002-07-03,32.74
3,2003-08-02,33.68
4,2003-04-03,33.58
"""
)
cii_data = StringIO(
"""
,from_fy,to_fy,score
0,2001-04-01,2002-03-31,100
1,2002-04-01,2003-03-31,105
2,2003-04-01,2004-03-31,109
3,2004-04-01,2005-03-31,113
4,2005-04-01,2006-03-31,117
"""
)
tr_df = pd.read_csv(trans_data, index_col = 0)
tr_df['buy_date'] = pd.to_datetime(tr_df['buy_date'])
cii_df = pd.read_csv(cii_data, index_col = 0)
cii_df['from_fy'] = pd.to_datetime(cii_df['from_fy'])
cii_df['to_fy'] = pd.to_datetime(cii_df['to_fy'])
The main thing is the following calculation: for each row index of tr_df find the index of the row in cii_df that satisfies the condition. The following calculates this match, each element of the list is equal to the appropriate row index of cii_df:
match = [ [(f<=d) & (d<=e) for f,e in zip(cii_df['from_fy'],cii_df['to_fy']) ].index(True) for d in tr_df['buy_date']]
match
produces
[0, 0, 1, 2, 2]
now we can merge on this
tr_df.merge(cii_df, left_on = np.array(match), right_index = True)
so that we get
key_0 buy_date buy_price from_fy to_fy score
0 0 2001-04-16 33.23 2001-04-01 2002-03-31 100
1 0 2001-05-09 33.51 2001-04-01 2002-03-31 100
2 1 2002-07-03 32.74 2002-04-01 2003-03-31 105
3 2 2003-08-02 33.68 2003-04-01 2004-03-31 109
4 2 2003-04-03 33.58 2003-04-01 2004-03-31 109
and the score column is what you asked for

How to create DataFrame from json data - dicts, lists and arrays within an array

I'm not able to get the data but only the headers from json data
Have tried to use json_normalize which creates a DataFrame from json data, but when I try to loop and append data the result is that I only get the headers.
import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
import numpy as np
# importing json data
def get_json(file_path):
r = requests.get('https://www.atg.se/services/racinginfo/v1/api/games/V75_2019-09-29_5_6')
jsonResponse = r.json()
with open(file_path, 'w', encoding='utf-8') as outfile:
json.dump(jsonResponse, outfile, ensure_ascii=False, indent=None)
# Run the function and choose where to save the json file
get_json('../trav.json')
# Open the json file and print a list of the keys
with open('../trav.json', 'r') as json_data:
d = json.load(json_data)
print(list(d.keys()))
[Out]:
['#type', 'id', 'status', 'pools', 'races', 'currentVersion']
To get all data for the starts in one race I can use json_normalize function
race_1_starts = json_normalize(d['races'][0]['starts'])
race_1_starts_df = race_1_starts.drop('videos', axis=1)
print(race_1_starts_df)
[Out]:
distance driver.birth ... result.prizeMoney result.startNumber
0 1640 1984 ... 62500 1
1 1640 1976 ... 11000 2
2 1640 1968 ... 500 3
3 1640 1953 ... 250000 4
4 1640 1968 ... 500 5
5 1640 1962 ... 18500 6
6 1640 1961 ... 7000 7
7 1640 1989 ... 31500 8
8 1640 1960 ... 500 9
9 1640 1954 ... 500 10
10 1640 1977 ... 125000 11
11 1640 1977 ... 500 12
Above we get a DataFrame with data on all starts from one race. However, when I try to loop through all races in range in order to get data on all starts for all races, then I only get the headers from each race and not the data on starts for each race:
all_starts = []
for t in range(len(d['races'])):
all_starts.append([t+1, json_normalize(d['races'][t]['starts'])])
all_starts_df = pd.DataFrame(all_starts, columns = ['race', 'starts'])
print(all_starts_df)
[Out]:
race starts
0 1 distance ... ...
1 2 distance ... ...
2 3 distance ... ...
3 4 distance ... ...
4 5 distance ... ...
5 6 distance ... ...
6 7 distance ... ...
In output I want a DataFrame that is a merge of data on all starts from all races. Note that the number of columns can differ depending on which race, but that I expect in case one race has 21 columns and another has 20 columns - then the all_starts_df should contain all columns but in case a race do not have data for one column it should say 'NaN'.
Expected result:
[Out]:
race distance driver.birth ... result.column_20 result.column_22
1 1640 1984 ... 12500 1
1 1640 1976 ... 11000 2
2 2140 1968 ... NaN 1
2 2140 1953 ... NaN 2
3 3360 1968 ... 1500 NaN
3 3360 1953 ... 250000 NaN

If you want all columns you can try this.. (I find a lot more than 20 columns so I might have something wrong.)
all_starts = []
headers = []
for idx, race in enumerate(d['races']):
df = json_normalize(race['starts'])
df['race'] = idx
all_starts.append(df.drop('videos', axis=1))
headers.append(set(df.columns))
# Create set of all columns for all races
columns = set.union(*headers)
# If columns are missing from one dataframe add it (as np.nan)
for df in all_starts:
for c in columns - set(df.columns):
df[c] = np.nan
# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0, sort=True)
Alternatively, if you know the names of the columns you want to keep, try this
columns = ['race', 'distance', 'driver.birth', 'result.prizeMoney']
all_starts = []
for idx, race in enumerate(d['races']):
df = json_normalize(race['starts'])
df['race'] = idx
all_starts.append(df[columns])
# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create pandas dataframe from web scrape? - python

Related

Scrape for Table with Limits

Python Pandas : Getting only 3 first elements from table

Which is the correct way to use `to_csv` after reading `json` from restapi ? How to get data in tabular format?

create a new column in pandas dataframe using if condition from another dataframe

How to create DataFrame from json data - dicts, lists and arrays within an array

Categories

Resources