Python version of dplyr R code commands for calculations

Python version of dplyr R code commands for calculations - python

I am trying to create a separate pandas DataFrame in python using pandas'.groupby function. I am working with basketball data and want to create a column that displays if the home and away teams are on the tail end of a back-to-back.
The 0 in the yesterday_home_team and yesterday_away_team columns indicates that the away team did not play the previous night.
Given that there are multiple games each night, the .groupby function should be used.
Input Data:
date home_team away_team
9/22/22 LAL DET
9/23/22 LAC LAL
Desired output:
date home_team away_team yesterday_home_team yesterday_away_team
9/21/22 LAL MIN 0 MIN
9/22/22 LAL DET DET 0
9/23/22 LAC LAL LAL LAC
Appreciate your assistance.

Your output example doesn't make sense to me. Do you need the team names in the 'yesterday_home_team' and 'yesterday_away_team'? Is it sufficient to simply just have a 1 if the home team is on the back to back, and 0 if the home team is not (and then also same logic for away team)? It's also tough when you don't provide a good sample dataset.
Anyways, here's my solution that just indicates a 1 or 0 if the given team is on the back end of the back to back:
import pandas as pd
import numpy as np
months = ['October', 'November', 'December', 'January', 'February', 'March', 'April', 'May', 'June']
dfs = []
for month in months:
month = month.lower()
url = f'https://www.basketball-reference.com/leagues/NBA_2022_games-{month}.html'
df = pd.read_html(url)[0]
df['Date'] = pd.to_datetime(df['Date'])
dfs.append(df)
df = pd.concat(dfs)
df = df.rename(columns={'Visitor/Neutral':'away_team', 'Home/Neutral':'home_team'})
df_melt = pd.melt(df, id_vars=['Date'],
value_vars=['away_team', 'home_team'],
var_name = 'Home_Away',
value_name = 'Team')
df_melt = df_melt.sort_values('Date').reset_index(drop=True)
df_melt['days_between'] = df_melt.groupby('Team')['Date'].diff().dt.days
df_melt['yesterday'] = np.where(df_melt['days_between'] == 1, 1, 0)
df_melt = df_melt.drop(['days_between', 'Home_Away'], axis=1)
df = df.merge(df_melt.rename(columns={'Team':'home_team', 'yesterday':'yesterday_home_team'}), how='left', left_on=['Date', 'home_team'], right_on=['Date', 'home_team'])
df = df.merge(df_melt.rename(columns={'Team':'away_team', 'yesterday':'yesterday_away_team'}), how='left', left_on=['Date', 'away_team'], right_on=['Date', 'away_team'])
df = df[['Date', 'home_team', 'away_team', 'yesterday_home_team', 'yesterday_away_team']]
Output:
print(df.head(30).to_string())
Date home_team away_team yesterday_home_team yesterday_away_team
0 2021-10-19 Milwaukee Bucks Brooklyn Nets 0 0
1 2021-10-19 Los Angeles Lakers Golden State Warriors 0 0
2 2021-10-20 Charlotte Hornets Indiana Pacers 0 0
3 2021-10-20 Detroit Pistons Chicago Bulls 0 0
4 2021-10-20 New York Knicks Boston Celtics 0 0
5 2021-10-20 Toronto Raptors Washington Wizards 0 0
6 2021-10-20 Memphis Grizzlies Cleveland Cavaliers 0 0
7 2021-10-20 Minnesota Timberwolves Houston Rockets 0 0
8 2021-10-20 New Orleans Pelicans Philadelphia 76ers 0 0
9 2021-10-20 San Antonio Spurs Orlando Magic 0 0
10 2021-10-20 Utah Jazz Oklahoma City Thunder 0 0
11 2021-10-20 Portland Trail Blazers Sacramento Kings 0 0
12 2021-10-20 Phoenix Suns Denver Nuggets 0 0
13 2021-10-21 Atlanta Hawks Dallas Mavericks 0 0
14 2021-10-21 Miami Heat Milwaukee Bucks 0 0
15 2021-10-21 Golden State Warriors Los Angeles Clippers 0 0
16 2021-10-22 Orlando Magic New York Knicks 0 0
17 2021-10-22 Washington Wizards Indiana Pacers 0 0
18 2021-10-22 Cleveland Cavaliers Charlotte Hornets 0 0
19 2021-10-22 Boston Celtics Toronto Raptors 0 0
20 2021-10-22 Philadelphia 76ers Brooklyn Nets 0 0
21 2021-10-22 Houston Rockets Oklahoma City Thunder 0 0
22 2021-10-22 Chicago Bulls New Orleans Pelicans 0 0
23 2021-10-22 Denver Nuggets San Antonio Spurs 0 0
24 2021-10-22 Los Angeles Lakers Phoenix Suns 0 0
25 2021-10-22 Sacramento Kings Utah Jazz 0 0
26 2021-10-23 Cleveland Cavaliers Atlanta Hawks 1 0
27 2021-10-23 Indiana Pacers Miami Heat 1 0
28 2021-10-23 Toronto Raptors Dallas Mavericks 1 0
29 2021-10-23 Chicago Bulls Detroit Pistons 1 0

Related

How to filter with conditions to add to new column

I was trying to work with dataframe that looks like:
home
away
home_score
away_score
Tampa Bay
Colorado
3
1
San Jose
Colombus
1
3
New England
San Jose
1
5
Colorado
Tampa Bay
2
0
New England
KC Wizards
2
1
My goal is to compare 'home_score' with 'away_score' and choose the string from 'home' or 'away' to store that value in to separate column based on which score was lower.
For example, for the first row, as away_score is 1 I should be able to add "Colorado" to a separate column.
Desired outcome:
home
away
home_score
away_score
lost_team
Tampa Bay
Colorado
3
1
Colorado
I tried to search for the task but I was not successful in finding methods.

You can use np.where
df['lost_team'] = np.where(df['home_score'] < df['away_score'], df['home'], df['away'])
print(df)
# Output
home away home_score away_score lost_team
0 Tampa Bay Colorado 3 1 Colorado
1 San Jose Colombus 1 3 San Jose
2 New England San Jose 1 5 New England
3 Colorado Tampa Bay 2 0 Tampa Bay
4 New England KC Wizards 2 1 KC Wizards
If a draw is possible, use np.select:
conds = [df['home_score'] < df['away_score'],
df['home_score'] > df['away_score']]
choices = [df['home'], df['away']]
draw = df[['home', 'away']].agg(list, axis=1)
df['lost_team'] = np.select(condlist=conds, choicelist=choices, default=draw).explode()
df = df.explode('lost_team')
print(df)
# Output
home away home_score away_score lost_team
0 Tampa Bay Colorado 3 1 Colorado
1 San Jose Colombus 1 3 San Jose
2 New England San Jose 1 5 New England
3 Colorado Tampa Bay 2 0 Tampa Bay
4 New England KC Wizards 2 1 KC Wizards
5 Team A Team B 0 0 Team A # Row 1
5 Team A Team B 0 0 Team B # Row 2

You can pandas.DataFrame.apply with axis=1 to check the condition on each row and save the result:
df['lost_team'] = df.apply(lambda row:
'Equal' if row['home_score'] == row['away_score'] else (
row['away'] if row['home_score'] > row['away_score'] else row['home']),
axis=1)
print(df)
home away home_score away_score lost_team
0 Tampa Bay Colorado 3 1 Colorado
1 San Jose Columbus 1 3 San Jose
2 New England San Jose 1 5 New England
3 Colorado Tampa Bay 2 0 Tampa Bay
4 New England KC Wizards 2 1 KC Wizards
5 Team A Team B 1 1 Equal

Scraping data from TeamRankings.com

I want to scrape some NBA data from TeamRankings.com for my program in python. Here is an example link:
https://www.teamrankings.com/nba/stat/effective-field-goal-pct?date=2023-01-03
I only need the "Last 3" column data. I want to be able to set the date to whatever I want with a constant variable. There are a few other data points I want that are on different links but I will be able to figure that part out if this gets figured out.
I have tried using https://github.com/tymiguel/TeamRankingsWebScraper but it is outdated and did not work for me.

The easiest way will be to use pandas.read_html:
import pandas as pd
url = 'https://www.teamrankings.com/nba/stat/effective-field-goal-pct?date=2023-01-03'
df = pd.read_html(url)[0]
print(df)
Prints:
Rank Team 2022 Last 3 Last 1 Home Away 2021
0 1 Brooklyn 58.8% 64.5% 68.3% 59.4% 58.1% 54.2%
1 2 Denver 57.8% 62.8% 52.2% 59.5% 56.4% 55.5%
2 3 Boston 56.8% 54.6% 51.1% 58.2% 55.1% 54.0%
3 4 Sacramento 56.3% 56.9% 48.4% 59.1% 53.4% 52.5%
4 5 Golden State 56.3% 53.2% 52.5% 56.9% 55.6% 55.4%
5 6 Dallas 56.0% 59.5% 50.0% 55.8% 56.2% 54.0%
6 7 Portland 55.5% 58.6% 65.5% 57.3% 54.3% 51.5%
7 8 Minnesota 55.3% 52.1% 59.2% 55.7% 54.9% 53.8%
8 9 Utah 55.3% 53.9% 53.7% 58.1% 53.0% 55.1%
9 10 Philadelphia 55.3% 57.3% 56.4% 54.5% 56.2% 53.6%
10 11 Cleveland 55.1% 57.7% 60.9% 56.7% 53.1% 53.7%
11 12 Washington 54.6% 61.4% 56.9% 54.7% 54.5% 53.2%
12 13 Chicago 54.6% 57.3% 54.7% 55.7% 53.5% 53.7%
13 14 Indiana 54.5% 60.3% 53.8% 56.1% 52.8% 53.1%
14 15 New Orleans 54.4% 52.5% 56.5% 56.2% 52.5% 51.8%
15 16 Phoenix 54.1% 51.6% 44.8% 54.8% 53.5% 55.0%
16 17 LA Clippers 54.1% 57.8% 52.2% 52.3% 55.8% 53.0%
17 18 LA Lakers 54.0% 56.6% 53.8% 53.7% 54.3% 53.7%
18 19 San Antonio 53.1% 54.6% 47.4% 53.4% 52.8% 52.7%
19 20 Orlando 52.9% 48.0% 44.5% 54.6% 50.9% 50.2%
20 21 Milwaukee 52.8% 45.5% 42.2% 55.0% 50.4% 54.0%
21 22 Memphis 52.8% 54.0% 51.0% 53.8% 51.8% 52.1%
22 23 Miami 52.6% 54.6% 52.9% 53.1% 52.1% 54.0%
23 24 New York 52.2% 51.4% 57.4% 53.9% 50.6% 51.3%
24 25 Atlanta 52.2% 51.5% 53.7% 51.5% 53.0% 54.2%
25 26 Okla City 52.2% 50.9% 44.6% 52.6% 51.7% 49.7%
26 27 Detroit 51.5% 52.3% 45.1% 52.7% 50.5% 49.4%
27 28 Toronto 51.1% 51.3% 52.7% 51.3% 50.8% 51.0%
28 29 Houston 51.0% 50.0% 51.8% 50.2% 51.6% 53.4%
29 30 Charlotte 50.3% 52.0% 51.1% 49.3% 51.2% 54.3%
If you want only Last 3 column:
print(df[['Team', 'Last 3']])
Prints:
Team Last 3
0 Brooklyn 64.5%
1 Denver 62.8%
2 Boston 54.6%
3 Sacramento 56.9%
...

inner join not working in pandas dataframes

I have the following 2 pandas dataframes:
city Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
city W/L Ratio
0 Boston 2.500000
1 Buffalo 0.555556
2 Calgary 1.057143
3 Chicago 0.846154
4 Columbus 1.500000
5 Dallas–Fort Worth 1.312500
6 Denver 1.433333
7 Detroit 0.769231
8 Edmonton 0.900000
9 Las Vegas 2.125000
10 Los Angeles 1.655862
11 Miami–Fort Lauderdale 1.466667
12 Minneapolis-Saint Paul 1.730769
13 Montreal 0.725000
14 Nashville 2.944444
15 New York 1.517241
16 New York City 0.908870
17 Ottawa 0.651163
18 Philadelphia 1.615385
19 Phoenix 0.707317
20 Pittsburgh 1.620690
21 Raleigh 1.028571
22 San Francisco Bay Area 1.666667
23 St. Louis 1.375000
24 Tampa Bay 2.347826
25 Toronto 1.884615
26 Vancouver 0.775000
27 Washington, D.C. 1.884615
28 Winnipeg 2.600000
And I do a join like this:
result = pd.merge(df, nhl_df , on="city")
The result should have 28 rows, instead I have 24 rows.
One of the missing one is for example Miami-Fort Lauderdale
I have double checked on both dataframes and there are NO typographical errors. So, why isnt it in the end dataframe?
city Population W/L Ratio
0 New York City 20153634 0.908870
1 Los Angeles 13310447 1.655862
2 San Francisco Bay Area 6657982 1.666667
3 Chicago 9512999 0.846154
4 Dallas–Fort Worth 7233323 1.312500
5 Washington, D.C. 6131977 1.884615
6 Philadelphia 6070500 1.615385
7 Boston 4794447 2.500000
8 Denver 2853077 1.433333
9 Phoenix 4661537 0.707317
10 Detroit 4297617 0.769231
11 Toronto 5928040 1.884615
12 Pittsburgh 2342299 1.620690
13 St. Louis 2807002 1.375000
14 Nashville 1865298 2.944444
15 Buffalo 1132804 0.555556
16 Montreal 4098927 0.725000
17 Vancouver 2463431 0.775000
18 Columbus 2041520 1.500000
19 Calgary 1392609 1.057143
20 Ottawa 1323783 0.651163
21 Edmonton 1321426 0.900000
22 Winnipeg 778489 2.600000
23 Las Vegas 2155664 2.125000
24 Raleigh 1302946 1.028571

I think here is possible check if same chars by integer that represents the character in function ord, here are different – with code 150 and – with code 8211, so it is reason why values not matched:
a = df1.loc[10, 'city']
print (a)
Miami–Fort Lauderdale
print ([ord(x) for x in a])
[77, 105, 97, 109, 105, 150, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
b = df2.loc[11, 'city']
print (b)
Miami–Fort Lauderdale
print ([ord(x) for x in b])
[77, 105, 97, 109, 105, 8211, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
You can try copy values for replace for select correct - value:
#first – is copied from b, second – from a
df2['city'] = df2['city'].replace('–','–', regex=True)

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.

arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Label Merged DataFrame

I've combine two DataFrames into one but can't figure out how to label "state_x" and "state_y" tp "West Coast and "East Coast". I will be plotting them later.
What I have so far:
West_quakes = pd.DataFrame({'state': ['California', 'Oregon', 'Washington', 'Alaska'],
'Occurrences': [18108, 376, 973, 12326]})
East_quakes = pd.DataFrame({'state': ['Maine', 'New Hampshire', 'Massachusetts',
'Connecticut', 'New York', 'New Jersey', 'Pennsylvania', 'Maryland',
'Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida'],
'Occurrences': [36, 13, 10, 5, 35, 10, 14, 2, 28, 17, 32, 14, 1]})
West_quakes.reset_index(drop=True).merge(East_quakes.reset_index(drop=True), left_index=True, right_index=True)
Output:
state_x Occurrences_x state_y Occurrences_y
0 California 18108 Maine 36
1 Oregon 376 New Hampshire 13
2 Washington 973 Massachusetts 10
3 Alaska 12326 Connecticut 5
Other merging methods I've tried but results in syntax error such as:
West_quake.set_index('West Coast', inplace=True)
East_quake.set_index('East Coast', inplace=True)
I'm really lost after searching on Google and searching on here.
Any help would be greatly appreciated.
Thank you.

Maybe you are looking for concat instead:
pd.concat((West_quakes, East_quakes))
gives:
state Occurrences
0 California 18108
1 Oregon 376
2 Washington 973
3 Alaska 12326
0 Maine 36
1 New Hampshire 13
2 Massachusetts 10
3 Connecticut 5
4 New York 35
5 New Jersey 10
6 Pennsylvania 14
7 Maryland 2
8 Virginia 28
9 North Carolina 17
10 South Carolina 32
11 Georgia 14
12 Florida 1
Or:
pd.concat((West_quakes, East_quakes), keys=('West','East'))
which gives:
state Occurrences
West 0 California 18108
1 Oregon 376
2 Washington 973
3 Alaska 12326
East 0 Maine 36
1 New Hampshire 13
2 Massachusetts 10
3 Connecticut 5
4 New York 35
5 New Jersey 10
6 Pennsylvania 14
7 Maryland 2
8 Virginia 28
9 North Carolina 17
10 South Carolina 32
11 Georgia 14
12 Florida 1
Or:
pd.concat((West_quakes, East_quakes), axis=1, keys=('West','East'))
outputs:
West East
state Occurrences state Occurrences
0 California 18108.0 Maine 36
1 Oregon 376.0 New Hampshire 13
2 Washington 973.0 Massachusetts 10
3 Alaska 12326.0 Connecticut 5
4 NaN NaN New York 35
5 NaN NaN New Jersey 10
6 NaN NaN Pennsylvania 14
7 NaN NaN Maryland 2
8 NaN NaN Virginia 28
9 NaN NaN North Carolina 17
10 NaN NaN South Carolina 32
11 NaN NaN Georgia 14
12 NaN NaN Florida 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python version of dplyr R code commands for calculations - python

Related

How to filter with conditions to add to new column

Scraping data from TeamRankings.com

inner join not working in pandas dataframes

Cumsum with groupby

Label Merged DataFrame

Categories

Resources