How can I use `pivot` to track wins and loses? - python

Suppose I have some team data as a dataframe df.
home_team home_score away_team away_score
A 3 C 1
B 1 A 0
C 3 B 2
I'd like to a dataframe indicating how many times one team has beat another. So for instance the entry in [1,3] would be the number of times team 1 has beat team 3, but the number in [3,1] would be the number of times team 3 as beat team 1.
This sounds like something df.pivot should be able to do, but I can't seem to get it to do what I would like.
How can I accomplish this using pandas?
Here is a desired output
A B C
A 0 0 1
B 1 0 0
C 0 1 0

This will create a new dataframe with just the winners and loosers. It can be pivoted to created what you are looking for.
I made some additional data to fill in some of the pivot table values
import pandas as pd
data = {'home_team':['A','B','C','A','B','C','A','B','C'],
'home_score':[3,1,3,0,1,2,0,4,0],
'away_team':['C','A','B','B','C','B','C','A','A'],
'away_score':[1,0,2,2,0,3,1,7,1]}
df = pd.DataFrame(d)
# create new dataframe
WL = pd.DataFrame()
WL['winner'] = pd.concat([df.home_team[df.home_score>df.away_score],
df.away_team[df.home_score<df.away_score]], axis=0)
WL['loser'] = pd.concat([df.home_team[df.home_score<df.away_score],
df.away_team[df.home_score>df.away_score]], axis=0)
WL['game'] = 1
# groupby to count the number of win/lose pairs
WL_gb = WL.groupby(['winner','loser']).count().reset_index()
# pivot the data
WL_piv = WL_gb.pivot(index='winner', columns='loser', values='game')

Related

Count first and last occurrences in a dataset

I ultimately want to count the number of months in a given range per each user. For example, see below, user 1 has 1 range of data from April 2021-June 2021. Where I'm struggling is counting users that multiple ranges (see users 3 & 4).
I have a pandas df w/ columns that looks like these:
username Jan_2021 Feb_2021 March_2021 April_2021 May_2021 June_2021 July_2021 Sum_of_Months
user 1 0 0 0 1 1 1 0 3
user 2 0 0 0 0 0 0 1 1
user 3 1 1 1 0 1 1 0 5
user 4 0 1 1 1 0 1 1 5
Id like to be able to get a summary column that says the number of groups and their count.
For example:
When I say num of groups I mean the amount of grouped 1's together. and when I say length of group I mean the amount of months in 1 group, like if I were to draw a circle around the 1s. For example, user 1 is 3 because there's a 1 in columns April-June 2021
username Num_of_groups Lenth_of_group
user 1 1 3
user 2 1 1
user 3 2 3,2
user 4 2 3,2
You can try with groupby function from itertools
from itertools import groupby
df1 = df[[col for col in df.columns if "2021" in col]]
df["Lenth_of_group"] = df1.apply(lambda x: [sum(g) for i, g in groupby(x) if i == 1],axis=1)
df["Num_of_groups"] = df["Lenth_of_group"].apply(lambda x: len(x))
Hope this Helps...
This solution uses staircase, and works by treating each users data as a step function (of 1s and 0s)
setup
import pandas as pd
df = pd.DataFrame(
{
"username":["user 1", "user 2", "user 3", "user 4"],
"Jan_2021":[0,0,1,0],
"Feb_2021":[0,0,1,0],
"Mar_2021":[0,0,1,1],
"April_2021":[1,0,0,1],
"May_2021":[1,0,1,0],
"June_2021":[1,0,1,1],
"July_2021":[0,1,1,0],
"Sum_of_Months":[3,1,5,5],
}
)
solution
import staircase as sc
# trim down to month columns only, and transpose to make users correspond to columns, and months correspond to rows
data = df[["Jan_2021", "Feb_2021", "Mar_2021", "April_2021", "May_2021", "June_2021", "July_2021"]].transpose().reset_index(drop=True)
def extract_groups(series):
return (
sc.Stairs.from_values(initial_value=0, values=series) # create step function for each user
.clip(0, len(series)) # clip step function to region of interest
.to_frame() # represent data as start/stop intervals in a dataframe
.query("value==1") # filter for groups of 1s
.eval("dist=end-start") # calculate the length of each "group"
["dist"].to_list() # convert the result from Series to list
)
sfs = data.columns.to_series().apply(lambda c: extract_groups(data[c]))
sfs is a pandas.Series where the values are lists representing number of groups and the lengths of each. It looks like this:
0 [3]
1 [1]
2 [3, 3]
3 [2, 1]
dtype: object
You can use it to create the data you need, eg
df["Num_of_groups"] = sfs.apply(list.__len__)
adds the Num_of_groups column to your original dataframe
Disclaimer: I am author of staircase

Pandas generate a column of random numbers based on ID column

I'd like to generate random numbers from 1 to n based on the ID column in my DataFrame. Repeating values in this ID column should have the same random number. A Random number should be assigned to more than 1 ID, but the number of IDs belonging to each random number should be equal or the most equal as possible. I'd also like a seed value that way I can replicate the results.
A very simple example is let's say I have an ID column with values of A,B,C,D,E. I'd like to assign a random number of 1 to 2. So in this example, IDs A,B,E would be assigned to random number 1 and IDs C,D to 2.
ID Random
A 1
C 2
A 1
B 1
E 1
D 2
Also, I have a very large DataFrame so speed is very important.
Update: What I tried previously was getting a unique list of the IDs then generating random numbers for each, but I made a DataFrame and tried to merge the 2 DataFrames which was too time consuming.
Thanks to S3DEV he suggested mapping a dictionary to the column which was a lot faster.
ID_list = df['ID'].unique()
random_list = np.random.randint(1, 2, size=len(ID_list))
dic = {ID_list[i]: random_list[i] for i in range(len(ID_list))}
df['Random'] = df['ID'].map(dic)
To fix your approach (i.e. create side dataframe):
n = 10
ids = df[["ID"]].drop_duplicates()
ids["Random"] = np.random.randint(1, n, len(ids))
ids.set_index("ID", inplace=True)
df.set_index("ID", inplace=True)
df["Random"] = ids["Random"]
df.reset_index(inplace=True)
Outputs:
ID Random
0 A 6
1 C 7
2 A 6
3 B 4
4 E 1
5 D 6

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Grouping data from multiple columns in data frame into summary view

I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:
Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

Categories

Resources