Calculate Win Rates In Python Using Groupby and Lambda Functions - python

I'm trying to create a new df from race_dbs that's grouped by 'horse_id' showing the number of times 'place' = 1 as well as the total number of times that 'horse_id' occurs.
Some background on the dataset if it's helpful;
race_dbs contains horse race data. There are 12 horses in a race, for each is shown their odds, fire, place, time, and gate number.
What I'm trying to achieve from this code is the calculation of win rates for each horse.
A win is denoted by 'place' = 1
Total race count will be calculated by how many times a particular 'horse_id' occurs in the db.
race_dbs
race_id
horse_id
odds
fire
place
horse_time
gate
V14qANzi
398807
NaN
0
1
72.0191
7
xeieZak
191424
NaN
0
8
131.3010
10
xeieZak
139335
NaN
0
1
131.3713
9
xeieZak
137195
NaN
0
11
131.6310
11
xeieZak
398807
NaN
0
12
131.7886
2
...
...
..
..
...
...
..
From this simple table the output would look like, but please bear in mind my dataset is very large, containing 12882353 rows in total.
desired output
horse_id
wins
races
win rate
398807
1
2
50%
191424
0
1
0%
139335
1
1
100%
137195
0
1
0%
...
..
..
...
It should be noted that I'm a complete coding beginner so forgive me if this is an easy solve.
I have tried to use the groupby and lambda pandas functions but I am struggling to combine both functions, and believe there will be a much simpler way.
import pandas as pd
race_db = pd.read_csv('horse_race_data_db.csv')
race_db_2 = pd.read_csv('2_horse_race_data.csv')
frames = [race_db, race_db_2]
race_dbs = pd.concat(frames, ignore_index=True, sort=False)
race_dbs_horse_wins = race_dbs.groupby('horse_id')['place'].apply(lambda x: x[x == 1].count())
race_dbs_horse_sums = race_dbs.groupby('horse_id').aggregate({"horse_id":['sum']})
Thanks for the help!

For count Trues values create helper boolean column and aggregate sum, for win rate aggregate mean and for count use GroupBy.size in named aggregations by GroupBy.agg:
out = (race_dbs.assign(no1 = race_dbs['place'].eq(1))
.groupby('horse_id', sort=False, as_index=False)
.agg(**{'wins':('no1','sum'),
'races':('horse_id','size'),
'win rate':('no1','mean')}))
print (out)
horse_id wins races win rate
0 398807 1 2 0.5
1 191424 0 1 0.0
2 139335 1 1 1.0
3 137195 0 1 0.0

can you try this way:
Example code
import pandas as pd
import numpy as np
new_technologies= {
'Courses':["Python","Java","Python","Ruby","Ruby"],
'Fees' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', '30days', '30days']
}
print('new_technologies:',new_technologies)
df = pd.DataFrame(new_technologies)
print('df:',df)
#calculate precentage of aggregated functions
df2 = df.groupby(['Courses', 'Fees']).agg({'Fees': 'sum'})
print(df2)
# Percentage by lambda and DataFrame.apply() method.
df3 = df2.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
print(df3)
output:

Related

Given windows of start and stop times for each object, how can I count how many objects were on for each second?

I have the following pandas dataframe:
import pandas as pd
TurnedOn = pd.Series([1000.4,1200.5,1550.1,500.3])
TurnedOff = pd.Series([1400.2,1600.8,1570.3,74500.6])
df = pd.DataFrame(data=[TurnedOn,TurnedOff]).T
df.index = ['OBJ1','OBJ2','OBJ3','OBJ4']
I want to get a time based count in seconds throughout the day of how many lights were on at a 0.1 second sampling rate.
I've tried doing this by making a large dataframe from 0 to 864000 (seconds per day times 10), and setting each object true for each 0.1 second in the time window of between Turned on and Turned off and then counting them, but this is horribly inefficient for large dataframes.
Is there something in python that I can use to count how many lights are on each second?
For instance, the output would be:
500.3-1000.4: 1 light
1000.4-1200.5: 2 lights
1200.5 - 1400.2: 3 lights
1400.2-1550.1: 2 lights
1550.1-1570.3: 3 lights
1570.3-1600.8: 2 lights
1600.8-74500.6: 1 light
With the following toy dataframe:
import pandas as pd
TurnedOn = pd.Series([1000.4, 1200.5, 1550.1, 500.3])
TurnedOff = pd.Series([1400.2, 1600.8, 1570.3, 74500.6])
df = pd.DataFrame(data=[TurnedOn, TurnedOff]).T
df.columns = ["TurnedOn", "TurnedOff"]
print(df)
# Output
TurnedOn TurnedOff
0 1000.4 1400.2
1 1200.5 1600.8
2 1550.1 1570.3
3 500.3 74500.6
Here is one way to do it with Pandas unstack and cumsum:
# Prep data
df = (
df.unstack()
.reset_index()
.drop(columns="level_1")
.rename(columns={"level_0": "status", 0: "start"})
)
df = df.sort_values(by="start", ignore_index=True)
df["end"] = df["start"].shift(-1)
# Count how many lights are simultaneously on
df["num_lights_on"] = df.apply(lambda x: 1 if x["status"] == "TurnedOn" else -1, axis=1)
df["num_lights_on"] = df["num_lights_on"].cumsum()
# Cleanup
df = df.reindex(["start", "end", "num_lights_on"], axis=1).dropna()
Then:
print(df)
# Output
start end num_lights_on
0 500.3 1000.4 1
1 1000.4 1200.5 2
2 1200.5 1400.2 3
3 1400.2 1550.1 2
4 1550.1 1570.3 3
5 1570.3 1600.8 2
6 1600.8 74500.6 1

Find "most used items" per "level" in big csv file with Pandas

I have a rather big csv file and I want to find out which items are used the most at a certain player level.
So one column I'm looking at has all the player levels (from 1 to 30) another column has all the item names (e.g. knife_1, knife_2, etc.) and yet another column lists backpacks (backback_1, backpack_2, etc.).
Now I want to check which is the most used knife and backpack for player level 1, for player level 2, player level 3, etc.
What I've tried was this but when I tried to verify it in Excel (with countifs) the results were different:
import pandas as pd
df = pd.read_csv('filename.csv')
#getting the columns I need:
df = df[["playerLevel", "playerKnife", "playerBackpack"]]
print(df.loc[df["playerLevel"] == 1].mode())
In my head, this should locate all the rows with playerLevel 1 and then only print out the most used items for that level. However, I wanted to double-check and used "countifs" in excel which gave me a different result.
Maybe I'm thinking too simple (or complicated) so I hope you can either verify that my code should be correct or point out the error.
I'm also looking for an easy way to then go through all levels automatically and print out the most used items for each level.
Thanks in advance.
Edit:
Dataframe example. Just imagine there are thousands of players that can range from level 1 to level 30. And especially on higher levels, they have access to a lot of knives and backpacks. So the combinations are limitless.
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
Try the following:
data = """\
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
"""
import io
import pandas as pd
stream = io.StringIO(data)
df = pd.read_csv(stream, sep='\s+')
df = df.drop('index', axis='columns')
print(df.groupby('playerLevel').agg(pd.Series.mode))
yields
playerKnife playerBackpack
playerLevel
1 knife_1 backpack_1
2 [knife_2, knife_3] [backpack_1, backpack_2]
3 knife_1 backpack_2
13 knife_10 backpack_9
15 knife_13 backpack_12
Note that the result of df.groupby('playerLevel').agg(pd.Series.mode) is a DataFrame, so you can assign that result and use it as a normal dataframe.
For data plain from a CSV file, simply use
df = pd.read_csv('filename.csv')
df = df[['playerLevel, 'playerKnife', 'playerBackpack']] # or whichever columns you want
stats = df.groupby('playerLevel').agg(pd.Series.mode)) # stats will be dataframe as well

Dividing Dataframes with Different Dimensions

I prefer to use matrix multiplication for coding, because it's so much more efficient than iterating, but curious on how to do this if the dimensions are different.
I have two different dataframes
A:
Orig_vintage
Q12018
185
Q22018.
200
and B:
default_month
1
2
3
orig_vintage
Q12018
0
25
35
Q22018
0
15
45
Q32018
0
35
65
and I'm trying to divide A through columns of B, so the B dataframe becomes (note I've rounded random percentages):
default_month
1
2
3
orig_vintage
Q12018
0
.03
.04
Q22018
0
.04
.05
Q32018
0
.06
.07
But bottom line want to divide the monthly defaults by the total origination figure to get to a monthly default %.
first step is get data side by side with a right join()
then divide all columns by required value Divide multiple columns by another column in pandas
required value as I understand is sum, if join did not give a value.
import pandas as pd
import io
df1 = pd.read_csv(
io.StringIO("""Orig_vintage,Unnamed: 1\nQ12018,185\nQ22018,200\n"""), sep=","
)
df2 = pd.read_csv(
io.StringIO(
"""default_month,1,2,3\nQ12018,0.0,25.0,35.0\nQ22018,0.0,15.0,45.0\nQ32018,0.0,35.0,65.0\n"""
),
sep=",",
)
df1.set_index("Orig_vintage").join(df2.set_index("default_month"), how="right").pipe(
lambda d: d.div(d["Unnamed: 1"].fillna(d["Unnamed: 1"].sum()), axis=0)
)
default_month
Unnamed: 1
1
2
3
Q12018
1
0
0.135135
0.189189
Q22018
1
0
0.075
0.225
Q32018
nan
0
0.0909091
0.168831

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources