Fastest way to do cumulative totals in Pandas dataframe - python

I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:
Date----Golfer---Tournament-----Score---Player Total Rounds Played
2008-01-01---Tiger Woods----Invented Tournament R1---72---50
2008-01-01---Phil Mickelson----Invented Tournament R1---73---108
I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (i.e. instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.
Thanks,
Tom

Here is one way:
df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
For example:
import pandas as pd
df = pd.DataFrame([['A', 70, 50],
['B', 72, 55],
['A', 73, 45],
['A', 71, 60],
['B', 74, 55],
['A', 72, 65]],
columns=['Golfer', 'Rounds', 'Played'])
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
# Golfer Rounds Played Rounds CumSum
# 0 A 70 50 70
# 1 B 72 55 72
# 2 A 73 45 143
# 3 A 71 60 214
# 4 B 74 55 146
# 5 A 72 65 286

Related

How do I add a column based on selected row filter in pandas?

Hi I would like to give a final score to the students based on current Score + Score for their favourite subject.
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
for subj in df['Favourite_Subject'].unique():
mask = (df['Favourite_Subject'] == subj)
df['Final_Score'] = df[mask].apply(lambda row: row['Current_Score'] + row[subj], axis=1)
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English NaN
1 nick 30 42 23 21 Math NaN
2 juli 39 14 40 38 Science 79.0
When I apply the above function, I got NaN in the other 2 entries for 'Final_Score' column, how do I get the following result without overwriting with NaN? Thanks!
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
We can use lookup to find the scores corresponding to the Favourite_Subject then add them with the Current_Score to calculate Final_Score
i = df.columns.get_indexer(df['Favourite_Subject'])
df['Final_Score'] = df['Current_Score'] + df.values[df.index, i]
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
You do not need a loop, you can apply this directly to the dataframe:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
df['Final_Score'] = df.apply(lambda x: x['Current_Score'] + x[x['Favourite_Subject']], axis=1)
You can use .apply() on axis=1 and get the column label from the column value of column Favourite_Subject to get the value of the corresponding column. Then, add the result to column Current_Score with df['Current_Score'], as follows:
df['Final_Score'] = df['Current_Score'] + df.apply(lambda x: x[x['Favourite_Subject']], axis=1)
Result:
print(df)
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
Seems like you are overwriting the previous values during each loop which is why you only have the Final score for the final row when the loop ends.
Here is my implementation:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
favsubj = df['Favourite_Subject'].to_list()
final_scores = []
for i in range(0,len(df)):
final_scores.append(df['Current_Score'].iloc[i] + df[favsubj[i]].iloc[i])
df['Final_Score'] = final_scores

Adding calculated row in Pandas

gender math score reading score writing score
female 65 73 74
male 69 66 64
Given the dataframe (see above) how can we add a line that would calculate the difference between the row values in the following way :
gender math score reading score writing score
female 65 73 74
male 69 66 64
Difference -3 7 10
Or is there a more convenient way of expressing the difference between the rows?
Thank you in advance
Let -
df = pd.DataFrame({"A":[5, 10], "B":[9, 8], "gender": ["female", "male"]}).set_index("gender")
df.loc['Difference'] = df.apply(lambda x: x["female"]-x["male"])
In a one-liner with .loc[] and .diff():
df.loc['Difference'] = df.diff(-1).dropna().values.tolist()[0]
Another idea would be to work with a transposed dataframe and then transpose it back:
import pandas as pd
df = pd.DataFrame({'gender':['male','female'],'math score':[65,69],'reading score':[73,66],'writing score':[74,64]}).set_index('gender')
df = df.T
df['Difference'] = df.diff(axis=1)['female'].values
df = df.T
Output:
math score reading score writing score
gender
male 65.0 73.0 74.0
female 69.0 66.0 64.0
Difference 4.0 -7.0 -10.0
You can calculate the diff by selecting each row and then subtracting. But as you've correctly guessed, that is not the best way to do this. A more convenient way would be to transpose the df and then do subtraction:
import pandas as pd
df = pd.DataFrame([[65, 73, 74], [69, 66, 64]],
index=['female', 'male'],
columns=['math score', 'reading score', 'writing score'])
df_ = df.T
df_['Difference'] = df_['female'] - df_['male']
This is what you get:
female male Difference
math score 65 69 -4
reading score 73 66 7
writing score 74 64 10
If you want you can transpose it again df_.T, to revert back to it's initial form.

Python DataFrames concat or append problem

I have a problem with dataframes in Python. I am trying to copy certain rows to a new dataframe but I can't figure it out.
There are 2 arrays:
pokemon_data
# HP Attack Defense Sp. Atk Sp. Def Speed
0 1 45 49 49 65 65 45
1 2 60 62 63 80 80 60
2 3 80 82 83 100 100 80
3 4 80 100 123 122 120 80
4 5 39 52 43 60 50 65
... ... ... ... ... ... ... ...
795 796 50 100 150 100 150 50
796 797 50 160 110 160 110 110
797 798 80 110 60 150 130 70
798 799 80 160 60 170 130 80
799 800 80 110 120 130 90 70
800 rows × 7 columns
combats_data
First_pokemon Second_pokemon Winner
0 266 298 1
1 702 701 1
2 191 668 1
3 237 683 1
4 151 231 0
... ... ... ...
49995 707 126 0
49996 589 664 0
49997 303 368 1
49998 109 89 0
49999 9 73 0
50000 rows × 3 columns
I created third dataset with columns:
output1
HP0 Attack0 Defens0 Sp. Atk0 Sp. Def0 Speed0 HP1 Attack1 Defense1 Sp. Atk1 Sp. Def1 Speed1 Winner
What I'm trying to do is copy attributes from pokemon_data to output1 in order from combats_data.
HP0 and HP1 are respectivly HP from first Pokemon and HP from second Pokemon.
I want to use that data in neural networks with TensorFlow to predict what Pokemon would win.
For this type of wrangling, you should first "melt" or "tidy" the combats_data so each ID has its own row, then do a "join" or "merge" of the two dataframes.
You didn't provide a minimum reproducible example, so here's mine:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4,5],
'var1': [10,20,30,40,50],
'var2': [15,25,35,45,55]})
df2 = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'outcome': [1,4]})
df2tidy = pd.melt(df2, id_vars=['outcome'], value_vars=['id1', 'id2'],
var_name='name', value_name='id')
df2tidy
# outcome name id
# 0 1 id1 1
# 1 4 id1 2
# 2 1 id2 3
# 3 4 id2 4
output = pd.merge(df2tidy, df1, on='id')
output
# outcome name id var1 var2
# 0 1 id1 1 10 15
# 1 4 id1 2 20 25
# 2 1 id2 3 30 35
# 3 4 id2 4 40 45
which you could then train some sort of classifier on outcome.
(Btw, you should make outcome a 0 or 1 (for pokemon1 vs pokemon2) instead of the actual ID of the winner.)
So i would like to create new array based on these two arrays. For example:
#ids represent pokemons and their attributes
pokemons = pd.DataFrame({'id': [1,2,3,4,5],
'HP': [10,20,30,40,50],
'Attack': [15,25,35,45,55],
'Defese' : [25,15,45,15,35]})
#here 0 or 1 represents whether first or second pokemon won
combats = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'winner': [0,1]})
#in output data i want to replace ids with attributes, the order is based on combats array
output = pd.DataFrame({'HP1': [10,20],
'Attack1': [15,25],
'Defense1': [25,15],
'HP2': [30,40],
'Attack2': [35,45],
'Defense2': [45,15],
'winner': [0,1]})
Not sure if its correct thinking. I want to train neural network to figure out what pokemon will win.
This is solution from user part from 4programmers.net forum.
import pandas as pd
if __name__ == "__main__":
pokemon_data = pd.DataFrame({
"Id": [1, 2, 3, 4, 5],
"HP": [45, 60, 80, 80, 39],
"Attack": [49, 62, 82, 100, 52],
"Defense": [49, 63, 83, 123, 43],
"Sp. Atk": [65, 80, 100, 122, 60],
"Sp. Def": [65, 80, 100, 120, 50],
"Speed": [45, 60, 80, 80, 65]})
combats_data = pd.DataFrame({
"First_pokemon": [1, 2, 3],
"Second_pokemon": [2, 3, 4],
"Winner": [1, 0, 1]})
output = pokemon_data.merge(combats_data, left_on="Id", right_on="First_pokemon")
output = output.merge(pokemon_data, left_on="Second_pokemon", right_on="Id",
suffixes=("_pokemon1", "_pokemon2"))
print(output)

Reformatting pandas table - do I want a pivot?

I'm sure this is quite simple, but my brain is frozen and there are so many different pivot and transpose methods. A hint would be nice at this stage.
I have this dataframe:
I want this:
I know how to get to here, if that helped, but I'm not sure if it does
FYI - The actual data has more columns and I need to separate out these four based on the "site" column, reformat everything, calculate some percentages, put the pieces back together, and eventually end up with something like this:
I'm hoping that if I can get on the right track for reformatting part of the data, I can repeat the process...
(then I need to figure out how to run a Chi-square test, but that's for later... :-(
The easiest resolution is df.stack:
df = pd.DataFrame({'MIC-m': [138, 3, 22, 45],
'MIC-t': [34, 90, 30, 53],
'MIC-q': [73, 13, 53, 68],
'Total': [229, 229, 229, 229]}, index=['H', 'L', 'M', 'X'])
# Drop total, because we need sum of columns, not rows
df.drop(columns='Total', inplace=True)
# Get final result
df = pd.DataFrame(df.append(df.sum().rename('Total')).T.stack(), columns=['count'])
yields:
count
MIC-m H 138
L 3
M 22
X 45
Total 208
MIC-t H 34
L 90
M 30
X 53
Total 207
MIC-q H 73
L 13
M 53
X 68
Total 207

Create continuous graph using matplotlib and pandas

I have a dataframe like this.
column1 column2 column3
MyIndexes
7 22 90 98
8 50 06 56
23 60 58 44
49 30 62 00
I am using df.plot to plot a line chart. The problem is that using df.plot() treats the index as categorical data and plots graph for each of them (7, 8, 23 and 49). However I want these to be treated as numeric values and have a graph with even xticks and then plot these points into the graph. How will I be able to do that?
When I construct the dataframe as such:
df = pd.DataFrame([[22, 90, 98],
[50, 06, 56],
[60, 58, 44],
[30, 62, 00]],
index=pd.Index([7, 8, 23, 49], name='MyIndexex'),
columns=['column1', 'column2', 'column3'])
print df
column1 column2 column3
MyIndexex
7 22 90 98
8 50 6 56
23 60 58 44
49 30 62 0
then plot:
df.plot()
I suspect your index is not what you think it is.
To force your index to be integers do:
df.index = df.index.astype(int)
df.plot()

Categories

Resources