gender math score reading score writing score
female 65 73 74
male 69 66 64
Given the dataframe (see above) how can we add a line that would calculate the difference between the row values in the following way :
gender math score reading score writing score
female 65 73 74
male 69 66 64
Difference -3 7 10
Or is there a more convenient way of expressing the difference between the rows?
Thank you in advance
Let -
df = pd.DataFrame({"A":[5, 10], "B":[9, 8], "gender": ["female", "male"]}).set_index("gender")
df.loc['Difference'] = df.apply(lambda x: x["female"]-x["male"])
In a one-liner with .loc[] and .diff():
df.loc['Difference'] = df.diff(-1).dropna().values.tolist()[0]
Another idea would be to work with a transposed dataframe and then transpose it back:
import pandas as pd
df = pd.DataFrame({'gender':['male','female'],'math score':[65,69],'reading score':[73,66],'writing score':[74,64]}).set_index('gender')
df = df.T
df['Difference'] = df.diff(axis=1)['female'].values
df = df.T
Output:
math score reading score writing score
gender
male 65.0 73.0 74.0
female 69.0 66.0 64.0
Difference 4.0 -7.0 -10.0
You can calculate the diff by selecting each row and then subtracting. But as you've correctly guessed, that is not the best way to do this. A more convenient way would be to transpose the df and then do subtraction:
import pandas as pd
df = pd.DataFrame([[65, 73, 74], [69, 66, 64]],
index=['female', 'male'],
columns=['math score', 'reading score', 'writing score'])
df_ = df.T
df_['Difference'] = df_['female'] - df_['male']
This is what you get:
female male Difference
math score 65 69 -4
reading score 73 66 7
writing score 74 64 10
If you want you can transpose it again df_.T, to revert back to it's initial form.
Related
My df looks as follows:
Roll Name Age Physics English Maths
0 A1 Max 16 87 79 90
1 A2 Lisa 15 47 75 60
2 A3 Luna 17 83 49 95
3 A4 Ron 16 86 79 93
4 A5 Silvia 15 57 99 91
I'd like to add the columns Physics, English, and Maths and display the results in a separate column 'Grade'.
I've tried the code:
df['Physics'] + df['English'] + df['Maths']
But it just concatenates. I am not taught about the lambda function yet.
How do I go about this?
df['Grade'] = df['Physics'] + df['English'] + df['Maths']
it concatenates maybe your data is in **String** just convert into float or integer.
Check Data Types First by using df.dtypes
Try:
df["total"] = df[["Physics", "English", "Maths"]].sum(axis=1)
df
Check Below code, Its is possible you columns are in string format, belwo will solve that:
import pandas as pd
df = pd.DataFrame({"Physics":['1','2','3'],"English":['1','2','3'],"Maths":['1','2','3']})
df['Total'] = df['Physics'].astype('int') +df['English'].astype('int') +df['Maths'].astype('int')
df
Output:
How to find sum of values for different index level in multi level index table, and represent it in the indexes as a summation row
For example
Gender Age Marks
M. 20. 30
45
22. 46
33
F. 20. 44
46
22. 42
31
In this data frame how to find sum of F&20..and represent it as a row below age 20 marks
As :sum 90
I am not sure if I have understand you correctly, but it seems that you want to group by Gender and Age:
import pandas as pd
df = pd.DataFrame({
"Gender": ['M.','M.','M.','M.','F.','F.','F.','F.'],
"Age":[20.,20.,22.,22.,20,20,22,22],
'Marks':[30,45,46,33,44,46,42,31] })
df.groupby(['Gender','Age'])['Marks'].sum()
Result:
Gender Age
F. 20.0 90
22.0 73
M. 20.0 75
22.0 79
I have a 4x4 dataframe (df). I created two child dataframes (4x1), (4x2). And updated both. In first case, the parent is updated, in second, it is not. How to ensure that the parent dataframe is updated when child dataframe is updated?
I have a 4x4 dataframe (df). From this as a parent, I created two child dataframes - dfA with single column (4x1) and dfB with two columns (4x2). I have NaN values in both subsets. Now, when I use fillna on both, in respective dfA and dfB, i can see the NaN values updated with given value. Fine upto now. However, now when I check the Parent Dataframe, in First case (4x1), the updated value reflects whereas in Second case (4x2), it does not. Why it is so. And What should I do to let the changes in child dataframe reflect in the parent dataframe.
studentnames = ['Maths','English','Soc.Sci', 'Hindi', 'Science']
semisteronemarks = [15, 50, np.NaN, 50, np.NaN]
semistertwomarks = [25, 53, 45, 45, 54]
semisterthreemarks = [20, 50, 45, 15, 38]
semisterfourmarks = [26, 33, np.NaN, 35, 34]
semisters = ['Rakesh','Rohit', 'Sam', 'Sunil']
df1 = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
# case 1
dfA = df['Soc.Sci']
dfA.fillna(value = 98, inplace = True)
print(dfA)
print(df)
# case 2
dfB = df[['Soc.Sci', 'Science']]
dfB.fillna(value = 99, inplace = True)
print(dfB)
print(df)
'''
## contents of parent df ->>
## Actual Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 NaN 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 NaN 35 34.0
## Expected Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 99.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 99.0 35 34.0
# note the difference in output for column Soc.Sci in case 2.
In your code df1 is defined df is not.
With the approach being used
# case 1
dfA = df1['Soc.Sci'] # changed df to df1
dfA.fillna(value = 98, inplace = True)
df1['Soc.Sci'] = dfA # Because dfA is not a dataframe but a series
# if you want to do
df1['Soc.Sci'] = dfA['Soc.Sci']
# you will need to change the dfA
dfA = df1[['Soc.Sci']] # this makes it a dataframe
# case 2
dfB = df1[['Soc.Sci', 'Science']] # changed df to df1
dfB.fillna(value = 99, inplace = True)
df1[['Soc.Sci','Science']] = dfB[['Soc.Sci','Science']]
print(df1)
I would suggest just using the fillna in the parent df.
df1['Soc.Sci'].fillna(value=99,inplace=True)
You should have seen a warning:
Warning (from warnings module):
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
It means that dfB may be a copy instead of a view. And according to the results it is. There is little that can be done here, and specifically you cannot force pandas to generate a view. The choice depends of parameters only known to pandas and its developpers.
But it is always possible to assign to the columns of the parent DataFrame:
# case 2
df = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
df[['Soc.Sci', 'Science']] = df[['Soc.Sci', 'Science']].fillna(value = 99)
print(df)
When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())
I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:
Date----Golfer---Tournament-----Score---Player Total Rounds Played
2008-01-01---Tiger Woods----Invented Tournament R1---72---50
2008-01-01---Phil Mickelson----Invented Tournament R1---73---108
I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (i.e. instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.
Thanks,
Tom
Here is one way:
df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
For example:
import pandas as pd
df = pd.DataFrame([['A', 70, 50],
['B', 72, 55],
['A', 73, 45],
['A', 71, 60],
['B', 74, 55],
['A', 72, 65]],
columns=['Golfer', 'Rounds', 'Played'])
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
# Golfer Rounds Played Rounds CumSum
# 0 A 70 50 70
# 1 B 72 55 72
# 2 A 73 45 143
# 3 A 71 60 214
# 4 B 74 55 146
# 5 A 72 65 286