Sum rows of grouped data frame based on a specific column - python

I have one data frame where I would like to create new column from the sum of different rows within one group df["NEW_Salary"].
If grouped df by the column Year, month & day , I want to for each group to sum the rows where Combination is True to the rows where Combination is False.
import pandas as pd
data = {"Year":[2002,2002,2002,2002,2002,2010,2010,2010,2010,2010],
"Name":['Jason','Tom','KimJason','KimTom','Kim','Jason','Tom','KimJason','KimTom','Kim'],
"Combination":[False,False,True,True,False,False,False,True,True,False],
"Salary":[10,20,25,25,30,20,30,35,35,40]
}
df=pd.dataframe(data)
Year Month Day Name Combination Salary
0 2002 1 15 Jason False 10
1 2002 1 15 Tom False 20
2 2002 1 15 KimJason True 25
3 2002 1 15 KimTom True 25
4 2002 1 15 Kim False 30
5 2010 3 20 Jason False 20
6 2010 3 20 Tom False 30
7 2010 3 20 KimJason True 35
8 2010 3 20 KimTom True 35
9 2010 3 20 Kim False 40
10 2002 4 5 Mary False 10
11 2002 4 5 MaryTom True 20
12 2002 4 5 Tom False 30
df["New_Salary"] would be created as following:
The row where Name is KimJason,Salary would be added to the Salary rows where Name is Kim & Jason
The row where Name is KimTom, Salary would be added again to the Salary rows where Name is Kim& Tom
The rows of KimTom & KimJason would be the same in the new column NEW_Salary as in Salary
The expected output:
Year Month Day Name Combination Salary NEW_Salary
0 2002 1 15 Jason False 10 35
1 2002 1 15 Tom False 20 45
2 2002 1 15 KimJason True 25 25
3 2002 1 15 KimTom True 25 25
4 2002 1 15 Kim False 30 80
5 2010 3 20 Jason False 20 55
6 2010 3 20 Tom False 30 65
7 2010 3 20 KimJason True 35 35
8 2010 3 20 KimTom True 35 35
9 2010 3 20 Kim False 40 110
10 2002 4 5 Mary False 10 30
11 2002 4 5 MaryTom True 20 20
12 2002 4 5 Tom False 30 50
Is there an easy way to achieve this output? no matter how many groups I have ?

Here`s how you can do it, as far I can tell, it should also work for any groups.
First extract all rows with combination names to a dictionary
lookup = dict(tuple(df.loc[df['Combination']==True].groupby('Name')[['Year', 'Salary']]))
for key,value in lookup.items():
print(f"{key=}:\n{value}")
which looks like this: (value to each key is a df)
key='KimJason':
Year Salary
2 2002 25
7 2010 35
key='KimTom':
Year Salary
3 2002 25
8 2010 35
Then filter for rows where Combination is False, apply row by row the value of Salary and add all values which are found for that name and year in the lookup dictionary. At the end update the df with the new Salary values.
s = (df.loc[df['Combination']==False]
.apply(lambda row:
row['Salary'] + sum(lookup[x].loc[lookup[x]['Year']==row['Year'], 'Salary'].squeeze()
for x in lookup
if row['Name'] in x)
,axis=1)
)
df['Salary'].update(s)
print(df)
Output df:
Year Name Combination Salary
0 2002 Jason False 35
1 2002 Tom False 45
2 2002 KimJason True 25
3 2002 KimTom True 25
4 2002 Kim False 80
5 2010 Jason False 55
6 2010 Tom False 65
7 2010 KimJason True 35
8 2010 KimTom True 35
9 2010 Kim False 110

Related

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

Replace last value(s) of group with NaN

My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!
try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

pandas: conditionally return a column's value

I am trying to make a new column called 'wage_rate' that fills in the appropriate wage rate for the employee based on the year of the observation.
In other words, my list looks something like this:
eecode year w2011 w2012 w2013
1 2012 7 8 9
1 2013 7 8 9
2 2011 20 25 25
2 2012 20 25 25
2 2013 20 25 25
And I want return, in a new column, 8 for the first row, 9 for the second, 20, 25, 25.
One way would be to use apply by constructing column name for each row based on year like 'w' + str(x.year).
In [41]: df.apply(lambda x: x['w' + str(x.year)], axis=1)
Out[41]:
0 8
1 9
2 20
3 25
4 25
dtype: int64
Details:
In [42]: df
Out[42]:
eecode year w2011 w2012 w2013
0 1 2012 7 8 9
1 1 2013 7 8 9
2 2 2011 20 25 25
3 2 2012 20 25 25
4 2 2013 20 25 25
In [43]: df['wage_rate'] = df.apply(lambda x: x['w' + str(x.year)], axis=1)
In [44]: df
Out[44]:
eecode year w2011 w2012 w2013 wage_rate
0 1 2012 7 8 9 8
1 1 2013 7 8 9 9
2 2 2011 20 25 25 20
3 2 2012 20 25 25 25
4 2 2013 20 25 25 25
values = [ row['w%s'% row['year']] for key, row in df.iterrows() ]
df['wage_rate'] = values # create the new columns
This solution is using an explicit loop, thus is likely slower than other pure-pandas solutions, but on the other hand it is simple and readable.
you can rename columns names to be the same as year columns using replace
In [70]:
df.columns = [re.sub('w(?=\d+4$)' , '' , column ) for column in df.columns ]
In [80]:
df.columns
Out[80]:
Index([u'eecode', u'year', u'2011', u'2012', u'2013', u'wage_rate'], dtype='object')
then get the value using the following
df['wage_rate'] = df.apply(lambda x : x[str(x.year)] , axis = 1)
Out[79]:
eecode year 2011 2012 2013 wage_rate
1 2012 7 8 9 8
1 2013 7 8 9 9
2 2011 20 25 25 20
2 2012 20 25 25 25
2 2013 20 25 25 25

unstack "multi-indexed" column| in pandas

in pandas I have a dataframe as follows (first line is the column, second is just a row now)
2012 2013 2012 2013
women women men men
0 14 43 24 45
1 34 54 35 65
and would like to get it like
women men
2012 0 14 24
2012 1 34 35
2013 0 43 45
2013 1 54 65
using df.stack, df.unstack did not get anywhere?
Any elegant solution?
In [5]: df
Out[5]:
2012 2013
women men women men
0 0 1 2 3
1 4 5 6 7
the idea is to first stack the first level of the column to the first level of index, and then swap two indices (pandas.DataFrame.swaplevel)
In [6]: df.stack(level=0).swaplevel(0,1,axis=0)
Out[6]:
men women
2012 0 1 0
2013 0 3 2
2012 1 5 4
2013 1 7 6
df.stack is most likely what you want. See below, you do need to specify that you want the first level.
In [79]: df = pd.DataFrame(0., index=[0,1], columns=pd.MultiIndex.from_product([[2012,2013], ['women','men']]))
In [83]: df.stack(level=0)
Out[83]:
men women
0 2012 0 0
2013 0 0
1 2012 0 0
2013 0 0

Categories

Resources