pandas: conditionally return a column's value - python

I am trying to make a new column called 'wage_rate' that fills in the appropriate wage rate for the employee based on the year of the observation.
In other words, my list looks something like this:
eecode year w2011 w2012 w2013
1 2012 7 8 9
1 2013 7 8 9
2 2011 20 25 25
2 2012 20 25 25
2 2013 20 25 25
And I want return, in a new column, 8 for the first row, 9 for the second, 20, 25, 25.

One way would be to use apply by constructing column name for each row based on year like 'w' + str(x.year).
In [41]: df.apply(lambda x: x['w' + str(x.year)], axis=1)
Out[41]:
0 8
1 9
2 20
3 25
4 25
dtype: int64
Details:
In [42]: df
Out[42]:
eecode year w2011 w2012 w2013
0 1 2012 7 8 9
1 1 2013 7 8 9
2 2 2011 20 25 25
3 2 2012 20 25 25
4 2 2013 20 25 25
In [43]: df['wage_rate'] = df.apply(lambda x: x['w' + str(x.year)], axis=1)
In [44]: df
Out[44]:
eecode year w2011 w2012 w2013 wage_rate
0 1 2012 7 8 9 8
1 1 2013 7 8 9 9
2 2 2011 20 25 25 20
3 2 2012 20 25 25 25
4 2 2013 20 25 25 25

values = [ row['w%s'% row['year']] for key, row in df.iterrows() ]
df['wage_rate'] = values # create the new columns
This solution is using an explicit loop, thus is likely slower than other pure-pandas solutions, but on the other hand it is simple and readable.

you can rename columns names to be the same as year columns using replace
In [70]:
df.columns = [re.sub('w(?=\d+4$)' , '' , column ) for column in df.columns ]
In [80]:
df.columns
Out[80]:
Index([u'eecode', u'year', u'2011', u'2012', u'2013', u'wage_rate'], dtype='object')
then get the value using the following
df['wage_rate'] = df.apply(lambda x : x[str(x.year)] , axis = 1)
Out[79]:
eecode year 2011 2012 2013 wage_rate
1 2012 7 8 9 8
1 2013 7 8 9 9
2 2011 20 25 25 20
2 2012 20 25 25 25
2 2013 20 25 25 25

Related

Sum rows of grouped data frame based on a specific column

I have one data frame where I would like to create new column from the sum of different rows within one group df["NEW_Salary"].
If grouped df by the column Year, month & day , I want to for each group to sum the rows where Combination is True to the rows where Combination is False.
import pandas as pd
data = {"Year":[2002,2002,2002,2002,2002,2010,2010,2010,2010,2010],
"Name":['Jason','Tom','KimJason','KimTom','Kim','Jason','Tom','KimJason','KimTom','Kim'],
"Combination":[False,False,True,True,False,False,False,True,True,False],
"Salary":[10,20,25,25,30,20,30,35,35,40]
}
df=pd.dataframe(data)
Year Month Day Name Combination Salary
0 2002 1 15 Jason False 10
1 2002 1 15 Tom False 20
2 2002 1 15 KimJason True 25
3 2002 1 15 KimTom True 25
4 2002 1 15 Kim False 30
5 2010 3 20 Jason False 20
6 2010 3 20 Tom False 30
7 2010 3 20 KimJason True 35
8 2010 3 20 KimTom True 35
9 2010 3 20 Kim False 40
10 2002 4 5 Mary False 10
11 2002 4 5 MaryTom True 20
12 2002 4 5 Tom False 30
df["New_Salary"] would be created as following:
The row where Name is KimJason,Salary would be added to the Salary rows where Name is Kim & Jason
The row where Name is KimTom, Salary would be added again to the Salary rows where Name is Kim& Tom
The rows of KimTom & KimJason would be the same in the new column NEW_Salary as in Salary
The expected output:
Year Month Day Name Combination Salary NEW_Salary
0 2002 1 15 Jason False 10 35
1 2002 1 15 Tom False 20 45
2 2002 1 15 KimJason True 25 25
3 2002 1 15 KimTom True 25 25
4 2002 1 15 Kim False 30 80
5 2010 3 20 Jason False 20 55
6 2010 3 20 Tom False 30 65
7 2010 3 20 KimJason True 35 35
8 2010 3 20 KimTom True 35 35
9 2010 3 20 Kim False 40 110
10 2002 4 5 Mary False 10 30
11 2002 4 5 MaryTom True 20 20
12 2002 4 5 Tom False 30 50
Is there an easy way to achieve this output? no matter how many groups I have ?
Here`s how you can do it, as far I can tell, it should also work for any groups.
First extract all rows with combination names to a dictionary
lookup = dict(tuple(df.loc[df['Combination']==True].groupby('Name')[['Year', 'Salary']]))
for key,value in lookup.items():
print(f"{key=}:\n{value}")
which looks like this: (value to each key is a df)
key='KimJason':
Year Salary
2 2002 25
7 2010 35
key='KimTom':
Year Salary
3 2002 25
8 2010 35
Then filter for rows where Combination is False, apply row by row the value of Salary and add all values which are found for that name and year in the lookup dictionary. At the end update the df with the new Salary values.
s = (df.loc[df['Combination']==False]
.apply(lambda row:
row['Salary'] + sum(lookup[x].loc[lookup[x]['Year']==row['Year'], 'Salary'].squeeze()
for x in lookup
if row['Name'] in x)
,axis=1)
)
df['Salary'].update(s)
print(df)
Output df:
Year Name Combination Salary
0 2002 Jason False 35
1 2002 Tom False 45
2 2002 KimJason True 25
3 2002 KimTom True 25
4 2002 Kim False 80
5 2010 Jason False 55
6 2010 Tom False 65
7 2010 KimJason True 35
8 2010 KimTom True 35
9 2010 Kim False 110

pandas convert dataframe to pivot_table where index is the sorting values

i have the following dataframe:
site height_id height_meters
0 9 c3 24
1 9 c2 30
2 9 c1 36
3 3 c0 18
4 3 bf 24
5 3 be 30
6 4 10 18
7 4 0f 24
8 4 0e 30
i want to transform it to the following this column indexes is values of 'site' and the values is 'height_meters' and i want it to be indexed by the order of the values (i looked in the internet and didnt find somthing similar... tried to groupby and make some pivot table without success):
9 3 4
0 24 18 18
1 30 24 24
2 36 30 24
the gap between numbers isn't necessary ...
here is the df
my_df = pd.DataFrame(dict(
site=[9, 9, 9, 3, 3, 3, 4, 4, 4],
height_id='c3,c2,c1,c0,bf,be,10,0f,0e'.split(','),
height_meters=[24, 30, 36, 18, 24, 30, 18, 24, 30]
))
You can use GroupBy.cumcount for counter of column site:
print (my_df.groupby('site').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
You can convert it to index with site column and reshape by Series.unstack:
df = my_df.set_index([my_df.groupby('site').cumcount(), 'site'])['height_meters'].unstack()
print (df)
site 3 4 9
0 18 18 24
1 24 24 30
2 30 30 36
Similar solution with DataFrame.pivot and column created by cumcount:
df = my_df.assign(new=my_df.groupby('site').cumcount()).pivot('new','site','height_meters')
print (df)
site 3 4 9
new
0 18 18 24
1 24 24 30
2 30 30 36
If order is important add DataFrame.reindex by unique values of column site:
df = (my_df.set_index([my_df.groupby('site').cumcount(), 'site'])['height_meters']
.unstack()
.reindex(my_df['site'].unique(), axis=1))
print (df)
site 9 3 4
0 24 18 18
1 30 24 24
2 36 30 30
Last for remove site (new) columns and index names is possible use DataFrame.rename_axis:
df = df.rename_axis(index=None, columns=None)
print (df)
3 4 9
0 18 18 24
1 24 24 30
2 30 30 36

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

Mean value based on another column group

I have a dataframe (2000 rows, 5 columns):
year month day GroupBy_Day
0 2013 11 6 3
1 2013 11 7 10
2 2013 11 8 4
3 2013 11 9 4
4 2013 11 10 4
...
24 2013 12 1 5
25 2013 12 2 4
26 2013 12 3 5
27 2013 12 4 2
28 2013 12 5 7
29 2013 12 6 1
I already grouped my elements and got the count for each days (column GroupBy_Day). I need to get the mean count by day (e.g, for all days 6, we have a mean of (3+1)/2 = 2 occurence), and substract this value to GroupBy_Day in a new column.

parsing a list of columns to a pandas dataframe to show only those columns

I know with pandas if you have a dataframe (df) you can get columns using df.columns.values to return an object (not sure of which type) but you can convert this to a string.
If my dataframe has 10 columns and I know the names of the first three can I create a string, and parse it to the dataframe to show only those columns?
subset_columns['one','two','three']
df[[subset_colimns]]
df OUT >>
one | two | three
1345 415 1654
13445 56576 76r76
You can convert the columns to a list either by casting or using the numpy tolist() function. You can then select from this by slicing in the normal manner:
In [5]:
import pandas as pd
df = pd.DataFrame(dict(zip(list('abcdefghij'), [arange(10)] * 10)))
cols = df.columns.values.tolist()
# you can also do list(df.columns)
In [11]:
cols
Out[11]:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
In [12]:
subcols = cols[2:5]
df[subcols]
Out[12]:
c d e
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
[10 rows x 3 columns]
In order to select multiple non-sequential columns you can do this:
In [36]
part1 = cols[0:3]
part2 = cols[6:8]
subcols = part1+part2
df[subcols]
Out[36]:
a b c g h
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
[10 rows x 5 columns]
yes, you can select required columns.
df = pd.read_csv("../SO/data.csv")
df.head()
card_number effective_date expiry_date grouping_name Ac. Year code
0 1206090 28 Sep 2012 21 Aug 2013 Dummy no.1 201213
1 1206090 21 Feb 2013 21 Aug 2013 Dummy no.2 201213
2 1206090 28 Sep 2012 30 Nov 2012 Dummy no.3 201213
3 1206090 03 Dec 2012 21 Aug 2013 Dummy no.3 201213
4 1206090 23 Apr 2013 31 Aug 2013 Dummy no.4 201213
req_cols is list of required columns below:
req_cols = ['card_number', 'expiry_date', 'grouping_name']
df[req_cols].head()
card_number expiry_date grouping_name
0 1206090 21 Aug 2013 Dummy no.1
1 1206090 21 Aug 2013 Dummy no.2
2 1206090 30 Nov 2012 Dummy no.3
3 1206090 21 Aug 2013 Dummy no.3
4 1206090 31 Aug 2013 Dummy no.4

Categories

Resources