Python melt dataframe - python

I am trying to convert a dataframe.
Currently I have something similar to this
Material Revenue 2007 Revenue 2008 Revenue 2009 Profit 2007 Profit 2008 Profit 2009
Mat A 50 55 60 10 15 20
Mat B 45 50 55 5 10 35
Mat C 75 80 85 35 30 45
And this is the conversion I am trying to achieve:
Material Revenue Profit Period
Mat A 50 10 2007
Mat A 55 5 2008
Mat A 75 35 2009
Mat B 55 15 2007
Mat B 50 10 2008
Mat B 80 30 2009
Mat C 60 20 2007
Mat C 55 35 2008
Mat C 85 45 2009
From what I have gathered I most likely have to use melt but I am not able to get the code to work.
Edit:
This code does seem to work but is too complicated to be used for real dataframe.
df1 = df.melt(id_vars=['Material'],
value_vars=['Revenue 2007', 'Revenue 2008', 'Revenue 2009'],
var_name='Period', value_name='Revenue')
df1["Period"]=df1['Period'].str[-4:]
df2 = df.melt(id_vars=['Material'],
value_vars=['Profit 2007', 'Profit 2008', 'Profit 2009'],
var_name='Period', value_name='Profit')
df1["Profit"]=df2["Profit"]

Deforms all columns into a melt() target. Divide the created columns by whitespace. Outputs them as a group.
df1 = df.melt(id_vars=['Material'],
value_vars=['Revenue 2007', 'Revenue 2008', 'Revenue 2009','Profit 2007','Profit 2008','Profit 2009'],
var_name='Period', value_name='Revenue')
df2 = pd.concat([df1, df1['Period'].str.split(' ', expand=True)], axis=1).drop('Period', axis=1)
df2.rename(columns={0:'flg', 1:'Period'},inplace=True)
df2.groupby(['Material','Period','flg'])['Revenue'].sum().unstack().reset_index()
flg Material Period Profit Revenue
0 Mat A 2007 10 50
1 Mat A 2008 15 55
2 Mat A 2009 20 60
3 Mat B 2007 5 45
4 Mat B 2008 10 50
5 Mat B 2009 35 55
6 Mat C 2007 35 75
7 Mat C 2008 30 80
8 Mat C 2009 45 85

Is this what you're looking for?
left = df[[col for col in df.columns if col.startswith('Profit')] + ['Material']]\
.melt(id_vars='Material', var_name='Period', value_name='Profit')
left['Period'] = left['Period'].str.split(' ').str[1]
right = df[[col for col in df.columns if col.startswith('Revenue')] + ['Material']]\
.melt(id_vars='Material', var_name='Period', value_name='Revenue')
right['Period'] = right['Period'].str.split(' ').str[1]
print(left.merge(right).sort_values(by=['Material', 'Period']).reset_index(drop=True))
Output
Material Period Profit Revenue
0 Mat A 2007 10 50
1 Mat A 2008 15 55
2 Mat A 2009 20 60
3 Mat B 2007 5 45
4 Mat B 2008 10 50
5 Mat B 2009 35 55
6 Mat C 2007 35 75
7 Mat C 2008 30 80
8 Mat C 2009 45 85

df = pd.melt(df, id_vars=['Material'])
df['Period'] = df.variable.str.split(" ").str[1]
df['type'] = df.variable.str.split(" ").str[0]
df = df.drop('variable', axis=1)
df = (
df
.groupby(['Material','Period','type'])
.sum()
.unstack('type')
.reset_index()
)
df.columns = ["Material", "Period", "Profit", "Revenue"]
df['Material'] = 'Mat ' + df['Material'].astype(str)
df = df[["Material","Revenue","Profit","Period"]]
df
Material Revenue Profit Period
0 Mat A 50 10 2007
1 Mat A 55 15 2008
2 Mat A 60 20 2009
3 Mat B 45 5 2007
4 Mat B 50 10 2008
5 Mat B 55 35 2009
6 Mat C 75 35 2007
7 Mat C 80 30 2008
8 Mat C 85 45 2009

Related

Pandas sum multi-index columns with same name

I know that I can sum index's by:
df["name1"]+df["name2"]
But how does sum work when the two index names are the same?
Given the following CSV:
,,College 1,,,,,,,,,,,,College 2,,,,,,,,,,,,College 3,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Music,,,,Geography,,,,Business,,,,Mathematics,,,,Biology,,,,Geography,,,,Business,,,,Biology,,,,Technology,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13,4,9,6,2,0,10,11,14,4,12,12,5
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5,2,12,14,9,10,11,18,20,0,5,7,8
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12,4,9,6,2,1,13,15,18,3,19,8,16
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10,2,12,14,9,0,17,19,19,0,4,6,4
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20,4,9,6,2,8,12,16,13,4,19,18,7
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8,2,12,14,9,9,16,20,13,0,10,5,6
I can clean the file and setup a multi-index with pandas and numpy:
df = pd.read_csv("CollegeGrades2.csv", index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x)).ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
df.groupby(level=0, sort=False).sum()
However my issue is that I want to total the subjects e.g. College 1 Geography + College 3 Geography and display them in the following output:
I have tried separating them out into different data frames, summing them and then concatenating them but in doing so I lose the headings, for example:
music = df2["College 1", "Music"]
geography = df2["College 1", "Geography"] + df2["College 1", "Geography"]
pd.concat([music,geography], axis=1).groupby(level=0, sort=False).sum()
How I sum the subjects while maintaining my desired output? Any help would be appreciated.
Thank you.
You can also group by the column:
df.groupby(level=[1, 2], axis=1).sum().groupby(level=0).sum()
Result:
1 Biology Business Geography Mathematics Music Technology
2 D F M P D F M P D F M P D F M P D F M P D F M P
0
Evening 47 21 69 52 22 12 40 42 41 7 41 46 36 8 21 35 18 8 18 22 13 4 23 29
Year 1 68 26 63 57 22 12 40 42 34 5 34 45 29 13 30 31 20 0 21 18 13 4 19 17
Year 2 64 12 63 66 22 12 40 42 42 2 21 57 33 17 33 30 21 6 20 21 20 3 14 23

average of one wrt another or averageifs in python

I have a pandas df as displayed I would like to calculate Avg Rate by DC by Brand column which is a similar to averageif in excel ,
I have tried methods like groupby mean() but that does not give correct results
Your question is not clear but you may be looking for:
df.groupby(['DC','Brand'])['Rate'].mean()
AVERAGEIF in excel returns a column which is the same size as your original data. So I think you're looking for pandas.transform():
# Sample DF
Brand Rate
0 A 45
1 B 100
2 C 28
3 A 92
4 B 2
5 C 79
6 A 48
7 B 97
8 C 72
9 D 14
10 D 16
11 D 64
12 E 85
13 E 22
Result:
df['Avg Rate by Brand'] = df.groupby('Brand')['Rate'].transform('mean')
print(df)
Brand Rate Avg Rate by Brand
0 A 45 61.666667
1 B 100 66.333333
2 C 28 59.666667
3 A 92 61.666667
4 B 2 66.333333
5 C 79 59.666667
6 A 48 61.666667
7 B 97 66.333333
8 C 72 59.666667
9 D 14 31.333333
10 D 16 31.333333
11 D 64 31.333333
12 E 85 53.500000
13 E 22 53.500000

Getting a list of values of a column depending on the conditions of other columns

For the given dataframe df as:
Election Yr. Party Region Votes
0 2000 A a 50
1 2000 A b 30
2 2000 B a 40
3 2000 B b 50
4 2000 C a 30
5 2000 C c 40
6 2004 A a 20
7 2004 A b 30
8 2004 B a 40
9 2004 B b 50
10 2004 C a 60
11 2004 C b 40
12 2008 A a 30
13 2008 A c 30
14 2008 B a 80
15 2008 B b 50
16 2008 C a 60
17 2008 C b 40
How to find the list of regions which has a different winner in every election. The winner is decided by the total votes by a party in a year.
First you need to figure out the winner in every election for each region, essentially the party with the highest vote.
winners = df.groupby(['Election Yr.', 'Region']).apply(lambda x: x.set_index('Party').Votes.idxmax())
Then you can figure out for each region how many different winners there have been:
n_unique_winners = winners.groupby(['Region']).nunique()
You can also figure out how many elections have occurred in each region:
n_elections = winners.groupby(['Region']).size()
Entries with a true value in n_unique_winners == n_elections are the regions you are looking for.
To get a list of these regions, you can do n_unique_winners[n_unique_winners == n_elections].index.values

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

Categories

Resources