Python melt dataframe - python
I am trying to convert a dataframe.
Currently I have something similar to this
Material Revenue 2007 Revenue 2008 Revenue 2009 Profit 2007 Profit 2008 Profit 2009
Mat A 50 55 60 10 15 20
Mat B 45 50 55 5 10 35
Mat C 75 80 85 35 30 45
And this is the conversion I am trying to achieve:
Material Revenue Profit Period
Mat A 50 10 2007
Mat A 55 5 2008
Mat A 75 35 2009
Mat B 55 15 2007
Mat B 50 10 2008
Mat B 80 30 2009
Mat C 60 20 2007
Mat C 55 35 2008
Mat C 85 45 2009
From what I have gathered I most likely have to use melt but I am not able to get the code to work.
Edit:
This code does seem to work but is too complicated to be used for real dataframe.
df1 = df.melt(id_vars=['Material'],
value_vars=['Revenue 2007', 'Revenue 2008', 'Revenue 2009'],
var_name='Period', value_name='Revenue')
df1["Period"]=df1['Period'].str[-4:]
df2 = df.melt(id_vars=['Material'],
value_vars=['Profit 2007', 'Profit 2008', 'Profit 2009'],
var_name='Period', value_name='Profit')
df1["Profit"]=df2["Profit"]
Deforms all columns into a melt() target. Divide the created columns by whitespace. Outputs them as a group.
df1 = df.melt(id_vars=['Material'],
value_vars=['Revenue 2007', 'Revenue 2008', 'Revenue 2009','Profit 2007','Profit 2008','Profit 2009'],
var_name='Period', value_name='Revenue')
df2 = pd.concat([df1, df1['Period'].str.split(' ', expand=True)], axis=1).drop('Period', axis=1)
df2.rename(columns={0:'flg', 1:'Period'},inplace=True)
df2.groupby(['Material','Period','flg'])['Revenue'].sum().unstack().reset_index()
flg Material Period Profit Revenue
0 Mat A 2007 10 50
1 Mat A 2008 15 55
2 Mat A 2009 20 60
3 Mat B 2007 5 45
4 Mat B 2008 10 50
5 Mat B 2009 35 55
6 Mat C 2007 35 75
7 Mat C 2008 30 80
8 Mat C 2009 45 85
Is this what you're looking for?
left = df[[col for col in df.columns if col.startswith('Profit')] + ['Material']]\
.melt(id_vars='Material', var_name='Period', value_name='Profit')
left['Period'] = left['Period'].str.split(' ').str[1]
right = df[[col for col in df.columns if col.startswith('Revenue')] + ['Material']]\
.melt(id_vars='Material', var_name='Period', value_name='Revenue')
right['Period'] = right['Period'].str.split(' ').str[1]
print(left.merge(right).sort_values(by=['Material', 'Period']).reset_index(drop=True))
Output
Material Period Profit Revenue
0 Mat A 2007 10 50
1 Mat A 2008 15 55
2 Mat A 2009 20 60
3 Mat B 2007 5 45
4 Mat B 2008 10 50
5 Mat B 2009 35 55
6 Mat C 2007 35 75
7 Mat C 2008 30 80
8 Mat C 2009 45 85
df = pd.melt(df, id_vars=['Material'])
df['Period'] = df.variable.str.split(" ").str[1]
df['type'] = df.variable.str.split(" ").str[0]
df = df.drop('variable', axis=1)
df = (
df
.groupby(['Material','Period','type'])
.sum()
.unstack('type')
.reset_index()
)
df.columns = ["Material", "Period", "Profit", "Revenue"]
df['Material'] = 'Mat ' + df['Material'].astype(str)
df = df[["Material","Revenue","Profit","Period"]]
df
Material Revenue Profit Period
0 Mat A 50 10 2007
1 Mat A 55 15 2008
2 Mat A 60 20 2009
3 Mat B 45 5 2007
4 Mat B 50 10 2008
5 Mat B 55 35 2009
6 Mat C 75 35 2007
7 Mat C 80 30 2008
8 Mat C 85 45 2009
Related
Pandas sum multi-index columns with same name
I know that I can sum index's by: df["name1"]+df["name2"] But how does sum work when the two index names are the same? Given the following CSV: ,,College 1,,,,,,,,,,,,College 2,,,,,,,,,,,,College 3,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,Music,,,,Geography,,,,Business,,,,Mathematics,,,,Biology,,,,Geography,,,,Business,,,,Biology,,,,Technology,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13,4,9,6,2,0,10,11,14,4,12,12,5 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5,2,12,14,9,10,11,18,20,0,5,7,8 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12,4,9,6,2,1,13,15,18,3,19,8,16 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10,2,12,14,9,0,17,19,19,0,4,6,4 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20,4,9,6,2,8,12,16,13,4,19,18,7 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8,2,12,14,9,9,16,20,13,0,10,5,6 I can clean the file and setup a multi-index with pandas and numpy: df = pd.read_csv("CollegeGrades2.csv", index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1) df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x)).ffill()) df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill()) df.groupby(level=0, sort=False).sum() However my issue is that I want to total the subjects e.g. College 1 Geography + College 3 Geography and display them in the following output: I have tried separating them out into different data frames, summing them and then concatenating them but in doing so I lose the headings, for example: music = df2["College 1", "Music"] geography = df2["College 1", "Geography"] + df2["College 1", "Geography"] pd.concat([music,geography], axis=1).groupby(level=0, sort=False).sum() How I sum the subjects while maintaining my desired output? Any help would be appreciated. Thank you.
You can also group by the column: df.groupby(level=[1, 2], axis=1).sum().groupby(level=0).sum() Result: 1 Biology Business Geography Mathematics Music Technology 2 D F M P D F M P D F M P D F M P D F M P D F M P 0 Evening 47 21 69 52 22 12 40 42 41 7 41 46 36 8 21 35 18 8 18 22 13 4 23 29 Year 1 68 26 63 57 22 12 40 42 34 5 34 45 29 13 30 31 20 0 21 18 13 4 19 17 Year 2 64 12 63 66 22 12 40 42 42 2 21 57 33 17 33 30 21 6 20 21 20 3 14 23
average of one wrt another or averageifs in python
I have a pandas df as displayed I would like to calculate Avg Rate by DC by Brand column which is a similar to averageif in excel , I have tried methods like groupby mean() but that does not give correct results
Your question is not clear but you may be looking for: df.groupby(['DC','Brand'])['Rate'].mean()
AVERAGEIF in excel returns a column which is the same size as your original data. So I think you're looking for pandas.transform(): # Sample DF Brand Rate 0 A 45 1 B 100 2 C 28 3 A 92 4 B 2 5 C 79 6 A 48 7 B 97 8 C 72 9 D 14 10 D 16 11 D 64 12 E 85 13 E 22 Result: df['Avg Rate by Brand'] = df.groupby('Brand')['Rate'].transform('mean') print(df) Brand Rate Avg Rate by Brand 0 A 45 61.666667 1 B 100 66.333333 2 C 28 59.666667 3 A 92 61.666667 4 B 2 66.333333 5 C 79 59.666667 6 A 48 61.666667 7 B 97 66.333333 8 C 72 59.666667 9 D 14 31.333333 10 D 16 31.333333 11 D 64 31.333333 12 E 85 53.500000 13 E 22 53.500000
Getting a list of values of a column depending on the conditions of other columns
For the given dataframe df as: Election Yr. Party Region Votes 0 2000 A a 50 1 2000 A b 30 2 2000 B a 40 3 2000 B b 50 4 2000 C a 30 5 2000 C c 40 6 2004 A a 20 7 2004 A b 30 8 2004 B a 40 9 2004 B b 50 10 2004 C a 60 11 2004 C b 40 12 2008 A a 30 13 2008 A c 30 14 2008 B a 80 15 2008 B b 50 16 2008 C a 60 17 2008 C b 40 How to find the list of regions which has a different winner in every election. The winner is decided by the total votes by a party in a year.
First you need to figure out the winner in every election for each region, essentially the party with the highest vote. winners = df.groupby(['Election Yr.', 'Region']).apply(lambda x: x.set_index('Party').Votes.idxmax()) Then you can figure out for each region how many different winners there have been: n_unique_winners = winners.groupby(['Region']).nunique() You can also figure out how many elections have occurred in each region: n_elections = winners.groupby(['Region']).size() Entries with a true value in n_unique_winners == n_elections are the regions you are looking for. To get a list of these regions, you can do n_unique_winners[n_unique_winners == n_elections].index.values
How to unpack a list of tuple in various length in a panda dataframe?
ID LIST_OF_TUPLE (2col) 1 [('2012','12'), ('2012','33'), ('2014', '82')] 2 NA 3 [('2012','12')] 4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')] Result: ID TUP_1 TUP_2(3col) 1 2012 12 1 2012 33 1 2014 82 3 2012 12 4 2012 12 4 2012 33 4 2014 82 4 2022 67 Thanks in advance.
This is explode then create a dataframe and then join: s = df['LIST_OF_TUPLE'].explode() out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index) .add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd print(out) ID TUP_0 TUP_1 0 1 2012 12 1 1 2012 33 2 1 2014 82 3 2 NaN None 4 3 2012 12 5 4 2012 12 6 4 2012 33 7 4 2014 82 8 4 2022 67
How can I get this series to a pandas dataframe?
I have some data and after using a groupby function I now have a series that looks like this: year 1997 15 1998 22 1999 24 2000 24 2001 28 2002 11 2003 15 2004 19 2005 10 2006 10 2007 21 2008 26 2009 23 2010 16 2011 33 2012 19 2013 26 2014 25 How can I create a pandas dataframe from here with year as one column and the other column named sightings ? I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns: print (df.reset_index()) index year 0 1997 15 1 1998 22 2 1999 24 3 2000 24 4 2001 28 5 2002 11 6 2003 15 7 2004 19 8 2005 10 9 2006 10 10 2007 21 11 2008 26 12 2009 23 13 2010 16 14 2011 33 15 2012 19 16 2013 26 17 2014 25 print (df.reset_index().rename(columns=({'index':'year','year':'sightings'}))) year sightings 0 1997 15 1 1998 22 2 1999 24 3 2000 24 4 2001 28 5 2002 11 6 2003 15 7 2004 19 8 2005 10 9 2006 10 10 2007 21 11 2008 26 12 2009 23 13 2010 16 14 2011 33 15 2012 19 16 2013 26 17 2014 25 Another solution is set column names by list of names: df1 = df.reset_index() df1.columns = ['year','sightings'] print (df1) year sightings 0 1997 15 1 1998 22 2 1999 24 3 2000 24 4 2001 28 5 2002 11 6 2003 15 7 2004 19 8 2005 10 9 2006 10 10 2007 21 11 2008 26 12 2009 23 13 2010 16 14 2011 33 15 2012 19 16 2013 26 17 2014 25 EDIT: Sometimes help add parameter as_index=False to groupby for returning DataFrame: import pandas as pd df = pd.DataFrame({'A':[1,1,3], 'B':[4,5,6]}) print (df) A B 0 1 4 1 1 5 2 3 6 print (df.groupby('A')['B'].sum()) A 1 9 3 6 Name: B, dtype: int64 print (df.groupby('A', as_index=False)['B'].sum()) A B 0 1 9 1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe: df2 = df1.groupby(['Year']).count() df3 = pd.DataFrame(df2).reset_index() If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings". If not, you can change the column names by doing the following: df3.columns = ['Year','Sightings'] or df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})