Enter Missing Year Amounts with Zeros After GroupBy in Pandas - python
I am grouping the following rows.
df = df.groupby(['id','year']).sum().sort(ascending=False)
print df
amount
id year
1 2009 120
2008 240
2007 240
2006 240
2005 240
2 2014 100
2013 50
2012 50
2011 100
2010 50
2006 100
... ...
Is there a way to add years that do not have any values with the amount equal to zero until a specific year, in this case 2005, as I am showing below?
Expected Output:
amount
id year
2015 0
2014 0
2013 0
2012 0
2011 0
2010 0
2009 120
2008 240
2007 240
2006 240
2005 240
2 2015 0
2014 100
2013 50
2012 50
2011 100
2010 50
2009 0
2008 0
2007 0
2006 100
2005 0
... ...
Starting with your first DataFrame, this will add all years that occur with some id to all ids.
df = df.unstack().fillna(0).stack()
e.g.
In [16]: df
Out[16]:
amt
id year
1 2001 1
2002 2
2003 3
2 2002 4
2003 5
2004 6
In [17]: df = df.unstack().fillna(0).stack()
In [18]: df
Out[18]:
amt
id year
1 2001 1
2002 2
2003 3
2004 0
2 2001 0
2002 4
2003 5
2004 6
Related
How to create multiple triangle using another dataframe? [closed]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. This question does not appear to be about programming within the scope defined in the help center. Closed 1 year ago. Improve this question Below is my code: triangle = cl.load_sample('genins') # Use bootstrap sampler to get resampled triangles bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_ #converting to dataframe resampledtriangledf = bootstrapdataframe.to_frame() print(resampledtriangledf) In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame: 0 2001 12 254,926 0 2001 24 535,877 0 2001 36 1,355,613 0 2001 48 2,034,557 0 2001 60 2,311,789 0 2001 72 2,539,807 0 2001 84 2,724,773 0 2001 96 3,187,095 0 2001 108 3,498,646 0 2001 120 3,586,037 0 2002 12 542,369 0 2002 24 1,016,927 0 2002 36 2,201,329 0 2002 48 2,923,381 0 2002 60 3,711,305 0 2002 72 3,914,829 0 2002 84 4,385,757 0 2002 96 4,596,072 0 2002 108 5,047,861 0 2003 12 235,361 0 2003 24 960,355 0 2003 36 1,661,972 0 2003 48 2,643,370 0 2003 60 3,372,684 0 2003 72 3,642,605 0 2003 84 4,160,583 0 2003 96 4,480,332 0 2004 12 764,553 0 2004 24 1,703,557 0 2004 36 2,498,418 0 2004 48 3,198,358 0 2004 60 3,524,562 0 2004 72 3,884,971 0 2004 84 4,268,241 0 2005 12 381,670 0 2005 24 1,124,054 0 2005 36 2,026,434 0 2005 48 2,863,902 0 2005 60 3,039,322 0 2005 72 3,288,253 0 2006 12 320,332 0 2006 24 1,022,323 0 2006 36 1,830,842 0 2006 48 2,676,710 0 2006 60 3,375,172 0 2007 12 330,361 0 2007 24 1,463,348 0 2007 36 2,771,839 0 2007 48 4,003,745 0 2008 12 282,143 0 2008 24 1,782,267 0 2008 36 2,898,699 0 2009 12 362,726 0 2009 24 1,277,750 0 2010 12 321,247 1 2001 12 219,021 1 2001 24 755,975 1 2001 36 1,360,298 1 2001 48 2,062,947 1 2001 60 2,356,983 1 2001 72 2,781,187 1 2001 84 2,987,837 1 2001 96 3,118,952 1 2001 108 3,307,522 1 2001 120 3,455,107 1 2002 12 302,932 1 2002 24 1,022,459 1 2002 36 1,634,938 1 2002 48 2,538,708 1 2002 60 3,005,695 1 2002 72 3,274,719 1 2002 84 3,356,499 1 2002 96 3,595,361 1 2002 108 4,100,065 1 2003 12 489,934 1 2003 24 1,233,438 1 2003 36 2,471,849 1 2003 48 3,672,629 1 2003 60 4,157,489 1 2003 72 4,498,470 1 2003 84 4,587,579 1 2003 96 4,816,232 1 2004 12 518,680 1 2004 24 1,209,705 1 2004 36 2,019,757 1 2004 48 2,997,820 1 2004 60 3,630,442 1 2004 72 3,881,093 1 2004 84 4,080,322 1 2005 12 453,963 1 2005 24 1,458,504 1 2005 36 2,036,506 1 2005 48 2,846,464 1 2005 60 3,280,124 1 2005 72 3,544,597 1 2006 12 369,755 1 2006 24 1,209,117 1 2006 36 1,973,136 1 2006 48 3,034,294 1 2006 60 3,537,784 1 2007 12 477,788 1 2007 24 1,524,537 1 2007 36 2,170,391 1 2007 48 3,355,093 1 2008 12 250,690 1 2008 24 1,546,986 1 2008 36 2,996,737 1 2009 12 271,270 1 2009 24 1,446,353 1 2010 12 510,114 2 2001 12 170,866 2 2001 24 797,338 2 2001 36 1,663,610 2 2001 48 2,293,697 2 2001 60 2,607,067 2 2001 72 2,979,479 2 2001 84 3,127,308 2 2001 96 3,285,338 2 2001 108 3,574,272 2 2001 120 3,630,610 2 2002 12 259,060 2 2002 24 1,011,092 2 2002 36 1,851,504 2 2002 48 2,705,313 2 2002 60 3,195,774 2 2002 72 3,766,008 2 2002 84 3,944,417 2 2002 96 4,234,043 2 2002 108 4,763,664 2 2003 12 239,981 2 2003 24 983,484 2 2003 36 1,929,785 2 2003 48 2,497,929 2 2003 60 2,972,887 2 2003 72 3,313,868 2 2003 84 3,727,432 2 2003 96 4,024,122 2 2004 12 77,522 2 2004 24 729,401 2 2004 36 1,473,914 2 2004 48 2,376,313 2 2004 60 2,999,197 2 2004 72 3,372,020 2 2004 84 3,887,883 2 2005 12 321,598 2 2005 24 1,132,502 2 2005 36 1,710,504 2 2005 48 2,438,620 2 2005 60 2,801,957 2 2005 72 3,182,466 2 2006 12 255,407 2 2006 24 1,275,141 2 2006 36 2,083,421 2 2006 48 3,144,579 2 2006 60 3,891,772 2 2007 12 338,120 2 2007 24 1,275,697 2 2007 36 2,238,715 2 2007 48 3,615,323 2 2008 12 310,214 2 2008 24 1,237,156 2 2008 36 2,563,326 2 2009 12 271,093 2 2009 24 1,523,131 2 2010 12 430,591 3 2001 12 330,887 3 2001 24 831,193 3 2001 36 1,601,374 3 2001 48 2,188,879 3 2001 60 2,662,773 3 2001 72 3,086,976 3 2001 84 3,332,247 3 2001 96 3,317,279 3 2001 108 3,576,659 3 2001 120 3,613,563 3 2002 12 358,263 3 2002 24 1,139,259 3 2002 36 2,236,375 3 2002 48 3,163,464 3 2002 60 3,715,130 3 2002 72 4,295,638 3 2002 84 4,502,105 3 2002 96 4,769,139 3 2002 108 5,323,304 3 2003 12 489,934 3 2003 24 1,570,352 3 2003 36 3,123,215 3 2003 48 4,189,299 3 2003 60 4,819,070 3 2003 72 5,306,689 3 2003 84 5,560,371 3 2003 96 5,827,003 3 2004 12 419,727 3 2004 24 1,308,884 3 2004 36 2,118,936 3 2004 48 2,906,732 3 2004 60 3,561,577 3 2004 72 3,934,400 3 2004 84 4,010,511 3 2005 12 389,217 3 2005 24 1,173,226 3 2005 36 1,794,216 3 2005 48 2,528,910 3 2005 60 3,474,035 3 2005 72 3,908,999 3 2006 12 291,940 3 2006 24 1,136,674 3 2006 36 1,915,614 3 2006 48 2,693,930 3 2006 60 3,375,601 3 2007 12 506,055 3 2007 24 1,684,660 3 2007 36 2,678,739 3 2007 48 3,545,156 3 2008 12 282,143 3 2008 24 1,536,490 3 2008 36 2,458,789 3 2009 12 271,093 3 2009 24 1,199,897 3 2010 12 266,359 Using above dataframe I have to create 4 triangles based on Toatal column: For example: Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total 2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119 2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832 2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261 2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659 2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635 2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377 2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294 2008 282,143 1,782,267 2,898,699 4,963,110 2009 362,726 1,277,750 1,640,475 2010 321,247 321,247 Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009 . . . Like this i need 4 triangles (4 is number of simulation) using 1st dataframe. If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.
Use pivot_table and choose the aggregation function (sum here but you can use mean or whatever): df = df.pivot_table(index="origin", columns="development", values="values", aggfunc="sum") df = df.set_index(df.index.year) df.loc["Grand Total"] = df.sum() df.loc[:, "Grand Total"] = df.sum(axis=1) >>> df development 12 24 36 48 60 72 84 96 108 120 Grand Total origin 2001 1.356449e+09 4.695043e+09 8.226504e+09 1.200121e+10 1.408404e+10 1.555555e+10 1.690673e+10 1.781579e+10 1.917689e+10 1.951240e+10 1.293306e+11 2002 1.887634e+09 6.573443e+09 1.150100e+10 1.671772e+10 1.960781e+10 2.164808e+10 2.352267e+10 2.480478e+10 2.671911e+10 NaN 1.529823e+11 2003 1.866031e+09 6.531145e+09 1.137408e+10 1.657377e+10 1.945944e+10 2.148353e+10 2.334087e+10 2.459720e+10 NaN NaN 1.252261e+11 2004 1.842447e+09 6.411653e+09 1.120732e+10 1.633725e+10 1.917381e+10 2.117893e+10 2.301072e+10 NaN NaN NaN 9.916214e+10 2005 1.688064e+09 5.876106e+09 1.027445e+10 1.496756e+10 1.757424e+10 1.939891e+10 NaN NaN NaN NaN 6.977932e+10 2006 1.762834e+09 6.154760e+09 1.076776e+10 1.569864e+10 1.843549e+10 NaN NaN NaN NaN NaN 5.281948e+10 2007 1.968264e+09 6.855178e+09 1.195292e+10 1.741326e+10 NaN NaN NaN NaN NaN NaN 3.818962e+10 2008 2.344669e+09 8.218527e+09 1.433187e+10 NaN NaN NaN NaN NaN NaN NaN 2.489507e+10 2009 1.955145e+09 6.813284e+09 NaN NaN NaN NaN NaN NaN NaN NaN 8.768429e+09 2010 1.716057e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.716057e+09 Grand Total 1.838759e+10 5.812914e+10 8.963591e+10 1.097094e+11 1.083348e+11 9.926499e+10 8.678100e+10 6.721778e+10 4.589601e+10 1.951240e+10 7.028691e+11 The code above works for the following input data: >>> df origin development values Total 0 2001-01-01 12 3.766810e+05 0 2001-01-01 24 1.025411e+06 0 2001-01-01 36 1.541503e+06 0 2001-01-01 48 2.155232e+06 0 2001-01-01 60 2.422287e+06 ... ... ... ... 4999 2008-01-01 24 2.403488e+06 4999 2008-01-01 36 3.100034e+06 4999 2009-01-01 12 3.747304e+05 4999 2009-01-01 24 1.262821e+06 4999 2010-01-01 12 2.469928e+05 [275000 rows x 3 columns]
Sum by year and total_vehicles pandas dataframe
I have the following dataframe lrdata3 and I would like to sum the total_vehicles for every year instead of having multiple separate for the same year. year total_vehicles 0 2000 2016 1 2000 1483 2 2000 1275 3 2000 1086 4 2000 816 When I do this lrdata3.groupby('year')['total_vehicles'].sum() I get this which is not even a dataframe year 2000 419587299 2001 425832533 2002 430480581 2003 434270003 2004 442680113 2005 443366960 2006 452086899 2007 452280161 2008 445462026 2009 443333980 2010 438827716 2011 440461505 2012 440073277 2013 441751395 2014 451394270 2015 460050397 2016 470256985 2017 474693803 2018 473765568 Any help please? Thanks
You can do it in one line and get a df with this syntax. Some sample data: year total_vehicles 0 2000 2016 1 2000 1483 2 2000 1275 3 2000 1086 4 2000 816 5 2001 2016 6 2001 1483 7 2001 1275 8 2002 1086 9 2002 816 df = pd.read_clipboard() gb = df.groupby('year').agg({'total_vehicles': 'sum'}) print(gb) total_vehicles year 2000 6676 2001 4774 2002 1902 print(type(gb)) <class 'pandas.core.frame.DataFrame'>
Your code is fine, just add a .reset_index() to it. Like this: lrdata3.groupby('year')['total_vehicles'].sum().reset_index() This will get you what you want.
lrdata3.groupby('year')['total_vehicles'].sum().to_frame() or groupby and transform lrdata3['yearlytotal_vehicles']=lrdata3.groupby('year')['total_vehicles'].transform('sum')
Finding all values in between specific values in data frame
i have this dataframe. df name timestamp year 0 A 2004 1995 1 D 2008 2004 2 M 2005 2006 3 T 2003 2007 4 B 1995 2008 5 C 2007 2003 6 D 2005 2001 7 E 2009 2005 8 A 2018 2009 9 L 2016 2018 What i am doing is that on the basis of first two entries in the df['timestamp']. I am fetching all the values from df['year'] which comes in between these two entries. Which in this case is (2004-2008). y1 = df['timestamp'].iloc[0] y2 = df['timestamp'].iloc[1] movies = df[df['year'].between(y1, y2,inclusive=True )] movies name timestamp year 1 D 2008 2004 2 M 2005 2006 3 T 2003 2007 4 B 1995 2008 7 E 2009 2005 This is working fine for me. But when i have greater value in first index and lower in 2nd index (e.g. 2008-2004) the result is empty. df name timestamp year 0 A 2008 1995 1 D 2004 2004 2 M 2005 2006 3 T 2003 2007 4 B 1995 2008 5 C 2007 2003 6 D 2005 2001 7 E 2009 2005 8 A 2018 2009 9 L 2016 2018 In this case i fetch nothing. Expected Outcome: What i want is if the values are greater or smaller i should get in-between values every time.
You could use Series.head and Series.agg: y1, y2 = df['timestamp'].head(2).agg(['min', 'max']) movies = df[df['year'].between(y1, y2,inclusive=True )] [out] name timestamp year 1 D 2004 2004 2 M 2005 2006 3 T 2003 2007 4 B 1995 2008 7 E 2009 2005
You can fix that by changing just two lines of code: y1 = min(df['timestamp'].iloc[0], df['timestamp'].iloc[1]) y2 = max(df['timestamp'].iloc[0], df['timestamp'].iloc[1]) in this way y1 is always less or equal than y2. However as #ALollz pointed out it is possible to save both computation and coding time by using y1,y2 = np.sort(df['timestamp'].head(2))
How to add a column with the growth rate in a budget table in Pandas?
I would like to know how can I add a growth rate year to year in the following data in Pandas. Date Total Managed Expenditure 0 2001 503.2 1 2002 529.9 2 2003 559.8 3 2004 593.2 4 2005 629.5 5 2006 652.1 6 2007 664.3 7 2008 688.2 8 2009 732.0 9 2010 759.2 10 2011 769.2 11 2012 759.8 12 2013 760.6 13 2014 753.3 14 2015 757.6 15 2016 753.9
Use Series.pct_change(): df['Total Managed Expenditure'].pct_change() Out: 0 NaN 1 0.053060 2 0.056426 3 0.059664 4 0.061194 5 0.035902 6 0.018709 7 0.035978 8 0.063644 9 0.037158 10 0.013172 11 -0.012220 12 0.001053 13 -0.009598 14 0.005708 15 -0.004884 Name: Total Managed Expenditure, dtype: float64 To assign it back: df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()
How can I get this series to a pandas dataframe?
I have some data and after using a groupby function I now have a series that looks like this: year 1997 15 1998 22 1999 24 2000 24 2001 28 2002 11 2003 15 2004 19 2005 10 2006 10 2007 21 2008 26 2009 23 2010 16 2011 33 2012 19 2013 26 2014 25 How can I create a pandas dataframe from here with year as one column and the other column named sightings ? I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns: print (df.reset_index()) index year 0 1997 15 1 1998 22 2 1999 24 3 2000 24 4 2001 28 5 2002 11 6 2003 15 7 2004 19 8 2005 10 9 2006 10 10 2007 21 11 2008 26 12 2009 23 13 2010 16 14 2011 33 15 2012 19 16 2013 26 17 2014 25 print (df.reset_index().rename(columns=({'index':'year','year':'sightings'}))) year sightings 0 1997 15 1 1998 22 2 1999 24 3 2000 24 4 2001 28 5 2002 11 6 2003 15 7 2004 19 8 2005 10 9 2006 10 10 2007 21 11 2008 26 12 2009 23 13 2010 16 14 2011 33 15 2012 19 16 2013 26 17 2014 25 Another solution is set column names by list of names: df1 = df.reset_index() df1.columns = ['year','sightings'] print (df1) year sightings 0 1997 15 1 1998 22 2 1999 24 3 2000 24 4 2001 28 5 2002 11 6 2003 15 7 2004 19 8 2005 10 9 2006 10 10 2007 21 11 2008 26 12 2009 23 13 2010 16 14 2011 33 15 2012 19 16 2013 26 17 2014 25 EDIT: Sometimes help add parameter as_index=False to groupby for returning DataFrame: import pandas as pd df = pd.DataFrame({'A':[1,1,3], 'B':[4,5,6]}) print (df) A B 0 1 4 1 1 5 2 3 6 print (df.groupby('A')['B'].sum()) A 1 9 3 6 Name: B, dtype: int64 print (df.groupby('A', as_index=False)['B'].sum()) A B 0 1 9 1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe: df2 = df1.groupby(['Year']).count() df3 = pd.DataFrame(df2).reset_index() If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings". If not, you can change the column names by doing the following: df3.columns = ['Year','Sightings'] or df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})