Pandas transform function to do custom row manipulation - python
We want to create a column in the dataframe called feature col which is range of current value and previous 2 values, difference of max and min as shown in the image. How can we calculate this in pandas?
There are several IDs in the dataset
[![enter image description here][2]][2]
ID Year percentage
123 2009 0
123 2010 -27
123 2011 0
123 2012 -50
123 2013 3
123 2014 -3
123 2015 0
123 2016 -28
123 2017 -5
Use Series.rolling with numpy method np.ptp, but first if necessary remove % and convert values to numbers:
df['feature_col'] = df['percentage'].str.strip('%').astype(int).rolling(3).apply(np.ptp)
print (df)
ID Year percentage feature_col
0 123 2009 0% NaN
1 123 2010 -27% NaN
2 123 2011 0% 27.0
3 123 2012 -50% 50.0
4 123 2013 3% 53.0
5 123 2014 -3% 53.0
6 123 2015 0% 6.0
7 123 2016 -28% 28.0
8 123 2017 -5% 28.0
If output is necessary with % then is possible use:
df['feature_col'] = (df['percentage'].str.strip('%')
.astype(int)
.rolling(3)
.apply(np.ptp)
.mask(lambda x: x.notna(), lambda x: x.astype('Int64').astype(str).add('%'))
)
print (df)
ID Year percentage feature_col
0 123 2009 0% NaN
1 123 2010 -27% NaN
2 123 2011 0% 27%
3 123 2012 -50% 50%
4 123 2013 3% 53%
5 123 2014 -3% 53%
6 123 2015 0% 6%
7 123 2016 -28% 28%
8 123 2017 -5% 28%
EDIT: If need processing per groups by ID:
print (df)
ID Year percentage
0 123 2009 0%
1 123 2010 -27%
2 123 2011 0%
3 123 2012 -50%
4 123 2013 3%
5 124 2014 -3%
6 124 2015 0%
7 124 2016 -28%
8 124 2017 -5%
df['feature_col'] = (df['percentage'].str.strip('%')
.astype(int)
.groupby(df['ID'])
.rolling(3)
.apply(np.ptp)
.reset_index(level=0, drop=True))
print (df)
ID Year percentage feature_col
0 123 2009 0% NaN
1 123 2010 -27% NaN
2 123 2011 0% 27.0
3 123 2012 -50% 50.0
4 123 2013 3% 53.0
5 124 2014 -3% NaN
6 124 2015 0% NaN
7 124 2016 -28% 28.0
8 124 2017 -5% 28.0
Related
How to create multiple triangle using another dataframe? [closed]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. This question does not appear to be about programming within the scope defined in the help center. Closed 1 year ago. Improve this question Below is my code: triangle = cl.load_sample('genins') # Use bootstrap sampler to get resampled triangles bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_ #converting to dataframe resampledtriangledf = bootstrapdataframe.to_frame() print(resampledtriangledf) In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame: 0 2001 12 254,926 0 2001 24 535,877 0 2001 36 1,355,613 0 2001 48 2,034,557 0 2001 60 2,311,789 0 2001 72 2,539,807 0 2001 84 2,724,773 0 2001 96 3,187,095 0 2001 108 3,498,646 0 2001 120 3,586,037 0 2002 12 542,369 0 2002 24 1,016,927 0 2002 36 2,201,329 0 2002 48 2,923,381 0 2002 60 3,711,305 0 2002 72 3,914,829 0 2002 84 4,385,757 0 2002 96 4,596,072 0 2002 108 5,047,861 0 2003 12 235,361 0 2003 24 960,355 0 2003 36 1,661,972 0 2003 48 2,643,370 0 2003 60 3,372,684 0 2003 72 3,642,605 0 2003 84 4,160,583 0 2003 96 4,480,332 0 2004 12 764,553 0 2004 24 1,703,557 0 2004 36 2,498,418 0 2004 48 3,198,358 0 2004 60 3,524,562 0 2004 72 3,884,971 0 2004 84 4,268,241 0 2005 12 381,670 0 2005 24 1,124,054 0 2005 36 2,026,434 0 2005 48 2,863,902 0 2005 60 3,039,322 0 2005 72 3,288,253 0 2006 12 320,332 0 2006 24 1,022,323 0 2006 36 1,830,842 0 2006 48 2,676,710 0 2006 60 3,375,172 0 2007 12 330,361 0 2007 24 1,463,348 0 2007 36 2,771,839 0 2007 48 4,003,745 0 2008 12 282,143 0 2008 24 1,782,267 0 2008 36 2,898,699 0 2009 12 362,726 0 2009 24 1,277,750 0 2010 12 321,247 1 2001 12 219,021 1 2001 24 755,975 1 2001 36 1,360,298 1 2001 48 2,062,947 1 2001 60 2,356,983 1 2001 72 2,781,187 1 2001 84 2,987,837 1 2001 96 3,118,952 1 2001 108 3,307,522 1 2001 120 3,455,107 1 2002 12 302,932 1 2002 24 1,022,459 1 2002 36 1,634,938 1 2002 48 2,538,708 1 2002 60 3,005,695 1 2002 72 3,274,719 1 2002 84 3,356,499 1 2002 96 3,595,361 1 2002 108 4,100,065 1 2003 12 489,934 1 2003 24 1,233,438 1 2003 36 2,471,849 1 2003 48 3,672,629 1 2003 60 4,157,489 1 2003 72 4,498,470 1 2003 84 4,587,579 1 2003 96 4,816,232 1 2004 12 518,680 1 2004 24 1,209,705 1 2004 36 2,019,757 1 2004 48 2,997,820 1 2004 60 3,630,442 1 2004 72 3,881,093 1 2004 84 4,080,322 1 2005 12 453,963 1 2005 24 1,458,504 1 2005 36 2,036,506 1 2005 48 2,846,464 1 2005 60 3,280,124 1 2005 72 3,544,597 1 2006 12 369,755 1 2006 24 1,209,117 1 2006 36 1,973,136 1 2006 48 3,034,294 1 2006 60 3,537,784 1 2007 12 477,788 1 2007 24 1,524,537 1 2007 36 2,170,391 1 2007 48 3,355,093 1 2008 12 250,690 1 2008 24 1,546,986 1 2008 36 2,996,737 1 2009 12 271,270 1 2009 24 1,446,353 1 2010 12 510,114 2 2001 12 170,866 2 2001 24 797,338 2 2001 36 1,663,610 2 2001 48 2,293,697 2 2001 60 2,607,067 2 2001 72 2,979,479 2 2001 84 3,127,308 2 2001 96 3,285,338 2 2001 108 3,574,272 2 2001 120 3,630,610 2 2002 12 259,060 2 2002 24 1,011,092 2 2002 36 1,851,504 2 2002 48 2,705,313 2 2002 60 3,195,774 2 2002 72 3,766,008 2 2002 84 3,944,417 2 2002 96 4,234,043 2 2002 108 4,763,664 2 2003 12 239,981 2 2003 24 983,484 2 2003 36 1,929,785 2 2003 48 2,497,929 2 2003 60 2,972,887 2 2003 72 3,313,868 2 2003 84 3,727,432 2 2003 96 4,024,122 2 2004 12 77,522 2 2004 24 729,401 2 2004 36 1,473,914 2 2004 48 2,376,313 2 2004 60 2,999,197 2 2004 72 3,372,020 2 2004 84 3,887,883 2 2005 12 321,598 2 2005 24 1,132,502 2 2005 36 1,710,504 2 2005 48 2,438,620 2 2005 60 2,801,957 2 2005 72 3,182,466 2 2006 12 255,407 2 2006 24 1,275,141 2 2006 36 2,083,421 2 2006 48 3,144,579 2 2006 60 3,891,772 2 2007 12 338,120 2 2007 24 1,275,697 2 2007 36 2,238,715 2 2007 48 3,615,323 2 2008 12 310,214 2 2008 24 1,237,156 2 2008 36 2,563,326 2 2009 12 271,093 2 2009 24 1,523,131 2 2010 12 430,591 3 2001 12 330,887 3 2001 24 831,193 3 2001 36 1,601,374 3 2001 48 2,188,879 3 2001 60 2,662,773 3 2001 72 3,086,976 3 2001 84 3,332,247 3 2001 96 3,317,279 3 2001 108 3,576,659 3 2001 120 3,613,563 3 2002 12 358,263 3 2002 24 1,139,259 3 2002 36 2,236,375 3 2002 48 3,163,464 3 2002 60 3,715,130 3 2002 72 4,295,638 3 2002 84 4,502,105 3 2002 96 4,769,139 3 2002 108 5,323,304 3 2003 12 489,934 3 2003 24 1,570,352 3 2003 36 3,123,215 3 2003 48 4,189,299 3 2003 60 4,819,070 3 2003 72 5,306,689 3 2003 84 5,560,371 3 2003 96 5,827,003 3 2004 12 419,727 3 2004 24 1,308,884 3 2004 36 2,118,936 3 2004 48 2,906,732 3 2004 60 3,561,577 3 2004 72 3,934,400 3 2004 84 4,010,511 3 2005 12 389,217 3 2005 24 1,173,226 3 2005 36 1,794,216 3 2005 48 2,528,910 3 2005 60 3,474,035 3 2005 72 3,908,999 3 2006 12 291,940 3 2006 24 1,136,674 3 2006 36 1,915,614 3 2006 48 2,693,930 3 2006 60 3,375,601 3 2007 12 506,055 3 2007 24 1,684,660 3 2007 36 2,678,739 3 2007 48 3,545,156 3 2008 12 282,143 3 2008 24 1,536,490 3 2008 36 2,458,789 3 2009 12 271,093 3 2009 24 1,199,897 3 2010 12 266,359 Using above dataframe I have to create 4 triangles based on Toatal column: For example: Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total 2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119 2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832 2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261 2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659 2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635 2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377 2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294 2008 282,143 1,782,267 2,898,699 4,963,110 2009 362,726 1,277,750 1,640,475 2010 321,247 321,247 Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009 . . . Like this i need 4 triangles (4 is number of simulation) using 1st dataframe. If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.
Use pivot_table and choose the aggregation function (sum here but you can use mean or whatever): df = df.pivot_table(index="origin", columns="development", values="values", aggfunc="sum") df = df.set_index(df.index.year) df.loc["Grand Total"] = df.sum() df.loc[:, "Grand Total"] = df.sum(axis=1) >>> df development 12 24 36 48 60 72 84 96 108 120 Grand Total origin 2001 1.356449e+09 4.695043e+09 8.226504e+09 1.200121e+10 1.408404e+10 1.555555e+10 1.690673e+10 1.781579e+10 1.917689e+10 1.951240e+10 1.293306e+11 2002 1.887634e+09 6.573443e+09 1.150100e+10 1.671772e+10 1.960781e+10 2.164808e+10 2.352267e+10 2.480478e+10 2.671911e+10 NaN 1.529823e+11 2003 1.866031e+09 6.531145e+09 1.137408e+10 1.657377e+10 1.945944e+10 2.148353e+10 2.334087e+10 2.459720e+10 NaN NaN 1.252261e+11 2004 1.842447e+09 6.411653e+09 1.120732e+10 1.633725e+10 1.917381e+10 2.117893e+10 2.301072e+10 NaN NaN NaN 9.916214e+10 2005 1.688064e+09 5.876106e+09 1.027445e+10 1.496756e+10 1.757424e+10 1.939891e+10 NaN NaN NaN NaN 6.977932e+10 2006 1.762834e+09 6.154760e+09 1.076776e+10 1.569864e+10 1.843549e+10 NaN NaN NaN NaN NaN 5.281948e+10 2007 1.968264e+09 6.855178e+09 1.195292e+10 1.741326e+10 NaN NaN NaN NaN NaN NaN 3.818962e+10 2008 2.344669e+09 8.218527e+09 1.433187e+10 NaN NaN NaN NaN NaN NaN NaN 2.489507e+10 2009 1.955145e+09 6.813284e+09 NaN NaN NaN NaN NaN NaN NaN NaN 8.768429e+09 2010 1.716057e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.716057e+09 Grand Total 1.838759e+10 5.812914e+10 8.963591e+10 1.097094e+11 1.083348e+11 9.926499e+10 8.678100e+10 6.721778e+10 4.589601e+10 1.951240e+10 7.028691e+11 The code above works for the following input data: >>> df origin development values Total 0 2001-01-01 12 3.766810e+05 0 2001-01-01 24 1.025411e+06 0 2001-01-01 36 1.541503e+06 0 2001-01-01 48 2.155232e+06 0 2001-01-01 60 2.422287e+06 ... ... ... ... 4999 2008-01-01 24 2.403488e+06 4999 2008-01-01 36 3.100034e+06 4999 2009-01-01 12 3.747304e+05 4999 2009-01-01 24 1.262821e+06 4999 2010-01-01 12 2.469928e+05 [275000 rows x 3 columns]
Replace last value(s) of group with NaN
My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes. Example: import pandas as pd ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3] year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015] percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110] dictex ={"id":ids,"year":year,"percent [%]": percent} dfex = pd.DataFrame(dictex) print(dfex) id year percent [%] 0 1 2000 120 1 1 2001 70 2 1 2002 37 3 1 2003 40 4 1 2004 50 5 1 2005 110 6 2 1990 140 7 2 1991 100 8 2 1992 90 9 2 1993 5 10 2 1994 52 11 2 1995 80 12 3 2010 60 13 3 2011 40 14 3 2012 70 15 3 2013 60 16 3 2014 50 17 3 2015 110 My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN. The result should look like this: (here: replace the last 2 values of each id) id year percent [%] 0 1 2000 120 1 1 2001 70 2 1 2002 37 3 1 2003 40 4 1 2004 NaN 5 1 2005 NaN 6 2 1990 140 7 2 1991 100 8 2 1992 90 9 2 1993 5 10 2 1994 NaN 11 2 1995 NaN 12 3 2010 60 13 3 2011 40 14 3 2012 70 15 3 2013 60 16 3 2014 NaN 17 3 2015 NaN I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way. Thanks for the help!
try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values nrows = 2 idx = df.groupby('id').tail(nrows).index df.loc[idx, 'percent [%]'] = np.nan #output id year percent [%] 0 1 2000 120.0 1 1 2001 70.0 2 1 2002 37.0 3 1 2003 40.0 4 1 2004 NaN 5 1 2005 NaN 6 2 1990 140.0 7 2 1991 100.0 8 2 1992 90.0 9 2 1993 5.0 10 2 1994 NaN 11 2 1995 NaN 12 3 2010 60.0 13 3 2011 40.0 14 3 2012 70.0 15 3 2013 60.0 16 3 2014 NaN 17 3 2015 NaN
Pandas map groupby with multiple columns in same dataframe
I have a dataframe df which i need to groupby multiple column based on a condition. df user_id area_id group_id key year value new 10835 48299 1 5 2011 0 ? 10835 48299 1 2 2010 0 10835 48299 2 102 2013 13100 10835 48299 2 5 2016 0 10836 48299 1 78 2017 67100 10836 48299 1 1 2012 54000 10836 48299 1 12 2018 0 10836 48752 1 7 2014 0 10836 48752 2 103 2015 5000 10837 48752 2 102 2016 5000 10837 48752 1 3 2017 0 10837 48752 1 103 2017 0 10837 49226 1 2 2011 4000 10837 49226 1 83 2011 4000 10838 49226 2 16 2011 0 10838 49226 1 75 2012 0 10838 49226 1 2 2012 4000 10838 49226 1 12 2013 1000 10839 49226 1 3 2015 6500 10839 49226 1 102 2016 7900 10839 49226 1 16 2017 0 10839 49226 2 6 2017 5500 22489 49226 2 89 2017 5000 22489 49226 1 102 2017 5000 my goal is to create a new column df['new'] Current solution: df['new'] =df['user_id'].map(df[df['key'].eq(102)].groupby(['user_id', 'area_id', 'group_id', 'year'])['value'].sum()) I get NaN for all df['new'] values. I'm guessing is not possible to use the the map function to grouped multiple columns this way. Is there a proper way to accomplish this? Thanks in advance for tip to the right direction.
You can add as_index=False for new DataFrame: df1 = (df[df['key'].eq(102)] .groupby(['user_id', 'area_id', 'group_id', 'year'], as_index=False)['value'] .sum()) print (df1) user_id area_id group_id year value 0 10835 48299 2 2013 13100 1 10837 48752 2 2016 5000 2 10839 49226 1 2016 7900 3 22489 49226 1 2017 5000 Then if possible duplicated user_id first get unique rows by DataFrame.drop_duplicates, create Series by DataFrame.set_index and map: df['new'] = df['user_id'].map(df1.drop_duplicates('user_id').set_index('user_id')['value']) #if never duplicates #df['new'] = df['user_id'].map(df1.set_index('user_id')['value']) print (df) user_id area_id group_id key year value new 0 10835 48299 1 5 2011 0 13100.0 1 10835 48299 1 2 2010 0 13100.0 2 10835 48299 2 102 2013 13100 13100.0 3 10835 48299 2 5 2016 0 13100.0 4 10836 48299 1 78 2017 67100 NaN 5 10836 48299 1 1 2012 54000 NaN 6 10836 48299 1 12 2018 0 NaN 7 10836 48752 1 7 2014 0 NaN 8 10836 48752 2 103 2015 5000 NaN 9 10837 48752 2 102 2016 5000 5000.0 10 10837 48752 1 3 2017 0 5000.0 11 10837 48752 1 103 2017 0 5000.0 12 10837 49226 1 2 2011 4000 5000.0 13 10837 49226 1 83 2011 4000 5000.0 14 10838 49226 2 16 2011 0 NaN 15 10838 49226 1 75 2012 0 NaN 16 10838 49226 1 2 2012 4000 NaN 17 10838 49226 1 12 2013 1000 NaN 18 10839 49226 1 3 2015 6500 7900.0 19 10839 49226 1 102 2016 7900 7900.0 20 10839 49226 1 16 2017 0 7900.0 21 10839 49226 2 6 2017 5500 7900.0 22 22489 49226 2 89 2017 5000 5000.0 23 22489 49226 1 102 2017 5000 5000.0
Panel data pandas, variation according to a certain condition
i am a stata user and i trying to switch to python and i having problem with some codes. If i have the following panel data id year quarter fecha jobs 1 2007 1 220 10 1 2007 2 221 12 1 2007 3 222 12 1 2007 4 223 12 1 2008 1 224 12 1 2008 2 225 13 1 2008 3 226 14 1 2008 4 227 9 1 2009 1 228 12 1 2009 2 229 15 1 2009 3 230 18 1 2009 4 231 15 1 2010 1 232 15 1 2010 2 233 16 1 2010 3 234 17 1 2010 4 235 18 2 2007 1 220 10 2 2007 2 221 12 2 2007 3 222 12 2 2007 4 223 12 2 2008 1 224 12 2 2008 2 225 13 2 2008 3 226 14 2 2008 4 227 9 2 2009 1 228 12 2 2009 2 229 15 2 2009 3 230 18 2 2009 4 231 15 2 2010 1 232 15 2 2010 2 233 16 2 2010 4 235 18 (My panel data is much bigger than the example, is just to illustrate my problem). I want to calculate the variation of jobs of the same quarter and three year before So result should look like these id year quarter fecha jobs jobs_variation 1 2007 1 220 10 Nan 1 2007 2 221 12 Nan 1 2007 3 222 12 Nan 1 2007 4 223 12 Nan 1 2008 1 224 12 Nan 1 2008 2 225 13 Nan 1 2008 3 226 14 Nan 1 2008 4 227 9 Nan 1 2009 1 228 12 Nan 1 2009 2 229 15 Nan 1 2009 3 230 18 Nan 1 2009 4 231 15 Nan 1 2010 1 232 15 0.5 1 2010 2 233 16 0.33 1 2010 3 234 17 0.30769 1 2010 4 235 18 0.5 2 2007 1 220 10 Nan 2 2007 4 223 12 Nan 2 2008 1 224 12 Nan 2 2008 2 225 13 Nan 2 2008 3 226 14 Nan 2 2008 4 227 9 Nan 2 2009 1 228 12 Nan 2 2009 2 229 15 Nan 2 2009 3 230 18 Nan 2 2009 4 231 15 Nan 2 2010 1 232 15 0.5 2 2010 2 233 16 Nan 2 2010 3 234 20 Nan 2 2010 4 235 18 0.5 Check that in the second id year 2010 in the second and thir quarter calculation must not be me made because the id was not present at 2007Q2 and 2007Q3. In stata the code would be, bys id: gen jobs_variation=jobs/jobs[_n-12]-1 if fecha[_n-12]==fecha-12
IIUC, you need a groupby on id and quarter followed by apply: df['jobs_variation'] = df.groupby(['id', 'quarter']).jobs\ .apply(lambda x: x / x.shift(3) - 1) df id year quarter fecha jobs jobs_variation 0 1 2007 1 220 10 NaN 1 1 2007 2 221 12 NaN 2 1 2007 3 222 12 NaN 3 1 2007 4 223 12 NaN 4 1 2008 1 224 12 NaN 5 1 2008 2 225 13 NaN 6 1 2008 3 226 14 NaN 7 1 2008 4 227 9 NaN 8 1 2009 1 228 12 NaN 9 1 2009 2 229 15 NaN 10 1 2009 3 230 18 NaN 11 1 2009 4 231 15 NaN 12 1 2010 1 232 15 0.500000 13 1 2010 2 233 16 0.333333 14 1 2010 3 234 17 0.416667 15 1 2010 4 235 18 0.500000 16 2 2007 1 220 10 NaN 17 2 2007 4 223 12 NaN 18 2 2008 1 224 12 NaN 19 2 2008 2 225 13 NaN 20 2 2008 3 226 14 NaN 21 2 2008 4 227 9 NaN 22 2 2009 1 228 12 NaN 23 2 2009 2 229 15 NaN 24 2 2009 3 230 18 NaN 25 2 2009 4 231 15 NaN 26 2 2010 1 232 15 0.500000 27 2 2010 2 233 16 NaN 28 2 2010 3 234 20 NaN 29 2 2010 4 235 18 0.500000 x / x.shift(3) will divide the current year's job count (for that quarter) by the corresponding value from 3 years ago.
Enter Missing Year Amounts with Zeros After GroupBy in Pandas
I am grouping the following rows. df = df.groupby(['id','year']).sum().sort(ascending=False) print df amount id year 1 2009 120 2008 240 2007 240 2006 240 2005 240 2 2014 100 2013 50 2012 50 2011 100 2010 50 2006 100 ... ... Is there a way to add years that do not have any values with the amount equal to zero until a specific year, in this case 2005, as I am showing below? Expected Output: amount id year 2015 0 2014 0 2013 0 2012 0 2011 0 2010 0 2009 120 2008 240 2007 240 2006 240 2005 240 2 2015 0 2014 100 2013 50 2012 50 2011 100 2010 50 2009 0 2008 0 2007 0 2006 100 2005 0 ... ...
Starting with your first DataFrame, this will add all years that occur with some id to all ids. df = df.unstack().fillna(0).stack() e.g. In [16]: df Out[16]: amt id year 1 2001 1 2002 2 2003 3 2 2002 4 2003 5 2004 6 In [17]: df = df.unstack().fillna(0).stack() In [18]: df Out[18]: amt id year 1 2001 1 2002 2 2003 3 2004 0 2 2001 0 2002 4 2003 5 2004 6