Enter Missing Year Amounts with Zeros After GroupBy in Pandas - python

I am grouping the following rows.
df = df.groupby(['id','year']).sum().sort(ascending=False)
print df
amount
id year
1 2009 120
2008 240
2007 240
2006 240
2005 240
2 2014 100
2013 50
2012 50
2011 100
2010 50
2006 100
... ...
Is there a way to add years that do not have any values with the amount equal to zero until a specific year, in this case 2005, as I am showing below?
Expected Output:
amount
id year
2015 0
2014 0
2013 0
2012 0
2011 0
2010 0
2009 120
2008 240
2007 240
2006 240
2005 240
2 2015 0
2014 100
2013 50
2012 50
2011 100
2010 50
2009 0
2008 0
2007 0
2006 100
2005 0
... ...

Starting with your first DataFrame, this will add all years that occur with some id to all ids.
df = df.unstack().fillna(0).stack()
e.g.
In [16]: df
Out[16]:
amt
id year
1 2001 1
2002 2
2003 3
2 2002 4
2003 5
2004 6
In [17]: df = df.unstack().fillna(0).stack()
In [18]: df
Out[18]:
amt
id year
1 2001 1
2002 2
2003 3
2004 0
2 2001 0
2002 4
2003 5
2004 6

Related

How to create multiple triangle using another dataframe? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Below is my code:
triangle = cl.load_sample('genins')
# Use bootstrap sampler to get resampled triangles
bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_
#converting to dataframe
resampledtriangledf = bootstrapdataframe.to_frame()
print(resampledtriangledf)
In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame:
0 2001 12 254,926
0 2001 24 535,877
0 2001 36 1,355,613
0 2001 48 2,034,557
0 2001 60 2,311,789
0 2001 72 2,539,807
0 2001 84 2,724,773
0 2001 96 3,187,095
0 2001 108 3,498,646
0 2001 120 3,586,037
0 2002 12 542,369
0 2002 24 1,016,927
0 2002 36 2,201,329
0 2002 48 2,923,381
0 2002 60 3,711,305
0 2002 72 3,914,829
0 2002 84 4,385,757
0 2002 96 4,596,072
0 2002 108 5,047,861
0 2003 12 235,361
0 2003 24 960,355
0 2003 36 1,661,972
0 2003 48 2,643,370
0 2003 60 3,372,684
0 2003 72 3,642,605
0 2003 84 4,160,583
0 2003 96 4,480,332
0 2004 12 764,553
0 2004 24 1,703,557
0 2004 36 2,498,418
0 2004 48 3,198,358
0 2004 60 3,524,562
0 2004 72 3,884,971
0 2004 84 4,268,241
0 2005 12 381,670
0 2005 24 1,124,054
0 2005 36 2,026,434
0 2005 48 2,863,902
0 2005 60 3,039,322
0 2005 72 3,288,253
0 2006 12 320,332
0 2006 24 1,022,323
0 2006 36 1,830,842
0 2006 48 2,676,710
0 2006 60 3,375,172
0 2007 12 330,361
0 2007 24 1,463,348
0 2007 36 2,771,839
0 2007 48 4,003,745
0 2008 12 282,143
0 2008 24 1,782,267
0 2008 36 2,898,699
0 2009 12 362,726
0 2009 24 1,277,750
0 2010 12 321,247
1 2001 12 219,021
1 2001 24 755,975
1 2001 36 1,360,298
1 2001 48 2,062,947
1 2001 60 2,356,983
1 2001 72 2,781,187
1 2001 84 2,987,837
1 2001 96 3,118,952
1 2001 108 3,307,522
1 2001 120 3,455,107
1 2002 12 302,932
1 2002 24 1,022,459
1 2002 36 1,634,938
1 2002 48 2,538,708
1 2002 60 3,005,695
1 2002 72 3,274,719
1 2002 84 3,356,499
1 2002 96 3,595,361
1 2002 108 4,100,065
1 2003 12 489,934
1 2003 24 1,233,438
1 2003 36 2,471,849
1 2003 48 3,672,629
1 2003 60 4,157,489
1 2003 72 4,498,470
1 2003 84 4,587,579
1 2003 96 4,816,232
1 2004 12 518,680
1 2004 24 1,209,705
1 2004 36 2,019,757
1 2004 48 2,997,820
1 2004 60 3,630,442
1 2004 72 3,881,093
1 2004 84 4,080,322
1 2005 12 453,963
1 2005 24 1,458,504
1 2005 36 2,036,506
1 2005 48 2,846,464
1 2005 60 3,280,124
1 2005 72 3,544,597
1 2006 12 369,755
1 2006 24 1,209,117
1 2006 36 1,973,136
1 2006 48 3,034,294
1 2006 60 3,537,784
1 2007 12 477,788
1 2007 24 1,524,537
1 2007 36 2,170,391
1 2007 48 3,355,093
1 2008 12 250,690
1 2008 24 1,546,986
1 2008 36 2,996,737
1 2009 12 271,270
1 2009 24 1,446,353
1 2010 12 510,114
2 2001 12 170,866
2 2001 24 797,338
2 2001 36 1,663,610
2 2001 48 2,293,697
2 2001 60 2,607,067
2 2001 72 2,979,479
2 2001 84 3,127,308
2 2001 96 3,285,338
2 2001 108 3,574,272
2 2001 120 3,630,610
2 2002 12 259,060
2 2002 24 1,011,092
2 2002 36 1,851,504
2 2002 48 2,705,313
2 2002 60 3,195,774
2 2002 72 3,766,008
2 2002 84 3,944,417
2 2002 96 4,234,043
2 2002 108 4,763,664
2 2003 12 239,981
2 2003 24 983,484
2 2003 36 1,929,785
2 2003 48 2,497,929
2 2003 60 2,972,887
2 2003 72 3,313,868
2 2003 84 3,727,432
2 2003 96 4,024,122
2 2004 12 77,522
2 2004 24 729,401
2 2004 36 1,473,914
2 2004 48 2,376,313
2 2004 60 2,999,197
2 2004 72 3,372,020
2 2004 84 3,887,883
2 2005 12 321,598
2 2005 24 1,132,502
2 2005 36 1,710,504
2 2005 48 2,438,620
2 2005 60 2,801,957
2 2005 72 3,182,466
2 2006 12 255,407
2 2006 24 1,275,141
2 2006 36 2,083,421
2 2006 48 3,144,579
2 2006 60 3,891,772
2 2007 12 338,120
2 2007 24 1,275,697
2 2007 36 2,238,715
2 2007 48 3,615,323
2 2008 12 310,214
2 2008 24 1,237,156
2 2008 36 2,563,326
2 2009 12 271,093
2 2009 24 1,523,131
2 2010 12 430,591
3 2001 12 330,887
3 2001 24 831,193
3 2001 36 1,601,374
3 2001 48 2,188,879
3 2001 60 2,662,773
3 2001 72 3,086,976
3 2001 84 3,332,247
3 2001 96 3,317,279
3 2001 108 3,576,659
3 2001 120 3,613,563
3 2002 12 358,263
3 2002 24 1,139,259
3 2002 36 2,236,375
3 2002 48 3,163,464
3 2002 60 3,715,130
3 2002 72 4,295,638
3 2002 84 4,502,105
3 2002 96 4,769,139
3 2002 108 5,323,304
3 2003 12 489,934
3 2003 24 1,570,352
3 2003 36 3,123,215
3 2003 48 4,189,299
3 2003 60 4,819,070
3 2003 72 5,306,689
3 2003 84 5,560,371
3 2003 96 5,827,003
3 2004 12 419,727
3 2004 24 1,308,884
3 2004 36 2,118,936
3 2004 48 2,906,732
3 2004 60 3,561,577
3 2004 72 3,934,400
3 2004 84 4,010,511
3 2005 12 389,217
3 2005 24 1,173,226
3 2005 36 1,794,216
3 2005 48 2,528,910
3 2005 60 3,474,035
3 2005 72 3,908,999
3 2006 12 291,940
3 2006 24 1,136,674
3 2006 36 1,915,614
3 2006 48 2,693,930
3 2006 60 3,375,601
3 2007 12 506,055
3 2007 24 1,684,660
3 2007 36 2,678,739
3 2007 48 3,545,156
3 2008 12 282,143
3 2008 24 1,536,490
3 2008 36 2,458,789
3 2009 12 271,093
3 2009 24 1,199,897
3 2010 12 266,359
Using above dataframe I have to create 4 triangles based on Toatal column:
For example:
Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total
2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119
2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832
2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261
2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659
2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635
2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377
2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294
2008 282,143 1,782,267 2,898,699 4,963,110
2009 362,726 1,277,750 1,640,475
2010 321,247 321,247
Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009
.
.
.
Like this i need 4 triangles (4 is number of simulation) using 1st dataframe.
If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.
Use pivot_table and choose the aggregation function (sum here but you can use mean or whatever):
df = df.pivot_table(index="origin", columns="development",
values="values", aggfunc="sum")
df = df.set_index(df.index.year)
df.loc["Grand Total"] = df.sum()
df.loc[:, "Grand Total"] = df.sum(axis=1)
>>> df
development 12 24 36 48 60 72 84 96 108 120 Grand Total
origin
2001 1.356449e+09 4.695043e+09 8.226504e+09 1.200121e+10 1.408404e+10 1.555555e+10 1.690673e+10 1.781579e+10 1.917689e+10 1.951240e+10 1.293306e+11
2002 1.887634e+09 6.573443e+09 1.150100e+10 1.671772e+10 1.960781e+10 2.164808e+10 2.352267e+10 2.480478e+10 2.671911e+10 NaN 1.529823e+11
2003 1.866031e+09 6.531145e+09 1.137408e+10 1.657377e+10 1.945944e+10 2.148353e+10 2.334087e+10 2.459720e+10 NaN NaN 1.252261e+11
2004 1.842447e+09 6.411653e+09 1.120732e+10 1.633725e+10 1.917381e+10 2.117893e+10 2.301072e+10 NaN NaN NaN 9.916214e+10
2005 1.688064e+09 5.876106e+09 1.027445e+10 1.496756e+10 1.757424e+10 1.939891e+10 NaN NaN NaN NaN 6.977932e+10
2006 1.762834e+09 6.154760e+09 1.076776e+10 1.569864e+10 1.843549e+10 NaN NaN NaN NaN NaN 5.281948e+10
2007 1.968264e+09 6.855178e+09 1.195292e+10 1.741326e+10 NaN NaN NaN NaN NaN NaN 3.818962e+10
2008 2.344669e+09 8.218527e+09 1.433187e+10 NaN NaN NaN NaN NaN NaN NaN 2.489507e+10
2009 1.955145e+09 6.813284e+09 NaN NaN NaN NaN NaN NaN NaN NaN 8.768429e+09
2010 1.716057e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.716057e+09
Grand Total 1.838759e+10 5.812914e+10 8.963591e+10 1.097094e+11 1.083348e+11 9.926499e+10 8.678100e+10 6.721778e+10 4.589601e+10 1.951240e+10 7.028691e+11
The code above works for the following input data:
>>> df
origin development values
Total
0 2001-01-01 12 3.766810e+05
0 2001-01-01 24 1.025411e+06
0 2001-01-01 36 1.541503e+06
0 2001-01-01 48 2.155232e+06
0 2001-01-01 60 2.422287e+06
... ... ... ...
4999 2008-01-01 24 2.403488e+06
4999 2008-01-01 36 3.100034e+06
4999 2009-01-01 12 3.747304e+05
4999 2009-01-01 24 1.262821e+06
4999 2010-01-01 12 2.469928e+05
[275000 rows x 3 columns]

Sum by year and total_vehicles pandas dataframe

I have the following dataframe lrdata3 and I would like to sum the total_vehicles for every year instead of having multiple separate for the same year.
year total_vehicles
0 2000 2016
1 2000 1483
2 2000 1275
3 2000 1086
4 2000 816
When I do this
lrdata3.groupby('year')['total_vehicles'].sum()
I get this which is not even a dataframe
year
2000 419587299
2001 425832533
2002 430480581
2003 434270003
2004 442680113
2005 443366960
2006 452086899
2007 452280161
2008 445462026
2009 443333980
2010 438827716
2011 440461505
2012 440073277
2013 441751395
2014 451394270
2015 460050397
2016 470256985
2017 474693803
2018 473765568
Any help please?
Thanks
You can do it in one line and get a df with this syntax.
Some sample data:
year total_vehicles
0 2000 2016
1 2000 1483
2 2000 1275
3 2000 1086
4 2000 816
5 2001 2016
6 2001 1483
7 2001 1275
8 2002 1086
9 2002 816
df = pd.read_clipboard()
gb = df.groupby('year').agg({'total_vehicles': 'sum'})
print(gb)
total_vehicles
year
2000 6676
2001 4774
2002 1902
print(type(gb))
<class 'pandas.core.frame.DataFrame'>
Your code is fine, just add a .reset_index() to it. Like this:
lrdata3.groupby('year')['total_vehicles'].sum().reset_index()
This will get you what you want.
lrdata3.groupby('year')['total_vehicles'].sum().to_frame()
or groupby and transform
lrdata3['yearlytotal_vehicles']=lrdata3.groupby('year')['total_vehicles'].transform('sum')

Finding all values in between specific values in data frame

i have this dataframe.
df
name timestamp year
0 A 2004 1995
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
What i am doing is that on the basis of first two entries in the df['timestamp']. I am fetching all the values from df['year'] which comes in between these two entries. Which in this case is (2004-2008).
y1 = df['timestamp'].iloc[0]
y2 = df['timestamp'].iloc[1]
movies = df[df['year'].between(y1, y2,inclusive=True )]
movies
name timestamp year
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
This is working fine for me. But when i have greater value in first index and lower in 2nd index (e.g. 2008-2004) the result is empty.
df
name timestamp year
0 A 2008 1995
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
In this case i fetch nothing.
Expected Outcome:
What i want is if the values are greater or smaller i should get in-between values every time.
You could use Series.head and Series.agg:
y1, y2 = df['timestamp'].head(2).agg(['min', 'max'])
movies = df[df['year'].between(y1, y2,inclusive=True )]
[out]
name timestamp year
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
You can fix that by changing just two lines of code:
y1 = min(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
y2 = max(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
in this way y1 is always less or equal than y2.
However as #ALollz pointed out it is possible to save both computation and coding time by using
y1,y2 = np.sort(df['timestamp'].head(2))

How to add a column with the growth rate in a budget table in Pandas?

I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

Categories

Resources