How to create multiple triangle using another dataframe? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Below is my code:
triangle = cl.load_sample('genins')
# Use bootstrap sampler to get resampled triangles
bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_
#converting to dataframe
resampledtriangledf = bootstrapdataframe.to_frame()
print(resampledtriangledf)
In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame:
0 2001 12 254,926
0 2001 24 535,877
0 2001 36 1,355,613
0 2001 48 2,034,557
0 2001 60 2,311,789
0 2001 72 2,539,807
0 2001 84 2,724,773
0 2001 96 3,187,095
0 2001 108 3,498,646
0 2001 120 3,586,037
0 2002 12 542,369
0 2002 24 1,016,927
0 2002 36 2,201,329
0 2002 48 2,923,381
0 2002 60 3,711,305
0 2002 72 3,914,829
0 2002 84 4,385,757
0 2002 96 4,596,072
0 2002 108 5,047,861
0 2003 12 235,361
0 2003 24 960,355
0 2003 36 1,661,972
0 2003 48 2,643,370
0 2003 60 3,372,684
0 2003 72 3,642,605
0 2003 84 4,160,583
0 2003 96 4,480,332
0 2004 12 764,553
0 2004 24 1,703,557
0 2004 36 2,498,418
0 2004 48 3,198,358
0 2004 60 3,524,562
0 2004 72 3,884,971
0 2004 84 4,268,241
0 2005 12 381,670
0 2005 24 1,124,054
0 2005 36 2,026,434
0 2005 48 2,863,902
0 2005 60 3,039,322
0 2005 72 3,288,253
0 2006 12 320,332
0 2006 24 1,022,323
0 2006 36 1,830,842
0 2006 48 2,676,710
0 2006 60 3,375,172
0 2007 12 330,361
0 2007 24 1,463,348
0 2007 36 2,771,839
0 2007 48 4,003,745
0 2008 12 282,143
0 2008 24 1,782,267
0 2008 36 2,898,699
0 2009 12 362,726
0 2009 24 1,277,750
0 2010 12 321,247
1 2001 12 219,021
1 2001 24 755,975
1 2001 36 1,360,298
1 2001 48 2,062,947
1 2001 60 2,356,983
1 2001 72 2,781,187
1 2001 84 2,987,837
1 2001 96 3,118,952
1 2001 108 3,307,522
1 2001 120 3,455,107
1 2002 12 302,932
1 2002 24 1,022,459
1 2002 36 1,634,938
1 2002 48 2,538,708
1 2002 60 3,005,695
1 2002 72 3,274,719
1 2002 84 3,356,499
1 2002 96 3,595,361
1 2002 108 4,100,065
1 2003 12 489,934
1 2003 24 1,233,438
1 2003 36 2,471,849
1 2003 48 3,672,629
1 2003 60 4,157,489
1 2003 72 4,498,470
1 2003 84 4,587,579
1 2003 96 4,816,232
1 2004 12 518,680
1 2004 24 1,209,705
1 2004 36 2,019,757
1 2004 48 2,997,820
1 2004 60 3,630,442
1 2004 72 3,881,093
1 2004 84 4,080,322
1 2005 12 453,963
1 2005 24 1,458,504
1 2005 36 2,036,506
1 2005 48 2,846,464
1 2005 60 3,280,124
1 2005 72 3,544,597
1 2006 12 369,755
1 2006 24 1,209,117
1 2006 36 1,973,136
1 2006 48 3,034,294
1 2006 60 3,537,784
1 2007 12 477,788
1 2007 24 1,524,537
1 2007 36 2,170,391
1 2007 48 3,355,093
1 2008 12 250,690
1 2008 24 1,546,986
1 2008 36 2,996,737
1 2009 12 271,270
1 2009 24 1,446,353
1 2010 12 510,114
2 2001 12 170,866
2 2001 24 797,338
2 2001 36 1,663,610
2 2001 48 2,293,697
2 2001 60 2,607,067
2 2001 72 2,979,479
2 2001 84 3,127,308
2 2001 96 3,285,338
2 2001 108 3,574,272
2 2001 120 3,630,610
2 2002 12 259,060
2 2002 24 1,011,092
2 2002 36 1,851,504
2 2002 48 2,705,313
2 2002 60 3,195,774
2 2002 72 3,766,008
2 2002 84 3,944,417
2 2002 96 4,234,043
2 2002 108 4,763,664
2 2003 12 239,981
2 2003 24 983,484
2 2003 36 1,929,785
2 2003 48 2,497,929
2 2003 60 2,972,887
2 2003 72 3,313,868
2 2003 84 3,727,432
2 2003 96 4,024,122
2 2004 12 77,522
2 2004 24 729,401
2 2004 36 1,473,914
2 2004 48 2,376,313
2 2004 60 2,999,197
2 2004 72 3,372,020
2 2004 84 3,887,883
2 2005 12 321,598
2 2005 24 1,132,502
2 2005 36 1,710,504
2 2005 48 2,438,620
2 2005 60 2,801,957
2 2005 72 3,182,466
2 2006 12 255,407
2 2006 24 1,275,141
2 2006 36 2,083,421
2 2006 48 3,144,579
2 2006 60 3,891,772
2 2007 12 338,120
2 2007 24 1,275,697
2 2007 36 2,238,715
2 2007 48 3,615,323
2 2008 12 310,214
2 2008 24 1,237,156
2 2008 36 2,563,326
2 2009 12 271,093
2 2009 24 1,523,131
2 2010 12 430,591
3 2001 12 330,887
3 2001 24 831,193
3 2001 36 1,601,374
3 2001 48 2,188,879
3 2001 60 2,662,773
3 2001 72 3,086,976
3 2001 84 3,332,247
3 2001 96 3,317,279
3 2001 108 3,576,659
3 2001 120 3,613,563
3 2002 12 358,263
3 2002 24 1,139,259
3 2002 36 2,236,375
3 2002 48 3,163,464
3 2002 60 3,715,130
3 2002 72 4,295,638
3 2002 84 4,502,105
3 2002 96 4,769,139
3 2002 108 5,323,304
3 2003 12 489,934
3 2003 24 1,570,352
3 2003 36 3,123,215
3 2003 48 4,189,299
3 2003 60 4,819,070
3 2003 72 5,306,689
3 2003 84 5,560,371
3 2003 96 5,827,003
3 2004 12 419,727
3 2004 24 1,308,884
3 2004 36 2,118,936
3 2004 48 2,906,732
3 2004 60 3,561,577
3 2004 72 3,934,400
3 2004 84 4,010,511
3 2005 12 389,217
3 2005 24 1,173,226
3 2005 36 1,794,216
3 2005 48 2,528,910
3 2005 60 3,474,035
3 2005 72 3,908,999
3 2006 12 291,940
3 2006 24 1,136,674
3 2006 36 1,915,614
3 2006 48 2,693,930
3 2006 60 3,375,601
3 2007 12 506,055
3 2007 24 1,684,660
3 2007 36 2,678,739
3 2007 48 3,545,156
3 2008 12 282,143
3 2008 24 1,536,490
3 2008 36 2,458,789
3 2009 12 271,093
3 2009 24 1,199,897
3 2010 12 266,359
Using above dataframe I have to create 4 triangles based on Toatal column:
For example:
Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total
2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119
2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832
2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261
2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659
2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635
2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377
2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294
2008 282,143 1,782,267 2,898,699 4,963,110
2009 362,726 1,277,750 1,640,475
2010 321,247 321,247
Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009
.
.
.
Like this i need 4 triangles (4 is number of simulation) using 1st dataframe.
If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.

Use pivot_table and choose the aggregation function (sum here but you can use mean or whatever):
df = df.pivot_table(index="origin", columns="development",
values="values", aggfunc="sum")
df = df.set_index(df.index.year)
df.loc["Grand Total"] = df.sum()
df.loc[:, "Grand Total"] = df.sum(axis=1)
>>> df
development 12 24 36 48 60 72 84 96 108 120 Grand Total
origin
2001 1.356449e+09 4.695043e+09 8.226504e+09 1.200121e+10 1.408404e+10 1.555555e+10 1.690673e+10 1.781579e+10 1.917689e+10 1.951240e+10 1.293306e+11
2002 1.887634e+09 6.573443e+09 1.150100e+10 1.671772e+10 1.960781e+10 2.164808e+10 2.352267e+10 2.480478e+10 2.671911e+10 NaN 1.529823e+11
2003 1.866031e+09 6.531145e+09 1.137408e+10 1.657377e+10 1.945944e+10 2.148353e+10 2.334087e+10 2.459720e+10 NaN NaN 1.252261e+11
2004 1.842447e+09 6.411653e+09 1.120732e+10 1.633725e+10 1.917381e+10 2.117893e+10 2.301072e+10 NaN NaN NaN 9.916214e+10
2005 1.688064e+09 5.876106e+09 1.027445e+10 1.496756e+10 1.757424e+10 1.939891e+10 NaN NaN NaN NaN 6.977932e+10
2006 1.762834e+09 6.154760e+09 1.076776e+10 1.569864e+10 1.843549e+10 NaN NaN NaN NaN NaN 5.281948e+10
2007 1.968264e+09 6.855178e+09 1.195292e+10 1.741326e+10 NaN NaN NaN NaN NaN NaN 3.818962e+10
2008 2.344669e+09 8.218527e+09 1.433187e+10 NaN NaN NaN NaN NaN NaN NaN 2.489507e+10
2009 1.955145e+09 6.813284e+09 NaN NaN NaN NaN NaN NaN NaN NaN 8.768429e+09
2010 1.716057e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.716057e+09
Grand Total 1.838759e+10 5.812914e+10 8.963591e+10 1.097094e+11 1.083348e+11 9.926499e+10 8.678100e+10 6.721778e+10 4.589601e+10 1.951240e+10 7.028691e+11
The code above works for the following input data:
>>> df
origin development values
Total
0 2001-01-01 12 3.766810e+05
0 2001-01-01 24 1.025411e+06
0 2001-01-01 36 1.541503e+06
0 2001-01-01 48 2.155232e+06
0 2001-01-01 60 2.422287e+06
... ... ... ...
4999 2008-01-01 24 2.403488e+06
4999 2008-01-01 36 3.100034e+06
4999 2009-01-01 12 3.747304e+05
4999 2009-01-01 24 1.262821e+06
4999 2010-01-01 12 2.469928e+05
[275000 rows x 3 columns]

Related

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

Replace last value(s) of group with NaN

My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!
try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN

Panel data pandas, variation according to a certain condition

i am a stata user and i trying to switch to python and i having problem with some codes. If i have the following panel data
id year quarter fecha jobs
1 2007 1 220 10
1 2007 2 221 12
1 2007 3 222 12
1 2007 4 223 12
1 2008 1 224 12
1 2008 2 225 13
1 2008 3 226 14
1 2008 4 227 9
1 2009 1 228 12
1 2009 2 229 15
1 2009 3 230 18
1 2009 4 231 15
1 2010 1 232 15
1 2010 2 233 16
1 2010 3 234 17
1 2010 4 235 18
2 2007 1 220 10
2 2007 2 221 12
2 2007 3 222 12
2 2007 4 223 12
2 2008 1 224 12
2 2008 2 225 13
2 2008 3 226 14
2 2008 4 227 9
2 2009 1 228 12
2 2009 2 229 15
2 2009 3 230 18
2 2009 4 231 15
2 2010 1 232 15
2 2010 2 233 16
2 2010 4 235 18
(My panel data is much bigger than the example, is just to illustrate my problem). I want to calculate the variation of jobs of the same quarter and three year before
So result should look like these
id year quarter fecha jobs jobs_variation
1 2007 1 220 10 Nan
1 2007 2 221 12 Nan
1 2007 3 222 12 Nan
1 2007 4 223 12 Nan
1 2008 1 224 12 Nan
1 2008 2 225 13 Nan
1 2008 3 226 14 Nan
1 2008 4 227 9 Nan
1 2009 1 228 12 Nan
1 2009 2 229 15 Nan
1 2009 3 230 18 Nan
1 2009 4 231 15 Nan
1 2010 1 232 15 0.5
1 2010 2 233 16 0.33
1 2010 3 234 17 0.30769
1 2010 4 235 18 0.5
2 2007 1 220 10 Nan
2 2007 4 223 12 Nan
2 2008 1 224 12 Nan
2 2008 2 225 13 Nan
2 2008 3 226 14 Nan
2 2008 4 227 9 Nan
2 2009 1 228 12 Nan
2 2009 2 229 15 Nan
2 2009 3 230 18 Nan
2 2009 4 231 15 Nan
2 2010 1 232 15 0.5
2 2010 2 233 16 Nan
2 2010 3 234 20 Nan
2 2010 4 235 18 0.5
Check that in the second id year 2010 in the second and thir quarter calculation must not be me made because the id was not present at 2007Q2 and 2007Q3.
In stata the code would be,
bys id: gen jobs_variation=jobs/jobs[_n-12]-1 if fecha[_n-12]==fecha-12
IIUC, you need a groupby on id and quarter followed by apply:
df['jobs_variation'] = df.groupby(['id', 'quarter']).jobs\
.apply(lambda x: x / x.shift(3) - 1)
df
id year quarter fecha jobs jobs_variation
0 1 2007 1 220 10 NaN
1 1 2007 2 221 12 NaN
2 1 2007 3 222 12 NaN
3 1 2007 4 223 12 NaN
4 1 2008 1 224 12 NaN
5 1 2008 2 225 13 NaN
6 1 2008 3 226 14 NaN
7 1 2008 4 227 9 NaN
8 1 2009 1 228 12 NaN
9 1 2009 2 229 15 NaN
10 1 2009 3 230 18 NaN
11 1 2009 4 231 15 NaN
12 1 2010 1 232 15 0.500000
13 1 2010 2 233 16 0.333333
14 1 2010 3 234 17 0.416667
15 1 2010 4 235 18 0.500000
16 2 2007 1 220 10 NaN
17 2 2007 4 223 12 NaN
18 2 2008 1 224 12 NaN
19 2 2008 2 225 13 NaN
20 2 2008 3 226 14 NaN
21 2 2008 4 227 9 NaN
22 2 2009 1 228 12 NaN
23 2 2009 2 229 15 NaN
24 2 2009 3 230 18 NaN
25 2 2009 4 231 15 NaN
26 2 2010 1 232 15 0.500000
27 2 2010 2 233 16 NaN
28 2 2010 3 234 20 NaN
29 2 2010 4 235 18 0.500000
x / x.shift(3) will divide the current year's job count (for that quarter) by the corresponding value from 3 years ago.

Filter rows in a pandas DataFrame based on a value

I have DataFrame similar to the below (this is just a sample):
i TIME CITIES_LABEL Value lat_rounded long
2 2005 Tilburg 22 250 52.070498 4.300700
3 2005 Amsterdam 45 825 52.370216 4.895168
4 2005 Rotterdam 27 600 51.924420 4.477733
5 2005 Utrecht 12 915 52.090737 5.121420
6 2005 Eindhoven 9 165 51.441642 5.469722
7 2006 Tilburg 7 800 51.560596 5.091914
8 2005 Groningen 7 620 53.219383 6.566502
9 2005 Enschede 6 250 52.221537 6.893662
10 2005 Arnhem 6 025 51.985103 5.898730
11 2006 Utrecht 3 400 50.888174 5.979499
12 2006 Amsterdam 6 795 52.350785 5.264702
13 2005 Breda 8 565 51.571915 4.768323
14 2010 Groningen 6 325 51.812563 5.837226
15 2005 Apeldoorn 7 005 52.211157 5.969923
16 2007 Utrecht 3 785 53.201233 5.799913
17 2006 Rotterdam 7 130 52.387388 4.646219
18 2005 Zaanstad 6 060 52.457966 4.751042
19 2008 Tilburg 6 945 51.697816 5.303675
20 2007 Amsterdam 5 840 52.156111 5.387827
21 2005 Maastricht 5 220 50.851368 5.690972
Cities are repeated along the CITIES_LABEL field. I would like to filter the cities based on their highest TIME value. An example of the output I would like is:
i TIME CITIES_LABEL Value lat_rounded long
6 2005 Eindhoven 9 165 51.441642 5.469722
9 2005 Enschede 6 250 52.221537 6.893662
10 2005 Arnhem 6 025 51.985103 5.898730
13 2005 Breda 8 565 51.571915 4.768323
14 2010 Groningen 6 325 51.812563 5.837226
15 2005 Apeldoorn 7 005 52.211157 5.969923
16 2007 Utrecht 3 785 53.201233 5.799913
17 2006 Rotterdam 7 130 52.387388 4.646219
18 2005 Zaanstad 6 060 52.457966 4.751042
19 2008 Tilburg 6 945 51.697816 5.303675
20 2007 Amsterdam 5 840 52.156111 5.387827
21 2005 Maastricht 5 220 50.851368 5.690972
Any thoughts on how best to approach this issue in pandas?
EDIT
my question is different from Python : How can I get Rows which have the max value of the group to which they belong? because I am looking for a filter for both TIME and CITIES_LABEL while the previous question is only looking at filtering based to a (maximum) value of one field, and it does not care for duplicates in other fields
use groupby and idxmax
df.ix[df.groupby('CITIES_LABEL').TIME.idxmax()]

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

Categories

Resources