Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Below is my code:
triangle = cl.load_sample('genins')
# Use bootstrap sampler to get resampled triangles
bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_
#converting to dataframe
resampledtriangledf = bootstrapdataframe.to_frame()
print(resampledtriangledf)
In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame:
0 2001 12 254,926
0 2001 24 535,877
0 2001 36 1,355,613
0 2001 48 2,034,557
0 2001 60 2,311,789
0 2001 72 2,539,807
0 2001 84 2,724,773
0 2001 96 3,187,095
0 2001 108 3,498,646
0 2001 120 3,586,037
0 2002 12 542,369
0 2002 24 1,016,927
0 2002 36 2,201,329
0 2002 48 2,923,381
0 2002 60 3,711,305
0 2002 72 3,914,829
0 2002 84 4,385,757
0 2002 96 4,596,072
0 2002 108 5,047,861
0 2003 12 235,361
0 2003 24 960,355
0 2003 36 1,661,972
0 2003 48 2,643,370
0 2003 60 3,372,684
0 2003 72 3,642,605
0 2003 84 4,160,583
0 2003 96 4,480,332
0 2004 12 764,553
0 2004 24 1,703,557
0 2004 36 2,498,418
0 2004 48 3,198,358
0 2004 60 3,524,562
0 2004 72 3,884,971
0 2004 84 4,268,241
0 2005 12 381,670
0 2005 24 1,124,054
0 2005 36 2,026,434
0 2005 48 2,863,902
0 2005 60 3,039,322
0 2005 72 3,288,253
0 2006 12 320,332
0 2006 24 1,022,323
0 2006 36 1,830,842
0 2006 48 2,676,710
0 2006 60 3,375,172
0 2007 12 330,361
0 2007 24 1,463,348
0 2007 36 2,771,839
0 2007 48 4,003,745
0 2008 12 282,143
0 2008 24 1,782,267
0 2008 36 2,898,699
0 2009 12 362,726
0 2009 24 1,277,750
0 2010 12 321,247
1 2001 12 219,021
1 2001 24 755,975
1 2001 36 1,360,298
1 2001 48 2,062,947
1 2001 60 2,356,983
1 2001 72 2,781,187
1 2001 84 2,987,837
1 2001 96 3,118,952
1 2001 108 3,307,522
1 2001 120 3,455,107
1 2002 12 302,932
1 2002 24 1,022,459
1 2002 36 1,634,938
1 2002 48 2,538,708
1 2002 60 3,005,695
1 2002 72 3,274,719
1 2002 84 3,356,499
1 2002 96 3,595,361
1 2002 108 4,100,065
1 2003 12 489,934
1 2003 24 1,233,438
1 2003 36 2,471,849
1 2003 48 3,672,629
1 2003 60 4,157,489
1 2003 72 4,498,470
1 2003 84 4,587,579
1 2003 96 4,816,232
1 2004 12 518,680
1 2004 24 1,209,705
1 2004 36 2,019,757
1 2004 48 2,997,820
1 2004 60 3,630,442
1 2004 72 3,881,093
1 2004 84 4,080,322
1 2005 12 453,963
1 2005 24 1,458,504
1 2005 36 2,036,506
1 2005 48 2,846,464
1 2005 60 3,280,124
1 2005 72 3,544,597
1 2006 12 369,755
1 2006 24 1,209,117
1 2006 36 1,973,136
1 2006 48 3,034,294
1 2006 60 3,537,784
1 2007 12 477,788
1 2007 24 1,524,537
1 2007 36 2,170,391
1 2007 48 3,355,093
1 2008 12 250,690
1 2008 24 1,546,986
1 2008 36 2,996,737
1 2009 12 271,270
1 2009 24 1,446,353
1 2010 12 510,114
2 2001 12 170,866
2 2001 24 797,338
2 2001 36 1,663,610
2 2001 48 2,293,697
2 2001 60 2,607,067
2 2001 72 2,979,479
2 2001 84 3,127,308
2 2001 96 3,285,338
2 2001 108 3,574,272
2 2001 120 3,630,610
2 2002 12 259,060
2 2002 24 1,011,092
2 2002 36 1,851,504
2 2002 48 2,705,313
2 2002 60 3,195,774
2 2002 72 3,766,008
2 2002 84 3,944,417
2 2002 96 4,234,043
2 2002 108 4,763,664
2 2003 12 239,981
2 2003 24 983,484
2 2003 36 1,929,785
2 2003 48 2,497,929
2 2003 60 2,972,887
2 2003 72 3,313,868
2 2003 84 3,727,432
2 2003 96 4,024,122
2 2004 12 77,522
2 2004 24 729,401
2 2004 36 1,473,914
2 2004 48 2,376,313
2 2004 60 2,999,197
2 2004 72 3,372,020
2 2004 84 3,887,883
2 2005 12 321,598
2 2005 24 1,132,502
2 2005 36 1,710,504
2 2005 48 2,438,620
2 2005 60 2,801,957
2 2005 72 3,182,466
2 2006 12 255,407
2 2006 24 1,275,141
2 2006 36 2,083,421
2 2006 48 3,144,579
2 2006 60 3,891,772
2 2007 12 338,120
2 2007 24 1,275,697
2 2007 36 2,238,715
2 2007 48 3,615,323
2 2008 12 310,214
2 2008 24 1,237,156
2 2008 36 2,563,326
2 2009 12 271,093
2 2009 24 1,523,131
2 2010 12 430,591
3 2001 12 330,887
3 2001 24 831,193
3 2001 36 1,601,374
3 2001 48 2,188,879
3 2001 60 2,662,773
3 2001 72 3,086,976
3 2001 84 3,332,247
3 2001 96 3,317,279
3 2001 108 3,576,659
3 2001 120 3,613,563
3 2002 12 358,263
3 2002 24 1,139,259
3 2002 36 2,236,375
3 2002 48 3,163,464
3 2002 60 3,715,130
3 2002 72 4,295,638
3 2002 84 4,502,105
3 2002 96 4,769,139
3 2002 108 5,323,304
3 2003 12 489,934
3 2003 24 1,570,352
3 2003 36 3,123,215
3 2003 48 4,189,299
3 2003 60 4,819,070
3 2003 72 5,306,689
3 2003 84 5,560,371
3 2003 96 5,827,003
3 2004 12 419,727
3 2004 24 1,308,884
3 2004 36 2,118,936
3 2004 48 2,906,732
3 2004 60 3,561,577
3 2004 72 3,934,400
3 2004 84 4,010,511
3 2005 12 389,217
3 2005 24 1,173,226
3 2005 36 1,794,216
3 2005 48 2,528,910
3 2005 60 3,474,035
3 2005 72 3,908,999
3 2006 12 291,940
3 2006 24 1,136,674
3 2006 36 1,915,614
3 2006 48 2,693,930
3 2006 60 3,375,601
3 2007 12 506,055
3 2007 24 1,684,660
3 2007 36 2,678,739
3 2007 48 3,545,156
3 2008 12 282,143
3 2008 24 1,536,490
3 2008 36 2,458,789
3 2009 12 271,093
3 2009 24 1,199,897
3 2010 12 266,359
Using above dataframe I have to create 4 triangles based on Toatal column:
For example:
Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total
2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119
2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832
2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261
2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659
2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635
2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377
2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294
2008 282,143 1,782,267 2,898,699 4,963,110
2009 362,726 1,277,750 1,640,475
2010 321,247 321,247
Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009
.
.
.
Like this i need 4 triangles (4 is number of simulation) using 1st dataframe.
If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.
Use pivot_table and choose the aggregation function (sum here but you can use mean or whatever):
df = df.pivot_table(index="origin", columns="development",
values="values", aggfunc="sum")
df = df.set_index(df.index.year)
df.loc["Grand Total"] = df.sum()
df.loc[:, "Grand Total"] = df.sum(axis=1)
>>> df
development 12 24 36 48 60 72 84 96 108 120 Grand Total
origin
2001 1.356449e+09 4.695043e+09 8.226504e+09 1.200121e+10 1.408404e+10 1.555555e+10 1.690673e+10 1.781579e+10 1.917689e+10 1.951240e+10 1.293306e+11
2002 1.887634e+09 6.573443e+09 1.150100e+10 1.671772e+10 1.960781e+10 2.164808e+10 2.352267e+10 2.480478e+10 2.671911e+10 NaN 1.529823e+11
2003 1.866031e+09 6.531145e+09 1.137408e+10 1.657377e+10 1.945944e+10 2.148353e+10 2.334087e+10 2.459720e+10 NaN NaN 1.252261e+11
2004 1.842447e+09 6.411653e+09 1.120732e+10 1.633725e+10 1.917381e+10 2.117893e+10 2.301072e+10 NaN NaN NaN 9.916214e+10
2005 1.688064e+09 5.876106e+09 1.027445e+10 1.496756e+10 1.757424e+10 1.939891e+10 NaN NaN NaN NaN 6.977932e+10
2006 1.762834e+09 6.154760e+09 1.076776e+10 1.569864e+10 1.843549e+10 NaN NaN NaN NaN NaN 5.281948e+10
2007 1.968264e+09 6.855178e+09 1.195292e+10 1.741326e+10 NaN NaN NaN NaN NaN NaN 3.818962e+10
2008 2.344669e+09 8.218527e+09 1.433187e+10 NaN NaN NaN NaN NaN NaN NaN 2.489507e+10
2009 1.955145e+09 6.813284e+09 NaN NaN NaN NaN NaN NaN NaN NaN 8.768429e+09
2010 1.716057e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.716057e+09
Grand Total 1.838759e+10 5.812914e+10 8.963591e+10 1.097094e+11 1.083348e+11 9.926499e+10 8.678100e+10 6.721778e+10 4.589601e+10 1.951240e+10 7.028691e+11
The code above works for the following input data:
>>> df
origin development values
Total
0 2001-01-01 12 3.766810e+05
0 2001-01-01 24 1.025411e+06
0 2001-01-01 36 1.541503e+06
0 2001-01-01 48 2.155232e+06
0 2001-01-01 60 2.422287e+06
... ... ... ...
4999 2008-01-01 24 2.403488e+06
4999 2008-01-01 36 3.100034e+06
4999 2009-01-01 12 3.747304e+05
4999 2009-01-01 24 1.262821e+06
4999 2010-01-01 12 2.469928e+05
[275000 rows x 3 columns]
I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!
As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
I just realized that this solution doesn't seem to be working for me.
When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.
#Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?