Python stack loses data

Python stack loses data - python

I'm trying to reorganise my data (the overarching goal is to convert a ASCII file to netCDF). One of the steps to get there is to take the data and stack the columns. My original data look like this:
import pandas as pd
import numpy as np
import xarray as xr
fname = 'data.out'
df = pd.read_csv(fname, header=0, delim_whitespace=True)
print(df)
gives
Lon Lat Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 150.25 -34.25 1851 0.027 -0.005 -0.010 -0.034 -0.029 -0.025 0.016 -0.049 -0.055 0.003 -0.029 0.060
1 150.25 -34.25 1852 0.021 -0.002 -0.050 0.071 0.066 0.001 0.021 -0.014 -0.072 -0.050 0.113 0.114
2 150.25 -34.25 1853 0.093 0.094 0.139 -0.019 0.015 0.003 0.018 -0.032 -0.024 -0.010 0.132 0.107
3 150.25 -34.25 1854 0.084 0.071 0.024 -0.004 -0.022 0.005 0.025 0.006 -0.040 -0.051 -0.067 -0.005
4 150.25 -34.25 1855 -0.030 -0.004 -0.035 -0.036 -0.035 -0.012 0.009 -0.017 -0.062 -0.068 -0.077 -0.084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
707995 138.75 -19.25 2096 -0.044 -0.039 -0.068 -0.027 -0.023 -0.029 -0.031 -0.002 -0.005 0.018 -0.039 -0.094
707996 138.75 -19.25 2097 -0.041 -0.066 -0.006 -0.018 -0.005 -0.017 0.011 0.018 0.026 0.024 0.010 -0.086
707997 138.75 -19.25 2098 -0.033 -0.044 -0.032 -0.044 -0.046 -0.040 -0.021 -0.017 0.022 -0.011 -0.015 -0.032
707998 138.75 -19.25 2099 0.039 0.016 -0.009 0.001 -0.002 0.001 0.010 0.021 0.026 0.027 0.012 -0.017
707999 138.75 -19.25 2100 0.010 -0.022 -0.024 -0.037 -0.008 -0.020 0.002 0.011 0.011 0.033 0.020 -0.002
[708000 rows x 15 columns]
I then select the actual timesteps
months=list(df.columns)
months=months[3:]
and select all columns that have monthly data. This then returns the shape
print(df[months].shape)
(708000, 12). So far so good, but then when I stack the data
df_stack = df[months].stack()
print(df_stack.shape)
instead of the expected shape ((8496000,) I get (8493000,). The weird thing is the script runs on other files that have the same shape as the data I used for this example and I don't have that problem there. It looks like I'm losing one Lon/Lat pixel for 250 years - but I don't understand why? This becomes a problem later when I try to convert the data to a netcdf file.
lons = np.unique(df.Lon)
lats = np.unique(df.Lat)
years = np.unique(df.Year)
nyears = len(years)
nrows = len(lats)
ncols = len(lons)
nmonths = 12
lons.sort()
lats.sort()
years.sort()
time = pd.date_range(start=f'01/{years[0]}',
end=f'01/{years[-1]+1}', freq='M')
dx = 0.5
Lon = xr.DataArray(np.arange(-180.+dx/2., 180., dx), dims=("Lon"),
attrs={"long_name":"longitude", "unit":"degrees_east"})
nlon = Lon.size
dy = 0.5
Lat = xr.DataArray(np.arange(-90.+dy/2., 90., dy), dims=("Lat"),
attrs={"long_name":"latitude", "unit":"degrees_north"})
nlat = Lat.size
out = xr.DataArray(np.zeros((nyears*nmonths,nlat, nlon)),
dims=("Time","Lat","Lon"),
coords=({"Lat":Lat, "Lon":Lon, "Time":time}))
for nr in range(0,len(df.index),nyears):
rows = df[nr:nr+nyears]
thislon = rows["Lon"].min()
thislat = rows["Lat"].min()
out.loc[dict(
Lon=thislon,
Lat=thislat)] = df_stack[nr*nmonths:(nr+nyears)*nmonths]
this gives me the error
ValueError: could not broadcast input array from shape (0,) into shape (3000,)
It's missing the 3000 values that I'm losing while stacking the data. Does anyone know how to fix this?

Replace:
df_stack = df[months].stack()
by
df_stack = df[months].stack(dropna=False)

Related

Why is my regressor predicting counts much further than the actual counts?

I'm building a regression model (statsmodels.discrete.count_model.ZeroInflatedPoisson) in which my goal is to predict the "count" variable, as you can see below. Any suggestion on how I can improve this model?
#regression expression in Patsy notation. count depends on all these columns
expr = 'count ~ day_of_week + business_day + duration + Distance_KM + wind_speed + wind_deg + wind_gust + rain_1h + rain_3h + clouds_all + End_Station_Region_cat'
y_train, X_train = dmatrices(expr, df, return_type='dataframe')
y_test, X_test = dmatrices(expr, df, return_type='dataframe')
import statsmodels.discrete.count_model as cm
zip_training_results = cm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit', maxiter=1000, maxfun=500).fit_regularized()
Optimization terminated successfully (Exit mode 0)
Current function value: 0.030743127487834265
Iterations: 184
Function evaluations: 230
Gradient evaluations: 184
import statsmodels.discrete.count_model as cm
print(zip_training_results.summary())
ZeroInflatedPoisson Regression Results
===============================================================================
Dep. Variable: count No. Observations: 3099093
Model: ZeroInflatedPoisson Df Residuals: 3099081
Method: MLE Df Model: 11
Date: Sun, 04 Sep 2022 Pseudo R-squ.: 0.6252
Time: 17:17:42 Log-Likelihood: -95276.
converged: True LL-Null: -2.5418e+05
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
inflate_Intercept 29.2202 488.934 0.060 0.952 -929.073 987.514
inflate_day_of_week 7.5087 11.894 0.631 0.528 -15.802 30.820
inflate_business_day 20.5779 59.774 0.344 0.731 -96.577 137.733
inflate_duration -85.6092 487.673 -0.176 0.861 -1041.431 870.212
inflate_Distance_KM -1.9937 46.122 -0.043 0.966 -92.391 88.404
inflate_wind_speed -1.0937 9.175 -0.119 0.905 -19.077 16.889
inflate_wind_deg -5.6348 48.297 -0.117 0.907 -100.296 89.026
inflate_wind_gust 0.8242 763.772 0.001 0.999 -1496.142 1497.790
inflate_rain_1h 0.5288 59.935 0.009 0.993 -116.942 118.000
inflate_rain_3h 0.9703 7.97e+04 1.22e-05 1.000 -1.56e+05 1.56e+05
inflate_clouds_all -0.1600 5.526 -0.029 0.977 -10.991 10.671
inflate_End_Station_Region_cat -0.3111 2.362 -0.132 0.895 -4.940 4.318
Intercept -0.5539 0.020 -28.229 0.000 -0.592 -0.515
day_of_week 0.0115 0.003 3.563 0.000 0.005 0.018
business_day 0.0560 0.014 4.127 0.000 0.029 0.083
duration 0.0149 0.000 80.515 0.000 0.015 0.015
Distance_KM 0.0972 0.002 49.338 0.000 0.093 0.101
wind_speed -0.9573 0.009 -109.888 0.000 -0.974 -0.940
wind_deg -0.0106 8.82e-05 -119.945 0.000 -0.011 -0.010
wind_gust 0.0001 0.015 0.008 0.994 -0.029 0.029
rain_1h -0.1926 0.019 -10.084 0.000 -0.230 -0.155
rain_3h -0.0724 0.026 -2.755 0.006 -0.124 -0.021
clouds_all 0.0006 0.000 5.288 0.000 0.000 0.001
End_Station_Region_cat 0.0034 0.005 0.693 0.489 -0.006 0.013
==================================================================================================
zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test)
predicted_counts=np.round(zip_predictions)
actual_counts = y_test['count']
print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2)))))
ZIP RMSE=195.05127530985283
With the following image, I believe it is clear how the model presented in the code above ends up making predictions of counts much higher than the observed values. When Y is a low value, the regressor can make good predictions, however, what I would like to fix is that it predicts much higher values for Y than the actual values.
plt.clf()
plt.hist([actual_counts, predicted_counts], log=True)
plt.legend(('orig','pred'))
plt.show()
Just Poisson attempt:
pipeline = Pipeline([('model', PoissonRegressor())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
r2_test = metrics.r2_score(y_test, y_pred)
r2_test
-0.18668761012669255
y_pred_train = pipeline.predict(X_train)
r2_train = metrics.r2_score(y_train, y_pred_train)
r2_train
-0.10978290552023906

How to groupby and aggregate an operation to multiple columns?

I'm trying to create mean for rows in a data frame based on two columns, but I'm getting the following error:
TypeError: 'numpy.float64' object is not callable
The dataframe:
date origin positive_score neutral_score negativity_score compound_score
2020-09-19 the verge 0.130 0.846 0.024 0.9833
2020-09-19 the verge 0.130 0.846 0.024 0.9833
2020-09-19 fool 0.075 0.869 0.056 0.8560
2020-09-19 seeking_alpha 0.067 0.918 0.015 0.9983
2020-09-19 seeking_alpha 0.171 0.791 0.038 0.7506
2020-09-19 seeking_alpha 0.095 0.814 0.091 0.9187
2020-09-19 seeking_alpha 0.113 0.801 0.086 0.9890
2020-09-19 seeking_alpha 0.094 0.869 0.038 0.9997
2020-09-19 wall street journal 0.000 1.000 0.000 0.0000
2020-09-19 seeking_alpha 0.179 0.779 0.042 0.9997
2020-09-19 seeking_alpha 0.178 0.704 0.117 0.7360
My code:
def mean_indicators(cls, df: pd.DataFrame):
df_with_mean = df.groupby([DATE, ORIGIN], as_index=False).agg({POSITIVE_SCORE: df[POSITIVE_SCORE].mean(),
NEGATIVE_SCORE: df[NEGATIVE_SCORE].mean(),
NEUTRAL_SCORE: df[NEUTRAL_SCORE].mean(),
COMPOUND_SCORE: df[COMPOUND_SCORE].mean()
})
return df_with_mean

I think this should do what you want:
def mean_indicators(cls, df: pd.DataFrame):
df_with_mean = df.groupby([DATE, ORIGIN], as_index=False).agg(
{POSITIVE_SCORE: "mean",
NEGATIVE_SCORE: "mean",
NEUTRAL_SCORE: "mean",
COMPOUND_SCORE: "mean",
})
return df_with_mean
You can alternatively use named aggregation syntax as seen here

The error is a result of incorrectly aggregating an operation.
{POSITIVE_SCORE: df[POSITIVE_SCORE].mean() is not correct.
{'positive_score': 'mean'} is correct
Since you are trying to take the mean of all the non-grouped numeric columns, the function is not necessary.
Use pandas.core.groupby.GroupBy.mean for one operation on the entire dataframe.
Use pandas.core.groupby.DataFrameGroupBy.aggregate to aggregate different operations.
Applying multiple functions at once
# just groupby and mean
df_mean = df.groupby(['date', 'origin'], as_index=False).mean()
# display(df_mean())
date origin positive_score neutral_score negativity_score compound_score
2020-09-19 fool 0.075000 0.869000 0.056 0.856000
2020-09-19 seeking_alpha 0.128143 0.810857 0.061 0.913143
2020-09-19 the verge 0.130000 0.846000 0.024 0.983300
2020-09-19 wall street journal 0.000000 1.000000 0.000 0.000000

Why using group by makes some id disappear

i was working in a machine learning project , and while i am extracting features i found that some of consumers LCLid disappear from the data set while i was grouping by the LCLid
Dataset: SmartMeter Energy Consumption Data in London Households
here is the original data set
and here is the code that i used to extract some features
LCLid=[]
for i in range(68):
LCLid.append('MAC0000'+str(228+i))
consommation=data.groupby('LCLid')['KWH/hh'].sum()
consommation_min=data.groupby('LCLid')['KWH/hh'].min()
consommation_max=data.groupby('LCLid')['KWH/hh'].max()
consommation_mean=data.groupby('LCLid')['KWH/hh'].mean()
consommation_evening=data.groupby(['LCLid','period'])['KWH/hh'].mean()
#creation de dataframe
list_of_tuples = list(zip (LCLid, consommation, consommation_min, consommation_max, consommation_mean))
data2= pd.DataFrame(list_of_tuples, columns = ['LCLid', 'Consumption', 'Consumption_min', 'Consumption_max', 'Consumption_mean'])
as you see after the execution of the code the dataset stopped in the LCLid 282 while iin the original one the dataset containes also the LCLid from 283 to 295

Using low-carbon-london-data from SmartMeter Energy Consumption Data in London Households
The issue is LCLid does not uniformly increment by 1, from MAC000228 to MAC000295.
print(data.LCLid.unique())
array(['MAC000228', 'MAC000229', 'MAC000230', 'MAC000231', 'MAC000232',
'MAC000233', 'MAC000234', 'MAC000235', 'MAC000237', 'MAC000238',
'MAC000239', 'MAC000240', 'MAC000241', 'MAC000242', 'MAC000243',
'MAC000244', 'MAC000245', 'MAC000246', 'MAC000248', 'MAC000249',
'MAC000250', 'MAC000251', 'MAC000252', 'MAC000253', 'MAC000254',
'MAC000255', 'MAC000256', 'MAC000258', 'MAC000260', 'MAC000262',
'MAC000263', 'MAC000264', 'MAC000267', 'MAC000268', 'MAC000269',
'MAC000270', 'MAC000271', 'MAC000272', 'MAC000273', 'MAC000274',
'MAC000275', 'MAC000276', 'MAC000277', 'MAC000279', 'MAC000280',
'MAC000281', 'MAC000282', 'MAC000283', 'MAC000284', 'MAC000285',
'MAC000287', 'MAC000289', 'MAC000291', 'MAC000294', 'MAC000295'],
dtype=object)
print(len(data.LCLid.unique()))
>>> 55
To resolve the issue
import pandas as pd
import numpy as np
df = pd.read_csv('Power-Networks-LCL-June2015(withAcornGps)v2.csv')
# determine the rows needed for the MAC000228 - MAC000295
df[df.LCLid == 'MAC000228'].iloc[0, :] # first row of 228
df[df.LCLid == 'MAC000295'].iloc[-1, :] # last row of 295
# create a dataframe with the desired data
data = df[['LCLid', 'DateTime', 'KWH/hh (per half hour) ']].iloc[6989700:9032044, :].copy()
# fix the data
data.DateTime = pd.to_datetime(data.DateTime)
data.rename(columns={'KWH/hh (per half hour) ': 'KWH/hh'}, inplace=True)
data['KWH/hh'] = data['KWH/hh'].str.replace('Null', 'NaN')
data['KWH/hh'].fillna(np.nan, inplace=True)
data['KWH/hh'] = data['KWH/hh'].astype('float')
data.reset_index(drop=True, inplace=True)
# aggregate your functions
agg_data = data.groupby('LCLid')['KWH/hh'].agg(['sum', 'min', 'max', 'mean']).reset_index()
print(agg_data)
agg_data
LCLid sum min max mean
0 MAC000228 5761.288000 0.021 1.616 0.146356
1 MAC000229 6584.866999 0.008 3.294 0.167456
2 MAC000230 8911.154000 0.029 2.750 0.226384
3 MAC000231 3174.314000 0.000 1.437 0.080663
4 MAC000232 2083.042000 0.005 0.736 0.052946
5 MAC000233 2241.591000 0.000 3.137 0.056993
6 MAC000234 9700.328001 0.029 2.793 0.246646
7 MAC000235 8473.999003 0.011 3.632 0.223194
8 MAC000237 22263.294998 0.036 4.450 0.598299
9 MAC000238 7814.889998 0.016 2.835 0.198781
10 MAC000239 6113.029000 0.015 1.346 0.155481
11 MAC000240 7280.662000 0.000 3.146 0.222399
12 MAC000241 4181.169999 0.024 1.733 0.194963
13 MAC000242 1654.336000 0.000 1.481 0.042088
14 MAC000243 11057.366999 0.009 3.588 0.281989
15 MAC000244 5894.271000 0.005 1.884 0.149939
16 MAC000245 22788.699005 0.037 4.743 0.580087
17 MAC000246 13787.060005 0.014 3.516 0.351075
18 MAC000248 10192.239001 0.000 4.351 0.259536
19 MAC000249 24401.468995 0.148 5.242 0.893042
20 MAC000250 5850.003000 0.000 2.185 0.148999
21 MAC000251 8400.234000 0.035 3.505 0.213931
22 MAC000252 21748.489004 0.135 4.171 0.554978
23 MAC000253 9739.408999 0.009 1.714 0.248201
24 MAC000254 9351.614001 0.009 2.484 0.238209
25 MAC000255 14142.974002 0.097 3.305 0.360220
26 MAC000256 20398.665001 0.049 3.019 0.520680
27 MAC000258 6646.485998 0.017 2.319 0.169666
28 MAC000260 5952.563001 0.006 2.192 0.151952
29 MAC000262 13909.603999 0.000 2.878 0.355181
30 MAC000263 3753.997000 0.015 1.060 0.095863
31 MAC000264 7022.967000 0.020 0.910 0.179432
32 MAC000267 8797.094000 0.029 2.198 0.224898
33 MAC000268 3734.252001 0.000 1.599 0.095359
34 MAC000269 2395.232000 0.000 1.029 0.061167
35 MAC000270 15569.711002 0.131 2.249 0.397501
36 MAC000271 7244.860000 0.028 1.794 0.184974
37 MAC000272 8703.658998 0.034 3.295 0.222446
38 MAC000273 3622.199002 0.005 5.832 0.092587
39 MAC000274 28724.718997 0.032 3.927 0.734422
40 MAC000275 5564.004999 0.012 1.840 0.161290
41 MAC000276 11060.774001 0.000 1.709 0.315724
42 MAC000277 8446.528999 0.027 1.938 0.241075
43 MAC000279 3444.160999 0.016 1.846 0.098354
44 MAC000280 12595.780001 0.125 1.988 0.360436
45 MAC000281 6282.568000 0.024 1.433 0.179538
46 MAC000282 4457.989001 0.030 1.830 0.127444
47 MAC000283 5024.917000 0.011 2.671 0.143627
48 MAC000284 1293.503000 0.000 0.752 0.047975
49 MAC000285 2399.018000 0.006 0.931 0.068567
50 MAC000287 1407.290000 0.000 2.372 0.045253
51 MAC000289 4767.490999 0.000 2.287 0.136436
52 MAC000291 13456.678999 0.072 3.354 0.385060
53 MAC000294 9477.966000 0.053 2.438 0.271264
54 MAC000295 7750.128000 0.010 1.839 0.221774

Extremely slow inference on MacOS for faster_rcnn_resnet50_fgvc_2018_07_19 trained on iNaturalist dataset

It takes more than 6 minutes per image to make an inference with iNat models (2854 classes) on my reasonable new MBP. Other models are much faster but useless to me as I want to box (and recognize) insects.
352.11440205574036
{'num_detections': 5, 'detection_boxes': array([[0.161314 , 0.07663244, 0.8760584 , 0.8353734 ],
[0.21500829, 0.15136112, 0.8591431 , 0.77994514],
[0.2097887 , 0.15024516, 0.798306 , 0.7945756 ],
[0.22616749, 0.14413752, 0.8274681 , 0.78175414],
[0.23031202, 0.1721791 , 0.8461484 , 0.7838437 ]], dtype=float32), 'detection_scores': array([0.49970442, 0.07632057, 0.07298005, 0.06298006, 0.03536102],
dtype=float32), 'detection_classes': array([ 2, 248, 9, 155, 198], dtype=uint8)}
29469691 function calls (29323558 primitive calls) in 471.230 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
2740/1 0.044 0.000 471.234 471.234 {built-in method builtins.exec}
1 0.319 0.319 471.227 471.227 detect_insect.py:1(<module>)
1 0.004 0.004 355.473 355.473 detect_insect.py:72(run_inference_for_single_image)
1 0.001 0.001 352.112 352.112 session.py:846(run)
1 0.002 0.002 352.111 352.111 session.py:1091(_run)
1 0.000 0.000 352.096 352.096 session.py:1318(_do_run)
1 0.000 0.000 352.096 352.096 session.py:1363(_do_call)
1 0.001 0.001 352.096 352.096 session.py:1346(_run_fn)
1 0.002 0.002 347.445 347.445 session.py:1439(_call_tf_sessionrun)
1 347.443 347.443 347.443 347.443 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
1 0.441 0.441 56.288 56.288 request.py:1775(retrieve)
85287 0.176 0.000 52.912 0.001 tempfile.py:479(func_wrapper)
85287 0.589 0.000 52.735 0.001 client.py:436(read)
85286 0.328 0.000 51.999 0.001 client.py:468(readinto)
85286 0.487 0.000 51.645 0.001 {method 'readinto' of '_io.BufferedReader' objects}
159149 0.374 0.000 51.280 0.000 socket.py:575(readinto)
159149 50.610 0.000 50.610 0.000 {method 'recv_into' of '_socket.socket' objects}
25 0.000 0.000 35.959 1.438 deprecation.py:473(new_func)
1 0.000 0.000 35.956 35.956 importer.py:347(import_graph_def)
1 0.001 0.001 35.956 35.956 importer.py:415(_import_graph_def_internal)
1 0.489 0.489 19.450 19.450 importer.py:237(_ProcessNewOps)
1 0.290 0.290 13.701 13.701 ops.py:3542(_add_new_tf_operations)
1 0.106 0.106 11.004 11.004 ops.py:3560(<listcomp>)
1 10.931 10.931 10.931 10.931 {built-in method _pywrap_tensorflow_internal.TF_GraphImportGraphDefWithResults}
Any advice?

Obtaining the last value that equals or most near in the column dataframe

i have an issue in my code, i'm making points of cuts.
First, this is my Dataframe Column:
In [23]: df['bad_%']
0 0.025
1 0.007
2 0.006
3 0.006
4 0.006
5 0.006
6 0.007
7 0.007
8 0.007
9 0.006
10 0.006
11 0.009
12 0.009
13 0.009
14 0.008
15 0.008
16 0.008
17 0.012
18 0.012
19 0.05
20 0.05
21 0.05
22 0.05
23 0.05
24 0.05
25 0.05
26 0.05
27 0.062
28 0.062
29 0.061
5143 0.166
5144 0.166
5145 0.166
5146 0.167
5147 0.167
5148 0.167
5149 0.167
5150 0.167
5151 0.05
5152 0.167
5153 0.167
5154 0.167
5155 0.167
5156 0.051
5157 0.052
5158 0.161
5159 0.149
5160 0.168
5161 0.168
5162 0.168
5163 0.168
5164 0.168
5165 0.168
5166 0.168
5167 0.168
5168 0.049
5169 0.168
5170 0.168
5171 0.168
5172 0.168
Name: bad%, Length: 5173, dtype: float64
I used this code to detected the value equals or most near to 0.05 (VALUE THAT INTRODUCED on the CONSOLE)
error = 100 #Margin of error
valuesA = [] #array to save data
pointCut=0 #identify cut point
for index, row in df.iterrows():
if(abs(row['bad%'] - a) <= error):
valuesA = row
error = abs(row['bad%'] - a)
#Variable "a" introduced by console, in this case is "0.05"
pointCut = index
This code return the value "0.05" in the index 5151, in first instance looks good, because the "0.05" in the index "5151" is the last "0.05".
Out [27]:
5151 0.05
But my objetive is obtain THE LAST VALUE IN THE COLUMN equal or most near to "0.05", in this case this value correspond to "0.049" in the index "5168", i need obtain this value.
Exists an algorithm that permit this? Any solution or recomendation?
Thanks in advance.

Solutions if exist at leas one value:
Use [::-1] for swap values from back and get idxmax for last matched index value:
a = 0.05
s = df['bad%']
b = s[[(s[::-1] <= a).idxmax()]]
print (b)
5168 0.049
Or:
b = s[(s <= a)].iloc[[-1]]
print (b)
5168 0.049
Name: bad%, dtype: float64
Solution working also if value not exist - then empty Series yields:
a = 0.05
s = df['bad%']
m1 = (s <= a)
m2 = m1[::-1].cumsum().eq(1)
b = s[m1 & m2]
print (b)
5168 0.049
Name: bad%, dtype: float64
Sample data:
df = pd.DataFrame({'bad%': {5146: 0.16699999999999998, 5147: 0.16699999999999998, 5148: 0.16699999999999998, 5149: 0.049, 5150: 0.16699999999999998, 5151: 0.05, 5152: 0.16699999999999998, 5167: 0.168, 5168: 0.049, 5169: 0.168}})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python stack loses data - python

Replace: df_stack = df[months].stack() by df_stack = df[months].stack(dropna=False)

Related

Why is my regressor predicting counts much further than the actual counts?

How to groupby and aggregate an operation to multiple columns?

Why using group by makes some id disappear

Extremely slow inference on MacOS for faster_rcnn_resnet50_fgvc_2018_07_19 trained on iNaturalist dataset

Obtaining the last value that equals or most near in the column dataframe

Categories

Resources