Sagemaker Regression - ValueError: Cannot format input - python

I am new to SageMaker & Python
I am trying to get a simple regression model going on AWS using Jupyter Notebooks.
I am using the Abalone date from the UCI data repository.
I would greatly appreciate some assistance or a link to help me in what to do.
Everything looks fine until I try to run:
regression_linear = sagemaker.estimator.Estimator(
container,
role=sagemaker.get_execution_role(),
input_mode = "File",
instance_count = 1,
instance_type='ml.m4.xlarge',
output_path=output_location,
sagemaker_session=sess
)
regression_linear.set_hyperparameters(
feature_dim=8,
epochs=16,
wd=0.01,
loss="absolute_loss",
predictor_type="regressor",
normalize_data=True,
optimizer="adam",
mini_batch_size=100,
lr_scheduler_step=100,
lr_scheduler_factor=0.99,
lr_scheduler_minimum_lr=0.0001,
learning_rate=0.1,
)
from time import gmtime, strftime
job_name = "DEMO-linear-learner-abalone-regression-" + strftime("%H-%M-%S", gmtime())
print("Training job: ", job_name)
regression_linear.fit(inputs={"train": train_data}, job_name=job_name)
Then I am getting the following error:
ValueError Traceback (most recent call last)
<ipython-input-101-82bd2950b590> in <module>
----> 1 regression_linear.fit(inputs={"train": train_data}, job_name=job_name)
2
3 # , "validation": test_data
ValueError: Cannot format input age sex length diameter height whole_weight shucked_weight \
449 18 0 0.565 0.455 0.150 0.8205 0.3650
1080 7 1 0.430 0.335 0.120 0.3970 0.1985
2310 13 0 0.435 0.350 0.110 0.3840 0.1430
3790 10 0 0.650 0.505 0.175 1.2075 0.5105
3609 9 0 0.555 0.405 0.120 0.9130 0.4585
... ... ... ... ... ... ... ...
2145 9 0 0.415 0.325 0.115 0.3455 0.1405
3815 8 -1 0.460 0.340 0.100 0.3860 0.1805
3534 6 -1 0.400 0.315 0.090 0.3300 0.1510
2217 13 0 0.515 0.415 0.130 0.7640 0.2760
3041 9 1 0.575 0.470 0.150 0.9785 0.4505
vicera_weight shell_weight
449 0.1590 0.2600
1080 0.0865 0.1035
2310 0.1005 0.1250
3790 0.2620 0.3900
3609 0.1960 0.2065
... ... ...
2145 0.0765 0.1100
3815 0.0875 0.0965
3534 0.0680 0.0800
2217 0.1960 0.2500
3041 0.1960 0.2760
[2923 rows x 9 columns]. Expecting one of str, TrainingInput, file_input or FileSystemInput

Related

Requests in multiprocessing fail python

I'm trying to query data from a website, but it takes 6 seconds to query. For all 3000 of my queries, I'd be sitting around for 5 hours. I heard there was a way to parallelize this stuff, so I tried using multiprocessing to do it. It didn't work, and I tried asyncio, but it gave much the same result. Here's the multiprocessing code since that's simpler.
I have 5+ urls in a list I want to request tables from:
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=1&btime=2014+01+17+18:38:41&etime=2014+01+18+18:38:41&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=2&btime=2014+05+18+23:10:01&etime=2014+05+19+23:10:01&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=3&btime=2014+11+04+06:01:27&etime=2014+11+05+06:01:27&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=4&btime=2014+07+14+10:01:45&etime=2014+07+15+10:01:45&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=5&btime=2014+07+04+20:17:01&etime=2014+07+05+20:17:01&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
Here's my request code:
from astropy.io import fits
from astropy.io import ascii as astro_ascii
from astropy.time import Time
def get_real_meta(url):
df = astro_ascii.read(url)
df=df.to_pandas()
print(df)
return df
import multiprocessing as mp
pool = mp.Pool(processes=10)
results = pool.map(get_real_meta, urls)
When I run this, some of the results are failed requests.
Why is this happening?
This is the full result from the run:
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_1YL1MV_20034/Gator/irsa/20034/log.20034.html"]
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_gteaoW_20031/Gator/irsa/20031/log.20031.html"]
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_yWl2vY_20037/Gator/irsa/20037/log.20037.html"]
cntr_u dist_x pang_x ra_u dec_u ra \
0 1 2.338323 -43.660587 153.298036 13.719689 153.297574
1 2 1.047075 96.475058 153.337711 13.730126 153.338009
2 3 1.709365 -159.072115 153.377399 13.740497 153.377224
3 4 0.903435 84.145005 153.377439 13.740491 153.377696
4 5 0.800164 99.321042 153.397283 13.745653 153.397509
5 6 0.591963 16.330683 153.417180 13.750790 153.417228
6 7 0.642462 63.761302 153.437090 13.755910 153.437255
7 8 1.020182 -123.497531 153.457013 13.761012 153.456770
8 9 1.051963 130.842143 153.476909 13.766102 153.477137
9 11 1.007540 -55.156815 153.516820 13.776216 153.516583
10 12 0.607295 118.463910 153.556744 13.786265 153.556897
11 13 0.227240 -79.964079 153.556784 13.786259 153.556720
12 14 1.526454 -113.268004 153.596760 13.796237 153.596359
dec clon clat w1mpro w1sigmpro w1snr w1rchi2 \
0 13.720159 10h13m11.42s 13d43m12.57s 7.282 0.014 78.4 307.90
1 13.730093 10h13m21.12s 13d43m48.34s 6.925 0.018 59.8 82.35
2 13.740054 10h13m30.53s 13d44m24.19s 6.354 0.012 91.3 203.90
3 13.740517 10h13m30.65s 13d44m25.86s 6.862 0.015 70.3 61.03
4 13.745617 10h13m35.40s 13d44m44.22s 7.005 0.016 68.0 62.28
5 13.750948 10h13m40.13s 13d45m03.41s 6.749 0.015 70.4 26.35
6 13.755989 10h13m44.94s 13d45m21.56s 7.031 0.019 57.5 37.56
7 13.760856 10h13m49.62s 13d45m39.08s 6.729 0.013 84.9 66.91
8 13.765911 10h13m54.51s 13d45m57.28s 6.944 0.022 49.0 44.22
9 13.776376 10h14m03.98s 13d46m34.95s 7.049 0.022 49.1 20.63
10 13.786185 10h14m13.66s 13d47m10.26s 6.728 0.018 58.9 14.40
11 13.786270 10h14m13.61s 13d47m10.57s 6.773 0.024 45.3 10.65
12 13.796069 10h14m23.13s 13d47m45.85s 7.126 0.015 72.2 219.50
w1flux w1sigflux w1sky w1mag_2 w1sigm mjd
0 248830.0 3173.5 24.057 7.719 0.013 56795.965297
1 345700.0 5780.5 27.888 8.348 0.006 56796.096965
2 584870.0 6406.8 24.889 7.986 0.006 56796.228504
3 366210.0 5206.7 27.876 7.653 0.006 56796.228632
4 321210.0 4725.7 26.150 7.867 0.009 56796.294338
5 406400.0 5771.4 25.240 7.711 0.006 56796.360172
6 313360.0 5449.7 26.049 7.988 0.006 56796.426005
7 414100.0 4877.9 25.581 8.022 0.006 56796.491839
8 339610.0 6935.9 25.564 8.029 0.007 56796.557545
9 308370.0 6285.2 25.491 8.331 0.006 56796.689212
10 414410.0 7035.5 27.656 7.851 0.007 56796.820752
11 397500.0 8773.4 27.628 8.015 0.006 56796.820880
12 287270.0 3980.2 24.825 8.310 0.006 56796.952419
cntr_u dist_x pang_x ra_u dec_u ra dec \
0 1 0.570817 137.605512 128.754979 4.242103 128.755086 4.241986
1 2 0.852021 14.819525 128.791578 4.225474 128.791639 4.225703
2 3 1.099000 -4.816139 128.828083 4.208860 128.828057 4.209164
3 4 1.207022 9.485091 128.864456 4.192260 128.864511 4.192591
4 5 0.323112 107.976317 128.882608 4.183966 128.882694 4.183938
5 6 0.627727 99.373708 128.882645 4.183967 128.882817 4.183939
6 7 0.489166 19.732971 128.900773 4.175676 128.900819 4.175804
7 8 0.231292 -139.425350 128.918877 4.167389 128.918835 4.167340
8 9 0.393206 -28.705753 128.936958 4.159106 128.936905 4.159202
9 10 0.466548 -35.199460 128.936995 4.159107 128.936920 4.159213
10 11 1.153921 -100.879703 128.955053 4.150828 128.954737 4.150767
11 12 1.078232 -38.005043 128.973087 4.142552 128.972902 4.142788
12 13 1.172329 -27.290606 128.991097 4.134280 128.990947 4.134569
13 15 1.399220 54.717544 129.027083 4.117750 129.027401 4.117974
clon clat w1mpro w1sigmpro w1snr w1rchi2 w1flux \
0 08h35m01.22s 04d14m31.15s 6.768 0.018 58.9 60.490 395130.0
1 08h35m09.99s 04d13m32.53s 6.706 0.018 59.1 30.780 418160.0
2 08h35m18.73s 04d12m32.99s 6.754 0.024 45.4 20.520 400280.0
3 08h35m27.48s 04d11m33.33s 6.667 0.024 44.9 34.090 433390.0
4 08h35m31.85s 04d11m02.18s 6.782 0.023 47.8 9.326 389870.0
5 08h35m31.88s 04d11m02.18s 6.710 0.035 31.4 11.360 416570.0
6 08h35m36.20s 04d10m32.89s 6.880 0.021 52.7 7.781 356410.0
7 08h35m40.52s 04d10m02.42s 6.653 0.023 46.8 18.900 439130.0
8 08h35m44.86s 04d09m33.13s 6.986 0.023 47.2 8.576 323350.0
9 08h35m44.86s 04d09m33.17s 6.917 0.019 58.5 25.720 344400.0
10 08h35m49.14s 04d09m02.76s 6.782 0.015 70.1 173.800 390170.0
11 08h35m53.50s 04d08m34.04s 6.671 0.016 69.6 70.490 431820.0
12 08h35m57.83s 04d08m04.45s 7.152 0.016 66.6 131.100 277440.0
13 08h36m06.58s 04d07m04.71s 6.436 0.017 63.3 86.350 536630.0
w1sigflux w1sky w1mag_2 w1sigm mjd
0 6711.7 21.115 7.988 0.008 56965.251016
1 7070.1 23.748 7.830 0.007 56965.382556
2 8812.6 20.930 8.456 0.007 56965.514096
3 9649.6 21.350 8.120 0.008 56965.645509
4 8161.1 19.988 7.686 0.007 56965.711215
5 13264.0 22.180 7.902 0.016 56965.711343
6 6769.0 22.962 8.023 0.008 56965.777049
7 9382.2 22.355 8.030 0.007 56965.842755
8 6847.6 23.531 8.024 0.007 56965.908462
9 5882.5 21.256 7.654 0.007 56965.908589
10 5568.7 21.926 8.051 0.007 56965.974295
11 6202.3 23.497 7.950 0.007 56966.040002
12 4165.8 20.094 8.091 0.010 56966.105708
13 8482.9 22.436 8.191 0.008 56966.237248

Python DataFrame manipulation: How to extract a set of columns in a fast way

I need to access and extract information from a Dataframe that is used for other colleagues in a research group.
The DataFrame structure is:
zee.loc[zee['layer']=='EMB2'].loc[zee['roi']==0]
e et eta phi deta dphi samp hash det layer roi eventNumber
2249 20.677443 20.675829 0.0125 -1.067651 0.025 0.024544 3 2030015444 2 EMB2 0 2
2250 21.635288 21.633598 0.0125 -1.043107 0.025 0.024544 3 2030015445 2 EMB2 0 2
2251 -29.408310 -29.406013 0.0125 -1.018563 0.025 0.024544 3 2030015446 2 EMB2 0 2
2252 43.127533 43.124165 0.0125 -0.994020 0.025 0.024544 3 2030015447 2 EMB2 0 2
2253 -3.025344 -3.025108 0.0125 -0.969476 0.025 0.024544 3 2030015448 2 EMB2 0 2
... ... ... ... ... ... ... ... ... ... ... ... ...
4968988 -5.825550 -5.309279 0.4375 -0.454058 0.025 0.024544 3 2030019821 2 EMB2 0 3955
4968989 39.750645 36.227871 0.4375 -0.429515 0.025 0.024544 3 2030019822 2 EMB2 0 3955
4968990 80.568573 73.428436 0.4375 -0.404971 0.025 0.024544 3 2030019823 2 EMB2 0 3955
4968991 -28.921751 -26.358652 0.4375 -0.380427 0.025 0.024544 3 2030019824 2 EMB2 0 3955
4968992 55.599472 50.672146 0.4375 -0.355884 0.025 0.024544 3 2030019825 2 EMB2 0 3955
So, I need to work only with the layer: EMB2 and the columns: et, eta, phi. To pick up these columns, I'm using the following code:
EtEtaPhi, EventLens = [], []
events = set(zee.loc[zee['layer']=='EMB2']['eventNumber'].to_numpy())
roi = set(zee.loc[zee['layer']=='EMB2']['roi'].to_numpy())
for ee in events:
for rr in roi:
if len(zee.loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr])==0: break
EtEtaPhi.append(zee[['et','eta','phi']].loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr].to_numpy())
EventLens.append(len(EtEtaPhi[-1]))
But to read 4000 events take so long time, almost one second per event. This result isn't good, almost one hour just to extract those columns!
Is there some way to extract columns from a DataFrame more efficiently and faster?
The code
zee[['et','eta','phi']].loc[zee['layer']=='EMB2']
which you already have somewhere in there should do what you asked for. The rest is not needed.
Just use .loc:
sample = zee.loc[zee["layer"].eq("EMB2"), ["et","eta","phi"]]

Python stack loses data

I'm trying to reorganise my data (the overarching goal is to convert a ASCII file to netCDF). One of the steps to get there is to take the data and stack the columns. My original data look like this:
import pandas as pd
import numpy as np
import xarray as xr
fname = 'data.out'
df = pd.read_csv(fname, header=0, delim_whitespace=True)
print(df)
gives
Lon Lat Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 150.25 -34.25 1851 0.027 -0.005 -0.010 -0.034 -0.029 -0.025 0.016 -0.049 -0.055 0.003 -0.029 0.060
1 150.25 -34.25 1852 0.021 -0.002 -0.050 0.071 0.066 0.001 0.021 -0.014 -0.072 -0.050 0.113 0.114
2 150.25 -34.25 1853 0.093 0.094 0.139 -0.019 0.015 0.003 0.018 -0.032 -0.024 -0.010 0.132 0.107
3 150.25 -34.25 1854 0.084 0.071 0.024 -0.004 -0.022 0.005 0.025 0.006 -0.040 -0.051 -0.067 -0.005
4 150.25 -34.25 1855 -0.030 -0.004 -0.035 -0.036 -0.035 -0.012 0.009 -0.017 -0.062 -0.068 -0.077 -0.084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
707995 138.75 -19.25 2096 -0.044 -0.039 -0.068 -0.027 -0.023 -0.029 -0.031 -0.002 -0.005 0.018 -0.039 -0.094
707996 138.75 -19.25 2097 -0.041 -0.066 -0.006 -0.018 -0.005 -0.017 0.011 0.018 0.026 0.024 0.010 -0.086
707997 138.75 -19.25 2098 -0.033 -0.044 -0.032 -0.044 -0.046 -0.040 -0.021 -0.017 0.022 -0.011 -0.015 -0.032
707998 138.75 -19.25 2099 0.039 0.016 -0.009 0.001 -0.002 0.001 0.010 0.021 0.026 0.027 0.012 -0.017
707999 138.75 -19.25 2100 0.010 -0.022 -0.024 -0.037 -0.008 -0.020 0.002 0.011 0.011 0.033 0.020 -0.002
[708000 rows x 15 columns]
I then select the actual timesteps
months=list(df.columns)
months=months[3:]
and select all columns that have monthly data. This then returns the shape
print(df[months].shape)
(708000, 12). So far so good, but then when I stack the data
df_stack = df[months].stack()
print(df_stack.shape)
instead of the expected shape ((8496000,) I get (8493000,). The weird thing is the script runs on other files that have the same shape as the data I used for this example and I don't have that problem there. It looks like I'm losing one Lon/Lat pixel for 250 years - but I don't understand why? This becomes a problem later when I try to convert the data to a netcdf file.
lons = np.unique(df.Lon)
lats = np.unique(df.Lat)
years = np.unique(df.Year)
nyears = len(years)
nrows = len(lats)
ncols = len(lons)
nmonths = 12
lons.sort()
lats.sort()
years.sort()
time = pd.date_range(start=f'01/{years[0]}',
end=f'01/{years[-1]+1}', freq='M')
dx = 0.5
Lon = xr.DataArray(np.arange(-180.+dx/2., 180., dx), dims=("Lon"),
attrs={"long_name":"longitude", "unit":"degrees_east"})
nlon = Lon.size
dy = 0.5
Lat = xr.DataArray(np.arange(-90.+dy/2., 90., dy), dims=("Lat"),
attrs={"long_name":"latitude", "unit":"degrees_north"})
nlat = Lat.size
out = xr.DataArray(np.zeros((nyears*nmonths,nlat, nlon)),
dims=("Time","Lat","Lon"),
coords=({"Lat":Lat, "Lon":Lon, "Time":time}))
for nr in range(0,len(df.index),nyears):
rows = df[nr:nr+nyears]
thislon = rows["Lon"].min()
thislat = rows["Lat"].min()
out.loc[dict(
Lon=thislon,
Lat=thislat)] = df_stack[nr*nmonths:(nr+nyears)*nmonths]
this gives me the error
ValueError: could not broadcast input array from shape (0,) into shape (3000,)
It's missing the 3000 values that I'm losing while stacking the data. Does anyone know how to fix this?
Replace:
df_stack = df[months].stack()
by
df_stack = df[months].stack(dropna=False)

Why using group by makes some id disappear

i was working in a machine learning project , and while i am extracting features i found that some of consumers LCLid disappear from the data set while i was grouping by the LCLid
Dataset: SmartMeter Energy Consumption Data in London Households
here is the original data set
and here is the code that i used to extract some features
LCLid=[]
for i in range(68):
LCLid.append('MAC0000'+str(228+i))
consommation=data.groupby('LCLid')['KWH/hh'].sum()
consommation_min=data.groupby('LCLid')['KWH/hh'].min()
consommation_max=data.groupby('LCLid')['KWH/hh'].max()
consommation_mean=data.groupby('LCLid')['KWH/hh'].mean()
consommation_evening=data.groupby(['LCLid','period'])['KWH/hh'].mean()
#creation de dataframe
list_of_tuples = list(zip (LCLid, consommation, consommation_min, consommation_max, consommation_mean))
data2= pd.DataFrame(list_of_tuples, columns = ['LCLid', 'Consumption', 'Consumption_min', 'Consumption_max', 'Consumption_mean'])
as you see after the execution of the code the dataset stopped in the LCLid 282 while iin the original one the dataset containes also the LCLid from 283 to 295
Using low-carbon-london-data from SmartMeter Energy Consumption Data in London Households
The issue is LCLid does not uniformly increment by 1, from MAC000228 to MAC000295.
print(data.LCLid.unique())
array(['MAC000228', 'MAC000229', 'MAC000230', 'MAC000231', 'MAC000232',
'MAC000233', 'MAC000234', 'MAC000235', 'MAC000237', 'MAC000238',
'MAC000239', 'MAC000240', 'MAC000241', 'MAC000242', 'MAC000243',
'MAC000244', 'MAC000245', 'MAC000246', 'MAC000248', 'MAC000249',
'MAC000250', 'MAC000251', 'MAC000252', 'MAC000253', 'MAC000254',
'MAC000255', 'MAC000256', 'MAC000258', 'MAC000260', 'MAC000262',
'MAC000263', 'MAC000264', 'MAC000267', 'MAC000268', 'MAC000269',
'MAC000270', 'MAC000271', 'MAC000272', 'MAC000273', 'MAC000274',
'MAC000275', 'MAC000276', 'MAC000277', 'MAC000279', 'MAC000280',
'MAC000281', 'MAC000282', 'MAC000283', 'MAC000284', 'MAC000285',
'MAC000287', 'MAC000289', 'MAC000291', 'MAC000294', 'MAC000295'],
dtype=object)
print(len(data.LCLid.unique()))
>>> 55
To resolve the issue
import pandas as pd
import numpy as np
df = pd.read_csv('Power-Networks-LCL-June2015(withAcornGps)v2.csv')
# determine the rows needed for the MAC000228 - MAC000295
df[df.LCLid == 'MAC000228'].iloc[0, :] # first row of 228
df[df.LCLid == 'MAC000295'].iloc[-1, :] # last row of 295
# create a dataframe with the desired data
data = df[['LCLid', 'DateTime', 'KWH/hh (per half hour) ']].iloc[6989700:9032044, :].copy()
# fix the data
data.DateTime = pd.to_datetime(data.DateTime)
data.rename(columns={'KWH/hh (per half hour) ': 'KWH/hh'}, inplace=True)
data['KWH/hh'] = data['KWH/hh'].str.replace('Null', 'NaN')
data['KWH/hh'].fillna(np.nan, inplace=True)
data['KWH/hh'] = data['KWH/hh'].astype('float')
data.reset_index(drop=True, inplace=True)
# aggregate your functions
agg_data = data.groupby('LCLid')['KWH/hh'].agg(['sum', 'min', 'max', 'mean']).reset_index()
print(agg_data)
agg_data
LCLid sum min max mean
0 MAC000228 5761.288000 0.021 1.616 0.146356
1 MAC000229 6584.866999 0.008 3.294 0.167456
2 MAC000230 8911.154000 0.029 2.750 0.226384
3 MAC000231 3174.314000 0.000 1.437 0.080663
4 MAC000232 2083.042000 0.005 0.736 0.052946
5 MAC000233 2241.591000 0.000 3.137 0.056993
6 MAC000234 9700.328001 0.029 2.793 0.246646
7 MAC000235 8473.999003 0.011 3.632 0.223194
8 MAC000237 22263.294998 0.036 4.450 0.598299
9 MAC000238 7814.889998 0.016 2.835 0.198781
10 MAC000239 6113.029000 0.015 1.346 0.155481
11 MAC000240 7280.662000 0.000 3.146 0.222399
12 MAC000241 4181.169999 0.024 1.733 0.194963
13 MAC000242 1654.336000 0.000 1.481 0.042088
14 MAC000243 11057.366999 0.009 3.588 0.281989
15 MAC000244 5894.271000 0.005 1.884 0.149939
16 MAC000245 22788.699005 0.037 4.743 0.580087
17 MAC000246 13787.060005 0.014 3.516 0.351075
18 MAC000248 10192.239001 0.000 4.351 0.259536
19 MAC000249 24401.468995 0.148 5.242 0.893042
20 MAC000250 5850.003000 0.000 2.185 0.148999
21 MAC000251 8400.234000 0.035 3.505 0.213931
22 MAC000252 21748.489004 0.135 4.171 0.554978
23 MAC000253 9739.408999 0.009 1.714 0.248201
24 MAC000254 9351.614001 0.009 2.484 0.238209
25 MAC000255 14142.974002 0.097 3.305 0.360220
26 MAC000256 20398.665001 0.049 3.019 0.520680
27 MAC000258 6646.485998 0.017 2.319 0.169666
28 MAC000260 5952.563001 0.006 2.192 0.151952
29 MAC000262 13909.603999 0.000 2.878 0.355181
30 MAC000263 3753.997000 0.015 1.060 0.095863
31 MAC000264 7022.967000 0.020 0.910 0.179432
32 MAC000267 8797.094000 0.029 2.198 0.224898
33 MAC000268 3734.252001 0.000 1.599 0.095359
34 MAC000269 2395.232000 0.000 1.029 0.061167
35 MAC000270 15569.711002 0.131 2.249 0.397501
36 MAC000271 7244.860000 0.028 1.794 0.184974
37 MAC000272 8703.658998 0.034 3.295 0.222446
38 MAC000273 3622.199002 0.005 5.832 0.092587
39 MAC000274 28724.718997 0.032 3.927 0.734422
40 MAC000275 5564.004999 0.012 1.840 0.161290
41 MAC000276 11060.774001 0.000 1.709 0.315724
42 MAC000277 8446.528999 0.027 1.938 0.241075
43 MAC000279 3444.160999 0.016 1.846 0.098354
44 MAC000280 12595.780001 0.125 1.988 0.360436
45 MAC000281 6282.568000 0.024 1.433 0.179538
46 MAC000282 4457.989001 0.030 1.830 0.127444
47 MAC000283 5024.917000 0.011 2.671 0.143627
48 MAC000284 1293.503000 0.000 0.752 0.047975
49 MAC000285 2399.018000 0.006 0.931 0.068567
50 MAC000287 1407.290000 0.000 2.372 0.045253
51 MAC000289 4767.490999 0.000 2.287 0.136436
52 MAC000291 13456.678999 0.072 3.354 0.385060
53 MAC000294 9477.966000 0.053 2.438 0.271264
54 MAC000295 7750.128000 0.010 1.839 0.221774

Obtaining the last value that equals or most near in the column dataframe

i have an issue in my code, i'm making points of cuts.
First, this is my Dataframe Column:
In [23]: df['bad_%']
0 0.025
1 0.007
2 0.006
3 0.006
4 0.006
5 0.006
6 0.007
7 0.007
8 0.007
9 0.006
10 0.006
11 0.009
12 0.009
13 0.009
14 0.008
15 0.008
16 0.008
17 0.012
18 0.012
19 0.05
20 0.05
21 0.05
22 0.05
23 0.05
24 0.05
25 0.05
26 0.05
27 0.062
28 0.062
29 0.061
5143 0.166
5144 0.166
5145 0.166
5146 0.167
5147 0.167
5148 0.167
5149 0.167
5150 0.167
5151 0.05
5152 0.167
5153 0.167
5154 0.167
5155 0.167
5156 0.051
5157 0.052
5158 0.161
5159 0.149
5160 0.168
5161 0.168
5162 0.168
5163 0.168
5164 0.168
5165 0.168
5166 0.168
5167 0.168
5168 0.049
5169 0.168
5170 0.168
5171 0.168
5172 0.168
Name: bad%, Length: 5173, dtype: float64
I used this code to detected the value equals or most near to 0.05 (VALUE THAT INTRODUCED on the CONSOLE)
error = 100 #Margin of error
valuesA = [] #array to save data
pointCut=0 #identify cut point
for index, row in df.iterrows():
if(abs(row['bad%'] - a) <= error):
valuesA = row
error = abs(row['bad%'] - a)
#Variable "a" introduced by console, in this case is "0.05"
pointCut = index
This code return the value "0.05" in the index 5151, in first instance looks good, because the "0.05" in the index "5151" is the last "0.05".
Out [27]:
5151 0.05
But my objetive is obtain THE LAST VALUE IN THE COLUMN equal or most near to "0.05", in this case this value correspond to "0.049" in the index "5168", i need obtain this value.
Exists an algorithm that permit this? Any solution or recomendation?
Thanks in advance.
Solutions if exist at leas one value:
Use [::-1] for swap values from back and get idxmax for last matched index value:
a = 0.05
s = df['bad%']
b = s[[(s[::-1] <= a).idxmax()]]
print (b)
5168 0.049
Or:
b = s[(s <= a)].iloc[[-1]]
print (b)
5168 0.049
Name: bad%, dtype: float64
Solution working also if value not exist - then empty Series yields:
a = 0.05
s = df['bad%']
m1 = (s <= a)
m2 = m1[::-1].cumsum().eq(1)
b = s[m1 & m2]
print (b)
5168 0.049
Name: bad%, dtype: float64
Sample data:
df = pd.DataFrame({'bad%': {5146: 0.16699999999999998, 5147: 0.16699999999999998, 5148: 0.16699999999999998, 5149: 0.049, 5150: 0.16699999999999998, 5151: 0.05, 5152: 0.16699999999999998, 5167: 0.168, 5168: 0.049, 5169: 0.168}})

Categories

Resources