Find first row where value is less than threshold - python

I have a pandas data frame where I try to find the first ID for when the left is less than the values of
list = [0,50,100,150,200,250,500,1000]
ID ST ... csum left
0 0 AK ... 4.293174e+05 760964.996900
1 1 AK ... 4.722491e+06 760535.679500
2 2 AK ... 8.586347e+06 760149.293900
3 3 AK ... 2.683233e+07 758324.695200
4 4 AK ... 2.962290e+07 758045.638900
.. ... ... ... ... ...
111 111 AK ... 7.609006e+09 107.329336
112 112 AK ... 7.609221e+09 85.863469
113 113 AK ... 7.609435e+09 64.397602
114 114 AK ... 7.609650e+09 42.931735
115 115 AK ... 7.610079e+09 0.000000
So I would end up with a list or dataframe looking like
threshold ID
0 115
50 114
100 112
150 100
200 100
250 99
500 78
1000 77
How can I achieve this?

If you want to match the ID of the first value greater than the target, use a merge_asof:
lst = [0,50,100,150,200,250,500,1000]
pd.merge_asof(pd.Series(lst, name='threshold', dtype=df['left'].dtype),
df.sort_values(by='left').rename(columns={'left': 'threshold'})[['threshold', 'ID']],
# uncomment for strictly superior
#allow_exact_matches=False,
)
Output:
threshold ID
0 0.0 115
1 50.0 114
2 100.0 112
3 150.0 111 # due to truncated input
4 200.0 111 #
5 250.0 111 #
6 500.0 111 #
7 1000.0 111 #

list = [0,50,100,150,200,250,500,1000]
df11=pd.DataFrame(dict(threshold=list))
df11.assign(id=df11.threshold.map(lambda x:df1.query("left<=#x").iloc[0,0]))
out
threshold ID
0 0.0 115
1 50.0 114
2 100.0 112
3 150.0 111 # due to truncated input
4 200.0 111 #
5 250.0 111 #
6 500.0 111 #
7 1000.0 111 #

Related

how to obtain a subset of a dataframe based on column values?

I have a dataframe with the current values in columns:
pop_b
CT (mm) A B C D adultos_perc min max class_center Y
0 100- 110 40 0 0 0 0.000000 100 110 105 inf
1 110-120 72 0 0 0 0.000000 110 120 115 inf
2 120-130 108 12 0 0 0.100000 120 130 125 2.197225
3 130-140 112 41 7 0 0.300000 130 140 135 0.847298
4 140-150 92 70 18 4 0.500000 140 150 145 0.000000
5 150-160 60 98 34 7 0.698492 150 160 155 -0.840129
6 160-170 27 105 36 16 0.853261 160 170 165 -1.760409
7 170-180 0 87 38 21 1.000000 170 180 175 -inf
8 180-190 0 45 28 7 1.000000 180 190 185 -inf
9 190-200 0 15 9 6 1.000000 190 200 195 -inf
10 200-210 0 7 3 2 1.000000 200 210 205 -inf
11 210-220 0 4 2 2 1.000000 210 220 215 -inf
12 220-230 0 6 3 2 1.000000 220 230 225 -inf
13 230-240 0 8 3 2 1.000000 230 240 235 -inf
I wanted to create a new dataframe which has only the rows whose "Y" values aren't 'inf' or '-inf'.
The dataframe has the current dtypes:
CT (mm) object
A int64
B int64
C int64
D int64
adultos_perc float64
min int64
max int64
class_center int64
Y float64
dtype: object
You could use between:
out = df[df['Y'].between(-float('inf'), float('inf'), inclusive='neither')]
or gt and lt wrappers chained together with &:
out = df[df['Y'].gt(-float('inf')) & df['Y'].lt(float('inf'))]
Output:
CT(mm) A B C D adultos_perc min max class_center Y
2 120-130 108 12 0 0 0.100000 120 130 125 2.197225
3 130-140 112 41 7 0 0.300000 130 140 135 0.847298
4 140-150 92 70 18 4 0.500000 140 150 145 0.000000
5 150-160 60 98 34 7 0.698492 150 160 155 -0.840129
6 160-170 27 105 36 16 0.853261 160 170 165 -1.760409

Filtering static/stationary areas

I was trying to filter my sensor data. My objective is to filter the sensor data where the data is more or less stationary over a period of time. can anyone help me in this
time : 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sensor : 121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122
this is a sample data, i need to take the first data and compare it to the next 20 seconds of data, if all the 20 datas is in the the range of +or- 10 then i need to filter these 20 datas to another column, and i need to continue this process of filtering
However your question is not very clear but from my understanding what you want is between time duration of 20 seconds if the sensor is in between the range of +10 and -10 from the first reading then you have to append those values to new column and above or below that should not be considered. I tried replicating your DataFrame and you could go ahead in this way:
import pandas as pd
data = {'time':[1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
'sensor':[121, 115, 122, 123,116,117,113,116,113,114,115,112,116,129,123,125,130,120,121,122,123,124,144]}
df_new = pd.DataFrame(data) #I am taking time duration of 23 seconds where 23rd second data is out of range as 144 - 121 > 10
time sensor
0 1 121
1 2 115
2 3 122
3 4 123
4 5 116
5 6 117
6 7 113
7 8 116
8 9 113
9 10 114
10 11 115
11 12 112
12 13 116
13 14 129
14 15 123
15 16 125
16 17 130
17 18 120
18 19 121
19 20 122
20 21 123
21 22 124
22 23 144
list = []
for i in range(0, len(df_new['sensor'])):
if 0 <= df_new['time'][i] - df_new['time'][0] <= 23: #you take here 20 which is your requirement instead of 23 as I am doing to demonstrate for the value of 144
if -10 < df_new['sensor'][0] - df_new['sensor'][i] < 10:
list.append(df_new['sensor'][i])
else:
list.append('out of range')
else:
break
df_new['result'] = list
df_new
time sensor result
0 1 121 121
1 2 115 115
2 3 122 122
3 4 123 123
4 5 116 116
5 6 117 117
6 7 113 113
7 8 116 116
8 9 113 113
9 10 114 114
10 11 115 115
11 12 112 112
12 13 116 116
13 14 129 129
14 15 123 123
15 16 125 125
16 17 130 130
17 18 120 120
18 19 121 121
19 20 122 122
20 21 123 123
21 22 124 124
22 23 144 out of range
There is no sample data. Generated. Clearly filter on time could be two date times, I've just picked certain hours. For stable, example selected values that are between 45th & 55th percentile.
import numpy as np
t = pd.date_range(dt.date(2021,1,10), dt.date(2021,1,11), freq="min")
df = pd.DataFrame({"time":t, "val":np.random.dirichlet(np.ones(len(t)),size=1)[0]})
# filter on hour and val. val between 45th and 55th percentile
df2 = df[df.time.dt.hour.between(3,4) & df.val.between(df.val.quantile(.45), df.val.quantile(.55))]
output
time val
2021-01-10 03:13:00 0.000499
2021-01-10 03:41:00 0.000512
2021-01-10 04:00:00 0.000541
2021-01-10 04:39:00 0.000413
rolling window
Question was updated to state stable is defined as next window rows with a +/- rng output in a new column.
Using this definition, using rolling() capability with a lambda function to check that all subsequent rows within window are within tolerance levels of the first observation in the window. Any observation out of this range will return NaN. Also note last rows will return NaN as there are insufficient remaining rows to do test.
import pandas as pd
import io
import datetime as dt
import numpy as np
from distutils.version import StrictVersion
df = pd.read_csv(io.StringIO("""sensor
121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122"""))
df["time"] = pd.date_range(dt.date(2021,1,10), freq="s", periods=len(df))
# how many rows to compare
window = 5
# */- range
rng = 10
if StrictVersion(pd.__version__) < StrictVersion("1.0.0"):
df["stable"] = df["sensor"].rolling(window).apply(lambda x: np.where(pd.Series(x).between(x[0]-rng,x[0]+rng).all(), x[0], np.nan)).shift(-(window-1))
else:
df["stable"] = df.rolling(window).apply(lambda x: np.where(x.between(x.values[0]-rng,x.values[0]+rng).all(), x.values[0], np.nan)).shift(-(window-1))
output
sensor time stable
121 2021-01-10 00:00:00 121.0
115 2021-01-10 00:00:01 115.0
122 2021-01-10 00:00:02 122.0
123 2021-01-10 00:00:03 123.0
116 2021-01-10 00:00:04 116.0
117 2021-01-10 00:00:05 117.0
113 2021-01-10 00:00:06 113.0
116 2021-01-10 00:00:07 116.0
113 2021-01-10 00:00:08 113.0
114 2021-01-10 00:00:09 NaN
115 2021-01-10 00:00:10 NaN
112 2021-01-10 00:00:11 NaN
116 2021-01-10 00:00:12 NaN
129 2021-01-10 00:00:13 129.0
123 2021-01-10 00:00:14 123.0
125 2021-01-10 00:00:15 125.0
130 2021-01-10 00:00:16 NaN
120 2021-01-10 00:00:17 NaN
121 2021-01-10 00:00:18 NaN
122 2021-01-10 00:00:19 NaN

Plotting in pivot table using label

my dataset
df
Month 1 2 3 4 5 Label
Name
A 120 80.5 120 105.5 140 0
B 80 110 98.5 105 100 1
C 150 90.5 105 120 190 2
D 100 105 98.5 110 120 1
...
To draw a plot for Month, applying the inverse matrix,
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0.00 1.0 2.0 1.000
Ultimately what I want to do is Drawing a plot, the x-axis is this month, y-axis is value.
but,
I have two questions.
Q1.
To inverse matrix, the data type of 'label' is changed(int -> float),
Can only the index of the 'label' be set to int type?
output what I want
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0 1 2 1
Q2.
q1 is actually for q2.
When drawing a plot, I want to group it using a label.(Like seaborn hue)
When drawing a plot using the pivot table above, is there a way for grouping to be possible?
(matplotlib, sns method does not matter)
The label above doesn't have to be int, and if possible, you don't need to answer the q1 task.
thank you for reading
Q2: You need reshape values, e.g. here with DataFrame.melt for possible use hue:
df1 = df.reset_index().melt(['Name','Label'])
print (df1)
sns.stripplot(data=df1,hue='Label',x='Name',y='value')
Q1: Pandas not support it, e.g. if convert last row label it not change values to floats:
df = df.T
df.loc['Label', :] = df.loc['Label', :].astype(int)
print (df)
Name A B C D
1 120.0 80.0 150.0 100.0
2 80.5 110.0 90.5 105.0
3 120.0 98.5 105.0 98.5
4 105.5 105.0 120.0 110.0
5 140.0 100.0 190.0 120.0
Label 0.0 1.0 2.0 1.0
EDIT:
df1 = df.reset_index().melt(['Name','Label'], var_name='Month')
print (df1)
Name Label Month value
0 A 0 1 120.0
1 B 1 1 80.0
2 C 2 1 150.0
3 D 1 1 100.0
4 A 0 2 80.5
5 B 1 2 110.0
6 C 2 2 90.5
7 D 1 2 105.0
8 A 0 3 120.0
9 B 1 3 98.5
10 C 2 3 105.0
11 D 1 3 98.5
12 A 0 4 105.5
13 B 1 4 105.0
14 C 2 4 120.0
15 D 1 4 110.0
16 A 0 5 140.0
17 B 1 5 100.0
18 C 2 5 190.0
19 D 1 5 120.0
sns.lineplot(data=df1,hue='Label',x='Month',y='value')

How to calculate aggregate percentage in a dataframe grouped by a value in python?

I am new to python and I am trying to understand how to work with aggregating data and manipulation.
I have a dataframe:
df3
Out[122]:
SBK SSC CountRecs
0 99 22 9
1 99 12 10
2 99 121 11
3 99 138 12
4 99 123 8
... ... ...
160247 184 1318 1
160248 394 2659 1
160249 412 757 1
160250 357 1312 1
160251 202 106 1
I want to understand in the entire data frame, what percentage of CountRecs for each SBK.
For example, in this case, I want to understand 80618 is what % of the summation total number of SBK's with 99. in this case it is 9/50 * 100. But I want this to be done automated for all rows. How can I go about this?
you need to group by the column you want,
marge by the grouped column.
2.1 you can change the name of the new column.
add the percentage column.
a = df3.merge(pd.DataFrame(df3.groupby('SBK' ['CountRecs'].sum()),on='SBK')
df3['percent'] = (a['CountRecs_x']/a['CountRecs_y']) *100
df3
Use GroupBy.transform for Series with same size like original DataFrame filled by counts, so you can divide original column:
df3['percent'] = df3['CountRecs'] / df3.groupby('SBK')['CountRecs'].transform('sum') * 100
print (df3)
SBK SSC CountRecs percent
0 99 22 9 18.0
1 99 12 10 20.0
2 99 121 11 22.0
3 99 138 12 24.0
4 99 123 8 16.0
160247 184 1318 1 100.0
160248 394 2659 1 100.0
160249 412 757 1 100.0
160250 357 1312 1 100.0
160251 202 106 1 100.0

find the maximum value for each streak of numbers in another column in pandas

I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN
Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN

Categories

Resources