Filtering static/stationary areas

Filtering static/stationary areas - python

I was trying to filter my sensor data. My objective is to filter the sensor data where the data is more or less stationary over a period of time. can anyone help me in this
time : 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sensor : 121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122
this is a sample data, i need to take the first data and compare it to the next 20 seconds of data, if all the 20 datas is in the the range of +or- 10 then i need to filter these 20 datas to another column, and i need to continue this process of filtering

However your question is not very clear but from my understanding what you want is between time duration of 20 seconds if the sensor is in between the range of +10 and -10 from the first reading then you have to append those values to new column and above or below that should not be considered. I tried replicating your DataFrame and you could go ahead in this way:
import pandas as pd
data = {'time':[1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
'sensor':[121, 115, 122, 123,116,117,113,116,113,114,115,112,116,129,123,125,130,120,121,122,123,124,144]}
df_new = pd.DataFrame(data) #I am taking time duration of 23 seconds where 23rd second data is out of range as 144 - 121 > 10
time sensor
0 1 121
1 2 115
2 3 122
3 4 123
4 5 116
5 6 117
6 7 113
7 8 116
8 9 113
9 10 114
10 11 115
11 12 112
12 13 116
13 14 129
14 15 123
15 16 125
16 17 130
17 18 120
18 19 121
19 20 122
20 21 123
21 22 124
22 23 144
list = []
for i in range(0, len(df_new['sensor'])):
if 0 <= df_new['time'][i] - df_new['time'][0] <= 23: #you take here 20 which is your requirement instead of 23 as I am doing to demonstrate for the value of 144
if -10 < df_new['sensor'][0] - df_new['sensor'][i] < 10:
list.append(df_new['sensor'][i])
else:
list.append('out of range')
else:
break
df_new['result'] = list
df_new
time sensor result
0 1 121 121
1 2 115 115
2 3 122 122
3 4 123 123
4 5 116 116
5 6 117 117
6 7 113 113
7 8 116 116
8 9 113 113
9 10 114 114
10 11 115 115
11 12 112 112
12 13 116 116
13 14 129 129
14 15 123 123
15 16 125 125
16 17 130 130
17 18 120 120
18 19 121 121
19 20 122 122
20 21 123 123
21 22 124 124
22 23 144 out of range

There is no sample data. Generated. Clearly filter on time could be two date times, I've just picked certain hours. For stable, example selected values that are between 45th & 55th percentile.
import numpy as np
t = pd.date_range(dt.date(2021,1,10), dt.date(2021,1,11), freq="min")
df = pd.DataFrame({"time":t, "val":np.random.dirichlet(np.ones(len(t)),size=1)[0]})
# filter on hour and val. val between 45th and 55th percentile
df2 = df[df.time.dt.hour.between(3,4) & df.val.between(df.val.quantile(.45), df.val.quantile(.55))]
output
time val
2021-01-10 03:13:00 0.000499
2021-01-10 03:41:00 0.000512
2021-01-10 04:00:00 0.000541
2021-01-10 04:39:00 0.000413
rolling window
Question was updated to state stable is defined as next window rows with a +/- rng output in a new column.
Using this definition, using rolling() capability with a lambda function to check that all subsequent rows within window are within tolerance levels of the first observation in the window. Any observation out of this range will return NaN. Also note last rows will return NaN as there are insufficient remaining rows to do test.
import pandas as pd
import io
import datetime as dt
import numpy as np
from distutils.version import StrictVersion
df = pd.read_csv(io.StringIO("""sensor
121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122"""))
df["time"] = pd.date_range(dt.date(2021,1,10), freq="s", periods=len(df))
# how many rows to compare
window = 5
# */- range
rng = 10
if StrictVersion(pd.__version__) < StrictVersion("1.0.0"):
df["stable"] = df["sensor"].rolling(window).apply(lambda x: np.where(pd.Series(x).between(x[0]-rng,x[0]+rng).all(), x[0], np.nan)).shift(-(window-1))
else:
df["stable"] = df.rolling(window).apply(lambda x: np.where(x.between(x.values[0]-rng,x.values[0]+rng).all(), x.values[0], np.nan)).shift(-(window-1))
output
sensor time stable
121 2021-01-10 00:00:00 121.0
115 2021-01-10 00:00:01 115.0
122 2021-01-10 00:00:02 122.0
123 2021-01-10 00:00:03 123.0
116 2021-01-10 00:00:04 116.0
117 2021-01-10 00:00:05 117.0
113 2021-01-10 00:00:06 113.0
116 2021-01-10 00:00:07 116.0
113 2021-01-10 00:00:08 113.0
114 2021-01-10 00:00:09 NaN
115 2021-01-10 00:00:10 NaN
112 2021-01-10 00:00:11 NaN
116 2021-01-10 00:00:12 NaN
129 2021-01-10 00:00:13 129.0
123 2021-01-10 00:00:14 123.0
125 2021-01-10 00:00:15 125.0
130 2021-01-10 00:00:16 NaN
120 2021-01-10 00:00:17 NaN
121 2021-01-10 00:00:18 NaN
122 2021-01-10 00:00:19 NaN

Related

fb prophet daily prediction does not give accurate result for missing values

My dataframe (df) contains 2 inputs UnitShrtDescr and SchShrtDescr
.
So for particular UnitShrtDescr and SchShrtDescr it must predict next value. But my data contains lots of missing values (output for in-between dates are 0).
During prediction prophet continuously predict value for each and every day without considering in between dates output as empty. How can i resolve this?
>df #(main dataframe)
>
UnitShrtDescr SchShrtDescr y ds id
8110 50 93 1 2011-12-01 243
3437 29 87 1 2011-12-21 133
6867 43 75 1 2011-12-23 204
1102 8 23 1 2011-12-28 36
5271 36 14 1 2011-12-28 166
... ... ... ... ... ...
13138 83 0 1 2018-05-18 390
14424 92 3 1 2018-05-18 432
11556 69 0 1 2018-05-18 334
11767 69 5 1 2018-05-18 338
4458 30 102 1 2018-05-18 141
15950 rows × 5 columns
code:
model = Prophet(daily_seasonality=True)
model.add_regressor("UnitShrtDescr")
model.add_regressor("SchShrtDescr")
model.fit(df)
input regressor that i want to predict is
UnitShrtDescr=40 and SchShrtDescr=93. So i made make_future_dataframe:
future = model.make_future_dataframe(periods=100, include_history=False)
future["UnitShrtDescr"]=40
future["SchShrtDescr"]=93
Previous value for UnitShrtDescr=40 and SchShrtDescr=93 was:
>dfx[(dfx['UnitShrtDescr']==40) & (dfx['SchShrtDescr']==93)].tail(10)
>
UnitShrtDescr SchShrtDescr y ds id
6293 40 93 1 2018-02-27 189
6294 40 93 3 2018-02-28 189
6295 40 93 1 2018-03-17 189
6296 40 93 1 2018-03-29 189
6297 40 93 1 2018-03-30 189
6298 40 93 4 2018-03-31 189
6299 40 93 1 2018-04-26 189
6300 40 93 1 2018-04-27 189
6301 40 93 4 2018-04-30 189
6302 40 93 1 2018-05-16 189
Please note Gap between dates is much bigger which means y is 0 for between dates.
So when i make prediction it must predict in-between dates as 0 also.
But in this case it continuously predict y without considering in between y as 0
output = model.predict(future)
>output[['ds','yhat']].head(10)
>
ds yhat
0 2018-05-19 2.959505
1 2018-05-20 2.631181
2 2018-05-21 2.418850
3 2018-05-22 2.411914
4 2018-05-23 2.386383
5 2018-05-24 2.444841
6 2018-05-25 2.409294
7 2018-05-26 2.937428
8 2018-05-27 2.588136
9 2018-05-28 2.358953
Please Suggest Changes or better alternative for my case

Generate count column for IDs in a Pandas DataFrame

Here how you can generate the dummy version of my Pandas DataFrame:
import pandas as pd
usr_id = [121,121,121,121,135,135,135,135,135,135,135,135,135]
ses_id = [95,95,95,108,97,97,97,97,98,98,98,101,101]
que_id = [1,8,15,23,1,42,9,5,7,9,10,17,20]
df = pd.DataFrame(list(zip(usr_id, ses_id, que_id)),
columns =['usr_id', 'ses_id', 'que_id'])
usr_id
ses_id
que_id
121
95
1
121
95
8
121
95
15
121
108
23
135
97
1
135
97
42
135
97
9
135
97
5
135
98
7
135
98
9
135
98
10
135
101
17
135
101
20
A user can attempt multiple sessions where each sesssion can have varying number of multiple questions. I need to create two columns which will number the session and question i.e, (session number or question number 1, 2, 3...) for each indiviudal users. Something like this:
usr_id
ses_id
que_id
ses_no
que_no
121
95
1
1
1
121
95
8
1
2
121
95
15
1
3
121
108
23
2
1
135
97
1
1
1
135
97
42
1
2
135
97
9
1
3
135
97
5
1
4
135
98
7
2
1
135
98
9
2
2
135
98
10
2
3
135
101
17
3
1
135
101
20
3
2
So session_id 95 was the first session usr_id 121 attempted within which he attempted three questions que_id 1, 8 & 15. Next session attempted by the same user is ses_id 108 with only 1 question que_id 23. Another user, usr_id 135 atempted it's first session recorded as ses_id 97 in which he attempted four questions que_id 1, 42, 9 & 5. The second session from the same user now is ses_id 98 and so on.
I managed to generate the 'que_no' using the following:
df['que_no'] = df.groupby('ses_id').cumcount()+1
But could't find a way to do the same for ses_no.
I am also having an idea of using .shift() to compare whether there is a change in 'usr_id' and/or 'ses_id and some how apply a count logic on the output. Something like this:
i = df.usr_id
j = df.sess_id
i_shift_ne = i.ne(i.shift())
j_shift_ne = j.ne(j.shift())
Not sure whether this idea will work or not also I am pretty sure there has to be a smarter way of doing this. It will be great if we can make it happen using pandas library itself.

IIUC use custom lambda function per usr_id with factorize:
df['ses_no'] = df.groupby('usr_id')['ses_id'].transform(lambda x: pd.factorize(x)[0]) + 1
#if values are sorted
#df['ses_no'] = df.groupby('usr_id')['ses_id'].rank(method='dense').astype(int)
df['que_no'] = df.groupby(['usr_id','ses_no']).cumcount()+1
print (df)
usr_id ses_id que_id ses_no que_no
0 121 95 1 1 1
1 121 95 8 1 2
2 121 95 15 1 3
3 121 108 23 2 1
4 135 97 1 1 1
5 135 97 42 1 2
6 135 97 9 1 3
7 135 97 5 1 4
8 135 98 7 2 1
9 135 98 9 2 2
10 135 98 10 2 3
11 135 101 17 3 1
12 135 101 20 3 2

Aggregate grouped data conditionally over many columns doing different operations in Python/Pandas

I'm not sure this is a general coding question or not but I hope this is the correct forum. Consider the following reduced example data frame df:
Department CustomerID Date Price MenswearDemand HomeDemand
0 Menswear 418089 2019-04-18 199 199 0
1 Menswear 613573 2019-04-24 199 199 0
2 Menswear 161840 2019-04-25 199 199 0
3 Menswear 2134926 2019-04-29 199 199 0
4 Menswear 984801 2019-04-30 19 19 0
5 Home 398555 2019-01-27 52 0 52
6 Menswear 682906 2019-02-03 97 97 0
7 Menswear 682906 2019-02-03 97 97 0
8 Menswear 923491 2019-02-09 80 80 0
9 Menswear 1098782 2019-02-25 258 258 0
10 Menswear 721696 2019-03-25 12 12 0
11 Menswear 695706 2019-04-10 129 129 0
12 Underwear 637026 2019-01-18 349 0 0
13 Underwear 205997 2019-01-25 279 0 0
14 Underwear 787984 2019-02-01 27 0 0
15 Underwear 318256 2019-02-01 279 0 0
16 Underwear 570454 2019-02-14 262 0 0
17 Underwear 1239118 2019-02-28 279 0 0
18 Home 1680791 2019-04-04 1398 0 1398
I want to group this data based on 'CustomerID' and then:
Turn the purchase date 'Date' into number of days until a cutoff - date, which is '2021-01-01'. This is just the time from the customers most recent purchase till '2021-01-01'.
Sum over all the remaining Demand-columns, in this example only 'MenswearDemand' and 'HomeDemand'.
The result I should get is this:
Date MenswearDemand HomeDemand
CustomerID
161840 6 199 0
205997 96 0 0
318256 89 0 0
398555 94 0 52
418089 13 199 0
570454 76 0 0
613573 7 199 0
637026 103 0 0
682906 87 194 0
695706 21 129 0
721696 37 12 0
787984 89 0 0
923491 81 80 0
984801 1 19 0
1098782 65 258 0
1239118 62 0 0
1680791 27 0 1398
2134926 2 199 0
This is how I managed to sovle this:
df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days,
'MenswearDemand': lambda x: x.sum(),
'HomeDemand': lambda x: x.sum()})
However, in reality I got about 15 million rows and 30 demand columns. I really don't want to write all those 'DemandColumn': lambda x: x.sum() in my aggregate function every time, since they all should be summed. Is there a better way of doing this? Like passing in an array of the subset of columns that one wants to do a particular operation on?

Adding a row from a dataframe into another by matching columns with NaN values in row pandas python

The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB

Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19

Plot the result of a groupby operation in pandas

I have this sample table:
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 111 2016-06-01 30 20
6 111 2016-07-01 31 20
7 111 2016-08-01 31 15
8 111 2016-09-01 29 15
9 111 2016-10-01 31 10
10 111 2016-11-01 29 5
11 111 2016-12-01 27 0
0 112 2016-01-01 31 55
1 112 2016-02-01 26 45
2 112 2016-03-01 31 40
3 112 2016-04-01 30 35
4 112 2016-04-01 31 30
5 112 2016-05-01 30 25
6 112 2016-06-01 31 25
7 112 2016-07-01 31 20
8 112 2016-08-01 30 20
9 112 2016-09-01 31 15
10 112 2016-11-01 29 10
11 112 2016-12-01 31 0
I'm trying to make my table final table look like this below after grouping by ID and Date.
ID Date CumDays Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 45 40
2 111 2016-03-01 76 35
3 111 2016-04-01 106 30
4 111 2016-05-01 137 25
5 111 2016-06-01 167 20
6 111 2016-07-01 198 20
7 111 2016-08-01 229 15
8 111 2016-09-01 258 15
9 111 2016-10-01 289 10
10 111 2016-11-01 318 5
11 111 2016-12-01 345 0
0 112 2016-01-01 31 55
1 112 2016-02-01 57 45
2 112 2016-03-01 88 40
3 112 2016-04-01 118 35
4 112 2016-05-01 149 30
5 112 2016-06-01 179 25
6 112 2016-07-01 210 25
7 112 2016-08-01 241 20
8 112 2016-09-01 271 20
9 112 2016-10-01 302 15
10 112 2016-11-01 331 10
11 112 2016-12-01 362 0
Next, I want to be able to extract the first value of Volume/Day per ID, all the CumDays values and all the Volume/Day values per ID and Date. So I can use them for further computation and plotting Volume/Day vs CumDays. Example for ID:111, the first value of Volume/Day will be only 50 and ID:112, it will be only 55. All CumDays values for ID:111 will be 20,45... and ID:112, it will be 31,57...For all Volume/Day --- ID:111, will be 50, 40... and ID:112 will be 55,45...
My solution:
def get_time_rate(grp_df):
t = grp_df['Days'].cumsum()
r = grp_df['Volume/Day']
return t,r
vals = df.groupby(['ID','Date']).apply(get_time_rate)
vals
Doing this, the cumulative calculation doesn't take effect at all. It returns the original Days value. This didn't allow me move further in extracting the first value of Volume/Day, all the CumDays values and all the Volume/Day values I need. Any advice or help on how to go about it will be appreciated. Thanks

Get a groupby object.
g = df.groupby('ID')
Compute columns with transform:
df['CumDays'] = g.Days.transform('cumsum')
df['First Volume/Day'] = g['Volume/Day'].transform('first')
df
ID Date Days Volume/Day CumDays First Volume/Day
0 111 2016-01-01 20 50 20 50
1 111 2016-02-01 25 40 45 50
2 111 2016-03-01 31 35 76 50
3 111 2016-04-01 30 30 106 50
4 111 2016-05-01 31 25 137 50
5 111 2016-06-01 30 20 167 50
6 111 2016-07-01 31 20 198 50
7 111 2016-08-01 31 15 229 50
8 111 2016-09-01 29 15 258 50
9 111 2016-10-01 31 10 289 50
10 111 2016-11-01 29 5 318 50
11 111 2016-12-01 27 0 345 50
0 112 2016-01-01 31 55 31 55
1 112 2016-01-02 26 45 57 55
2 112 2016-01-03 31 40 88 55
3 112 2016-01-04 30 35 118 55
4 112 2016-01-05 31 30 149 55
5 112 2016-01-06 30 25 179 55
6 112 2016-01-07 31 25 210 55
7 112 2016-01-08 31 20 241 55
8 112 2016-01-09 30 20 271 55
9 112 2016-01-10 31 15 302 55
10 112 2016-01-11 29 10 331 55
11 112 2016-01-12 31 0 362 55
If you want grouped plots, you can iterate over each groups after grouping by ID. To plot, first set index and call plot.
fig, ax = plt.subplots(figsize=(8,6))
for i, g in df2.groupby('ID'):
g.plot(x='CumDays', y='Volume/Day', ax=ax, label=str(i))
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering static/stationary areas - python

Related

fb prophet daily prediction does not give accurate result for missing values

Generate count column for IDs in a Pandas DataFrame

Aggregate grouped data conditionally over many columns doing different operations in Python/Pandas

Adding a row from a dataframe into another by matching columns with NaN values in row pandas python

Plot the result of a groupby operation in pandas

Categories

Resources