May I know any better way to get the system downtime period.
The system will detect a value and transmit its reading for every 20 minutes.
This code is used to identify the similar data of 0.00 that measured no data.
However, I would like to find out how long its had lost data in this consecutive timestamp.
Like an example, 2.20am, 2,40am, 3.00am having the same value, and thus the total lost time is 1hour. Start time, end time in the dataframe. Please help.
df_same_values = df.groupby((df['Water Quality Sensor'].shift() != df['Water Quality Sensor']).\
cumsum()).filter(lambda x: len(x) >= 3)
Water Quality Sensor
date
2022-01-01 02:20:00 2.500000
2022-01-01 02:40:00 2.500000
2022-01-01 03:00:00 2.500000
2022-01-01 09:00:00 0.000000
2022-01-01 09:20:00 0.000000
2022-01-01 09:40:00 0.000000
2022-01-01 10:00:00 0.000000
2022-01-01 10:20:00 0.000000
2022-01-01 10:40:00 0.000000
2022-01-01 12:40:00 0.000000
2022-01-01 13:00:00 0.000000
2022-01-01 13:20:00 0.000000
2022-01-01 18:00:00 2.500000
2022-01-01 18:20:00 2.500000
2022-01-01 18:40:00 2.500000
2022-01-01 19:00:00 2.500000
2022-01-02 03:00:00 2.500000
2022-01-02 03:20:00 2.500000
2022-01-02 03:40:00 2.500000
2022-01-02 04:00:00 2.500000
2022-01-02 04:20:00 2.500000
2022-01-02 04:40:00 2.500000
2022-01-02 05:00:00 2.500000
2022-01-02 05:20:00 2.500000
2022-01-02 18:00:00 2.500000
2022-01-02 18:20:00 2.500000
2022-01-02 18:40:00 2.500000
2022-01-02 19:00:00 2.500000
2022-01-02 19:20:00 2.500000
If your data are truly this small, why not just iterate over the rows while recording appropriate values?
down_flag = False
down_start = None
down_stop = None
for i, row in df.iterrows():
if row['Water Quality Sensor'] == 0 and down_flag == False:
down_flag = True
down_start = row['date']
if row['Water Quality Sensor'] > 0 and down_flag == True:
down_flag = False
down_stop = df.loc[i-1, 'date']
down_time = down_stop - down_start
print(down_time)
gives
0 days 04:20:00
as expected. Note this only records the last portion of downtime if there are multiple outages.
Related
I have a dataframe:
df = T1 C1
01/01/2022 11:20 2
01/01/2022 15:40 8
01/01/2022 17:50 3
I want to expand it such that
I will have the value in specific given times
I will have a row for each round timestamp
So if the times are given in
l=[ 01/01/2022 15:46 , 01/01/2022 11:28]
I will have:
df_new = T1 C1
01/01/2022 11:20 2
01/01/2022 11:28 2
01/01/2022 12:00 2
01/01/2022 13:00 2
01/01/2022 14:00 2
01/01/2022 15:00 2
01/01/2022 15:40 8
01/01/2022 15:46 8
01/01/2022 16:00 8
01/01/2022 17:00 8
01/01/2022 17:50 3
You can add the extra dates and ffill:
df['T1'] = pd.to_datetime(df['T1'])
extra = pd.date_range(df['T1'].min().ceil('H'), df['T1'].max().floor('H'), freq='1h')
(pd.concat([df, pd.DataFrame({'T1': extra})])
.sort_values(by='T1', ignore_index=True)
.ffill()
)
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
Here is a way to do what your question asks that will ensure:
there are no duplicate times in T1 in the output, even if any of the times in the original are round hours
the results will be of the same type as the values in the C1 column of the input (in this case, integers not floats).
hours = pd.date_range(df.T1.min().ceil("H"), df.T1.max().floor("H"), freq="60min")
idx_new = df.set_index('T1').join(pd.DataFrame(index=hours), how='outer', sort=True).index
df_new = df.set_index('T1').reindex(index = idx_new, method='ffill').reset_index().rename(columns={'index':'T1'})
Output:
T1 C1
0 2022-01-01 11:20:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 8
8 2022-01-01 17:50:00 3
Example of how round dates in the input are handled:
df = pd.DataFrame({
#'T1':pd.to_datetime(['01/01/2022 11:20','01/01/2022 15:40','01/01/2022 17:50']),
'T1':pd.to_datetime(['01/01/2022 11:00','01/01/2022 15:40','01/01/2022 17:00']),
'C1':[2,8,3]})
Input:
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 15:40:00 8
2 2022-01-01 17:00:00 3
Output (no duplicates):
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 3
Another possible solution, based on pandas.DataFrame.resample:
df['T1'] = pd.to_datetime(df['T1'])
(pd.concat([df, df.set_index('T1').resample('1H').asfreq().reset_index()])
.sort_values('T1').ffill().dropna().reset_index(drop=True))
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
I have two pandas dateframes in python, df_main and df_aux.
df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:
Date
Description
2022-01-01 13:45:23
A
2022-01-01 14:22:00
C
2022-01-01 16:15:33
D
2022-01-01 16:21:22
E
2022-01-02 13:21:56
B
2022-01-02 14:45:41
B
2022-01-02 15:11:34
C
df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:
Initial_Date
Final_Date
A
B
C
D
2022-01-01 12:00:00
2022-01-01 12:59:59
2
0
1
2
2022-01-01 13:00:00
2022-01-01 13:59:59
3
2
4
5
2022-01-01 14:00:00
2022-01-01 14:59:59
2
2
7
0
2022-01-01 15:00:00
2022-01-01 15:59:59
5
2
2
0
2022-01-02 12:00:00
2022-01-02 12:59:59
1
1
0
3
2022-01-02 13:00:00
2022-01-02 13:59:59
5
5
0
3
2022-01-02 14:00:00
2022-01-02 14:59:59
2
3
2
1
2022-01-02 15:00:00
2022-01-02 15:59:59
3
4
1
0
So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.
Following the same logic, the full df_main would be,
Date
Description
People_prev_hour
2022-01-01 13:45:23
A
2
2022-01-01 14:22:00
C
4
2022-01-01 16:15:33
D
0
2022-01-01 16:21:22
E
NaN
2022-01-02 13:21:56
B
1
2022-01-02 14:45:41
B
5
2022-01-02 15:11:34
F
NaN
Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.
I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!
First idea was use merge_asof, because hours not overlaping intervals:
df1 = pd.merge_asof(df_main,
df_aux.assign(Initial_Date = df_aux['Initial_Date'] + pd.Timedelta(1, 'hour')),
left_on='Date',
right_on='Initial_Date')
Then use indexing lookup:
idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
Another idea with IntervalIndex:
s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date + pd.Timedelta(1, 'hour'),
df_aux.Final_Date + pd.Timedelta(1, 'hour'), 'both')
df1 = df_aux.set_index(s).loc[df_main.Date]
print (df1)
Initial_Date \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00
Final_Date A B C D
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59 2 0 1 2
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59 3 2 4 5
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59 1 1 0 3
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59 5 5 0 3
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59 2 3 2 1
idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
This question already has answers here:
How to compare two date columns in a dataframe using pandas?
(2 answers)
Closed 5 months ago.
I have a dataset of New York taxi rides. There are some wrong values in the pickup_datetime and dropoff_datetime, because the dropoff is before than the pickup. How can i compare this two values and drop the row?
You can do with:
import pandas as pd
import numpy as np
df=pd.read_excel("Taxi.xlsx")
#convert to datetime
df["Pickup"]=pd.to_datetime(df["Pickup"])
df["Drop"]=pd.to_datetime(df["Drop"])
#Identify line to remove
df["ToRemove"]=np.where(df["Pickup"]>df["Drop"],1,0)
print(df)
#filter the dataframe if pickup time is higher than drop time
dfClean=df[df["ToRemove"]==0]
print(dfClean)
Result:
df:
Pickup Drop Price
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3
3 2022-01-01 11:00:00 2022-01-01 10:05:00 10
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5
5 2022-01-01 11:40:00 2022-01-01 08:45:00 8
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3
Df Clean
Pickup Drop Price ToRemove
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5 0
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8 0
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3 0
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5 0
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3 0
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10 0
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5 0
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8 0
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3 0
I want to resample a dataframe and ffill values, see code below. However, the last value isn't forward filled.
df=pd.DataFrame(index=pd.date_range(start='1/1/2022',periods=5,freq='h'),data=range(0,5))
print(df.resample('15T').ffill())
results in:
0
2022-01-01 00:00:00 0
2022-01-01 00:15:00 0
2022-01-01 00:30:00 0
2022-01-01 00:45:00 0
2022-01-01 01:00:00 1
2022-01-01 01:15:00 1
2022-01-01 01:30:00 1
2022-01-01 01:45:00 1
2022-01-01 02:00:00 2
2022-01-01 02:15:00 2
2022-01-01 02:30:00 2
2022-01-01 02:45:00 2
2022-01-01 03:00:00 3
2022-01-01 03:15:00 3
2022-01-01 03:30:00 3
2022-01-01 03:45:00 3
2022-01-01 04:00:00 4
I would like the last entry to also occur 3 more times. Currently I handle this by adding an extra entry manually, resample and then drop the last value, but that seems cumbersome. I hope there is a more elegant way.
As #mozway mentioned, this is just the way resampling works in pandas. Alternatively, you can manually do the upsampling with a join.
df = pd.DataFrame(
index=pd.date_range(start='1/1/2022',periods=5,freq='1h'),
data=range(0,5)
)
pd.DataFrame(
index=pd.date_range(start='1/1/2022',periods=5*4,freq='15min')
).join(df).ffill()
I have 15 minutes candles of stock data and have a short signal - I want to create a new column stop-loss if signal=0 then stop-loss = high of the second next candle ie( df['high'].shift(-2) )
open high low close signal
date
2020-01-01 09:15:00 1452.50 1457.00 1449.20 1452.50 NaN
2020-01-01 09:30:00 1452.30 1454.40 1450.00 1451.45 NaN
2020-01-01 09:45:00 1450.50 1454.80 1450.00 1453.75 NaN
2020-01-01 10:00:00 1453.70 1453.70 1450.10 1450.70 0.0
2020-01-01 10:15:00 1450.70 1453.00 1450.50 1452.20 NaN
2020-01-01 10:30:00 1452.00 1452.00 1446.75 1446.85 NaN
2020-01-01 10:45:00 1447.60 1449.00 1445.50 1447.10 NaN
2020-01-01 11:00:00 1446.75 1449.00 1446.55 1447.65 NaN
in this example:
stop-loss for short signal at 2020-01-01 10:00:00 will be 1452.00
which is the High at 2020-01-01 10:30:00
Let us try np.where(condition, answer if condition is true, answer if condition is false)
df['stop-loss']=np.where(df.signal==0,df.high.shift(-2),'')
In this case, you didnt specify what the condition should be if false so I put there ''
open high low close signal stop-loss
date
2020-01-01 09:15:00 1452.50 1457.0 1449.20 1452.50 NaN
2020-01-01 09:30:00 1452.30 1454.4 1450.00 1451.45 NaN
2020-01-01 09:45:00 1450.50 1454.8 1450.00 1453.75 NaN
2020-01-01 10:00:00 1453.70 1453.7 1450.10 1450.70 0.0 1452.0
2020-01-01 10:15:00 1450.70 1453.0 1450.50 1452.20 NaN
2020-01-01 10:30:00 1452.00 1452.0 1446.75 1446.85 NaN
2020-01-01 10:45:00 1447.60 1449.0 1445.50 1447.10 NaN
2020-01-01 11:00:00 1446.75 1449.0 1446.55 1447.65 NaN
Following your additional question in the comments. assume dataframe is
open high low close signal
date
2020-01-01 09:15:00 1452.50 1457.0 1449.20 1452.50 NaN
2020-01-01 09:30:00 1452.30 1454.4 1450.00 1451.45 NaN
2020-01-01 09:45:00 1450.50 1454.8 1450.00 1453.75 NaN
2020-01-01 10:00:00 1453.70 1453.7 1450.10 1450.70 0.0
2020-01-01 10:15:00 1450.70 1453.0 1450.50 1452.20 NaN
2020-01-01 10:30:00 1452.00 1452.0 1446.75 1446.85 1.0
2020-01-01 10:45:00 1447.60 1449.0 1445.50 1447.10 NaN
2020-01-01 11:00:00 1446.75 1449.0 1446.55 1447.65 NaN
Use np.select([conditons],[choices], alternative)
conditions=[df.signal==0,df.signal==1]
choices=[df.high.shift(-2),df.low.shift(-2)]
df['stop-loss']=np.select(conditions, choices,'')
open high low close signal stop-loss
date
2020-01-01 09:15:00 1452.50 1457.0 1449.20 1452.50 NaN
2020-01-01 09:30:00 1452.30 1454.4 1450.00 1451.45 NaN
2020-01-01 09:45:00 1450.50 1454.8 1450.00 1453.75 NaN
2020-01-01 10:00:00 1453.70 1453.7 1450.10 1450.70 0.0 1452.0
2020-01-01 10:15:00 1450.70 1453.0 1450.50 1452.20 NaN
2020-01-01 10:30:00 1452.00 1452.0 1446.75 1446.85 1.0 1446.55
2020-01-01 10:45:00 1447.60 1449.0 1445.50 1447.10 NaN
2020-01-01 11:00:00 1446.75 1449.0 1446.55 1447.65 NaN