Find number of gaps per hour in 10 second interval data - python

My data is organized in 10-second intervals for 24 hours:
2015-10-14 15:01:10 3956.58 0 19 6.21 105.99 42 59.24
2015-10-14 15:01:20 3956.58 0 1 0.81 121.57 42 59.24
2015-10-14 15:01:30 3956.58 0 47 8.29 115.53 42 59.24
2015-10-14 15:01:40 3956.58 0 79 12.19 107.64 42 59.24
..
..
..
2015-10-15 13:01:10 3956.58 0 79 8.02 107.64 42 59.24
2015-10-15 13:01:10 3956.58 0 79 7.95 108.98 42 59.24
2015-10-15 13:01:10 3956.58 0 79 7.07 110.58 42 59.24
I want to check if, for any hourly group, there are intervals that exceed 10 seconds. How do I get the gaps for each group and print it? So far I've the following:
df = pd.read_csv('convertcsv.csv', parse_dates = True, index_col=0,
names=['date', 'hole_depth', 'rop', 'rotary',
'torque', 'hook_load', 'azimuth', 'inclin'])
df['num_gaps'] = df.groupby(df.index.date)
df.groupby(df.index.time)['num_gaps'].sum()
I want the output to be:
timestamp, num_of_gaps
2015-10-15 06:00, 5
2015-10-15 07:00, 0
...

This is a great answer to get you started. Your case is different as you would like to first group by hour and then look for differences larger than 10 seconds (avoids the date difference problem mentioned in the answer).
So you could try, assuming your DataFrame comes with DateTimeIndex:
import pandas as pd
df['tvalue'] = df.index
time_groups = df.groupby(pd.TimeGrouper('H'))
for hour, data in time_groups:
data['delta'] = (data['tvalue']-data['tvalue'].shift()).fillna(0)
data['delta_sec'] = data['delta'].apply(lambda x: x / np.timedelta64(10,'s'))
print(data[data.delta_sec > 10])
Just saw your edit - you could of course also just count the values per hour and check if the .count() is lower than the 360 one would expect. In other words,
print(df.groupby(TimeGrouper('H')).size())

Related

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

How can I iterate over a Pandas DataFrame and run a function over them

I have my CSV data saved as a dataframe and I want to take the values of a row then use them in a function. I'll try to show what I am looking for. I have tried sorting by amounts but I can figure out how to separate out the data after that step. I am new to Pandas and I would appreciate any helpful and problem-relevant feedback.
UPDATE: If you suggest using .apply on the dataframe, could you show me a good way of applying a complex function. The Pandas documentation only shows simple functions which I don't find useful given the contex.
Here is the df
Date Amount
0 12/27/2019 NaN
1 12/27/2019 -14.00
2 12/27/2019 -15.27
3 12/30/2019 -1.00
4 12/30/2019 -35.01
5 12/30/2019 -9.99
6 01/02/2020 -7.57
7 01/03/2020 1225.36
8 01/03/2020 -40.00
9 01/03/2020 -59.90
10 01/03/2020 -9.52
11 01/06/2020 100.00
12 01/06/2020 -6.41
13 01/06/2020 -31.07
14 01/06/2020 -2.50
15 01/06/2020 -7.46
16 01/06/2020 -18.98
17 01/06/2020 -1.25
18 01/06/2020 -2.50
19 01/06/2020 -1.25
20 01/06/2020 -170.94
21 01/06/2020 -150.00
22 01/07/2020 -20.00
23 01/07/2020 -18.19
24 01/07/2020 -4.00
25 01/08/2020 -1.85
26 01/08/2020 -1.10
27 01/09/2020 -21.00
28 01/09/2020 -31.00
29 01/09/2020 -7.13
30 01/10/2020 -10.00
31 01/10/2020 -1.75
32 01/10/2020 -125.00
33 01/13/2020 -10.60
34 01/13/2020 -2.50
35 01/13/2020 -7.00
36 01/13/2020 -46.32
37 01/13/2020 -1.25
38 01/13/2020 -39.04
39 01/13/2020 -9.46
40 01/13/2020 -179.00
41 01/13/2020 -140.00
42 01/15/2020 -150.04
I want to take the amount value from a row, then look for a matching amount value. Once a matching value is found I want to run a timedelta between the two rows with a matching value.
Thus far, every time I have tried a conditional statement of some sort I get an error. Does anyone have any ideas how I might be able to accomplish this task?
Here is a bit of code I have started with.
amount_1 = df.loc[1, 'Amount']
amount_2 = df.loc[2, 'Amount']
print(amount_1, amount_2)
date_1 = df.loc[2, 'Date'] #skipping the first row.
x = 2
x += 1
date_2 = df.loc[x, 'Date']
## Not real code, but a logical flow I am aiming for
if amount_2 == amount_1:
timed = date_2 - date_1
print(timed, amount_2)
elif amount_2 != amount_1:
# go to the next row and check
you could use something like that:
distinct_values = df["Amount"].unique() # Select all distinct values
for value_unique in distinct_values: # for each distinct value
temp_df = df.loc[df["Amount"] == value_unique] # find rows of that value
# You could iterate over that temp df to do your timedelta operations...

Data cleaning and preparation for Time-Series-LSTM

I need to prepare my Data to feed it into an LSTM for predicting the next day.
My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so can't change it)
I have Date-Time and a certain Value.
E.g.:
datetime..............Value
2015-03-15 12:00:00...1000
2015-03-15 12:00:01....10
.
.
I would like to write a code where I extract e.g. 4 hours and delete the first extracted hour just for specific months (because this data is faulty).
I managed to write a code to extract e.g. 2 hours for x-Data (Input) and y-Data (Output).
I hope I could explain my problem to you.
The Dataset is 1 Year in seconds Data, 6pm-11pm rest is missing.
In e.g. August-November the first hour is faulty data and needs to be deleted.
init = True
for day in np.unique(x_df.index.date):
temp = x_df.loc[(day + pd.DateOffset(hours=18)):(day + pd.DateOffset(hours=20))]
if len(temp) == 7201:
if init:
x_df1 = np.array([temp.values])
init = False
else:
#print (temp.values.shape)
x_df1 = np.append(x_df1, np.array([temp.values]), axis=0)
#else:
#if not temp.empty:
#print (temp.index[0].date(), len(temp))
x_df1 = np.array(x_df1)
print('X-Shape:', x_df1.shape,
'Y-Shape:', y_df1.shape)
#sample, timesteps and features for LSTM
X-Shape: (32, 7201, 6) Y-Shape: (32, 7201)
My expected result is to have a dataset of e.g. 4 hours a day where the first hour in e.g. August, September, and October is deleted.
I would be also very happy if there is someone who can also provide me with a nicer code to do so.
Probably not the most efficient solution, but maybe it still fits.
First lets generate some random data for the first 4 months and 5 days per month:
import random
import pandas as pd
df = pd.DataFrame()
for month in range(1,5): #First 4 Months
for day in range(5,10): #5 Days
hour = random.randint(18,19)
minute = random.randint(1,59)
dt = datetime.datetime(2018,month,day,hour,minute,0)
dti = pd.date_range(dt, periods=60*60*4, freq='S')
values = [random.randrange(1, 101, 1) for _ in range(len(dti))]
df = df.append(pd.DataFrame(values, index=dti, columns=['Value']))
Now let's define a function to filter the first row per day:
def first_value_per_day(df):
res_df = df.groupby(df.index.date).apply(lambda x: x.iloc[[0]])
res_df.index = res_df.index.droplevel(0)
return res_df
and print the results:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 18:58:00 6
2018-02-06 19:12:00 16
2018-02-07 18:18:00 10
2018-02-08 18:32:00 50
2018-02-09 18:38:00 69
2018-03-05 19:54:00 100
2018-03-06 18:37:00 70
2018-03-07 18:58:00 26
2018-03-08 18:28:00 30
2018-03-09 18:34:00 71
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22
So, now we need a list of the specific months, that should be processed, in this case 2 and 3. Now we use the defined function and filter the days for every selected month and loop over those to find the indexes of all values inside the first entry per day +1 hour later and drop them:
MONTHS_TO_MODIFY = [2,3]
HOURS_TO_DROP = 1
fvpd = first_value_per_day(df)
for m in MONTHS_TO_MODIFY:
fvpdm = fvpd[fvpd.index.month == m]
for idx, value in fvpdm.iterrows():
start_dt = idx
end_dt = idx + datetime.timedelta(hours=HOURS_TO_DROP)
index_list = df[(df.index >= start_dt) & (df.index < end_dt)].index.tolist()
df.drop(index_list, inplace=True)
result:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 19:58:00 1
2018-02-06 20:12:00 42
2018-02-07 19:18:00 34
2018-02-08 19:32:00 34
2018-02-09 19:38:00 61
2018-03-05 20:54:00 15
2018-03-06 19:37:00 88
2018-03-07 19:58:00 36
2018-03-08 19:28:00 38
2018-03-09 19:34:00 42
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22

Select specific days data for each month in a dataframe

I have a dataframe with daily data, for over 3 years.
I would like to construct another dataframe containing the data from the last 5 days of each month.
The rows of the 'date' column would be in this case (for the new constructed dataframe) :
2013-01-27
2013-01-28
2013-01-29
2013-01-30
2013-01-31
2013-02-23
2013-02-25
2013-02-26
2013-02-27
2013-02-28
Could someone tell me how I could manage that ?
Many thanks !
One way to do this is to dt.day and dt.days_in_month with boolean indexing:
df = pd.DataFrame({'Date':pd.date_range('2010-01-01','2013-12-31',freq='D'),
'Value':np.random.rand(1461)})
df_out = df[df['Date'].dt.day > df['Date'].dt.days_in_month-5]
print(df_out.head(20))
Output:
Date Value
26 2010-01-27 0.097695
27 2010-01-28 0.236572
28 2010-01-29 0.910922
29 2010-01-30 0.777657
30 2010-01-31 0.943031
54 2010-02-24 0.217144
55 2010-02-25 0.970090
56 2010-02-26 0.658967
57 2010-02-27 0.189376
58 2010-02-28 0.229299
85 2010-03-27 0.986992
86 2010-03-28 0.980633
87 2010-03-29 0.258102
88 2010-03-30 0.827310
89 2010-03-31 0.813219
115 2010-04-26 0.135519
116 2010-04-27 0.263941
117 2010-04-28 0.120624
118 2010-04-29 0.993652
119 2010-04-30 0.901466
Assuming that your column is named Date.
df.groupby([df.Date.dt.month,df.Date.dt.year]).apply(lambda x: x[-5:]).reset_index(drop=True).sort_values('Date')

How to map a function in pandas which compares each record in a column to previous and next records

I have a time series of water levels for which I need to calculate monthly and annual statistics in relation to several arbitrary flood stages. Specifically, I need to determine the duration per month that the water exceeded flood stage, as well as the number of times these excursions occurred. Additionally, because of the noise associated with the dataloggers, I need to exclude floods that lasted less than 1 hour as well as floods with less than 1 hour between events.
Mock up data:
start = datetime.datetime(2014,9,5,12,00)
daterange = pd.date_range(start, periods = 10000, freq = '30min', name = "Datetime")
data = np.random.random_sample((len(daterange), 3)) * 10
columns = ["Pond_A", "Pond_B", "Pond_C"]
df = pd.DataFrame(data = data, index = daterange, columns = columns)
flood_stages = [('Stage_1', 4.0), ('Stage_2', 6.0)]
My desired output is:
Pond_A_Stage_1_duration Pond_A_Stage_1_events \
2014-09-30 12:00:00 35.5 2
2014-10-31 12:00:00 40.5 31
2014-11-30 12:00:00 100 16
2014-12-31 12:00:00 36 12
etc. for the duration and events at each flood stage, at each reservoir.
I've tried grouping by month, iterating through the ponds and then iterating through each row like:
grouper = pd.TimeGrouper(freq = "1MS")
month_groups = df.groupby(grouper)
for name, group in month_groups:
flood_stage_a = group.sum()[1]
flood_stage_b = group.sum()[2]
inundation_a = False
inundation_30_a = False
inundation_hour_a = False
change_inundation_a = 0
for level in group.values:
if level[1]:
inundation_a = True
else:
inundation_a = False
if inundation_hour_a == False and inundation_a == True and inundation_30_a == True:
change_inundation_a += 1
inundation_hour_a = inundation_30_a
inundation_30_a = inundation_a
But this is a caveman solution and the heuristics are getting messy since I don't want to count a new event if a flood started in one month and continued into the next. This also doesn't combine events with less than one hour between their start and end. Is there a better way to compare a record to it previous and next?
My other thought is to create new columns with the series shifted t+1, t+2, t-1, t-2, so I can evaluate each row once, but this still seems inefficient. Is there a smarter way to do this by mapping a function?
Let me give a quick, partial answer since no one has answered yet, and maybe someone else can do something better later on if this does not suffice for you.
You can do the time spent above flood stage pretty easily. I divided by 48 so the units are in days.
df[ df > 4 ].groupby(pd.TimeGrouper( freq = "1MS" )).count() / 48
Pond_A Pond_B Pond_C
Datetime
2014-09-01 15.375000 15.437500 14.895833
2014-10-01 18.895833 18.187500 18.645833
2014-11-01 17.937500 17.979167 18.666667
2014-12-01 18.104167 18.354167 18.958333
2015-01-01 18.791667 18.645833 18.708333
2015-02-01 16.583333 17.208333 16.895833
2015-03-01 18.458333 18.458333 18.458333
2015-04-01 0.458333 0.520833 0.500000
Counting distinct events is a little harder, but something like this will get you most of the way. (Note that this produces an unrealistically high number of flooding events, but that's just because of how the sample data is set up and not reflective of a typical pond, though I'm not an expert on pond flooding!)
for c in df.columns:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift() <= 4))
df.iloc[:,-3:].groupby(pd.TimeGrouper( freq = "1MS" )).sum()
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 306 291 298
2014-10-01 381 343 373
2014-11-01 350 346 357
2014-12-01 359 352 361
2015-01-01 355 335 352
2015-02-01 292 337 316
2015-03-01 344 360 386
2015-04-01 9 10 9
A couple things to note. First, an event can span months and this method will group it with the month where the event began. Second, I'm ignoring the duration of the event here, but you can adjust that however you want. For example, if you want to say the event doesn't start unless there are 2 consecutive periods below flood level followed by 2 consecutive periods above flood level, just change the relevant line above to:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift(1) <= 4) &
(df[c].shift(-1) > 4) & (df[c].shift(2) <= 4))
That produces a pretty dramatic reduction in the count of distinct events:
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 70 71 72
2014-10-01 91 85 81
2014-11-01 87 75 91
2014-12-01 88 87 77
2015-01-01 91 95 94
2015-02-01 79 90 83
2015-03-01 83 78 85
2015-04-01 0 2 2

Categories

Resources