Data cleaning and preparation for Time-Series-LSTM

Data cleaning and preparation for Time-Series-LSTM - python

I need to prepare my Data to feed it into an LSTM for predicting the next day.
My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so can't change it)
I have Date-Time and a certain Value.
E.g.:
datetime..............Value
2015-03-15 12:00:00...1000
2015-03-15 12:00:01....10
.
.
I would like to write a code where I extract e.g. 4 hours and delete the first extracted hour just for specific months (because this data is faulty).
I managed to write a code to extract e.g. 2 hours for x-Data (Input) and y-Data (Output).
I hope I could explain my problem to you.
The Dataset is 1 Year in seconds Data, 6pm-11pm rest is missing.
In e.g. August-November the first hour is faulty data and needs to be deleted.
init = True
for day in np.unique(x_df.index.date):
temp = x_df.loc[(day + pd.DateOffset(hours=18)):(day + pd.DateOffset(hours=20))]
if len(temp) == 7201:
if init:
x_df1 = np.array([temp.values])
init = False
else:
#print (temp.values.shape)
x_df1 = np.append(x_df1, np.array([temp.values]), axis=0)
#else:
#if not temp.empty:
#print (temp.index[0].date(), len(temp))
x_df1 = np.array(x_df1)
print('X-Shape:', x_df1.shape,
'Y-Shape:', y_df1.shape)
#sample, timesteps and features for LSTM
X-Shape: (32, 7201, 6) Y-Shape: (32, 7201)
My expected result is to have a dataset of e.g. 4 hours a day where the first hour in e.g. August, September, and October is deleted.
I would be also very happy if there is someone who can also provide me with a nicer code to do so.

Probably not the most efficient solution, but maybe it still fits.
First lets generate some random data for the first 4 months and 5 days per month:
import random
import pandas as pd
df = pd.DataFrame()
for month in range(1,5): #First 4 Months
for day in range(5,10): #5 Days
hour = random.randint(18,19)
minute = random.randint(1,59)
dt = datetime.datetime(2018,month,day,hour,minute,0)
dti = pd.date_range(dt, periods=60*60*4, freq='S')
values = [random.randrange(1, 101, 1) for _ in range(len(dti))]
df = df.append(pd.DataFrame(values, index=dti, columns=['Value']))
Now let's define a function to filter the first row per day:
def first_value_per_day(df):
res_df = df.groupby(df.index.date).apply(lambda x: x.iloc[[0]])
res_df.index = res_df.index.droplevel(0)
return res_df
and print the results:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 18:58:00 6
2018-02-06 19:12:00 16
2018-02-07 18:18:00 10
2018-02-08 18:32:00 50
2018-02-09 18:38:00 69
2018-03-05 19:54:00 100
2018-03-06 18:37:00 70
2018-03-07 18:58:00 26
2018-03-08 18:28:00 30
2018-03-09 18:34:00 71
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22
So, now we need a list of the specific months, that should be processed, in this case 2 and 3. Now we use the defined function and filter the days for every selected month and loop over those to find the indexes of all values inside the first entry per day +1 hour later and drop them:
MONTHS_TO_MODIFY = [2,3]
HOURS_TO_DROP = 1
fvpd = first_value_per_day(df)
for m in MONTHS_TO_MODIFY:
fvpdm = fvpd[fvpd.index.month == m]
for idx, value in fvpdm.iterrows():
start_dt = idx
end_dt = idx + datetime.timedelta(hours=HOURS_TO_DROP)
index_list = df[(df.index >= start_dt) & (df.index < end_dt)].index.tolist()
df.drop(index_list, inplace=True)
result:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 19:58:00 1
2018-02-06 20:12:00 42
2018-02-07 19:18:00 34
2018-02-08 19:32:00 34
2018-02-09 19:38:00 61
2018-03-05 20:54:00 15
2018-03-06 19:37:00 88
2018-03-07 19:58:00 36
2018-03-08 19:28:00 38
2018-03-09 19:34:00 42
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22

Related

Overwrite a value in a pandas dataframe column based on a calculation function applied to it

From the following DataFrame:
worktime = 1440
person = [11,22,33,44,55]
begin_date = '2019-10-01'
shift= [1,2,3,1,2]
pause = [90,0,85,70,0]
occu = [60,0,40,20,0]
time_u = [50,40,80,20,0]
time_a = [84.5,0.0,10.5,47.7,0.0]
time_p = 0
time_q = [35.9,69.1,0.0,0.0,84.4]
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(person)),'person':person,'shift':shift,'worktime':worktime,'pause':pause,'occu':occu, 'time_u':time_u,'time_a':time_a,'time_p ':time_p,'time_q':time_q,})
Output:
date person shift worktime pause occu time_u time_a time_p time_q
0 2019-10-01 11 1 1440 90 60 50 84.5 0 35.9
1 2019-10-02 22 2 1440 0 0 40 0.0 0 69.1
2 2019-10-03 33 3 1440 85 40 80 10.5 0 0.0
3 2019-10-04 44 1 1440 70 20 20 47.7 0 0.0
4 2019-10-05 55 2 1440 0 0 0 0.0 0 84.4
I am looking for a suitable function that takes the already contained value of the columns and uses it in a calculation and then overwrites it with the result of the calculation.
It concerns the columns time_u, time_a, time_p and time_q and should be applied according to the following principle:
time_u = worktime - pause - occu - (existing value of time_u)
time_a = (new value of time_u) - time_a
time_p = (new value of time_a) - time_p
time_q = (new value of time_p)- time_q
Is there a possible function that could be used here?
Using this formula manually, the output would look like this:
date person shift worktime pause occu time_u time_a time_p time_q
0 2019-10-01 11 1 1440 90 60 1240 1155.5 1155.5 1119.6
1 2019-10-02 22 2 1440 0 0 1400 1400 1400 1330.9
2 2019-10-03 33 3 1440 85 40 1235 1224.5 1224.5 1224.5
3 2019-10-04 44 1 1440 70 20 1330 1282.3 1282.3 1282.3
4 2019-10-05 55 2 1440 0 0 1440 1440 1440 1355.6
Unfortunately, this task is way beyond my skill level, so any help in setting up the appropriate function would be greatly appreciated.
Many thanks in advance

You can simply apply the relationships you have supplied sequentially. Or are you looking for something else? By the way, you put an extra space at the end of 'time_p'
df['time_u'] = df['worktime'] - df['pause'] - df['occu'] - df['time_u']
df['time_a'] = df['time_u'] - df['time_a']
df['time_p'] = df['time_a'] - df['time_p']
df['time_q'] = df['time_p'] - df['time_q']

How can I iterate over a Pandas DataFrame and run a function over them

I have my CSV data saved as a dataframe and I want to take the values of a row then use them in a function. I'll try to show what I am looking for. I have tried sorting by amounts but I can figure out how to separate out the data after that step. I am new to Pandas and I would appreciate any helpful and problem-relevant feedback.
UPDATE: If you suggest using .apply on the dataframe, could you show me a good way of applying a complex function. The Pandas documentation only shows simple functions which I don't find useful given the contex.
Here is the df
Date Amount
0 12/27/2019 NaN
1 12/27/2019 -14.00
2 12/27/2019 -15.27
3 12/30/2019 -1.00
4 12/30/2019 -35.01
5 12/30/2019 -9.99
6 01/02/2020 -7.57
7 01/03/2020 1225.36
8 01/03/2020 -40.00
9 01/03/2020 -59.90
10 01/03/2020 -9.52
11 01/06/2020 100.00
12 01/06/2020 -6.41
13 01/06/2020 -31.07
14 01/06/2020 -2.50
15 01/06/2020 -7.46
16 01/06/2020 -18.98
17 01/06/2020 -1.25
18 01/06/2020 -2.50
19 01/06/2020 -1.25
20 01/06/2020 -170.94
21 01/06/2020 -150.00
22 01/07/2020 -20.00
23 01/07/2020 -18.19
24 01/07/2020 -4.00
25 01/08/2020 -1.85
26 01/08/2020 -1.10
27 01/09/2020 -21.00
28 01/09/2020 -31.00
29 01/09/2020 -7.13
30 01/10/2020 -10.00
31 01/10/2020 -1.75
32 01/10/2020 -125.00
33 01/13/2020 -10.60
34 01/13/2020 -2.50
35 01/13/2020 -7.00
36 01/13/2020 -46.32
37 01/13/2020 -1.25
38 01/13/2020 -39.04
39 01/13/2020 -9.46
40 01/13/2020 -179.00
41 01/13/2020 -140.00
42 01/15/2020 -150.04
I want to take the amount value from a row, then look for a matching amount value. Once a matching value is found I want to run a timedelta between the two rows with a matching value.
Thus far, every time I have tried a conditional statement of some sort I get an error. Does anyone have any ideas how I might be able to accomplish this task?
Here is a bit of code I have started with.
amount_1 = df.loc[1, 'Amount']
amount_2 = df.loc[2, 'Amount']
print(amount_1, amount_2)
date_1 = df.loc[2, 'Date'] #skipping the first row.
x = 2
x += 1
date_2 = df.loc[x, 'Date']
## Not real code, but a logical flow I am aiming for
if amount_2 == amount_1:
timed = date_2 - date_1
print(timed, amount_2)
elif amount_2 != amount_1:
# go to the next row and check

you could use something like that:
distinct_values = df["Amount"].unique() # Select all distinct values
for value_unique in distinct_values: # for each distinct value
temp_df = df.loc[df["Amount"] == value_unique] # find rows of that value
# You could iterate over that temp df to do your timedelta operations...

Select specific days data for each month in a dataframe

I have a dataframe with daily data, for over 3 years.
I would like to construct another dataframe containing the data from the last 5 days of each month.
The rows of the 'date' column would be in this case (for the new constructed dataframe) :
2013-01-27
2013-01-28
2013-01-29
2013-01-30
2013-01-31
2013-02-23
2013-02-25
2013-02-26
2013-02-27
2013-02-28
Could someone tell me how I could manage that ?
Many thanks !

One way to do this is to dt.day and dt.days_in_month with boolean indexing:
df = pd.DataFrame({'Date':pd.date_range('2010-01-01','2013-12-31',freq='D'),
'Value':np.random.rand(1461)})
df_out = df[df['Date'].dt.day > df['Date'].dt.days_in_month-5]
print(df_out.head(20))
Output:
Date Value
26 2010-01-27 0.097695
27 2010-01-28 0.236572
28 2010-01-29 0.910922
29 2010-01-30 0.777657
30 2010-01-31 0.943031
54 2010-02-24 0.217144
55 2010-02-25 0.970090
56 2010-02-26 0.658967
57 2010-02-27 0.189376
58 2010-02-28 0.229299
85 2010-03-27 0.986992
86 2010-03-28 0.980633
87 2010-03-29 0.258102
88 2010-03-30 0.827310
89 2010-03-31 0.813219
115 2010-04-26 0.135519
116 2010-04-27 0.263941
117 2010-04-28 0.120624
118 2010-04-29 0.993652
119 2010-04-30 0.901466

Assuming that your column is named Date.
df.groupby([df.Date.dt.month,df.Date.dt.year]).apply(lambda x: x[-5:]).reset_index(drop=True).sort_values('Date')

Find number of gaps per hour in 10 second interval data

My data is organized in 10-second intervals for 24 hours:
2015-10-14 15:01:10 3956.58 0 19 6.21 105.99 42 59.24
2015-10-14 15:01:20 3956.58 0 1 0.81 121.57 42 59.24
2015-10-14 15:01:30 3956.58 0 47 8.29 115.53 42 59.24
2015-10-14 15:01:40 3956.58 0 79 12.19 107.64 42 59.24
..
..
..
2015-10-15 13:01:10 3956.58 0 79 8.02 107.64 42 59.24
2015-10-15 13:01:10 3956.58 0 79 7.95 108.98 42 59.24
2015-10-15 13:01:10 3956.58 0 79 7.07 110.58 42 59.24
I want to check if, for any hourly group, there are intervals that exceed 10 seconds. How do I get the gaps for each group and print it? So far I've the following:
df = pd.read_csv('convertcsv.csv', parse_dates = True, index_col=0,
names=['date', 'hole_depth', 'rop', 'rotary',
'torque', 'hook_load', 'azimuth', 'inclin'])
df['num_gaps'] = df.groupby(df.index.date)
df.groupby(df.index.time)['num_gaps'].sum()
I want the output to be:
timestamp, num_of_gaps
2015-10-15 06:00, 5
2015-10-15 07:00, 0
...

This is a great answer to get you started. Your case is different as you would like to first group by hour and then look for differences larger than 10 seconds (avoids the date difference problem mentioned in the answer).
So you could try, assuming your DataFrame comes with DateTimeIndex:
import pandas as pd
df['tvalue'] = df.index
time_groups = df.groupby(pd.TimeGrouper('H'))
for hour, data in time_groups:
data['delta'] = (data['tvalue']-data['tvalue'].shift()).fillna(0)
data['delta_sec'] = data['delta'].apply(lambda x: x / np.timedelta64(10,'s'))
print(data[data.delta_sec > 10])
Just saw your edit - you could of course also just count the values per hour and check if the .count() is lower than the 360 one would expect. In other words,
print(df.groupby(TimeGrouper('H')).size())

Time Delta of Year Excluding Certain Days

I am making a heat map that has Company Name on the x axis, months on the y-axis, and shaded regions as the number of calls.
I am taking a slice of data from a database for the past year in order to create the heat map. However, this means that if you hover over the current month, say for example today is July 13, you will get the calls of July 1-13 of this year, and the calls of July 13-31 from last year added together. In the current month, I only want to show calls from July 1-13.
#This section selects the last year of data
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - datetime.timedelta(days=365)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]

You can use the pd.tseries.offsets.MonthEnd() to achieve your goal here.
import pandas as pd
import numpy as np
import datetime as dt
np.random.seed(0)
val = np.random.randn(600)
date_rng = pd.date_range('2014-01-01', periods=600, freq='D')
df = pd.DataFrame(dict(dates=date_rng,col=val))
print(df)
col dates
0 1.7641 2014-01-01
1 0.4002 2014-01-02
2 0.9787 2014-01-03
3 2.2409 2014-01-04
4 1.8676 2014-01-05
5 -0.9773 2014-01-06
6 0.9501 2014-01-07
7 -0.1514 2014-01-08
8 -0.1032 2014-01-09
9 0.4106 2014-01-10
.. ... ...
590 0.5433 2015-08-14
591 0.4390 2015-08-15
592 -0.2195 2015-08-16
593 -1.0840 2015-08-17
594 0.3518 2015-08-18
595 0.3792 2015-08-19
596 -0.4700 2015-08-20
597 -0.2167 2015-08-21
598 -0.9302 2015-08-22
599 -0.1786 2015-08-23
[600 rows x 2 columns]
print(df.dates.dtype)
datetime64[ns]
datetime_now = dt.datetime.now()
datetime_now_month_end = datetime_now + pd.tseries.offsets.MonthEnd(1)
print(datetime_now_month_end)
2015-07-31 03:19:18.292739
datetime_start = datetime_now_month_end - pd.tseries.offsets.DateOffset(years=1)
print(datetime_start)
2014-07-31 03:19:18.292739
print(df[(df.dates > datetime_start) & (df.dates < datetime_now)])
col dates
212 0.7863 2014-08-01
213 -0.4664 2014-08-02
214 -0.9444 2014-08-03
215 -0.4100 2014-08-04
216 -0.0170 2014-08-05
217 0.3792 2014-08-06
218 2.2593 2014-08-07
219 -0.0423 2014-08-08
220 -0.9559 2014-08-09
221 -0.3460 2014-08-10
.. ... ...
550 0.1639 2015-07-05
551 0.0963 2015-07-06
552 0.9425 2015-07-07
553 -0.2676 2015-07-08
554 -0.6780 2015-07-09
555 1.2978 2015-07-10
556 -2.3642 2015-07-11
557 0.0203 2015-07-12
558 -1.3479 2015-07-13
559 -0.7616 2015-07-14
[348 rows x 2 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data cleaning and preparation for Time-Series-LSTM - python

Related

Overwrite a value in a pandas dataframe column based on a calculation function applied to it

How can I iterate over a Pandas DataFrame and run a function over them

Select specific days data for each month in a dataframe

Find number of gaps per hour in 10 second interval data

Time Delta of Year Excluding Certain Days

Categories

Resources