Create Multiple DataFrames using Rolling Window from DataFrame Timestamps - python

I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00

This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)

Related

Convert a column to a specific time format which contains different types of time formats in python

This is my data frame
df = pd.DataFrame({
'Time': ['10:00PM', '15:45:00', '13:40:00AM','5:00']
})
Time
0 10:00PM
1 15:45:00
2 13:40:00AM
3 5:00
I need to convert the time format in a specific format which is my expected output, given below.
Time
0 22:00:00
1 15:45:00
2 01:40:00
3 05:00:00
I tried using split and endswith function of str which is a complicated solution. Is there any better way to achieve this?
Thanks in advance!
here you go. One thing to mention though 13:40:00AM will result in an error since 13 is a) wrong format as AM/PM only go from 1 to 12 and b) PM (which 13 would be) cannot at the same time be AM :)
Cheers
import pandas as pd
df = pd.DataFrame({'Time': ['10:00PM', '15:45:00', '01:40:00AM', '5:00']})
df['Time'] = pd.to_datetime(df['Time'])
print(df['Time'].dt.time)
<<< 22:00:00
<<< 15:45:00
<<< 01:45:00
<<< 05:00:00

Most pythonic way to preform sub-section analysis on a data frame multiple times?

I have a large time-series dataframe at the second interval, and I need to preform some analysis on every grouping of 0-59 seconds.
This is then used to be a feature in a different time-series dataframe by using the floor of the sub-section
I feel like I'm missing something basic, but unsure of the wording to get it right.
ex:
timestamp,close
2021-06-01 00:00:00,37282.0
2021-06-01 00:00:01,37282.0
2021-06-01 00:00:02,37285.0
2021-06-01 00:00:03,37283.0
2021-06-01 00:00:04,37281.0
2021-06-01 00:00:05,37278.0
2021-06-01 00:00:06,37275.0
2021-06-01 00:00:07,37263.0
2021-06-01 00:00:08,37264.0
2021-06-01 00:00:09,37259.0
...
2021-06-01 00:00:59,37260.0
2021-06-02 00:01:00,37261.0 --> new analysis starts here
2021-06-02 00:01:01,37262.0
# and repeat
My current implementation works, but I have a feeling is a really bad way of doing it.
df['last_update_hxro'] = df.apply(lambda x: 1 if x.timestamp.second == 59 else 0, axis=1)
df['hxro_close'] = df[df['last_update_hxro']==1].close
df['next_hxro_close'] = df['hxro_close'].shift(-60)
df['hxro_result'] = df[df['last_update_hxro']==1].apply(lambda x: 1 if x.next_hxro_close > x.hxro_close else 0, axis=1)
df['trade_number'] = df.last_update_hxro.cumsum() - df.last_update_hxro
unique_trades = df.trade_number.unique()
for x in unique_trades:
temp_df = btc_df[btc_df['trade_number']==x]
new_df = generate_sub_min_features(temp_df)
feature_df = feature_df.append(new_df)
def generate_sub_min_features(full_df):
# do stuff here and return a series of length 1, with the minute floor of the subsection as the key
First set timestamp as index:
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%Y-%m-%d %H:%M:S")
df.set_index('timestamp')
Then group by hour and save each dataframe in a list:
groups = [g for n, g in df.groupby(pd.Grouper(freq='H'))]
IMO, any pandas solution which includes a loop could be done better.
I feel I may be a little lost on your ask without a basic example, but it sounds like you're looking for a simple resample or groupby?
Example
Here's a sample using a df loaded with the distinct seconds from 1/1/21 - 1/2/21, paired with random integers from 0 - 10. We'll take the average of every minute.
import pandas as pd
import numpy as np
df_time = pd.DataFrame(
{'val': np.random.randint(0,10, size=(60 * 60 * 24) + 1)},
index=pd.date_range('1/1/2021', '1/2/2021', freq='S')
)
df_mean = df_time.resample('T').apply(lambda x: x.mean())
print(df_mean)
Returns...
val
2021-01-01 00:00:00 4.566667
2021-01-01 00:01:00 5.000000
2021-01-01 00:02:00 4.316667
2021-01-01 00:03:00 4.800000
2021-01-01 00:04:00 4.533333
... ...
2021-01-01 23:56:00 4.916667
2021-01-01 23:57:00 4.450000
2021-01-01 23:58:00 4.883333
2021-01-01 23:59:00 4.316667
2021-01-02 00:00:00 2.000000
Notes
Note the use of T here to define the "minute" portion of the datetime flags in the index. Read more about Offset Aliases. Also the use of the resample() method since our timeseries also acts as the index. groupby() would have also been valid here with a slightly different approach in case our datetime information was not our index.
Customization
In application, one would replace the contents of lambda() with whatever function you'd like to apply to each distinct group of line items which share a truncated datetime-minute.

Python Datetime resample results suddenly in NaN Values

I have tried to resample my values to hour. However, since I have changed the format of the date in csv file because of automatic swapping of months and days with low numbers (2003-04-01 is suddenly 2003-01-04). Now the date format is fine (when showing the csv file in Python) but while using resample, the values appear in NaN values.
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
`hour_avg = df_2.resample('H').mean()`
Sample of my data:
Raw data with time as index
Afterwards: even when time is datetime it shows 99% of the data as NaN values (one value per day is shown)
Data with NaN values after resample per hours
When I used resample for day values, all values are back. So it seems there is a problem with the Time.
When I use the format at the beginning, the error "The format doesn't fit" comes up.
I tried a different way before (not sure what was different) but resample worked per hour.
What do I need to change to be able to use resample for hour again?
Can you share a sample of your data? Assuming that your data consists of a DateTime feature (i.e. yyyy-mm-dd hh-mm-ss) and some other features that you are trying to resample by hour, NaN values can occur due to two reasons: incorrect formatting by Pandas or missing hour values in data.
(1) It is possible that pandas is not reading your dates correctly. Once you read the file, make sure the date column is in the right format (i.e. yyyy-mm-dd).
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')
(2) If there are any gaps in your data, NaN values will pop up. For instance, assume the data is of this form:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:06:00 1
If you try hour_avg = df_2.resample('H').mean(), your output will look like:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:05:00 NaN
2000-01-01 00:06:00 1
I suspect the problem is the latter. If it is the latter, you can simply remove the NaN values using df_2.dropna(). Otherwise, if you do need the hourly bins regardless of missing data, you can avoid the NaN values by padding the missing values first and then attempting to get the mean:
hour_pad = df_2.resample('H').pad()
hour_avg = hour_pad.resample('H').mean()

Calculate the sum between the fixed time range using Pandas

My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.

Most efficient way to break up a dataframe using multiple DateTimeIndexes

I have a dataframe which contains prices for a security each minute over a long period of time.
I would like to extract a subset of the prices, 1 per day between certain hours.
Here is an example of brute-forcing it (using hourly for brevity):
dates = pandas.date_range('20180101', '20180103', freq='H')
prices = pandas.DataFrame(index=dates,
data=numpy.random.rand(len(dates)),
columns=['price'])
I now have a DateTimeIndex for the hours within each day I want to extract:
start = datetime.datetime(2018,1,1,8)
end = datetime.datetime(2018,1,1,17)
day1 = pandas.date_range(start, end, freq='H')
start = datetime.datetime(2018,1,2,9)
end = datetime.datetime(2018,1,2,13)
day2 = pandas.date_range(start, end, freq='H')
days = [ day1, day2 ]
I can then use prices.index.isin with each of my DateTimeIndexes to extract the relevant day's prices:
daily_prices = [ prices[prices.index.isin(d)] for d in days]
This works as expected:
daily_prices[0]
daily_prices[1]
The problem is that as the length of each selection DateTimeIndex increases, and the number of days I want to extract increases, my list-comprehension slows down to a crawl.
Since I know each selection DateTimeIndex is fully inclusive of the hours it encompasses, I tried using loc and the first and last element of each index in my list comprehension:
daily_prices = [ prices.loc[d[0]:d[-1]] for d in days]
Whilst a bit faster, it is still exceptionally slow when the number of days is very large
Is there a more efficient way to divide up a dataframe into begin and end time ranges like above?
If the hours are consistent from day to day as it seems like they might be, you can just filter the index, which should be pretty fast:
In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
Out[5]:
price
2018-01-01 08:00:00 0.638051
2018-01-01 09:00:00 0.059258
2018-01-01 10:00:00 0.869144
2018-01-01 11:00:00 0.443970
2018-01-01 12:00:00 0.725146
2018-01-01 13:00:00 0.309600
2018-01-01 14:00:00 0.520718
2018-01-01 15:00:00 0.976284
2018-01-01 16:00:00 0.973313
2018-01-01 17:00:00 0.158488
2018-01-02 08:00:00 0.053680
2018-01-02 09:00:00 0.280477
2018-01-02 10:00:00 0.802826
2018-01-02 11:00:00 0.379837
2018-01-02 12:00:00 0.247583
....
EDIT: To your comment, working directly on the index and then doing a single lookup at the end will still probably be fastest even if it's not always consistent from day to day. Single day frames at the end will be easy with a groupby.
For example:
df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and i.day in range(1,10)) or (i.hour in range(2,4) and i.day in range(11,32))]]
framelist = [frame for _, frame in df.groupby(df.index.date)]
will give you a list of dataframes with 1 day per list element, and will include 8:00-17:00 for the first 10 days each month and 2:00-3:00 for days 11-31.

Categories

Resources