How to handle columns mixed with datetime.time and timestamp - python

I would like to have a dataframe that all columns in the datetime.time format. But my original dataframe is like
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 04:48:00 05:00:00 05:05:00 05:10:00 05:15:00
1 06:45:00 06:57:00 07:02:00 07:07:00 07:12:00
2 1900-01-01 00:48:00 NaN 1900-01-01 01:03:00 1900-01-01 01:08:00 1900-01-01 01:13:00
3 05:09:00 NaN NaN 05:36:00 05:41:00
What I would like to get is
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 04:48:00 05:00:00 05:05:00 05:10:00 05:15:00
1 06:45:00 06:57:00 07:02:00 07:07:00 07:12:00
2 00:48:00 NaN 01:03:00 01:08:00 01:13:00
3 05:09:00 NaN NaN 05:36:00 05:41:00
The datatypes of those values are
> type(test_result.iloc[0,0])
datetime.time
> type(test_result.iloc[2,0])
pandas._libs.tslibs.timestamps.Timestamp
I tried to_datetime(format= "%H:%M:%S", error = "coerce"), datetime.strptime(test_result['Moorabbin'],"%H:%M:%S").time() and test_result.astype('datetime64[ns]', copy=True, errors='ignore'), but nothing worked. Could anyone please help?

one approach would be as follows.
Make sure the type is 'object', you can convert it to datetime after you have reduced the data to your required 'length.
Then do df_new = df.apply(lambda x: x.str.split(' ').str[-1], axis=1)
Input
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 4:48:00 5:00:00 5:05:00 5:10:00 5:15:00
1 6:45:00 6:57:00 7:02:00 7:07:00 7:12:00
2 1/1/1900 0:48:00 NaN 1/1/1900 1:03:00 1/1/1900 1:08:00 1/1/1900 1:13:00
3 5:09:00 NaN NaN 5:36:00 5:41:00
output (df_new)
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 4:48:00 5:00:00 5:05:00 5:10:00 5:15:00
1 6:45:00 6:57:00 7:02:00 7:07:00 7:12:00
2 0:48:00 NaN 1:03:00 1:08:00 1:13:00
3 5:09:00 NaN NaN 5:36:00 5:41:00
Note The result is object & not dateime object but you can convert it to datetime object using pd.to_datetime on these columns.

Related

From hours to String

I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Finding maximum null values in stretch and generating flag

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0
First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

Converting datetime64 to numeric

How can I convert timestamp column of dataframe to numeric value? The datatype of the below Time column in below dataframe 'df' is 'datetime64'.
Time Count
2018-05-15 00:00:00 4
2018-05-15 00:15:00 1
2018-05-15 00:30:00 5
2018-05-15 00:45:00 6
2018-05-15 01:15:00 3
2018-05-15 01:30:00 4
2018-05-15 02:30:00 5
2018-05-15 02:45:00 3
2018-05-15 03:15:00 2
2018-05-15 03:30:00 5
By using to_numeric
pd.to_numeric(df.Time)
Out[218]:
0 1526342400000000000
1 1526343300000000000
2 1526344200000000000
3 1526345100000000000
4 1526346900000000000
5 1526347800000000000
6 1526351400000000000
7 1526352300000000000
8 1526354100000000000
9 1526355000000000000
Name: Time, dtype: int64

How to resample a df with datetime index to exactly n equally sized periods?

I've got a large dataframe with a datetime index and need to resample data to exactly 10 equally sized periods.
So far, I've tried finding the first and last dates to determine the total number of days in the data, divide that by 10 to determine the size of each period, then resample using that number of days. eg:
first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'
df.resample(periodsize,how='sum')
This doesn't guarantee exactly 10 periods in the df after resampling since the periodsize is a rounded down int. Using a float doesn't work in the resampling. Seems that either there's something simple that I'm missing here, or I'm attacking the problem all wrong.
import numpy as np
import pandas as pd
n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
# 0
# 2000-01-01 1
# 2000-01-02 1
# ...
# 2000-02-01 1
# 2000-02-02 1
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)
result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n
yields
0
2000-01-01 00:00:00 4
2000-01-04 07:12:00 3
2000-01-07 14:24:00 3
2000-01-10 21:36:00 4
2000-01-14 04:48:00 3
2000-01-17 12:00:00 3
2000-01-20 19:12:00 4
2000-01-24 02:24:00 3
2000-01-27 09:36:00 3
2000-01-30 16:48:00 3
The values in the 0-column indicate the number of rows that were aggregated, since the original DataFrame was filled with values of 1. The pattern of 4's and 3's is about as even as you can get since 33 rows can not be evenly grouped into 10 groups.
Explanation: Consider this simpler DataFrame:
n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
# 0
# 2000-01-01 1
# 2000-01-02 1
# 2000-01-03 1
# 2000-01-04 1
# 2000-01-05 1
Using df.resample('2D', how='sum') gives the wrong number of groups
In [366]: df.resample('2D', how='sum')
Out[366]:
0
2000-01-01 2
2000-01-03 2
2000-01-05 1
Using df.resample('3D', how='sum') gives the right number of groups, but the
second group starts at 2000-01-04 which does not evenly divide the DataFrame
into two equally-spaced groups:
In [367]: df.resample('3D', how='sum')
Out[367]:
0
2000-01-01 3
2000-01-04 2
To do better, we need to work at a finer time resolution than in days. Since Timedeltas have a total_seconds method, let's work in seconds. So for the example above, the desired frequency string would be
In [374]: df.resample('216000S', how='sum')
Out[374]:
0
2000-01-01 00:00:00 3
2000-01-03 12:00:00 2
since there are 216000*2 seconds in 5 days:
In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0
Okay, so now all we need is a way to generalize this. We'll want the minimum and maximum dates in the index:
first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
We add an extra day because it makes the difference in days come out right. In
the example above, There are only 4 days between the Timestamps for 2000-01-05
and 2000-01-01,
In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4
But as we can see in the worked example, the DataFrame has 5 rows representing 5
days. So it makes sense that we need to add an extra day.
Now we can compute the correct number of seconds in each equally-spaced group with:
secs = int((last-first).total_seconds()//n)
Here is one way to ensure equal-size sub-periods by using np.linspace() on pd.Timedelta and then classifying each obs into different bins using pd.cut.
import pandas as pd
import numpy as np
# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))
Out[87]:
A B
2015-01-01 00:00:00 1.7641 0.4002
2015-01-01 08:00:00 0.9787 2.2409
2015-01-01 16:00:00 1.8676 -0.9773
2015-01-02 00:00:00 0.9501 -0.1514
2015-01-02 08:00:00 -0.1032 0.4106
2015-01-02 16:00:00 0.1440 1.4543
2015-01-03 00:00:00 0.7610 0.1217
2015-01-03 08:00:00 0.4439 0.3337
2015-01-03 16:00:00 1.4941 -0.2052
2015-01-04 00:00:00 0.3131 -0.8541
2015-01-04 08:00:00 -2.5530 0.6536
2015-01-04 16:00:00 0.8644 -0.7422
2015-01-05 00:00:00 2.2698 -1.4544
2015-01-05 08:00:00 0.0458 -0.1872
2015-01-05 16:00:00 1.5328 1.4694
... ... ...
2015-01-29 08:00:00 0.9209 0.3187
2015-01-29 16:00:00 0.8568 -0.6510
2015-01-30 00:00:00 -1.0342 0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555 0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00 0.6252 -1.6021
2015-02-01 00:00:00 -1.1044 0.0522
2015-02-01 08:00:00 -0.7396 1.5430
2015-02-01 16:00:00 -1.2929 0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00 0.5233 -0.1715
2015-02-02 16:00:00 0.7718 0.8235
2015-02-03 00:00:00 2.1632 1.3365
[100 rows x 2 columns]
# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])
# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])
Out[89]:
A B start_time_index end_time_index
2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00 1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00 0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032 0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00 0.1440 1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00 0.7610 0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00 0.4439 0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00 1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00 0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00 0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00 2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00 0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00 1.5328 1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
... ... ... ... ...
2015-01-29 08:00:00 0.9209 0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00 0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342 0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555 0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00 0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044 0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396 1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929 0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00 0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00 0.7718 0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00 2.1632 1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00
[100 rows x 4 columns]
df.groupby('start_time_index').agg('sum')
Out[90]:
A B
start_time_index
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -0.0902 -2.5178
Another potential shorter way to do this is to specify your sampling freq as the time delta. But the problem, as shown in below, is that it delivers 11 sub-samples instead of 10. I believe the reason is that the resample implements a left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub-sampling scheme so that the very last obs at '2015-02-03 00:00:00' is considered as a separate group. If we use pd.cut to do it ourself, we can specify include_lowest=True so that it gives us exactly 10 sub-samples rather than 11.
n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')
Out[114]:
A B
2015-01-01 00:00:00 8.6133 2.7734
2015-01-04 07:12:00 1.9220 -0.8069
2015-01-07 14:24:00 -8.1334 0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00 1.1957 7.2285
2015-01-17 12:00:00 3.2485 6.6841
2015-01-20 19:12:00 -0.8903 2.2802
2015-01-24 02:24:00 -2.1025 1.3800
2015-01-27 09:36:00 -1.1017 1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00 2.1632 1.3365

Categories

Resources