Let's take any two date time columns and I wanna calculate the below formula inorder to get the mean values.
mean(24*(closed_time - created_time ))
In excel, I tried by applying the same logic, and getting the below value,
closed time created date mean(24*(closed_time - created_time ))
5/14/2022 8:35 5/11/2022 1:08 79.45
5/14/2022 8:12 5/13/2022 8:45 23.45
5/14/2022 8:34 5/13/2022 11:47 20.78333333
5/11/2022 11:21 5/9/2022 16:43 42.63333333
5/11/2022 11:30 5/8/2022 19:51 63.65
5/11/2022 11:22 5/6/2022 16:45 114.6166667
5/11/2022 11:25 5/9/2022 19:53 39.53333333
5/11/2022 11:28 5/9/2022 10:52 48.6
Any help would be appreciatable!!
Not sure about mean, in sample data got same ouput by subtracted columns with convert seconds to hours:
cols = ['closed time','created date']
df[cols] = df[cols].apply(pd.to_datetime)
df['mean1'] = df['closed time'].sub(df['created date']).dt.total_seconds().div(3600)
print (df)
closed time created date mean mean1
0 2022-05-14 08:35:00 2022-05-11 01:08:00 79.450000 79.450000
1 2022-05-14 08:12:00 2022-05-13 08:45:00 23.450000 23.450000
2 2022-05-14 08:34:00 2022-05-13 11:47:00 20.783333 20.783333
3 2022-05-11 11:21:00 2022-05-09 16:43:00 42.633333 42.633333
4 2022-05-11 11:30:00 2022-05-08 19:51:00 63.650000 63.650000
5 2022-05-11 11:22:00 2022-05-06 16:45:00 114.616667 114.616667
6 2022-05-11 11:25:00 2022-05-09 19:53:00 39.533333 39.533333
7 2022-05-11 11:28:00 2022-05-09 10:52:00 48.600000 48.600000
Mean of both datetimes is count by:
df['mean']=pd.to_datetime(df[['closed time','created date']].astype(np.int64).mean(axis=1))
print (df)
closed time created date mean
0 2022-05-14 08:35:00 2022-05-11 01:08:00 2022-05-12 16:51:30
1 2022-05-14 08:12:00 2022-05-13 08:45:00 2022-05-13 20:28:30
2 2022-05-14 08:34:00 2022-05-13 11:47:00 2022-05-13 22:10:30
3 2022-05-11 11:21:00 2022-05-09 16:43:00 2022-05-10 14:02:00
4 2022-05-11 11:30:00 2022-05-08 19:51:00 2022-05-10 03:40:30
5 2022-05-11 11:22:00 2022-05-06 16:45:00 2022-05-09 02:03:30
6 2022-05-11 11:25:00 2022-05-09 19:53:00 2022-05-10 15:39:00
7 2022-05-11 11:28:00 2022-05-09 10:52:00 2022-05-10 11:10:00
Related
I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN
Data:
df:
ts_code
2018-01-01 A
2018-02-07 A
2018-03-11 A
2022-07-08 A
df_cal:
start_date end_date
2018-02-07 2018-03-12
2018-10-22 2018-11-16
2019-01-07 2019-03-08
2019-03-11 2019-04-22
2019-05-24 2019-07-02
2019-08-06 2019-09-09
2019-10-09 2019-11-05
2019-11-29 2020-01-14
2020-02-03 2020-02-21
2020-02-28 2020-03-05
2020-03-19 2020-04-28
2020-05-06 2020-07-13
2020-07-24 2020-08-31
2020-11-02 2021-01-13
2020-09-11 2020-10-13
2021-01-29 2021-02-18
2021-03-09 2021-04-30
2021-05-06 2021-07-22
2021-07-28 2021-09-14
2021-10-12 2021-12-13
2022-04-27 2022-06-30
Expected result:
ts_code col
2018-01-01 A 0
2018-02-07 A 1
2018-03-11 A 1
2022-07-08 A 0
Goal:
I want to assign values to a new column col: to 1 if df.index is between any of df_cal date ranges, and to 0 otherwise.
Reference:
I refer this post. But it just works for one condition and mine is lots of date ranges. And I don't want to use dataframe join method to achieve it because it will break index order.
You check with numpy broadcasting
df2['new'] = np.any((df1.end_date.values >=df2.index.values[:,None])&
(df1.start_date.values <= df2.index.values[:,None]),1).astype(int)
df2
Out[55]:
ts_code col new
2018-01-01 A 0 0
2018-02-07 A 1 1
2018-03-11 A 1 1
2022-07-08 A 0 0
I have the following dataframe with daily data:
day value
2017-08-04 0.832
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
2017-08-09 0.894
2017-08-10 2.332
2017-08-11 0.886
2017-08-12 0.973
2017-08-13 0.980
... ...
2022-03-21 0.821
2022-03-22 1.121
2022-03-23 1.064
2022-03-24 1.058
2022-03-25 0.891
2022-03-26 1.010
2022-03-27 1.023
2022-03-28 1.393
2022-03-29 2.013
2022-03-30 3.872
[1700 rows x 1 columns]
I need to generate pooled averages using moving windows. I explain it group by group:
The first group must contain the data from 2017-08-04 to 2017-08-08, but also the data from 2018-08-04 to 2018-08-08, and so on until the last year. As shown below:
2017-08-04 0.832
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
---------- -----
2018-08-04 2.125
2018-08-05 2.200
2018-08-06 2.339
2018-08-07 2.035
2018-08-08 1.953
... ...
2020-08-04 0.965
2020-08-05 0.941
2020-08-06 0.917
2020-08-07 0.922
2020-08-08 0.909
---------- -----
2021-08-04 1.348
2021-08-05 1.302
2021-08-06 1.272
2021-08-07 1.258
2021-08-08 1.281
The second group must run one day the temporary window. That is, data from 2017-08-05 to 2017-08-09, from 2018-08-05 to 2018-08-09, and so on until the last year. As shown below:
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
2017-08-09 1.823
---------- -----
2018-08-05 2.200
2018-08-06 2.339
2018-08-07 2.035
2018-08-08 1.953
2018-08-09 2.009
... ...
2020-08-05 0.941
2020-08-06 0.917
2020-08-07 0.922
2020-08-08 0.909
2020-08-09 1.934
---------- -----
2021-08-05 1.302
2021-08-06 1.272
2021-08-07 1.258
2021-08-08 1.281
2021-08-09 2.348
And the following groups must follow the same dynamic. Finally, I need to form a DataFrame, where the indices are the central date of each window (the length of the DataFrame will be 365 days of the year) and the values are the average of each of the groups mentioned above.
I have been trying Groupby and Rolling at the same time. But any solution based on other pandas methods is completely valid.
This question already has answers here:
compute time difference of DateTimeIndex
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a Dataframe with a datetimeindex and I need to create a column that contains the difference in time between the rows of the datetimeindex expressed in hours. This is what I have:
Datetime Numbers
2020-11-27 08:30:00 1
2020-11-27 13:00:00 2
2020-11-27 15:15:00 3
2020-11-27 20:45:00 4
2020-11-28 08:45:00 5
2020-11-28 10:45:00 6
2020-12-01 04:00:00 7
2020-12-01 08:15:00 8
2020-12-01 12:45:00 9
2020-12-01 14:45:00 10
2020-12-01 17:15:00 11
...
This is what I need:
Datetime Numbers Delta
2020-11-27 08:30:00 1 Nan
2020-11-27 13:00:00 2 4.5
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.5
2020-11-28 08:45:00 5 12
2020-11-28 10:45:00 6 2
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.5
2020-12-01 14:45:00 10 2
2020-12-01 17:15:00 11 2.5
...
The Dataframe has thousands of rows so I can't use a "for" loop. Thanks in advance!
EDIT: I found a solution:
df = df.reset_index()
df['Time'] = df['Datetime'].astype(np.int64) // 10**9
df['Delta'] = df['Time'].diff()/3600
df.drop(columns=['Time'],inplace =True)
df.set_index('Datetime', inplace=True)
I assume that Datetime is set as index:
df.reset_index(inplace=True)
df['Delta'] = df['Datetime'].diff().dt.total_seconds()/3600
df.set_index('Datetime', inplace=True)
OUTPUT:
Numbers Delta
Datetime
2020-11-27 08:30:00 1 NaN
2020-11-27 13:00:00 2 4.50
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.50
2020-11-28 08:45:00 5 12.00
2020-11-28 10:45:00 6 2.00
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.50
2020-12-01 14:45:00 10 2.00
2020-12-01 17:15:00 11 2.50
I am learning how to filter dates on a Pandas data frame and need some help with the following please. This is my original data frame (from this data):
data
Out[120]:
Open High Low Last Volume NumberOfTrades BidVolume AskVolume
Timestamp
2014-03-04 09:30:00 1783.50 1784.50 1783.50 1784.50 171 17 29 142
2014-03-04 09:31:00 1784.75 1785.75 1784.50 1785.25 28 21 10 18
2014-03-04 09:32:00 1785.00 1786.50 1785.00 1786.50 81 19 4 77
2014-03-04 09:33:00 1786.00 1786.00 1785.25 1785.25 41 14 8 33
2014-03-04 09:34:00 1785.00 1785.25 1784.75 1785.25 11 8 2 9
2014-03-04 09:35:00 1785.50 1786.75 1785.50 1785.75 49 27 13 36
2014-03-04 09:36:00 1786.00 1786.00 1785.25 1785.75 12 8 3 9
2014-03-04 09:37:00 1786.00 1786.25 1785.25 1785.25 15 8 10 5
2014-03-04 09:38:00 1785.50 1785.50 1784.75 1785.25 24 17 17 7
data.dtypes
Out[118]:
Open float64
High float64
Low float64
Last float64
Volume int64
NumberOfTrades int64
BidVolume int64
AskVolume int64
dtype: object
I then resampled to 5 minute sections:
five_min = data.resample('5T').sum()
And look for the high volume days:
max_volume = five_min.Volume.at_time('9:30') > 65000
I then try to get the days high volume days as follows:
five_min.Volume = max_volume[max_volume == True]
for_high_vol = five_min.Volume.dropna()
for_high_vol
Timestamp
2014-03-21 09:30:00 True
2014-04-11 09:30:00 True
2014-04-16 09:30:00 True
2014-04-17 09:30:00 True
2014-07-18 09:30:00 True
2014-07-31 09:30:00 True
2014-09-19 09:30:00 True
2014-10-07 09:30:00 True
2014-10-10 09:30:00 True
2014-10-14 09:30:00 True
2014-10-15 09:30:00 True
2014-10-16 09:30:00 True
2014-10-17 09:30:00 True
I would like to use the index from "for_high_vol" to select all of the days from the original "data" Pandas dataframe.
Im sure there are much better was to approach this so can someone please show me the simplest way to do this?
IIUC, you can do it this way:
x.ix[(x.groupby(pd.Grouper(key='Timestamp', freq='5T'))['Volume'].transform('sum') > 65000)
&
(x.Timestamp.dt.hour==9)
&
(x.Timestamp.dt.minute>=30) & (x.Timestamp.dt.minute<=34)]
in order to set index back:
x.ix[(x.groupby(pd.Grouper(key='Timestamp', freq='5T'))['Volume'].transform('sum') > 65000)
&
(x.Timestamp.dt.hour==9)
&
(x.Timestamp.dt.minute>=30) & (x.Timestamp.dt.minute<=34)].set_index('Timestamp')
PS Timestamp is a regular column in my DF, not an index
Explanation:
resample / group our DF by 5 minutes interval, calculate the sum of Volume for each group and assign this sum to all rows in the group. For example in the example below 332 - is the sum of Volume in the first 5-min group
In [41]: (x.groupby(pd.Grouper(key='Timestamp', freq='5T'))['Volume'].transform('sum')).head(10)
Out[41]:
0 332
1 332
2 332
3 332
4 332
5 113
6 113
7 113
8 113
9 113
dtype: int64
filter time - the conditions are self-explanatory:
(x.Timestamp.dt.hour==9) & (x.Timestamp.dt.minute>=30) & (x.Timestamp.dt.minute<=34)].set_index('Timestamp')
and finally combine all conditions (filters) together - pass it to .ix[] indexer and set index back to Timestamp:
x.ix[(x.groupby(pd.Grouper(key='Timestamp', freq='5T'))['Volume'].transform('sum') > 65000)
&
(x.Timestamp.dt.hour==9)
&
(x.Timestamp.dt.minute>=30) & (x.Timestamp.dt.minute<=34)].set_index('Timestamp')
Output:
Out[32]:
Timestamp Open High Low Last Volume NumberOfTrades BidVolume AskVolume
5011 2014-03-21 09:30:00 1800.75 1802.50 1800.00 1802.25 30181 6006 13449 16732
5012 2014-03-21 09:31:00 1802.50 1803.25 1802.25 1802.50 15588 3947 5782 9806
5013 2014-03-21 09:32:00 1802.50 1803.75 1802.25 1803.25 16409 3994 6867 9542
5014 2014-03-21 09:33:00 1803.00 1803.50 1802.75 1803.25 10790 3158 4781 6009
5015 2014-03-21 09:34:00 1803.25 1804.75 1803.25 1804.75 13377 3466 4690 8687
11086 2014-04-11 09:30:00 1744.75 1744.75 1743.00 1743.50 21504 5876 11178 10326
11087 2014-04-11 09:31:00 1743.50 1746.50 1743.25 1746.00 21582 6191 8830 12752
11088 2014-04-11 09:32:00 1746.00 1746.50 1744.25 1745.75 18961 5214 9521 9440
11089 2014-04-11 09:33:00 1746.00 1746.25 1744.00 1744.25 12832 3658 7219 5613
11090 2014-04-11 09:34:00 1744.25 1744.25 1742.00 1742.75 15478 4919 8912 6566
12301 2014-04-16 09:30:00 1777.50 1778.25 1776.25 1777.00 21178 5431 10775 10403
12302 2014-04-16 09:31:00 1776.75 1779.25 1776.50 1778.50 16456 4400 6351 10105
12303 2014-04-16 09:32:00 1778.50 1779.25 1777.25 1777.50 9956 3015 5810 4146
12304 2014-04-16 09:33:00 1777.50 1778.00 1776.25 1776.25 8724 2470 5326 3398
12305 2014-04-16 09:34:00 1776.25 1777.00 1775.50 1776.25 9566 2968 5098 4468
12706 2014-04-17 09:30:00 1781.50 1782.50 1781.25 1782.25 16474 4583 7510 8964
12707 2014-04-17 09:31:00 1782.25 1782.50 1781.00 1781.25 10328 2587 6310 4018
12708 2014-04-17 09:32:00 1781.25 1782.25 1781.00 1781.25 9072 2142 4618 4454
12709 2014-04-17 09:33:00 1781.00 1781.75 1780.25 1781.25 17866 3807 10665 7201
12710 2014-04-17 09:34:00 1781.50 1782.25 1780.50 1781.75 11322 2523 5538 5784
38454 2014-07-18 09:30:00 1893.50 1893.75 1892.50 1893.00 24864 5135 13874 10990
38455 2014-07-18 09:31:00 1892.75 1893.50 1892.75 1892.75 8003 1751 3571 4432
38456 2014-07-18 09:32:00 1893.00 1893.50 1892.75 1893.50 7062 1680 3454 3608
38457 2014-07-18 09:33:00 1893.25 1894.25 1893.00 1894.25 10581 1955 3925 6656
38458 2014-07-18 09:34:00 1894.25 1895.25 1894.00 1895.25 15309 3347 5516 9793
42099 2014-07-31 09:30:00 1886.25 1886.25 1884.25 1884.75 21668 5857 11910 9758
42100 2014-07-31 09:31:00 1884.50 1884.75 1882.25 1883.00 17487 5186 11403 6084
42101 2014-07-31 09:32:00 1883.00 1884.50 1882.50 1884.00 13174 3782 4791 8383
42102 2014-07-31 09:33:00 1884.25 1884.50 1883.00 1883.25 9095 2814 5299 3796
42103 2014-07-31 09:34:00 1883.25 1884.25 1883.00 1884.25 7593 2528 3794 3799
... ... ... ... ... ... ... ... ... ...
193508 2016-01-21 09:30:00 1838.00 1838.75 1833.00 1834.00 22299 9699 12666 9633
193509 2016-01-21 09:31:00 1834.00 1836.50 1833.00 1834.50 8851 4520 4010 4841
193510 2016-01-21 09:32:00 1834.25 1835.25 1832.50 1833.25 7957 3672 3582 4375
193511 2016-01-21 09:33:00 1833.00 1838.50 1832.00 1838.00 12902 5564 5174 7728
193512 2016-01-21 09:34:00 1838.00 1841.50 1837.75 1840.50 13991 6130 6799 7192
199178 2016-02-10 09:30:00 1840.00 1841.75 1839.00 1840.75 13683 5080 6743 6940
199179 2016-02-10 09:31:00 1840.75 1842.00 1838.75 1841.50 11753 4623 5616 6137
199180 2016-02-10 09:32:00 1841.50 1844.75 1840.75 1843.00 16402 6818 8226 8176
199181 2016-02-10 09:33:00 1843.00 1843.50 1841.00 1842.00 14963 5402 8431 6532
199182 2016-02-10 09:34:00 1842.25 1843.50 1840.00 1840.00 8397 3475 4537 3860
200603 2016-02-16 09:30:00 1864.00 1866.25 1863.50 1864.75 19585 6865 9548 10037
200604 2016-02-16 09:31:00 1865.00 1865.50 1863.75 1864.25 16604 5936 8095 8509
200605 2016-02-16 09:32:00 1864.25 1864.75 1862.75 1863.50 10126 4713 5591 4535
200606 2016-02-16 09:33:00 1863.25 1863.75 1861.50 1862.25 9648 3786 5824 3824
200607 2016-02-16 09:34:00 1862.25 1863.50 1861.75 1862.25 10748 4143 5413 5335
205058 2016-03-02 09:30:00 1952.75 1954.25 1952.00 1952.75 19812 6684 10350 9462
205059 2016-03-02 09:31:00 1952.75 1954.50 1952.25 1953.50 10163 4236 3884 6279
205060 2016-03-02 09:32:00 1953.50 1954.75 1952.25 1952.50 15771 5519 8135 7636
205061 2016-03-02 09:33:00 1952.75 1954.50 1952.50 1953.75 9556 3583 3768 5788
205062 2016-03-02 09:34:00 1953.75 1954.75 1952.25 1952.50 11898 4463 6459 5439
209918 2016-03-18 09:30:00 2027.50 2028.25 2026.50 2028.00 38092 8644 17434 20658
209919 2016-03-18 09:31:00 2028.00 2028.25 2026.75 2027.25 11631 3209 6384 5247
209920 2016-03-18 09:32:00 2027.25 2027.75 2027.00 2027.50 9664 3270 5080 4584
209921 2016-03-18 09:33:00 2027.50 2027.75 2026.75 2026.75 10610 3117 5358 5252
209922 2016-03-18 09:34:00 2026.75 2027.00 2026.00 2026.50 8076 3022 4670 3406
227722 2016-05-20 09:30:00 2034.25 2035.25 2033.50 2034.50 30272 7815 16098 14174
227723 2016-05-20 09:31:00 2034.75 2035.75 2034.50 2035.50 12997 3690 6458 6539
227724 2016-05-20 09:32:00 2035.50 2037.50 2035.50 2037.25 12661 3864 5233 7428
227725 2016-05-20 09:33:00 2037.25 2037.75 2036.50 2037.00 9057 2524 5190 3867
227726 2016-05-20 09:34:00 2037.00 2037.50 2036.75 2037.00 5190 1620 2748 2442
[255 rows x 9 columns]