I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output:
Related
I have a dataframe with one column timestamp (of type datetime) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.
One can generate an example dataframe using this code:
def get_random_dates_df(
n=10000,
start=pd.to_datetime('2015-01-01'),
period_duration_days=5,
seed=None
):
if not seed: # from piR's answer
np.random.seed(0)
end = start + pd.Timedelta(period_duration_days, 'd'),
n_seconds = int(period_duration_days * 3600 * 24)
random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s') + start
return pd.DataFrame(data={"timestamp": random_dates}).reset_index()
df = get_random_dates_df()
it would look like this:
index
timestamp
0
0
2015-01-03 17:51:27.433696604
1
1
2015-01-04 13:49:21.806272885
2
2
2015-01-04 00:19:53.778462950
3
3
2015-01-03 17:23:09.535054659
4
4
2015-01-03 02:50:18.873314407
I think I have a working solution but it seems overly complicated:
gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()
Output:
count
time_of_day
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
... ...
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
[288 rows x 1 columns]
Is there a better way to solve this?
You could groupby the floored 5Min datetime's time portion:
df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()
I'd suggest something like this, to avoid trying to merge the results of two groupbys together:
gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()
It works by setting the year/month/day to fixed values and applying the built-in resampling function.
What about flooring the datetimes to 5min, extracting the time only and using value_counts:
out = (df['timestamp']
.dt.floor('5min')
.dt.time.value_counts(sort=False)
.sort_index()
)
Output:
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
..
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
Name: timestamp, Length: 288, dtype: int64
I have a MySQL database with records associated with date time of record. When several values are within a time range of 3 minutes, I want to do the mean of each values. I made a fake file to illustrate.
#dataSample.csv
;y;datetime
0;1.885539280369374;2020-12-18 00:16:59
1;88.87944658745302;2020-12-18 00:18:26
2;5.4934801892366645;2020-12-18 00:21:47
3;27.481240675960745;2020-12-22 02:22:43
4;78.20955112191257;2021-03-12 00:01:45
5;69.20174844202616;2021-03-12 00:03:01
6;92.452056802478;2021-03-12 00:04:10
7;65.44391665410022;2021-03-12 00:06:12
8;40.59036279552053;2021-03-13 11:07:40
9;97.28850548113896;2021-03-13 11:08:46
10;94.73214209590618;2021-03-13 11:09:52
11;15.032038741334246;2021-03-14 00:50:10
12;26.96629037360529;2021-03-14 00:51:17
13;57.257554884427755;2021-03-14 00:52:20
14;18.845976481042804;2021-03-17 13:52:00
15;57.19160644979182;2021-03-17 13:53:48
16;3.81419643210113;2021-03-17 13:54:50
17;46.65212265222033;2021-03-17 20:00:06
18;78.99788944141437;2021-03-17 20:01:28
19;72.57950242929162;2021-03-17 20:02:18
20;31.953619913660063;2021-03-20 16:40:04
21;71.03880579866258;2021-03-20 16:41:14
22;80.07721218822367;2021-03-20 16:42:03
23;84.4974927845413;2021-03-23 23:51:04
24;23.332882564418554;2021-03-23 23:52:37
25;24.84651458538292;2021-03-23 23:53:44
26;3.2905723920299073;2021-04-13 01:07:13
27;95.00543057651691;2021-04-13 01:08:53
28;46.02579988887248;2021-04-13 01:10:03
29;71.73362449536457;2021-04-13 07:54:22
30;93.17353939667422;2021-04-13 07:56:03
31;28.06669274690586;2021-04-13 07:57:04
32;10.733532291051478;2021-04-21 23:52:19
33;92.92374999199961;2021-04-21 23:53:02
34;59.68694726616824;2021-04-21 23:54:12
35;30.01172074266929;2021-11-29 00:21:09
36;34.905022198511915;2021-11-29 00:23:09
37;25.149590827473055;2021-11-29 00:24:13
38;82.09740354280564;2021-12-01 08:30:00
39;25.58339148753002;2021-12-01 08:32:00
40;72.7009145748645;2021-12-01 08:34:00
41;8.43474445404563;2021-12-01 13:18:58
42;57.95936012084567;2021-12-01 13:19:45
43;31.118114587376713;2021-12-01 13:21:19
44;42.082098854369576;2021-12-01 20:24:46
45;75.8402567179772;2021-12-01 20:25:45
46;55.29546227636972;2021-12-01 20:26:20
47;72.52918512264547;2021-12-02 08:35:42
48;77.81077056479849;2021-12-02 08:36:35
49;34.63717484559066;2021-12-02 08:37:22
50;71.65724478546072;2021-12-06 00:05:00
51;19.54082334014094;2021-12-06 00:08:00
52;48.28967362303979;2021-12-06 00:10:00
53;34.894095185290105;2021-12-03 08:36:00
54;58.187428474357375;2021-12-03 08:40:00
55;94.53441120864328;2021-12-03 08:45:00
56;12.272217150555866;2021-12-03 13:10:00
57;87.21292441413424;2021-12-03 13:11:00
58;86.35470090744712;2021-12-03 13:12:00
59;50.23396755270806;2021-12-06 23:46:00
60;73.30424413459407;2021-12-06 23:48:00
61;60.48531615320234;2021-12-06 23:49:00
62;56.10336877052336;2021-12-06 23:51:00
63;87.6451368964707;2021-12-07 08:37:00
64;11.902048844734905;2021-12-07 10:48:00
65;57.596744167099494;2021-12-07 10:58:00
66;61.77125104854312;2021-12-07 11:05:00
67;21.542193987296695;2021-12-07 11:28:00
68;91.64520146457525;2021-12-07 11:29:00
69;78.42486998655676;2021-12-07 16:06:00
70;79.51721853991806;2021-12-07 16:08:00
71;54.46969194684532;2021-12-07 16:09:00
72;56.092025088935785;2021-12-07 16:12:00
73;2.546437552510464;2021-12-07 18:35:00
74;11.598686235757118;2021-12-07 18:40:00
75;40.26003639570842;2021-12-07 18:45:00
76;30.697636730470848;2021-12-07 23:39:00
77;66.3177096178856;2021-12-07 23:42:00
78;73.16870525875022;2021-12-07 23:47:00
79;61.68994018242363;2021-12-08 13:47:00
80;38.06598256433572;2021-12-08 13:48:00
81;43.91268499464372;2021-12-08 13:49:00
82;33.166594417250735;2021-12-15 00:23:00
83;52.68422837459157;2021-12-15 00:24:00
84;86.01398356923765;2021-12-15 00:26:00
85;21.444108620566542;2021-12-15 00:31:00
86;86.6839608035921;2021-12-18 01:09:00
87;43.83047571188636;2022-01-06 00:24:00
Here is my code:
import pandas as pd
import numpy as np
import datetime
from datetime import datetime, timedelta
fileName = "dataSample.csv"
df = pd.read_csv(fileName, sep=";", index_col=0)
df['datetime_object'] = df['datetime'].apply(datetime.fromisoformat)
def define_mask(d, delta_minutes):
return (d <= df["datetime_object"]) & (df["datetime_object"]<= d + timedelta(minutes=delta_minutes))
group = []
i = 0
while i < len(df):
d = df.loc[i]["datetime_object"]
mask = define_mask(d, 3)
for k in range(len(df[mask].index)):
group.append(i)
i += len(df[mask].index)
df["group"] = group
df_new = df.groupby("group").apply(np.mean)
It works well but I am wondering if this is good "pandas" practice .
I have 2 questions:
Is there another way to do that with pandas ?
Is there an SQL command to do that directly ?
You can use resample:
df = pd.read_csv('data.csv', sep=';', index_col=0, parse_dates=['datetime'])
out = df.resample('3min', on='datetime').mean().dropna().reset_index()
print(out)
# Output
datetime y
0 2020-12-18 00:15:00 1.885539
1 2020-12-18 00:18:00 88.879447
2 2020-12-18 00:21:00 5.493480
3 2020-12-22 02:21:00 27.481241
4 2021-03-12 00:00:00 78.209551
.. ... ...
59 2021-12-15 00:21:00 33.166594
60 2021-12-15 00:24:00 69.349106
61 2021-12-15 00:30:00 21.444109
62 2021-12-18 01:09:00 86.683961
63 2022-01-06 00:24:00 43.830476
[64 rows x 2 columns]
Another way to get the first datetime value of a group of 3 minutes:
out = df.groupby(pd.Grouper(freq='3min', key='datetime'), as_index=False) \
.agg({'y': 'mean', 'datetime': 'first'}) \
.dropna(how='all').reset_index(drop=True)
print(out)
# Output
y datetime
0 1.885539 2020-12-18 00:16:59
1 88.879447 2020-12-18 00:18:26
2 5.493480 2020-12-18 00:21:47
3 27.481241 2020-12-22 02:22:43
4 78.209551 2021-03-12 00:01:45
.. ... ...
59 33.166594 2021-12-15 00:23:00
60 69.349106 2021-12-15 00:24:00
61 21.444109 2021-12-15 00:31:00
62 86.683961 2021-12-18 01:09:00
63 43.830476 2022-01-06 00:24:00
[64 rows x 2 columns]
Or
out = df.resample('3min', on='datetime') \
.agg({'y': 'mean', 'datetime': 'first'}) \
.dropna(how='all').reset_index(drop=True)`
In MySQL you can achieve it like this:
SELECT
FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(`datetime`)/180)*180) AS 'datetime'
AVG(`y`) AS 'y'
FROM `table`
GROUP BY
FLOOR(MINUTE(`datetime`) / 3)
AVG() is an aggregate function in MySQL that when used with 'GROUP BY' returns an aggregate result of the grouped rows.
One way to round the date to groups of 3 minute intervals would be to convert to a unix timestamp and utilize the FLOOR function:
UNIX_TIMESTAMP to convert the date to unix timestamp (number of seconds since 1970-01-01 00:00:00)
Divide by # of seconds to group by
FLOOR() function to get the closest integer value not greater than the input.
Multiply the result by # of seconds to convert back to a unix timestamp
FROM_UNIXTIME() to convert the unix timestamp back to a MySQL datetime
I'm trying to convert daily prices into weekly, monthly, quarterly, semesterly, yearly, but the code only works when I run it for one stock. When I add another stock to the list the code crashes and gives two errors. 'ValueError: Length of names must match number of levels in MultiIndex.' and 'TypeError: other must be a MultiIndex or a list of tuples.' I'm not experienced with MultiIndexing and have searched everywhere with no success.
This is the code:
import yfinance as yf
from pandas_datareader import data as pdr
symbols = ['AMZN', 'AAPL']
yf.pdr_override()
df = pdr.get_data_yahoo(symbols, start = '2014-12-01', end = '2021-01-01')
df = df.reset_index()
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace = True)
res = {'Open': 'first', 'Adj Close': 'last'}
dfw = df.resample('W').agg(res)
dfw_ret = (dfw['Adj Close'] / dfw['Open'] - 1)
dfm = df.resample('BM').agg(res)
dfm_ret = (dfm['Adj Close'] / dfm['Open'] - 1)
dfq = df.resample('Q').agg(res)
dfq_ret = (dfq['Adj Close'] / dfq['Open'] - 1)
dfs = df.resample('6M').agg(res)
dfs_ret = (dfs['Adj Close'] / dfs['Open'] - 1)
dfy = df.resample('Y').agg(res)
dfy_ret = (dfy['Adj Close'] / dfy['Open'] - 1)
print(dfw_ret)
print(dfm_ret)
print(dfq_ret)
print(dfs_ret)
print(dfy_ret)```
This is what the original df prints:
```Adj Close Open
AAPL AMZN AAPL AMZN
Date
2014-12-01 26.122288 326.000000 29.702499 338.119995
2014-12-02 26.022408 326.309998 28.375000 327.500000
2014-12-03 26.317518 316.500000 28.937500 325.730011
2014-12-04 26.217640 316.929993 28.942499 315.529999
2014-12-05 26.106400 312.630005 28.997499 316.799988
... ... ... ... ...
2020-12-24 131.549637 3172.689941 131.320007 3193.899902
2020-12-28 136.254608 3283.959961 133.990005 3194.000000
2020-12-29 134.440399 3322.000000 138.050003 3309.939941
2020-12-30 133.294067 3285.850098 135.580002 3341.000000
2020-12-31 132.267349 3256.929932 134.080002 3275.000000
And this is what the different df_ret print when I go from daily
to weekly/monthly/etc but it can only do it for one stock and
the idea is to be able to do it for multiple stocks:
Date
2014-12-07 -0.075387
2014-12-14 -0.013641
2014-12-21 -0.029041
2014-12-28 0.023680
2015-01-04 0.002176
...
2020-12-06 -0.014306
2020-12-13 -0.012691
2020-12-20 0.018660
2020-12-27 -0.008537
2021-01-03 0.019703
Freq: W-SUN, Length: 318, dtype: float64
Date
2014-12-31 -0.082131
2015-01-30 0.134206
2015-02-27 0.086016
2015-03-31 -0.022975
2015-04-30 0.133512
...
2020-08-31 0.085034
2020-09-30 -0.097677
2020-10-30 -0.053569
2020-11-30 0.034719
2020-12-31 0.021461
Freq: BM, Length: 73, dtype: float64
Date
2014-12-31 -0.082131
2015-03-31 0.190415
2015-06-30 0.166595
2015-09-30 0.165108
2015-12-31 0.322681
2016-03-31 -0.095461
2016-06-30 0.211909
2016-09-30 0.167275
2016-12-31 -0.103026
2017-03-31 0.169701
2017-06-30 0.090090
2017-09-30 -0.011760
2017-12-31 0.213143
2018-03-31 0.234932
2018-06-30 0.199052
2018-09-30 0.190349
2018-12-31 -0.257182
2019-03-31 0.215363
2019-06-30 0.051952
2019-09-30 -0.097281
2019-12-31 0.058328
2020-03-31 0.039851
2020-06-30 0.427244
2020-09-30 0.141676
2020-12-31 0.015252
Freq: Q-DEC, dtype: float64
Date
2014-12-31 -0.082131
2015-06-30 0.388733
2015-12-31 0.538386
2016-06-30 0.090402
2016-12-31 0.045377
2017-06-30 0.277180
2017-12-31 0.202181
2018-06-30 0.450341
2018-12-31 -0.107405
2019-06-30 0.292404
2019-12-31 -0.039075
2020-06-30 0.471371
2020-12-31 0.180907
Freq: 6M, dtype: float64
Date
2014-12-31 -0.082131
2015-12-31 1.162295
2016-12-31 0.142589
2017-12-31 0.542999
2018-12-31 0.281544
2019-12-31 0.261152
2020-12-31 0.737029
Freq: A-DEC, dtype: float64```
Without knowing what your df DataFrame looks like I am assuming it is an issue with correctly handling the resampling on a MultiIndex similar to the one talked about in this question.
The solution listed there is to use pd.Grouper with the freq and level parameters filled out correctly.
# This is just from the listed solution so I am not sure if these is the correct level to choose
df.groupby(pd.Grouper(freq='W', level=-1))
If this doesn't work, I think you would need to provide some more detail or a dummy data set to reproduce the issue.
I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.
try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns
I have a dataframe with 3 columns:
file = glob.glob('InputFile.csv')
for i in file:
df = pd.read_csv(i)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Date X Y
0 2020-02-13 00:11:59 -91.3900 -31.7914
1 2020-02-13 01:11:59 -87.1513 -34.6838
2 2020-02-13 02:11:59 -82.9126 -37.5762
3 2020-02-13 03:11:59 -79.3558 -40.2573
4 2020-02-13 04:11:59 -73.2293 -44.2463
... ... ... ...
2034 2020-05-04 18:00:00 -36.4645 -18.3421
2035 2020-05-04 19:00:00 -36.5767 -16.8311
2036 2020-05-04 20:00:00 -36.0170 -14.9356
2037 2020-05-04 21:00:00 -36.4354 -11.0533
2038 2020-05-04 22:00:00 -40.3424 -11.4000
[2039 rows x 3 columns]
print(converted_file.dtypes)
Date datetime64[ns]
xTilt float64
yTilt float64
dtype: object
I would like the output to be:
Date X Y X_Diff Y_Diff
0 2020-02-16 00:11:59 -38.46270 -70.8352 -38.46270 -70.8352
1 2020-02-23 00:11:59 -80.70250 -7.1893 -42.23980 63.6459
2 2020-03-01 00:11:59 -47.38980 -39.2652 33.31270 -32.0759
3 2020-03-08 00:00:00 -35.65350 -64.5058 11.73630 -25.2406
4 2020-03-15 00:00:00 -43.03290 -15.8425 -7.37940 48.6633
5 2020-03-22 00:00:00 -19.77130 -25.5298 23.26160 -9.6873
6 2020-03-29 00:00:00 -13.18940 12.4093 6.58190 37.9391
7 2020-04-05 00:00:00 -8.49098 27.8407 4.69842 15.4314
8 2020-04-12 00:00:00 -19.05360 20.0445 -10.56262 -7.7962
9 2020-04-26 00:00:00 -25.61330 31.6306 -6.55970 11.5861
10 2020-05-03 00:00:00 -46.09250 -30.3557 -20.47920 -61.9863
In such a way that I would like to search from the InputFile.csv file all dates that are in Sundays and extract every first occurence of every Sunday (that is the first entry on that day and not the other times) along with the X and Y values that corresponds to that selected day. Then save it to a new dataframe where I could do subtraction in the X and Y. Copying the very first X and Y to be copied on columns X_Diff and Y_Diff, respectively. Then for the next entries of the output file, loop in all rows to get the difference of the next X minus the previous X then result will be appended in the X_Diff. Same goes with Y until the end of the file.
Here is my solution.
1. Preparation: I will need to generate some random data to be worked on.
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
The data is like this:
Date X Y
0 2020-02-13 00:00:00 -12.044751 165.962038
1 2020-02-13 01:00:00 63.537406 65.137176
2 2020-02-13 02:00:00 67.555256 114.186898
... ... ... ..
2. Filter the dataframe to get Sunday only. Then, generate another column with date only for grouping purpose.
df = df[df.Date.dt.dayofweek == 0]
df['date_only'] = df.Date.dt.date
Then, it looks like this.
Date X Y date_only
96 2020-02-17 00:00:00 26.632391 120.311315 2020-02-17
97 2020-02-17 01:00:00 -14.111209 21.543440 2020-02-17
98 2020-02-17 02:00:00 -11.941086 -51.303122 2020-02-17
99 2020-02-17 03:00:00 -48.612563 137.023917 2020-02-17
100 2020-02-17 04:00:00 133.843010 -47.168805 2020-02-17
... ... ... ... ...
1796 2020-04-27 20:00:00 -158.310600 30.149292 2020-04-27
1797 2020-04-27 21:00:00 170.212825 181.626611 2020-04-27
1798 2020-04-27 22:00:00 59.773796 11.262186 2020-04-27
1799 2020-04-27 23:00:00 -99.757428 83.529157 2020-04-27
1944 2020-05-04 00:00:00 -168.435315 245.884281 2020-05-04
3. Next step, sort the data frame by "Date". Then, group the dataframe by "date_only". After that, take the first row of each group.
df = df.sort_values(by=['Date'])
df = df.groupby('date_only').apply(lambda g: g.head(1)).reset_index(drop=True).drop(columns=['date_only'])
Results:
Date X Y
0 2020-02-17 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274
2 2020-03-02 -231.596763 -46.989246
3 2020-03-09 76.561269 -40.188202
4 2020-03-16 -18.653363 52.376442
5 2020-03-23 106.758484 22.969963
6 2020-03-30 -133.601545 185.561830
7 2020-04-06 -57.748555 -187.878427
8 2020-04-13 57.648834 10.365917
9 2020-04-20 -47.959093 177.455676
10 2020-04-27 -30.527067 -37.046330
11 2020-05-04 -52.854252 -136.069205
4. Last step, get the difference for each X/Y value with their previous value.
df['X_Diff'] = df.X.diff()
df['Y_Diff'] = df.Y.diff()
Results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 NaN NaN
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
5. If you are not happy with the "NaN" for the first row, then just fill it with the X/Y columns' original values.
df['X_Diff'] = df['X_Diff'].fillna(df.X)
df['Y_Diff'] = df['Y_Diff'].fillna(df.Y)
Final results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
Note: There is no time displayed in the "Date" field in the final result. This is because the data I generated for those dates are hourly. So, the first row of each Sunday is XXXX-XX-XX 00:00:00, and the time 00:00:00 will not be displayed in pandas, although they actually exist.
Here is the Colab Link. You can have all my code in a notebook here.
https://colab.research.google.com/drive/1ecSSvJW0waCU19KPoj5uiiYmHp9SSQOf?usp=sharing
I will create a dataframe as Christopher did:
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
Dataframe view
At First, set the datetime column as index
df = df.set_index('Date')
Secondly, get the rows only for sundays:
sunday_df= df[df.index.dayofweek == 6]
Third, resample the values to day format, take the last value of the day and remove rows with empty hours
sunday_df = sunday_df.resample('D').last().dropna()
Lastly, do the subtraction:
sunday_df['X_Diff'] = sunday_df.X.diff()
sunday_df['Y_Diff'] = sunday_df.Y.diff()
The last view of the new dataframe