I have a dataframe as below. I want to calculate df['Close'].pct_change() in the slices between posizione==1 and posizione==-1. So in this example I have to calculate the pct_change in the slices:
between (2022-04-06 09:00:00,2022-04-06 10:15:00);<BR/>
between (2022-04-06 12:30:00,2022-04-06 14:00:00);<BR/>
between (2022-04-06 15:15:00,2022-04-06 16:00:00);<BR/>
so a formula like
df['between']=np.where(df['posizione'].between(1,-1),df['posizione'].pct_change(),0)
outside 1 and -1 the pct_change() will be zero.
Is it possible?. Thanks in advance
Time Close buy_80_150 posizione
0 2022-04-06 07:30:00 1.053 0.0 0.0
1 2022-04-06 07:45:00 1.049 0.0 0.0
2 2022-04-06 08:00:00 1.046 0.0 0.0
3 2022-04-06 08:15:00 1.049 0.0 0.0
4 2022-04-06 08:30:00 1.048 0.0 0.0
5 2022-04-06 08:45:00 1.049 0.0 0.0
6 2022-04-06 09:00:00 1.047 1.0 1.0
7 2022-04-06 09:15:00 1.044 1.0 0.0
8 2022-04-06 09:30:00 1.044 1.0 0.0
9 2022-04-06 09:45:00 1.050 1.0 0.0
10 2022-04-06 10:00:00 1.041 1.0 0.0
11 2022-04-06 10:15:00 1.048 0.0 -1.0
12 2022-04-06 10:30:00 1.040 0.0 0.0
13 2022-04-06 10:45:00 1.032 0.0 0.0
14 2022-04-06 11:00:00 1.017 0.0 0.0
15 2022-04-06 11:15:00 1.018 0.0 0.0
16 2022-04-06 11:30:00 1.021 0.0 0.0
17 2022-04-06 11:45:00 1.023 0.0 0.0
18 2022-04-06 12:00:00 1.021 0.0 0.0
19 2022-04-06 12:15:00 1.024 0.0 0.0
20 2022-04-06 12:30:00 1.021 1.0 1.0
21 2022-04-06 12:45:00 1.014 1.0 0.0
22 2022-04-06 13:00:00 1.018 1.0 0.0
23 2022-04-06 13:15:00 1.024 1.0 0.0
24 2022-04-06 13:30:00 1.014 1.0 0.0
25 2022-04-06 13:45:00 1.011 1.0 0.0
26 2022-04-06 14:00:00 1.014 0.0 -1.0
27 2022-04-06 14:15:00 1.017 0.0 0.0
28 2022-04-06 14:30:00 1.019 0.0 0.0
29 2022-04-06 14:45:00 1.015 0.0 0.0
30 2022-04-06 15:00:00 1.009 0.0 0.0
31 2022-04-06 15:15:00 1.003 1.0 1.0
32 2022-04-06 15:30:00 1.007 1.0 0.0
33 2022-04-06 15:45:00 1.007 1.0 0.0
34 2022-04-06 16:00:00 1.002 0.0 -1.0
35 2022-04-06 16:15:00 0.994 0.0 0.0
36 2022-04-06 16:30:00 0.993 0.0 0.0
37 2022-04-06 16:45:00 0.992 0.0 0.0
38 2022-04-06 17:00:00 0.980 0.0 0.0
IIUC, you can compute a mask and use it on your data before pct_change:
m1 = df['posizione'].replace(0, float('nan')).ffill().eq(1)
m2 = df['posizione'].eq(-1)
df['pct_change'] = df['Close'].where(m1|m2).pct_change().fillna(0)
output:
Time Close buy_80_150 posizione pct_change
0 2022-04-06 07:30:00 1.053 0.0 0.0 0.000000
1 2022-04-06 07:45:00 1.049 0.0 0.0 0.000000
2 2022-04-06 08:00:00 1.046 0.0 0.0 0.000000
3 2022-04-06 08:15:00 1.049 0.0 0.0 0.000000
4 2022-04-06 08:30:00 1.048 0.0 0.0 0.000000
5 2022-04-06 08:45:00 1.049 0.0 0.0 0.000000
6 2022-04-06 09:00:00 1.047 1.0 1.0 0.000000
7 2022-04-06 09:15:00 1.044 1.0 0.0 -0.002865
8 2022-04-06 09:30:00 1.044 1.0 0.0 0.000000
9 2022-04-06 09:45:00 1.050 1.0 0.0 0.005747
10 2022-04-06 10:00:00 1.041 1.0 0.0 -0.008571
11 2022-04-06 10:15:00 1.048 0.0 -1.0 0.006724
12 2022-04-06 10:30:00 1.040 0.0 0.0 0.000000
13 2022-04-06 10:45:00 1.032 0.0 0.0 0.000000
14 2022-04-06 11:00:00 1.017 0.0 0.0 0.000000
15 2022-04-06 11:15:00 1.018 0.0 0.0 0.000000
16 2022-04-06 11:30:00 1.021 0.0 0.0 0.000000
17 2022-04-06 11:45:00 1.023 0.0 0.0 0.000000
18 2022-04-06 12:00:00 1.021 0.0 0.0 0.000000
19 2022-04-06 12:15:00 1.024 0.0 0.0 0.000000
20 2022-04-06 12:30:00 1.021 1.0 1.0 -0.025763
21 2022-04-06 12:45:00 1.014 1.0 0.0 -0.006856
22 2022-04-06 13:00:00 1.018 1.0 0.0 0.003945
23 2022-04-06 13:15:00 1.024 1.0 0.0 0.005894
24 2022-04-06 13:30:00 1.014 1.0 0.0 -0.009766
25 2022-04-06 13:45:00 1.011 1.0 0.0 -0.002959
26 2022-04-06 14:00:00 1.014 0.0 -1.0 0.002967
27 2022-04-06 14:15:00 1.017 0.0 0.0 0.000000
28 2022-04-06 14:30:00 1.019 0.0 0.0 0.000000
29 2022-04-06 14:45:00 1.015 0.0 0.0 0.000000
30 2022-04-06 15:00:00 1.009 0.0 0.0 0.000000
31 2022-04-06 15:15:00 1.003 1.0 1.0 -0.010848
32 2022-04-06 15:30:00 1.007 1.0 0.0 0.003988
33 2022-04-06 15:45:00 1.007 1.0 0.0 0.000000
34 2022-04-06 16:00:00 1.002 0.0 -1.0 -0.004965
35 2022-04-06 16:15:00 0.994 0.0 0.0 0.000000
36 2022-04-06 16:30:00 0.993 0.0 0.0 0.000000
37 2022-04-06 16:45:00 0.992 0.0 0.0 0.000000
38 2022-04-06 17:00:00 0.980 0.0 0.0 0.000000
Related
Each row in this database represents 1 minute. But some minutes are missing upon pulling the data from API (You'll see 09:51:00 is missing)
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:30:00 33042 1.8807 1.8900 1.9200 1.8700 1.9017 0.0 0.0 68
1 AACG 2022-01-06 09:31:00 5306 1.9073 1.9100 1.9200 1.8801 1.9100 0.0 0.0 27
2 AACG 2022-01-06 09:32:00 3496 1.8964 1.9100 1.9193 1.8800 1.8900 0.0 0.0 17
3 AACG 2022-01-06 09:33:00 5897 1.9377 1.8900 1.9500 1.8900 1.9500 0.0 0.0 15
4 AACG 2022-01-06 09:34:00 1983 1.9362 1.9200 1.9499 1.9200 1.9200 0.0 0.0 9
5 AACG 2022-01-06 09:35:00 10725 1.9439 1.9400 1.9600 1.9201 1.9306 0.0 0.0 87
6 AACG 2022-01-06 09:36:00 5942 1.9380 1.9307 1.9400 1.9300 1.9400 0.0 0.0 48
7 AACG 2022-01-06 09:37:00 5759 1.9428 1.9659 1.9659 1.9400 1.9500 0.0 0.0 11
8 AACG 2022-01-06 09:38:00 4855 1.9424 1.9500 1.9500 1.9401 1.9495 0.0 0.0 10
9 AACG 2022-01-06 09:39:00 6275 1.9514 1.9500 1.9700 1.9450 1.9700 0.0 0.0 14
10 AACG 2022-01-06 09:40:00 13695 2.0150 1.9799 2.0500 1.9749 2.0200 0.0 0.0 59
11 AACG 2022-01-06 09:41:00 3252 2.0209 2.0275 2.0300 2.0200 2.0200 0.0 0.0 14
12 AACG 2022-01-06 09:42:00 12082 2.0117 2.0300 2.0400 1.9800 1.9900 0.0 0.0 41
13 AACG 2022-01-06 09:43:00 5148 1.9802 1.9800 1.9999 1.9750 1.9999 0.0 0.0 11
14 AACG 2022-01-06 09:44:00 2764 1.9927 1.9901 1.9943 1.9901 1.9943 0.0 0.0 5
15 AACG 2022-01-06 09:45:00 2379 1.9576 1.9601 1.9601 1.9201 1.9201 0.0 0.0 10
16 AACG 2022-01-06 09:46:00 8762 1.9852 1.9550 1.9900 1.9550 1.9900 0.0 0.0 35
17 AACG 2022-01-06 09:47:00 1343 1.9704 1.9700 1.9738 1.9700 1.9701 0.0 0.0 5
18 AACG 2022-01-06 09:48:00 17080 1.9696 1.9700 1.9800 1.9600 1.9600 0.0 0.0 9
19 AACG 2022-01-06 09:49:00 9004 1.9600 1.9600 1.9600 1.9600 1.9600 0.0 0.0 9
20 AACG 2022-01-06 09:50:00 9224 1.9603 1.9600 1.9613 1.9600 1.9613 0.0 0.0 4
21 AACG 2022-01-06 09:52:00 16914 1.9921 1.9800 2.0400 1.9750 2.0399 0.0 0.0 67
22 AACG 2022-01-06 09:53:00 4665 1.9866 1.9900 2.0395 1.9801 1.9900 0.0 0.0 37
23 AACG 2022-01-06 09:55:00 2107 2.0049 1.9900 2.0100 1.9900 2.0099 0.0 0.0 10
24 AACG 2022-01-06 09:56:00 3003 2.0028 2.0000 2.0099 2.0000 2.0099 0.0 0.0 23
25 AACG 2022-01-06 09:57:00 8489 2.0272 2.0100 2.0400 2.0100 2.0300 0.0 0.0 34
26 AACG 2022-01-06 09:58:00 6050 2.0155 2.0300 2.0300 2.0150 2.0150 0.0 0.0 6
27 AACG 2022-01-06 09:59:00 61623 2.0449 2.0300 2.0700 2.0300 2.0699 0.0 0.0 83
28 AACG 2022-01-06 10:00:00 19699 2.0856 2.0699 2.1199 2.0600 2.1100 0.0 0.0 54
I want to insert rows with empty values that only include the missing time data as a value.
missing_data = pd.DataFrame({'ticker': ['AACG'], 'date': ['2022-01-06'], 'time': ['09:51:00'],
'vol': [0], 'vwap': [0.0], 'open': [0.0], 'high': [0.0], 'low': [0.0],
'close': [0.0], 'lbh': [0.0], 'lah': [0.0], 'trades': [0]}, index=[21])
It would look something like this:
ticker date time vol vwap open high low close lbh lah trades
21 AACG 2022-01-06 09:51:00 0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0
With the help of someone, I've managed to isolate the areas that show me where the missing values are at:
time_in_minutes = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
indices_where_the_next_minute_is_missing = np.where(np.diff(time_in_minutes) != 1)[0]
out = df.loc[indices_where_the_next_minute_is_missing]
Simply adding 1 to time_in_minutes will give me the correction I need:
timeinminutesplus1 = pd.to_timedelta(out['time'].astype(str)).astype('timedelta64[m]') + 1
But how do i turn it back to a datetime.time datatype and insert it into the database?
Building off of my answer to your previous question, first expand your DataFrame to include NaN rows for missing minutes.
time = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
out = df.set_index(time).reindex(np.arange(time[0], time.iloc[len(df)-1]+1)).reset_index(drop=True)
then given your missing data DataFrame
missing_data = pd.DataFrame({'ticker': ['AACG'], 'date': ['2022-01-06'], 'time': ['09:51:00'],
'vol': [0], 'vwap': [0.0], 'open': [0.0], 'high': [0.0], 'low': [0.0],
'close': [0.0], 'lbh': [0.0], 'lah': [0.0], 'trades': [0]}, index=[21])
which looks like:
ticker date time vol vwap open high low close lbh lah trades
21 AACG 2022-01-06 09:51:00 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
you can update out:
out.update(missing_data)
Then out becomes:
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:51:00 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0 0.0
1 AACG 2022-01-06 09:31:00 5306.0 1.9073 1.9100 1.9200 1.8801 1.9100 0.0 0.0 27.0
2 AACG 2022-01-06 09:32:00 3496.0 1.8964 1.9100 1.9193 1.8800 1.8900 0.0 0.0 17.0
3 AACG 2022-01-06 09:33:00 5897.0 1.9377 1.8900 1.9500 1.8900 1.9500 0.0 0.0 15.0
...
20 AACG 2022-01-06 09:50:00 9224.0 1.9603 1.9600 1.9613 1.9600 1.9613 0.0 0.0 4.0
21 AACG 2022-01-06 09:51:00 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0 0.0
22 AACG 2022-01-06 09:52:00 16914.0 1.9921 1.9800 2.0400 1.9750 2.0399 0.0 0.0 67.0
23 AACG 2022-01-06 09:53:00 4665.0 1.9866 1.9900 2.0395 1.9801 1.9900 0.0 0.0 37.0
I used the code that you provided and then iterated over the result in order to add the missing rows. The final result is then sorted again to get the order and indices correctly.
import datetime
# Reading your dataframe
df = pd.read_csv('missing_minute.csv', sep=';', index_col='index')
# Define a default row to add for missing rows
default_row = {'ticker':'AACG', 'date':'2022-01-06', 'time': '00:00:00', 'vol':0.0, 'vwap':0.0, 'open':0.0, 'high':0.0, 'low':0.0, 'close':0.0, 'lbh':0.0, 'lah':0.0, 'trades':0.0}
# Your logic to find the rows before the missing
time_in_minutes = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
indices_where_the_next_minute_is_missing = np.where(np.diff(time_in_minutes) != 1)[0]
out = df.loc[indices_where_the_next_minute_is_missing]
# Iterating over the rows
for i, e in out.iterrows():
# Extract the time of the previous row and convert it to date
time_of_previous_row = datetime.datetime.strptime(e['time'], '%H:%M:%S')
# Add one minute for the new entry
time_of_new_row = (time_of_previous_row + datetime.timedelta(minutes=1)).strftime("%H:%M:%S")
# Set new time to the default row and append it to the dataframe
default_row['time'] = time_of_new_row
df = df.append(default_row, ignore_index=True)
# Sort the dataframe by the time column and reset the index
df = df.sort_values(by='time').reset_index(drop=True)
df
Output:
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:30:00 33042.0 1.8807 1.89 1.92 1.87 1.9017 0.0 0.0 68.0
1 AACG 2022-01-06 09:31:00 5306.0 1.9073 1.91 1.92 1.8801 1.91 0.0 0.0 27.0
2 AACG 2022-01-06 09:32:00 3496.0 1.8964 1.91 1.9193 1.88 1.89 0.0 0.0 17.0
3 AACG 2022-01-06 09:33:00 5897.0 1.9377 1.89 1.95 1.89 1.95 0.0 0.0 15.0
4 AACG 2022-01-06 09:34:00 1983.0 1.9362 1.92 1.9499 1.92 1.92 0.0 0.0 9.0
5 AACG 2022-01-06 09:35:00 10725.0 1.9439 1.94 1.96 1.9201 1.9306 0.0 0.0 87.0
6 AACG 2022-01-06 09:36:00 5942.0 1.938 1.9307 1.94 1.93 1.94 0.0 0.0 48.0
7 AACG 2022-01-06 09:37:00 5759.0 1.9428 1.9659 1.9659 1.94 1.95 0.0 0.0 11.0
8 AACG 2022-01-06 09:38:00 4855.0 1.9424 1.95 1.95 1.9401 1.9495 0.0 0.0 10.0
9 AACG 2022-01-06 09:39:00 6275.0 1.9514 1.95 1.97 1.945 1.97 0.0 0.0 14.0
10 AACG 2022-01-06 09:40:00 13695.0 2.015 1.9799 2.05 1.9749 2.02 0.0 0.0 59.0
11 AACG 2022-01-06 09:41:00 3252.0 2.0209 2.0275 2.03 2.02 2.02 0.0 0.0 14.0
12 AACG 2022-01-06 09:42:00 12082.0 2.0117 2.03 2.04 1.98 1.99 0.0 0.0 41.0
13 AACG 2022-01-06 09:43:00 5148.0 1.9802 1.98 1.9999 1.975 1.9999 0.0 0.0 11.0
14 AACG 2022-01-06 09:44:00 2764.0 1.9927 1.9901 1.9943 1.9901 1.9943 0.0 0.0 5.0
15 AACG 2022-01-06 09:45:00 2379.0 1.9576 1.9601 1.9601 1.9201 1.9201 0.0 0.0 10.0
16 AACG 2022-01-06 09:46:00 8762.0 1.9852 1.955 1.99 1.955 1.99 0.0 0.0 35.0
17 AACG 2022-01-06 09:47:00 1343.0 1.9704 1.97 1.9738 1.97 1.9701 0.0 0.0 5.0
18 AACG 2022-01-06 09:48:00 17080.0 1.9696 1.97 1.98 1.96 1.96 0.0 0.0 9.0
19 AACG 2022-01-06 09:49:00 9004.0 1.96 1.96 1.96 1.96 1.96 0.0 0.0 9.0
20 AACG 2022-01-06 09:50:00 9224.0 1.9603 1.96 1.9613 1.96 1.9613 0.0 0.0 4.0
21 AACG 2022-01-06 09:51:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 AACG 2022-01-06 09:52:00 16914.0 1.9921 1.98 2.04 1.975 2.0399 0.0 0.0 67.0
23 AACG 2022-01-06 09:53:00 4665.0 1.9866 1.99 2.0395 1.9801 1.99 0.0 0.0 37.0
24 AACG 2022-01-06 09:54:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25 AACG 2022-01-06 09:55:00 2107.0 2.0049 1.99 2.01 1.99 2.0099 0.0 0.0 10.0
26 AACG 2022-01-06 09:56:00 3003.0 2.0028 2.0 2.0099 2.0 2.0099 0.0 0.0 23.0
27 AACG 2022-01-06 09:57:00 8489.0 2.0272 2.01 2.04 2.01 2.03 0.0 0.0 34.0
28 AACG 2022-01-06 09:58:00 6050.0 2.0155 2.03 2.03 2.015 2.015 0.0 0.0 6.0
29 AACG 2022-01-06 09:59:00 61623.0 2.0449 2.03 2.07 2.03 2.0699 0.0 0.0 83.0
30 AACG 2022-01-06 10:00:00 19699.0 2.0856 2.0699 2.1199 2.06 2.11 0.0 0.0 54.0
P.S.: This only works if only one entry is missing at a time and not multiple entries in one sequence (e.g. 09:51 and 09:52 are missing). This could be added if you check how many rows in a sequence are missing.
Also, if you have data over multiple days, the code has to be adapted a little bit. First, set the date in the loop. Second, sort by date and time in the end.
Similar to enke's answer.
It sounds like you may want to create a new data frame with the proper date and time index that you desire (without missing rows). Then fill in the new data frame with the data that you have.
You can do this using pandas.DataFrame.update()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
Here is an example:
start_datetime = '2022-01-06 09:30:00'
end_datetime = '2022-01-06 10:00:00'
cols = ['ticker','date','time','vol','vwap','open','high','low','close','lbh','lah','trades']
new_df = pd.DataFrame(columns=cols,
index=pd.date_range(start=start_datetime,
end=end_datetime, freq='min'))
new_df['date'] = [d.date() for d in new_df.index]
new_df['time'] = [d.time() for d in new_df.index]
new_df.fillna(0.0)
start_datetime = '2022-01-06 09:35:00'
end_datetime = '2022-01-06 9:40:00'
cols = ['ticker','date','time','vol','vwap','open','high','low','close','lbh','lah','trades']
other_df = pd.DataFrame(columns=cols,
index=pd.date_range(start=start_datetime,
end=end_datetime, freq='min'))
other_df['date'] = [d.date() for d in other_df.index]
other_df['time'] = [d.time() for d in other_df.index]
other_df.fillna(3)
final_df = new_df.copy()
final_df.update(other_df)
I have a timeseries with data related to the irradiance of the sun. I have data for every hour during a year, but every month has data from a diferent year. For example, the data taken in March can be from 2012 and the data taken in January can be from 2014.
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2012-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2012-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2012-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2012-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2012-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2011-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2011-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2011-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2011-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2011-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
My question is: there is a way I can set all the data to a certain year?
For example, set all data to 2014
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2014-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2014-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2014-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2014-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2014-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2014-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2014-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2014-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2014-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2014-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
Thanks in advance.
Use offsets.DateOffset with year (without s) for set same year in all DatetimeIndex:
rng = pd.date_range('2009-04-03', periods=10, freq='350D')
df = pd.DataFrame({ 'a': range(10)}, rng)
print (df)
a
2009-04-03 0
2010-03-19 1
2011-03-04 2
2012-02-17 3
2013-02-01 4
2014-01-17 5
2015-01-02 6
2015-12-18 7
2016-12-02 8
2017-11-17 9
df.index += pd.offsets.DateOffset(year=2014)
print (df)
a
2014-04-03 0
2014-03-19 1
2014-03-04 2
2014-02-17 3
2014-02-01 4
2014-01-17 5
2014-01-02 6
2014-12-18 7
2014-12-02 8
2014-11-17 9
Another idea with Index.map and replace:
df.index = df.index.map(lambda x: x.replace(year=2014))
What's the most succinct way to resample this dataframe:
>>> uneven = pd.DataFrame({'a': [0, 12, 19]}, index=pd.DatetimeIndex(['2020-12-08', '2020-12-20', '2020-12-27']))
>>> print(uneven)
a
2020-12-08 0
2020-12-20 12
2020-12-27 19
...into this dataframe:
>>> daily = pd.DataFrame({'a': range(20)}, index=pd.date_range('2020-12-08', periods=3*7-1, freq='D'))
>>> print(daily)
a
2020-12-08 0
2020-12-09 1
...
2020-12-19 11
2020-12-20 12
2020-12-21 13
...
2020-12-27 19
NB: 12 days between the 8th and 20th Dec, 7 days between the 20th and 27th.
Also, to give clarity of the kind of interpolation/resampling I want to do:
>>> print(daily.diff())
a
2020-12-08 NaN
2020-12-09 1.0
2020-12-10 1.0
...
2020-12-19 1.0
2020-12-20 1.0
2020-12-21 1.0
...
2020-12-27 1.0
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
first_dose second_dose
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-20 E92000001 574829.0 0.0
N92000002 16068.0 0.0
S92000003 60333.0 0.0
W92000004 24056.0 0.0
2020-12-27 E92000001 267809.0 0.0
N92000002 14948.0 0.0
S92000003 34535.0 0.0
W92000004 12495.0 0.0
2021-01-03 E92000001 330037.0 20660.0
N92000002 9669.0 1271.0
S92000003 21446.0 44.0
W92000004 14205.0 27.0
I think you need:
df = df.reset_index('areaCode').groupby('areaCode')[['first_dose','second_dose']].resample('D').interpolate()
print (df)
first_dose second_dose
areaCode date
E92000001 2020-12-08 0.000000 0.000000
2020-12-09 47902.416667 0.000000
2020-12-10 95804.833333 0.000000
2020-12-11 143707.250000 0.000000
2020-12-12 191609.666667 0.000000
... ...
W92000004 2020-12-30 13227.857143 11.571429
2020-12-31 13472.142857 15.428571
2021-01-01 13716.428571 19.285714
2021-01-02 13960.714286 23.142857
2021-01-03 14205.000000 27.000000
[108 rows x 2 columns]
I have a timeseries
date
2009-12-23 0.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 0.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
I would like to set the value to 1 based on the following rule:
If the constant is set 9 this means the 9th of each month. Due to
that that 2010-01-09 doesn't exist I would like to set the next date
that exists in the series to 1 which is 2010-01-11 above.
I have tried to create two series one (series1) with day < 9 set to 1 and one (series2) with day > 9 to 1 and then series1.shift(1) * series2
It works in the middle of the month but not if day is set to 1 due to that the last date in previous month is set to 0 in series1.
Assume your timeseries is s with a datetimeindex
I want to create a groupby object of all index values whose days are greater than or equal to 9.
g = s.index.to_series().dt.day.ge(9).groupby(pd.TimeGrouper('M'))
Then I'll check that there is at least one day past >= 9 and grab the first among them. With those, I'll assign the value of 1.
s.loc[g.idxmax()[g.any()]] = 1
s
date
2009-12-23 1.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 1.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
Name: val, dtype: float64
Note that 2009-12-23 also was assigned a 1 as it satisfies this requirement as well.
I would like to plot a bar graph that has only a few entries of data in each column of a pandas DataFrame with a bar graph. This is successful, but not only does it have the wrong y-axis limits, it also makes the x ticks very closely spaced so that the graph is useless. I would like to change the step rate to be about every week or so and only display day, month and year. I have the following DataFrame:
Observed WRF
2014-06-28 12:00:00 0.0 0.0
2014-06-28 13:00:00 0.0 0.0
2014-06-28 14:00:00 0.0 0.0
2014-06-28 15:00:00 0.0 0.0
2014-06-28 16:00:00 0.0 0.0
2014-06-28 17:00:00 0.0 0.0
2014-06-28 18:00:00 0.0 0.0
2014-06-28 19:00:00 0.0 0.0
2014-06-28 20:00:00 0.0 0.0
2014-06-28 21:00:00 0.0 0.0
2014-06-28 22:00:00 0.0 0.0
2014-06-28 23:00:00 0.0 0.0
2014-06-29 00:00:00 0.0 0.0
2014-06-29 01:00:00 0.0 0.0
2014-06-29 02:00:00 0.0 0.0
2014-06-29 03:00:00 0.0 0.0
2014-06-29 04:00:00 0.0 0.0
2014-06-29 05:00:00 0.0 0.0
2014-06-29 06:00:00 0.0 0.0
2014-06-29 07:00:00 0.0 0.0
2014-06-29 08:00:00 0.0 0.0
2014-06-29 09:00:00 0.0 0.0
2014-06-29 10:00:00 0.0 0.0
2014-06-29 11:00:00 0.0 0.0
2014-06-29 12:00:00 0.0 0.0
2014-06-29 13:00:00 0.0 0.0
2014-06-29 14:00:00 0.0 0.0
2014-06-29 15:00:00 0.0 0.0
2014-06-29 16:00:00 0.0 0.0
2014-06-29 17:00:00 0.0 0.0
... ...
2014-07-04 02:00:00 0.0002 0.0
2014-07-04 03:00:00 0.2466 0.0
2014-07-04 04:00:00 0.7103 0.0
2014-07-04 05:00:00 0.9158 1.93521e-13
2014-07-04 06:00:00 0.6583 0.0
2014-07-04 07:00:00 0.3915 0.0
2014-07-04 08:00:00 0.1249 0.0
2014-07-04 09:00:00 0.0 0.0
... ...
2014-08-30 07:00:00 0.0 0.0
2014-08-30 08:00:00 0.0 0.0
2014-08-30 09:00:00 0.0 0.0
2014-08-30 10:00:00 0.0 0.0
2014-08-30 11:00:00 0.0 0.0
2014-08-30 12:00:00 0.0 0.0
2014-08-30 13:00:00 0.0 0.0
2014-08-30 14:00:00 0.0 0.0
2014-08-30 15:00:00 0.0 0.0
2014-08-30 16:00:00 0.0 0.0
2014-08-30 17:00:00 0.0 0.0
2014-08-30 18:00:00 0.0 0.0
2014-08-30 19:00:00 0.0 0.0
2014-08-30 20:00:00 0.0 0.0
2014-08-30 21:00:00 0.0 0.0
2014-08-30 22:00:00 0.0 0.0
2014-08-30 23:00:00 0.0 0.0
2014-08-31 00:00:00 0.0 0.0
2014-08-31 01:00:00 0.0 0.0
2014-08-31 02:00:00 0.0 0.0
2014-08-31 03:00:00 0.0 0.0
2014-08-31 04:00:00 0.0 0.0
2014-08-31 05:00:00 0.0 0.0
2014-08-31 06:00:00 0.0 0.0
2014-08-31 07:00:00 0.0 0.0
2014-08-31 08:00:00 0.0 0.0
2014-08-31 09:00:00 0.0 0.0
2014-08-31 10:00:00 0.0 0.0
2014-08-31 11:00:00 0.0 0.0
2014-08-31 12:00:00 0.0 0.0
And the following code to plot it:
df4.plot(kind='bar',edgecolor='none',figsize=(16,8),linewidth=2, color=((1,0.502,0),'black'))
plt.legend(prop={'size':16})
plt.subplots_adjust(left=.1, right=0.9, top=0.9, bottom=.1)
plt.title('Five Day WRF Model Comparison Near %.2f,%.2f' %(lat,lon),fontsize=24)
plt.ylabel('Hourly Accumulated Precipitation [mm]',fontsize=18,color='black')
ax4=plt.gca()
maxs4=df4.max()
ax4.set_ylim([0, maxs4.max()])
ax4.xaxis_date()
ax4.xaxis.set_label_coords(0.5, -0.05)
plt.xlabel('Time',fontsize=18,color='black')
plt.show()
The y-axis starts at 0, but continues to about double the maximum value of the y-limit. The x-axis counts by hours, which is what I separated the data by, so that makes sense. However, it is not a helpful display.
Look at this code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pylab as plt
from matplotlib.dates import DateFormatter
# Sample data
df_origin = pd.DataFrame(pd.date_range(datetime(2014,6,28,12,0,0),
datetime(2014,8,30,12,0,0), freq='1H'), columns=['Valid Time'])
df_origin = df_origin .set_index('Valid Time')
df_origin ['Precipitation'] = np.random.uniform(low=0., high=10., size=(len(df_origin.index)))
df_origin .loc[20:100, 'Precipitation'] = 0.
df_origin .loc[168:168*2, 'Precipitation'] = 0. # second week has to be dry
# Plotting
df_origin.plot(y='Precipitation',kind='bar',edgecolor='none',figsize=(16,8),linewidth=2, color=((1,0.502,0)))
plt.legend(prop={'size':16})
plt.subplots_adjust(left=.1, right=0.9, top=0.9, bottom=.1)
plt.title('Precipitation (WRF Model)',fontsize=24)
plt.ylabel('Hourly Accumulated Precipitation [mm]',fontsize=18,color='black')
ax = plt.gca()
plt.gcf().autofmt_xdate()
# skip ticks for X axis
ax.set_xticklabels([dt.strftime('%Y-%m-%d') for dt in df_origin.index])
for i, tick in enumerate(ax.xaxis.get_major_ticks()):
if (i % (24*7) != 0): # 24 hours * 7 days = 1 week
tick.set_visible(False)
plt.xlabel('Time',fontsize=18,color='black')
plt.show()