Changing year of DataTimeIndex in Pandas - python

I have a timeseries with data related to the irradiance of the sun. I have data for every hour during a year, but every month has data from a diferent year. For example, the data taken in March can be from 2012 and the data taken in January can be from 2014.
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2012-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2012-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2012-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2012-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2012-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2011-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2011-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2011-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2011-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2011-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
My question is: there is a way I can set all the data to a certain year?
For example, set all data to 2014
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2014-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2014-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2014-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2014-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2014-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2014-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2014-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2014-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2014-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2014-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
Thanks in advance.

Use offsets.DateOffset with year (without s) for set same year in all DatetimeIndex:
rng = pd.date_range('2009-04-03', periods=10, freq='350D')
df = pd.DataFrame({ 'a': range(10)}, rng)
print (df)
a
2009-04-03 0
2010-03-19 1
2011-03-04 2
2012-02-17 3
2013-02-01 4
2014-01-17 5
2015-01-02 6
2015-12-18 7
2016-12-02 8
2017-11-17 9
df.index += pd.offsets.DateOffset(year=2014)
print (df)
a
2014-04-03 0
2014-03-19 1
2014-03-04 2
2014-02-17 3
2014-02-01 4
2014-01-17 5
2014-01-02 6
2014-12-18 7
2014-12-02 8
2014-11-17 9
Another idea with Index.map and replace:
df.index = df.index.map(lambda x: x.replace(year=2014))

Related

How do I solve this NaN error by this function?

Input:
#Fixed-mono-cell temperature
parameters = pvlib.temperature.TEMPERATURE_MODEL_PARAMETERS['sapm']['open_rack_glass_glass'] #to extract specfic parameter
cell_temperature_mono_fixed = pvlib.temperature.sapm_cell(effective_irrad_mono_fixed,
df['T_a'],
df['W_s'],
**parameters)
cell_temperature_mono_fixed
Output:
2005-01-01 01:00:00 NaN
2005-01-01 02:00:00 NaN
2005-01-01 03:00:00 NaN
2005-01-01 04:00:00 NaN
2005-01-01 05:00:00 NaN
..
8755 NaN
8756 NaN
8757 NaN
8758 NaN
8759 NaN
Length: 17520, dtype: float64
cell_temperature_mono_fixed.plot
Output:
/Users/charlielinck/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:4024: RuntimeWarning: '
Extra data information:
df: dataframe
date_time Sun_Az Sun_alt GHI DHI DNI T_a W_s
0 2005-01-01 01:00:00 17.9 90.0 0.0 0.0 0.0 15.5 13.3
1 2005-01-01 02:00:00 54.8 90.0 0.0 0.0 0.0 17.0 14.5
2 2005-01-01 03:00:00 73.7 90.0 0.0 0.0 0.0 16.7 14.0
3 2005-01-01 04:00:00 85.7 90.0 0.0 0.0 0.0 16.7 14.2
4 2005-01-01 05:00:00 94.9 90.0 0.0 0.0 0.0 16.7 14.1
5 2005-01-01 06:00:00 103.5 90.0 0.0 0.0 0.0 16.6 14.3
6 2005-01-01 07:00:00 111.6 90.0 0.0 0.0 0.0 16.5 13.8
7 2005-01-01 08:00:00 120.5 89.6 1.0 1.0 0.0 16.6 16.0
8 2005-01-01 09:00:00 130.5 79.9 27.0 27.0 0.0 16.8 16.5
9 2005-01-01 10:00:00 141.8 71.7 55.0 55.0 0.0 16.9 16.9
10 2005-01-01 11:00:00 154.9 65.5 83.0 83.0 0.0 17.0 17.2
11 2005-01-01 12:00:00 169.8 61.9 114.0 114.0 0.0 17.4 17.9
12 2005-01-01 13:00:00 185.2 61.4 110.0 110.0 0.0 17.5 18.0
13 2005-01-01 14:00:00 200.4 64.0 94.0 94.0 0.0 17.5 17.8
14 2005-01-01 15:00:00 214.3 69.5 70.0 70.0 0.0 17.5 17.6
15 2005-01-01 16:00:00 226.3 77.2 38.0 38.0 0.0 17.2 17.0
16 2005-01-01 17:00:00 236.5 86.4 4.0 4.0 0.0 16.7 16.3
17 2005-01-01 18:00:00 245.5 90.0 0.0 0.0 0.0 16.0 14.5
18 2005-01-01 19:00:00 254.2 90.0 0.0 0.0 0.0 14.9 13.0
19 2005-01-01 20:00:00 262.3 90.0 0.0 0.0 0.0 16.0 14.1
20 2005-01-01 21:00:00 271.3 90.0 0.0 0.0 0.0 15.1 13.3
21 2005-01-01 22:00:00 282.1 90.0 0.0 0.0 0.0 15.5 13.2
22 2005-01-01 23:00:00 298.1 90.0 0.0 0.0 0.0 15.6 13.0
23 2005-01-02 00:00:00 327.5 90.0 0.0 0.0 0.0 15.8 13.1
df['T_a'] is temperature data,
df['W_s'] is windspeed data
effective_irrad_mono_fixed.head(24)
date_time
2005-01-01 01:00:00 0.000000
2005-01-01 02:00:00 0.000000
2005-01-01 03:00:00 0.000000
2005-01-01 04:00:00 0.000000
2005-01-01 05:00:00 0.000000
2005-01-01 06:00:00 0.000000
2005-01-01 07:00:00 0.000000
2005-01-01 08:00:00 0.936690
2005-01-01 09:00:00 25.168996
2005-01-01 10:00:00 51.165091
2005-01-01 11:00:00 77.354266
2005-01-01 12:00:00 108.002486
2005-01-01 13:00:00 103.809820
2005-01-01 14:00:00 88.138705
2005-01-01 15:00:00 65.051870
2005-01-01 16:00:00 35.390518
2005-01-01 17:00:00 3.742581
2005-01-01 18:00:00 0.000000
2005-01-01 19:00:00 0.000000
2005-01-01 20:00:00 0.000000
2005-01-01 21:00:00 0.000000
2005-01-01 22:00:00 0.000000
2005-01-01 23:00:00 0.000000
2005-01-02 00:00:00 0.000000
Question: I don't understand that if I simply run the function I only get NaN values, might it have something to with the timestamp.
I believe this also results in the RunTimeWarning when I want to plot the function.
This is not really a pvlib issue, more a pandas issue. The problem is that your input time series objects are not on a consistent index: the irradiance input has a pandas.DatetimeIndex while the temperature and wind speed inputs have pandas.RangeIndex (see the index printed out from your df). Math operations on Series are done by aligning index elements and substituting NaN where things don't line up. For example see how only the shared index elements correspond to non-NaN values here:
In [46]: a = pd.Series([1, 2, 3], index=[1, 2, 3])
...: b = pd.Series([2, 3, 4], index=[2, 3, 4])
...: a*b
Out[46]:
1 NaN
2 4.0
3 9.0
4 NaN
dtype: float64
If you examine the index of your cell_temperature_mono_fixed, you'll see it has both timestamps (from the irradiance input) and integers (from the other two), so it's taking the union of the indexes but only filling in values for the intersection (which is empty in this case).
So to fix your problem, you should make sure all the inputs are on a consistent index. The easiest way to do that is probably at the dataframe level, i.e. df = df.set_index('date_time').

How to insert missing row into pandas dataframe?

Each row in this database represents 1 minute. But some minutes are missing upon pulling the data from API (You'll see 09:51:00 is missing)
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:30:00 33042 1.8807 1.8900 1.9200 1.8700 1.9017 0.0 0.0 68
1 AACG 2022-01-06 09:31:00 5306 1.9073 1.9100 1.9200 1.8801 1.9100 0.0 0.0 27
2 AACG 2022-01-06 09:32:00 3496 1.8964 1.9100 1.9193 1.8800 1.8900 0.0 0.0 17
3 AACG 2022-01-06 09:33:00 5897 1.9377 1.8900 1.9500 1.8900 1.9500 0.0 0.0 15
4 AACG 2022-01-06 09:34:00 1983 1.9362 1.9200 1.9499 1.9200 1.9200 0.0 0.0 9
5 AACG 2022-01-06 09:35:00 10725 1.9439 1.9400 1.9600 1.9201 1.9306 0.0 0.0 87
6 AACG 2022-01-06 09:36:00 5942 1.9380 1.9307 1.9400 1.9300 1.9400 0.0 0.0 48
7 AACG 2022-01-06 09:37:00 5759 1.9428 1.9659 1.9659 1.9400 1.9500 0.0 0.0 11
8 AACG 2022-01-06 09:38:00 4855 1.9424 1.9500 1.9500 1.9401 1.9495 0.0 0.0 10
9 AACG 2022-01-06 09:39:00 6275 1.9514 1.9500 1.9700 1.9450 1.9700 0.0 0.0 14
10 AACG 2022-01-06 09:40:00 13695 2.0150 1.9799 2.0500 1.9749 2.0200 0.0 0.0 59
11 AACG 2022-01-06 09:41:00 3252 2.0209 2.0275 2.0300 2.0200 2.0200 0.0 0.0 14
12 AACG 2022-01-06 09:42:00 12082 2.0117 2.0300 2.0400 1.9800 1.9900 0.0 0.0 41
13 AACG 2022-01-06 09:43:00 5148 1.9802 1.9800 1.9999 1.9750 1.9999 0.0 0.0 11
14 AACG 2022-01-06 09:44:00 2764 1.9927 1.9901 1.9943 1.9901 1.9943 0.0 0.0 5
15 AACG 2022-01-06 09:45:00 2379 1.9576 1.9601 1.9601 1.9201 1.9201 0.0 0.0 10
16 AACG 2022-01-06 09:46:00 8762 1.9852 1.9550 1.9900 1.9550 1.9900 0.0 0.0 35
17 AACG 2022-01-06 09:47:00 1343 1.9704 1.9700 1.9738 1.9700 1.9701 0.0 0.0 5
18 AACG 2022-01-06 09:48:00 17080 1.9696 1.9700 1.9800 1.9600 1.9600 0.0 0.0 9
19 AACG 2022-01-06 09:49:00 9004 1.9600 1.9600 1.9600 1.9600 1.9600 0.0 0.0 9
20 AACG 2022-01-06 09:50:00 9224 1.9603 1.9600 1.9613 1.9600 1.9613 0.0 0.0 4
21 AACG 2022-01-06 09:52:00 16914 1.9921 1.9800 2.0400 1.9750 2.0399 0.0 0.0 67
22 AACG 2022-01-06 09:53:00 4665 1.9866 1.9900 2.0395 1.9801 1.9900 0.0 0.0 37
23 AACG 2022-01-06 09:55:00 2107 2.0049 1.9900 2.0100 1.9900 2.0099 0.0 0.0 10
24 AACG 2022-01-06 09:56:00 3003 2.0028 2.0000 2.0099 2.0000 2.0099 0.0 0.0 23
25 AACG 2022-01-06 09:57:00 8489 2.0272 2.0100 2.0400 2.0100 2.0300 0.0 0.0 34
26 AACG 2022-01-06 09:58:00 6050 2.0155 2.0300 2.0300 2.0150 2.0150 0.0 0.0 6
27 AACG 2022-01-06 09:59:00 61623 2.0449 2.0300 2.0700 2.0300 2.0699 0.0 0.0 83
28 AACG 2022-01-06 10:00:00 19699 2.0856 2.0699 2.1199 2.0600 2.1100 0.0 0.0 54
I want to insert rows with empty values that only include the missing time data as a value.
missing_data = pd.DataFrame({'ticker': ['AACG'], 'date': ['2022-01-06'], 'time': ['09:51:00'],
'vol': [0], 'vwap': [0.0], 'open': [0.0], 'high': [0.0], 'low': [0.0],
'close': [0.0], 'lbh': [0.0], 'lah': [0.0], 'trades': [0]}, index=[21])
It would look something like this:
ticker date time vol vwap open high low close lbh lah trades
21 AACG 2022-01-06 09:51:00 0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0
With the help of someone, I've managed to isolate the areas that show me where the missing values are at:
time_in_minutes = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
indices_where_the_next_minute_is_missing = np.where(np.diff(time_in_minutes) != 1)[0]
out = df.loc[indices_where_the_next_minute_is_missing]
Simply adding 1 to time_in_minutes will give me the correction I need:
timeinminutesplus1 = pd.to_timedelta(out['time'].astype(str)).astype('timedelta64[m]') + 1
But how do i turn it back to a datetime.time datatype and insert it into the database?
Building off of my answer to your previous question, first expand your DataFrame to include NaN rows for missing minutes.
time = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
out = df.set_index(time).reindex(np.arange(time[0], time.iloc[len(df)-1]+1)).reset_index(drop=True)
then given your missing data DataFrame
missing_data = pd.DataFrame({'ticker': ['AACG'], 'date': ['2022-01-06'], 'time': ['09:51:00'],
'vol': [0], 'vwap': [0.0], 'open': [0.0], 'high': [0.0], 'low': [0.0],
'close': [0.0], 'lbh': [0.0], 'lah': [0.0], 'trades': [0]}, index=[21])
which looks like:
ticker date time vol vwap open high low close lbh lah trades
21 AACG 2022-01-06 09:51:00 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
you can update out:
out.update(missing_data)
Then out becomes:
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:51:00 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0 0.0
1 AACG 2022-01-06 09:31:00 5306.0 1.9073 1.9100 1.9200 1.8801 1.9100 0.0 0.0 27.0
2 AACG 2022-01-06 09:32:00 3496.0 1.8964 1.9100 1.9193 1.8800 1.8900 0.0 0.0 17.0
3 AACG 2022-01-06 09:33:00 5897.0 1.9377 1.8900 1.9500 1.8900 1.9500 0.0 0.0 15.0
...
20 AACG 2022-01-06 09:50:00 9224.0 1.9603 1.9600 1.9613 1.9600 1.9613 0.0 0.0 4.0
21 AACG 2022-01-06 09:51:00 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0 0.0
22 AACG 2022-01-06 09:52:00 16914.0 1.9921 1.9800 2.0400 1.9750 2.0399 0.0 0.0 67.0
23 AACG 2022-01-06 09:53:00 4665.0 1.9866 1.9900 2.0395 1.9801 1.9900 0.0 0.0 37.0
I used the code that you provided and then iterated over the result in order to add the missing rows. The final result is then sorted again to get the order and indices correctly.
import datetime
# Reading your dataframe
df = pd.read_csv('missing_minute.csv', sep=';', index_col='index')
# Define a default row to add for missing rows
default_row = {'ticker':'AACG', 'date':'2022-01-06', 'time': '00:00:00', 'vol':0.0, 'vwap':0.0, 'open':0.0, 'high':0.0, 'low':0.0, 'close':0.0, 'lbh':0.0, 'lah':0.0, 'trades':0.0}
# Your logic to find the rows before the missing
time_in_minutes = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
indices_where_the_next_minute_is_missing = np.where(np.diff(time_in_minutes) != 1)[0]
out = df.loc[indices_where_the_next_minute_is_missing]
# Iterating over the rows
for i, e in out.iterrows():
# Extract the time of the previous row and convert it to date
time_of_previous_row = datetime.datetime.strptime(e['time'], '%H:%M:%S')
# Add one minute for the new entry
time_of_new_row = (time_of_previous_row + datetime.timedelta(minutes=1)).strftime("%H:%M:%S")
# Set new time to the default row and append it to the dataframe
default_row['time'] = time_of_new_row
df = df.append(default_row, ignore_index=True)
# Sort the dataframe by the time column and reset the index
df = df.sort_values(by='time').reset_index(drop=True)
df
Output:
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:30:00 33042.0 1.8807 1.89 1.92 1.87 1.9017 0.0 0.0 68.0
1 AACG 2022-01-06 09:31:00 5306.0 1.9073 1.91 1.92 1.8801 1.91 0.0 0.0 27.0
2 AACG 2022-01-06 09:32:00 3496.0 1.8964 1.91 1.9193 1.88 1.89 0.0 0.0 17.0
3 AACG 2022-01-06 09:33:00 5897.0 1.9377 1.89 1.95 1.89 1.95 0.0 0.0 15.0
4 AACG 2022-01-06 09:34:00 1983.0 1.9362 1.92 1.9499 1.92 1.92 0.0 0.0 9.0
5 AACG 2022-01-06 09:35:00 10725.0 1.9439 1.94 1.96 1.9201 1.9306 0.0 0.0 87.0
6 AACG 2022-01-06 09:36:00 5942.0 1.938 1.9307 1.94 1.93 1.94 0.0 0.0 48.0
7 AACG 2022-01-06 09:37:00 5759.0 1.9428 1.9659 1.9659 1.94 1.95 0.0 0.0 11.0
8 AACG 2022-01-06 09:38:00 4855.0 1.9424 1.95 1.95 1.9401 1.9495 0.0 0.0 10.0
9 AACG 2022-01-06 09:39:00 6275.0 1.9514 1.95 1.97 1.945 1.97 0.0 0.0 14.0
10 AACG 2022-01-06 09:40:00 13695.0 2.015 1.9799 2.05 1.9749 2.02 0.0 0.0 59.0
11 AACG 2022-01-06 09:41:00 3252.0 2.0209 2.0275 2.03 2.02 2.02 0.0 0.0 14.0
12 AACG 2022-01-06 09:42:00 12082.0 2.0117 2.03 2.04 1.98 1.99 0.0 0.0 41.0
13 AACG 2022-01-06 09:43:00 5148.0 1.9802 1.98 1.9999 1.975 1.9999 0.0 0.0 11.0
14 AACG 2022-01-06 09:44:00 2764.0 1.9927 1.9901 1.9943 1.9901 1.9943 0.0 0.0 5.0
15 AACG 2022-01-06 09:45:00 2379.0 1.9576 1.9601 1.9601 1.9201 1.9201 0.0 0.0 10.0
16 AACG 2022-01-06 09:46:00 8762.0 1.9852 1.955 1.99 1.955 1.99 0.0 0.0 35.0
17 AACG 2022-01-06 09:47:00 1343.0 1.9704 1.97 1.9738 1.97 1.9701 0.0 0.0 5.0
18 AACG 2022-01-06 09:48:00 17080.0 1.9696 1.97 1.98 1.96 1.96 0.0 0.0 9.0
19 AACG 2022-01-06 09:49:00 9004.0 1.96 1.96 1.96 1.96 1.96 0.0 0.0 9.0
20 AACG 2022-01-06 09:50:00 9224.0 1.9603 1.96 1.9613 1.96 1.9613 0.0 0.0 4.0
21 AACG 2022-01-06 09:51:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 AACG 2022-01-06 09:52:00 16914.0 1.9921 1.98 2.04 1.975 2.0399 0.0 0.0 67.0
23 AACG 2022-01-06 09:53:00 4665.0 1.9866 1.99 2.0395 1.9801 1.99 0.0 0.0 37.0
24 AACG 2022-01-06 09:54:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25 AACG 2022-01-06 09:55:00 2107.0 2.0049 1.99 2.01 1.99 2.0099 0.0 0.0 10.0
26 AACG 2022-01-06 09:56:00 3003.0 2.0028 2.0 2.0099 2.0 2.0099 0.0 0.0 23.0
27 AACG 2022-01-06 09:57:00 8489.0 2.0272 2.01 2.04 2.01 2.03 0.0 0.0 34.0
28 AACG 2022-01-06 09:58:00 6050.0 2.0155 2.03 2.03 2.015 2.015 0.0 0.0 6.0
29 AACG 2022-01-06 09:59:00 61623.0 2.0449 2.03 2.07 2.03 2.0699 0.0 0.0 83.0
30 AACG 2022-01-06 10:00:00 19699.0 2.0856 2.0699 2.1199 2.06 2.11 0.0 0.0 54.0
P.S.: This only works if only one entry is missing at a time and not multiple entries in one sequence (e.g. 09:51 and 09:52 are missing). This could be added if you check how many rows in a sequence are missing.
Also, if you have data over multiple days, the code has to be adapted a little bit. First, set the date in the loop. Second, sort by date and time in the end.
Similar to enke's answer.
It sounds like you may want to create a new data frame with the proper date and time index that you desire (without missing rows). Then fill in the new data frame with the data that you have.
You can do this using pandas.DataFrame.update()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
Here is an example:
start_datetime = '2022-01-06 09:30:00'
end_datetime = '2022-01-06 10:00:00'
cols = ['ticker','date','time','vol','vwap','open','high','low','close','lbh','lah','trades']
new_df = pd.DataFrame(columns=cols,
index=pd.date_range(start=start_datetime,
end=end_datetime, freq='min'))
new_df['date'] = [d.date() for d in new_df.index]
new_df['time'] = [d.time() for d in new_df.index]
new_df.fillna(0.0)
start_datetime = '2022-01-06 09:35:00'
end_datetime = '2022-01-06 9:40:00'
cols = ['ticker','date','time','vol','vwap','open','high','low','close','lbh','lah','trades']
other_df = pd.DataFrame(columns=cols,
index=pd.date_range(start=start_datetime,
end=end_datetime, freq='min'))
other_df['date'] = [d.date() for d in other_df.index]
other_df['time'] = [d.time() for d in other_df.index]
other_df.fillna(3)
final_df = new_df.copy()
final_df.update(other_df)

resampling a pandas dataframe from almost-weekly to daily

What's the most succinct way to resample this dataframe:
>>> uneven = pd.DataFrame({'a': [0, 12, 19]}, index=pd.DatetimeIndex(['2020-12-08', '2020-12-20', '2020-12-27']))
>>> print(uneven)
a
2020-12-08 0
2020-12-20 12
2020-12-27 19
...into this dataframe:
>>> daily = pd.DataFrame({'a': range(20)}, index=pd.date_range('2020-12-08', periods=3*7-1, freq='D'))
>>> print(daily)
a
2020-12-08 0
2020-12-09 1
...
2020-12-19 11
2020-12-20 12
2020-12-21 13
...
2020-12-27 19
NB: 12 days between the 8th and 20th Dec, 7 days between the 20th and 27th.
Also, to give clarity of the kind of interpolation/resampling I want to do:
>>> print(daily.diff())
a
2020-12-08 NaN
2020-12-09 1.0
2020-12-10 1.0
...
2020-12-19 1.0
2020-12-20 1.0
2020-12-21 1.0
...
2020-12-27 1.0
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
first_dose second_dose
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-20 E92000001 574829.0 0.0
N92000002 16068.0 0.0
S92000003 60333.0 0.0
W92000004 24056.0 0.0
2020-12-27 E92000001 267809.0 0.0
N92000002 14948.0 0.0
S92000003 34535.0 0.0
W92000004 12495.0 0.0
2021-01-03 E92000001 330037.0 20660.0
N92000002 9669.0 1271.0
S92000003 21446.0 44.0
W92000004 14205.0 27.0
I think you need:
df = df.reset_index('areaCode').groupby('areaCode')[['first_dose','second_dose']].resample('D').interpolate()
print (df)
first_dose second_dose
areaCode date
E92000001 2020-12-08 0.000000 0.000000
2020-12-09 47902.416667 0.000000
2020-12-10 95804.833333 0.000000
2020-12-11 143707.250000 0.000000
2020-12-12 191609.666667 0.000000
... ...
W92000004 2020-12-30 13227.857143 11.571429
2020-12-31 13472.142857 15.428571
2021-01-01 13716.428571 19.285714
2021-01-02 13960.714286 23.142857
2021-01-03 14205.000000 27.000000
[108 rows x 2 columns]

How to create rolling window variables without skipping months when using a multiIndex?

Currently I have a df with a location_key & year_month multiIndex. I want to create a sum using a rolling window for 3 months.
(pd.DataFrame(df.groupby(['LOCATION_KEY','YEAR_MONTH'])['SALES'].count()).sort_index()).groupby(level=(0)).apply(lambda x: x.rolling(window=3).sum())
The window is working properly the issue is that in months where there were no sales instead of counting an empty month instead it counts the another month.
e.g. in the data below, 2016-03 Sales is the sum of 2016-03, 2016-01, 2015-12 as opposed to what I would like: 2016-03, 2016-02, 2016-01.
LOCATION_KE YEAR_MONTH SALES
A 2015-10 NaN
2015-11 NaN
2015-12 200
2016-01 220
2016-03 180
B 2015-04 NaN
2015-05 NaN
2015-06 119
2015-07 120
Basically you have to get your index set up how you want so the rolling window has zeros to process.
df
LOCATION_KE YEAR_MONTH SALES
0 A 2015-10-01 NaN
1 A 2015-11-01 NaN
2 A 2015-12-01 200.0
3 A 2016-01-01 220.0
4 A 2016-03-01 180.0
5 B 2015-04-01 NaN
6 B 2015-05-01 NaN
7 B 2015-06-01 119.0
8 B 2015-07-01 120.0
df['SALES'] = df['SALES'].fillna(0)
df.index = [df["LOCATION_KE"], df["YEAR_MONTH"]]
df
LOCATION_KE YEAR_MONTH SALES
LOCATION_KE YEAR_MONTH
A 2015-10-01 A 2015-10-01 0.0
2015-11-01 A 2015-11-01 0.0
2015-12-01 A 2015-12-01 200.0
2016-01-01 A 2016-01-01 220.0
2016-03-01 A 2016-03-01 180.0
B 2015-04-01 B 2015-04-01 0.0
2015-05-01 B 2015-05-01 0.0
2015-06-01 B 2015-06-01 119.0
2015-07-01 B 2015-07-01 120.0
df = df.reindex(pd.MultiIndex.from_product([df['LOCATION_KE'],
pd.date_range("20150101", periods=24, freq='MS')],
names=['location', 'month']))
df['SALES'].fillna(0).reset_index(level=0).groupby('location').rolling(3).sum().fillna(0)
location SALES
location month
A 2015-01-01 A 0.0
2015-02-01 A 0.0
2015-03-01 A 0.0
2015-04-01 A 0.0
2015-05-01 A 0.0
2015-06-01 A 0.0
2015-07-01 A 0.0
2015-08-01 A 0.0
2015-09-01 A 0.0
2015-10-01 A 0.0
2015-11-01 A 0.0
2015-12-01 A 200.0
2016-01-01 A 420.0
2016-02-01 A 420.0
2016-03-01 A 400.0
2016-04-01 A 180.0
2016-05-01 A 180.0
2016-06-01 A 0.0
2016-07-01 A 0.0
2016-08-01 A 0.0
2016-09-01 A 0.0
2016-10-01 A 0.0
2016-11-01 A 0.0
2016-12-01 A 0.0
2015-01-01 A 0.0
2015-02-01 A 0.0
2015-03-01 A 0.0
2015-04-01 A 0.0
2015-05-01 A 0.0
2015-06-01 A 0.0
... ... ...
B 2016-07-01 B 0.0
2016-08-01 B 0.0
2016-09-01 B 0.0
2016-10-01 B 0.0
2016-11-01 B 0.0
2016-12-01 B 0.0
2015-01-01 B 0.0
2015-02-01 B 0.0
2015-03-01 B 0.0
2015-04-01 B 0.0
2015-05-01 B 0.0
2015-06-01 B 119.0
2015-07-01 B 239.0
2015-08-01 B 239.0
2015-09-01 B 120.0
2015-10-01 B 0.0
2015-11-01 B 0.0
2015-12-01 B 0.0
2016-01-01 B 0.0
2016-02-01 B 0.0
2016-03-01 B 0.0
2016-04-01 B 0.0
2016-05-01 B 0.0
2016-06-01 B 0.0
2016-07-01 B 0.0
2016-08-01 B 0.0
2016-09-01 B 0.0
2016-10-01 B 0.0
2016-11-01 B 0.0
2016-12-01 B 0.0
I think if you have a up to date pandas you can leave out the reset_index.

Plotting a Datetime Bar Graph with Pandas with different xlabels

I would like to plot a bar graph that has only a few entries of data in each column of a pandas DataFrame with a bar graph. This is successful, but not only does it have the wrong y-axis limits, it also makes the x ticks very closely spaced so that the graph is useless. I would like to change the step rate to be about every week or so and only display day, month and year. I have the following DataFrame:
Observed WRF
2014-06-28 12:00:00 0.0 0.0
2014-06-28 13:00:00 0.0 0.0
2014-06-28 14:00:00 0.0 0.0
2014-06-28 15:00:00 0.0 0.0
2014-06-28 16:00:00 0.0 0.0
2014-06-28 17:00:00 0.0 0.0
2014-06-28 18:00:00 0.0 0.0
2014-06-28 19:00:00 0.0 0.0
2014-06-28 20:00:00 0.0 0.0
2014-06-28 21:00:00 0.0 0.0
2014-06-28 22:00:00 0.0 0.0
2014-06-28 23:00:00 0.0 0.0
2014-06-29 00:00:00 0.0 0.0
2014-06-29 01:00:00 0.0 0.0
2014-06-29 02:00:00 0.0 0.0
2014-06-29 03:00:00 0.0 0.0
2014-06-29 04:00:00 0.0 0.0
2014-06-29 05:00:00 0.0 0.0
2014-06-29 06:00:00 0.0 0.0
2014-06-29 07:00:00 0.0 0.0
2014-06-29 08:00:00 0.0 0.0
2014-06-29 09:00:00 0.0 0.0
2014-06-29 10:00:00 0.0 0.0
2014-06-29 11:00:00 0.0 0.0
2014-06-29 12:00:00 0.0 0.0
2014-06-29 13:00:00 0.0 0.0
2014-06-29 14:00:00 0.0 0.0
2014-06-29 15:00:00 0.0 0.0
2014-06-29 16:00:00 0.0 0.0
2014-06-29 17:00:00 0.0 0.0
... ...
2014-07-04 02:00:00 0.0002 0.0
2014-07-04 03:00:00 0.2466 0.0
2014-07-04 04:00:00 0.7103 0.0
2014-07-04 05:00:00 0.9158 1.93521e-13
2014-07-04 06:00:00 0.6583 0.0
2014-07-04 07:00:00 0.3915 0.0
2014-07-04 08:00:00 0.1249 0.0
2014-07-04 09:00:00 0.0 0.0
... ...
2014-08-30 07:00:00 0.0 0.0
2014-08-30 08:00:00 0.0 0.0
2014-08-30 09:00:00 0.0 0.0
2014-08-30 10:00:00 0.0 0.0
2014-08-30 11:00:00 0.0 0.0
2014-08-30 12:00:00 0.0 0.0
2014-08-30 13:00:00 0.0 0.0
2014-08-30 14:00:00 0.0 0.0
2014-08-30 15:00:00 0.0 0.0
2014-08-30 16:00:00 0.0 0.0
2014-08-30 17:00:00 0.0 0.0
2014-08-30 18:00:00 0.0 0.0
2014-08-30 19:00:00 0.0 0.0
2014-08-30 20:00:00 0.0 0.0
2014-08-30 21:00:00 0.0 0.0
2014-08-30 22:00:00 0.0 0.0
2014-08-30 23:00:00 0.0 0.0
2014-08-31 00:00:00 0.0 0.0
2014-08-31 01:00:00 0.0 0.0
2014-08-31 02:00:00 0.0 0.0
2014-08-31 03:00:00 0.0 0.0
2014-08-31 04:00:00 0.0 0.0
2014-08-31 05:00:00 0.0 0.0
2014-08-31 06:00:00 0.0 0.0
2014-08-31 07:00:00 0.0 0.0
2014-08-31 08:00:00 0.0 0.0
2014-08-31 09:00:00 0.0 0.0
2014-08-31 10:00:00 0.0 0.0
2014-08-31 11:00:00 0.0 0.0
2014-08-31 12:00:00 0.0 0.0
And the following code to plot it:
df4.plot(kind='bar',edgecolor='none',figsize=(16,8),linewidth=2, color=((1,0.502,0),'black'))
plt.legend(prop={'size':16})
plt.subplots_adjust(left=.1, right=0.9, top=0.9, bottom=.1)
plt.title('Five Day WRF Model Comparison Near %.2f,%.2f' %(lat,lon),fontsize=24)
plt.ylabel('Hourly Accumulated Precipitation [mm]',fontsize=18,color='black')
ax4=plt.gca()
maxs4=df4.max()
ax4.set_ylim([0, maxs4.max()])
ax4.xaxis_date()
ax4.xaxis.set_label_coords(0.5, -0.05)
plt.xlabel('Time',fontsize=18,color='black')
plt.show()
The y-axis starts at 0, but continues to about double the maximum value of the y-limit. The x-axis counts by hours, which is what I separated the data by, so that makes sense. However, it is not a helpful display.
Look at this code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pylab as plt
from matplotlib.dates import DateFormatter
# Sample data
df_origin = pd.DataFrame(pd.date_range(datetime(2014,6,28,12,0,0),
datetime(2014,8,30,12,0,0), freq='1H'), columns=['Valid Time'])
df_origin = df_origin .set_index('Valid Time')
df_origin ['Precipitation'] = np.random.uniform(low=0., high=10., size=(len(df_origin.index)))
df_origin .loc[20:100, 'Precipitation'] = 0.
df_origin .loc[168:168*2, 'Precipitation'] = 0. # second week has to be dry
# Plotting
df_origin.plot(y='Precipitation',kind='bar',edgecolor='none',figsize=(16,8),linewidth=2, color=((1,0.502,0)))
plt.legend(prop={'size':16})
plt.subplots_adjust(left=.1, right=0.9, top=0.9, bottom=.1)
plt.title('Precipitation (WRF Model)',fontsize=24)
plt.ylabel('Hourly Accumulated Precipitation [mm]',fontsize=18,color='black')
ax = plt.gca()
plt.gcf().autofmt_xdate()
# skip ticks for X axis
ax.set_xticklabels([dt.strftime('%Y-%m-%d') for dt in df_origin.index])
for i, tick in enumerate(ax.xaxis.get_major_ticks()):
if (i % (24*7) != 0): # 24 hours * 7 days = 1 week
tick.set_visible(False)
plt.xlabel('Time',fontsize=18,color='black')
plt.show()

Categories

Resources