How to insert missing row into pandas dataframe?

How to insert missing row into pandas dataframe? - python

Each row in this database represents 1 minute. But some minutes are missing upon pulling the data from API (You'll see 09:51:00 is missing)
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:30:00 33042 1.8807 1.8900 1.9200 1.8700 1.9017 0.0 0.0 68
1 AACG 2022-01-06 09:31:00 5306 1.9073 1.9100 1.9200 1.8801 1.9100 0.0 0.0 27
2 AACG 2022-01-06 09:32:00 3496 1.8964 1.9100 1.9193 1.8800 1.8900 0.0 0.0 17
3 AACG 2022-01-06 09:33:00 5897 1.9377 1.8900 1.9500 1.8900 1.9500 0.0 0.0 15
4 AACG 2022-01-06 09:34:00 1983 1.9362 1.9200 1.9499 1.9200 1.9200 0.0 0.0 9
5 AACG 2022-01-06 09:35:00 10725 1.9439 1.9400 1.9600 1.9201 1.9306 0.0 0.0 87
6 AACG 2022-01-06 09:36:00 5942 1.9380 1.9307 1.9400 1.9300 1.9400 0.0 0.0 48
7 AACG 2022-01-06 09:37:00 5759 1.9428 1.9659 1.9659 1.9400 1.9500 0.0 0.0 11
8 AACG 2022-01-06 09:38:00 4855 1.9424 1.9500 1.9500 1.9401 1.9495 0.0 0.0 10
9 AACG 2022-01-06 09:39:00 6275 1.9514 1.9500 1.9700 1.9450 1.9700 0.0 0.0 14
10 AACG 2022-01-06 09:40:00 13695 2.0150 1.9799 2.0500 1.9749 2.0200 0.0 0.0 59
11 AACG 2022-01-06 09:41:00 3252 2.0209 2.0275 2.0300 2.0200 2.0200 0.0 0.0 14
12 AACG 2022-01-06 09:42:00 12082 2.0117 2.0300 2.0400 1.9800 1.9900 0.0 0.0 41
13 AACG 2022-01-06 09:43:00 5148 1.9802 1.9800 1.9999 1.9750 1.9999 0.0 0.0 11
14 AACG 2022-01-06 09:44:00 2764 1.9927 1.9901 1.9943 1.9901 1.9943 0.0 0.0 5
15 AACG 2022-01-06 09:45:00 2379 1.9576 1.9601 1.9601 1.9201 1.9201 0.0 0.0 10
16 AACG 2022-01-06 09:46:00 8762 1.9852 1.9550 1.9900 1.9550 1.9900 0.0 0.0 35
17 AACG 2022-01-06 09:47:00 1343 1.9704 1.9700 1.9738 1.9700 1.9701 0.0 0.0 5
18 AACG 2022-01-06 09:48:00 17080 1.9696 1.9700 1.9800 1.9600 1.9600 0.0 0.0 9
19 AACG 2022-01-06 09:49:00 9004 1.9600 1.9600 1.9600 1.9600 1.9600 0.0 0.0 9
20 AACG 2022-01-06 09:50:00 9224 1.9603 1.9600 1.9613 1.9600 1.9613 0.0 0.0 4
21 AACG 2022-01-06 09:52:00 16914 1.9921 1.9800 2.0400 1.9750 2.0399 0.0 0.0 67
22 AACG 2022-01-06 09:53:00 4665 1.9866 1.9900 2.0395 1.9801 1.9900 0.0 0.0 37
23 AACG 2022-01-06 09:55:00 2107 2.0049 1.9900 2.0100 1.9900 2.0099 0.0 0.0 10
24 AACG 2022-01-06 09:56:00 3003 2.0028 2.0000 2.0099 2.0000 2.0099 0.0 0.0 23
25 AACG 2022-01-06 09:57:00 8489 2.0272 2.0100 2.0400 2.0100 2.0300 0.0 0.0 34
26 AACG 2022-01-06 09:58:00 6050 2.0155 2.0300 2.0300 2.0150 2.0150 0.0 0.0 6
27 AACG 2022-01-06 09:59:00 61623 2.0449 2.0300 2.0700 2.0300 2.0699 0.0 0.0 83
28 AACG 2022-01-06 10:00:00 19699 2.0856 2.0699 2.1199 2.0600 2.1100 0.0 0.0 54
I want to insert rows with empty values that only include the missing time data as a value.
missing_data = pd.DataFrame({'ticker': ['AACG'], 'date': ['2022-01-06'], 'time': ['09:51:00'],
'vol': [0], 'vwap': [0.0], 'open': [0.0], 'high': [0.0], 'low': [0.0],
'close': [0.0], 'lbh': [0.0], 'lah': [0.0], 'trades': [0]}, index=[21])
It would look something like this:
ticker date time vol vwap open high low close lbh lah trades
21 AACG 2022-01-06 09:51:00 0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0
With the help of someone, I've managed to isolate the areas that show me where the missing values are at:
time_in_minutes = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
indices_where_the_next_minute_is_missing = np.where(np.diff(time_in_minutes) != 1)[0]
out = df.loc[indices_where_the_next_minute_is_missing]
Simply adding 1 to time_in_minutes will give me the correction I need:
timeinminutesplus1 = pd.to_timedelta(out['time'].astype(str)).astype('timedelta64[m]') + 1
But how do i turn it back to a datetime.time datatype and insert it into the database?

Building off of my answer to your previous question, first expand your DataFrame to include NaN rows for missing minutes.
time = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
out = df.set_index(time).reindex(np.arange(time[0], time.iloc[len(df)-1]+1)).reset_index(drop=True)
then given your missing data DataFrame
missing_data = pd.DataFrame({'ticker': ['AACG'], 'date': ['2022-01-06'], 'time': ['09:51:00'],
'vol': [0], 'vwap': [0.0], 'open': [0.0], 'high': [0.0], 'low': [0.0],
'close': [0.0], 'lbh': [0.0], 'lah': [0.0], 'trades': [0]}, index=[21])
which looks like:
ticker date time vol vwap open high low close lbh lah trades
21 AACG 2022-01-06 09:51:00 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
you can update out:
out.update(missing_data)
Then out becomes:
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:51:00 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0 0.0
1 AACG 2022-01-06 09:31:00 5306.0 1.9073 1.9100 1.9200 1.8801 1.9100 0.0 0.0 27.0
2 AACG 2022-01-06 09:32:00 3496.0 1.8964 1.9100 1.9193 1.8800 1.8900 0.0 0.0 17.0
3 AACG 2022-01-06 09:33:00 5897.0 1.9377 1.8900 1.9500 1.8900 1.9500 0.0 0.0 15.0
...
20 AACG 2022-01-06 09:50:00 9224.0 1.9603 1.9600 1.9613 1.9600 1.9613 0.0 0.0 4.0
21 AACG 2022-01-06 09:51:00 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0 0.0
22 AACG 2022-01-06 09:52:00 16914.0 1.9921 1.9800 2.0400 1.9750 2.0399 0.0 0.0 67.0
23 AACG 2022-01-06 09:53:00 4665.0 1.9866 1.9900 2.0395 1.9801 1.9900 0.0 0.0 37.0

I used the code that you provided and then iterated over the result in order to add the missing rows. The final result is then sorted again to get the order and indices correctly.
import datetime
# Reading your dataframe
df = pd.read_csv('missing_minute.csv', sep=';', index_col='index')
# Define a default row to add for missing rows
default_row = {'ticker':'AACG', 'date':'2022-01-06', 'time': '00:00:00', 'vol':0.0, 'vwap':0.0, 'open':0.0, 'high':0.0, 'low':0.0, 'close':0.0, 'lbh':0.0, 'lah':0.0, 'trades':0.0}
# Your logic to find the rows before the missing
time_in_minutes = pd.to_timedelta(df['time'].astype(str)).astype('timedelta64[m]')
indices_where_the_next_minute_is_missing = np.where(np.diff(time_in_minutes) != 1)[0]
out = df.loc[indices_where_the_next_minute_is_missing]
# Iterating over the rows
for i, e in out.iterrows():
# Extract the time of the previous row and convert it to date
time_of_previous_row = datetime.datetime.strptime(e['time'], '%H:%M:%S')
# Add one minute for the new entry
time_of_new_row = (time_of_previous_row + datetime.timedelta(minutes=1)).strftime("%H:%M:%S")
# Set new time to the default row and append it to the dataframe
default_row['time'] = time_of_new_row
df = df.append(default_row, ignore_index=True)
# Sort the dataframe by the time column and reset the index
df = df.sort_values(by='time').reset_index(drop=True)
df
Output:
ticker date time vol vwap open high low close lbh lah trades
0 AACG 2022-01-06 09:30:00 33042.0 1.8807 1.89 1.92 1.87 1.9017 0.0 0.0 68.0
1 AACG 2022-01-06 09:31:00 5306.0 1.9073 1.91 1.92 1.8801 1.91 0.0 0.0 27.0
2 AACG 2022-01-06 09:32:00 3496.0 1.8964 1.91 1.9193 1.88 1.89 0.0 0.0 17.0
3 AACG 2022-01-06 09:33:00 5897.0 1.9377 1.89 1.95 1.89 1.95 0.0 0.0 15.0
4 AACG 2022-01-06 09:34:00 1983.0 1.9362 1.92 1.9499 1.92 1.92 0.0 0.0 9.0
5 AACG 2022-01-06 09:35:00 10725.0 1.9439 1.94 1.96 1.9201 1.9306 0.0 0.0 87.0
6 AACG 2022-01-06 09:36:00 5942.0 1.938 1.9307 1.94 1.93 1.94 0.0 0.0 48.0
7 AACG 2022-01-06 09:37:00 5759.0 1.9428 1.9659 1.9659 1.94 1.95 0.0 0.0 11.0
8 AACG 2022-01-06 09:38:00 4855.0 1.9424 1.95 1.95 1.9401 1.9495 0.0 0.0 10.0
9 AACG 2022-01-06 09:39:00 6275.0 1.9514 1.95 1.97 1.945 1.97 0.0 0.0 14.0
10 AACG 2022-01-06 09:40:00 13695.0 2.015 1.9799 2.05 1.9749 2.02 0.0 0.0 59.0
11 AACG 2022-01-06 09:41:00 3252.0 2.0209 2.0275 2.03 2.02 2.02 0.0 0.0 14.0
12 AACG 2022-01-06 09:42:00 12082.0 2.0117 2.03 2.04 1.98 1.99 0.0 0.0 41.0
13 AACG 2022-01-06 09:43:00 5148.0 1.9802 1.98 1.9999 1.975 1.9999 0.0 0.0 11.0
14 AACG 2022-01-06 09:44:00 2764.0 1.9927 1.9901 1.9943 1.9901 1.9943 0.0 0.0 5.0
15 AACG 2022-01-06 09:45:00 2379.0 1.9576 1.9601 1.9601 1.9201 1.9201 0.0 0.0 10.0
16 AACG 2022-01-06 09:46:00 8762.0 1.9852 1.955 1.99 1.955 1.99 0.0 0.0 35.0
17 AACG 2022-01-06 09:47:00 1343.0 1.9704 1.97 1.9738 1.97 1.9701 0.0 0.0 5.0
18 AACG 2022-01-06 09:48:00 17080.0 1.9696 1.97 1.98 1.96 1.96 0.0 0.0 9.0
19 AACG 2022-01-06 09:49:00 9004.0 1.96 1.96 1.96 1.96 1.96 0.0 0.0 9.0
20 AACG 2022-01-06 09:50:00 9224.0 1.9603 1.96 1.9613 1.96 1.9613 0.0 0.0 4.0
21 AACG 2022-01-06 09:51:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 AACG 2022-01-06 09:52:00 16914.0 1.9921 1.98 2.04 1.975 2.0399 0.0 0.0 67.0
23 AACG 2022-01-06 09:53:00 4665.0 1.9866 1.99 2.0395 1.9801 1.99 0.0 0.0 37.0
24 AACG 2022-01-06 09:54:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25 AACG 2022-01-06 09:55:00 2107.0 2.0049 1.99 2.01 1.99 2.0099 0.0 0.0 10.0
26 AACG 2022-01-06 09:56:00 3003.0 2.0028 2.0 2.0099 2.0 2.0099 0.0 0.0 23.0
27 AACG 2022-01-06 09:57:00 8489.0 2.0272 2.01 2.04 2.01 2.03 0.0 0.0 34.0
28 AACG 2022-01-06 09:58:00 6050.0 2.0155 2.03 2.03 2.015 2.015 0.0 0.0 6.0
29 AACG 2022-01-06 09:59:00 61623.0 2.0449 2.03 2.07 2.03 2.0699 0.0 0.0 83.0
30 AACG 2022-01-06 10:00:00 19699.0 2.0856 2.0699 2.1199 2.06 2.11 0.0 0.0 54.0
P.S.: This only works if only one entry is missing at a time and not multiple entries in one sequence (e.g. 09:51 and 09:52 are missing). This could be added if you check how many rows in a sequence are missing.
Also, if you have data over multiple days, the code has to be adapted a little bit. First, set the date in the loop. Second, sort by date and time in the end.

Similar to enke's answer.
It sounds like you may want to create a new data frame with the proper date and time index that you desire (without missing rows). Then fill in the new data frame with the data that you have.
You can do this using pandas.DataFrame.update()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
Here is an example:
start_datetime = '2022-01-06 09:30:00'
end_datetime = '2022-01-06 10:00:00'
cols = ['ticker','date','time','vol','vwap','open','high','low','close','lbh','lah','trades']
new_df = pd.DataFrame(columns=cols,
index=pd.date_range(start=start_datetime,
end=end_datetime, freq='min'))
new_df['date'] = [d.date() for d in new_df.index]
new_df['time'] = [d.time() for d in new_df.index]
new_df.fillna(0.0)
start_datetime = '2022-01-06 09:35:00'
end_datetime = '2022-01-06 9:40:00'
cols = ['ticker','date','time','vol','vwap','open','high','low','close','lbh','lah','trades']
other_df = pd.DataFrame(columns=cols,
index=pd.date_range(start=start_datetime,
end=end_datetime, freq='min'))
other_df['date'] = [d.date() for d in other_df.index]
other_df['time'] = [d.time() for d in other_df.index]
other_df.fillna(3)
final_df = new_df.copy()
final_df.update(other_df)

Related

dataframe calculate pct_change() in slices between two specific values

I have a dataframe as below. I want to calculate df['Close'].pct_change() in the slices between posizione==1 and posizione==-1. So in this example I have to calculate the pct_change in the slices:
between (2022-04-06 09:00:00,2022-04-06 10:15:00);<BR/>
between (2022-04-06 12:30:00,2022-04-06 14:00:00);<BR/>
between (2022-04-06 15:15:00,2022-04-06 16:00:00);<BR/>
so a formula like
df['between']=np.where(df['posizione'].between(1,-1),df['posizione'].pct_change(),0)
outside 1 and -1 the pct_change() will be zero.
Is it possible?. Thanks in advance
Time Close buy_80_150 posizione
0 2022-04-06 07:30:00 1.053 0.0 0.0
1 2022-04-06 07:45:00 1.049 0.0 0.0
2 2022-04-06 08:00:00 1.046 0.0 0.0
3 2022-04-06 08:15:00 1.049 0.0 0.0
4 2022-04-06 08:30:00 1.048 0.0 0.0
5 2022-04-06 08:45:00 1.049 0.0 0.0
6 2022-04-06 09:00:00 1.047 1.0 1.0
7 2022-04-06 09:15:00 1.044 1.0 0.0
8 2022-04-06 09:30:00 1.044 1.0 0.0
9 2022-04-06 09:45:00 1.050 1.0 0.0
10 2022-04-06 10:00:00 1.041 1.0 0.0
11 2022-04-06 10:15:00 1.048 0.0 -1.0
12 2022-04-06 10:30:00 1.040 0.0 0.0
13 2022-04-06 10:45:00 1.032 0.0 0.0
14 2022-04-06 11:00:00 1.017 0.0 0.0
15 2022-04-06 11:15:00 1.018 0.0 0.0
16 2022-04-06 11:30:00 1.021 0.0 0.0
17 2022-04-06 11:45:00 1.023 0.0 0.0
18 2022-04-06 12:00:00 1.021 0.0 0.0
19 2022-04-06 12:15:00 1.024 0.0 0.0
20 2022-04-06 12:30:00 1.021 1.0 1.0
21 2022-04-06 12:45:00 1.014 1.0 0.0
22 2022-04-06 13:00:00 1.018 1.0 0.0
23 2022-04-06 13:15:00 1.024 1.0 0.0
24 2022-04-06 13:30:00 1.014 1.0 0.0
25 2022-04-06 13:45:00 1.011 1.0 0.0
26 2022-04-06 14:00:00 1.014 0.0 -1.0
27 2022-04-06 14:15:00 1.017 0.0 0.0
28 2022-04-06 14:30:00 1.019 0.0 0.0
29 2022-04-06 14:45:00 1.015 0.0 0.0
30 2022-04-06 15:00:00 1.009 0.0 0.0
31 2022-04-06 15:15:00 1.003 1.0 1.0
32 2022-04-06 15:30:00 1.007 1.0 0.0
33 2022-04-06 15:45:00 1.007 1.0 0.0
34 2022-04-06 16:00:00 1.002 0.0 -1.0
35 2022-04-06 16:15:00 0.994 0.0 0.0
36 2022-04-06 16:30:00 0.993 0.0 0.0
37 2022-04-06 16:45:00 0.992 0.0 0.0
38 2022-04-06 17:00:00 0.980 0.0 0.0

IIUC, you can compute a mask and use it on your data before pct_change:
m1 = df['posizione'].replace(0, float('nan')).ffill().eq(1)
m2 = df['posizione'].eq(-1)
df['pct_change'] = df['Close'].where(m1|m2).pct_change().fillna(0)
output:
Time Close buy_80_150 posizione pct_change
0 2022-04-06 07:30:00 1.053 0.0 0.0 0.000000
1 2022-04-06 07:45:00 1.049 0.0 0.0 0.000000
2 2022-04-06 08:00:00 1.046 0.0 0.0 0.000000
3 2022-04-06 08:15:00 1.049 0.0 0.0 0.000000
4 2022-04-06 08:30:00 1.048 0.0 0.0 0.000000
5 2022-04-06 08:45:00 1.049 0.0 0.0 0.000000
6 2022-04-06 09:00:00 1.047 1.0 1.0 0.000000
7 2022-04-06 09:15:00 1.044 1.0 0.0 -0.002865
8 2022-04-06 09:30:00 1.044 1.0 0.0 0.000000
9 2022-04-06 09:45:00 1.050 1.0 0.0 0.005747
10 2022-04-06 10:00:00 1.041 1.0 0.0 -0.008571
11 2022-04-06 10:15:00 1.048 0.0 -1.0 0.006724
12 2022-04-06 10:30:00 1.040 0.0 0.0 0.000000
13 2022-04-06 10:45:00 1.032 0.0 0.0 0.000000
14 2022-04-06 11:00:00 1.017 0.0 0.0 0.000000
15 2022-04-06 11:15:00 1.018 0.0 0.0 0.000000
16 2022-04-06 11:30:00 1.021 0.0 0.0 0.000000
17 2022-04-06 11:45:00 1.023 0.0 0.0 0.000000
18 2022-04-06 12:00:00 1.021 0.0 0.0 0.000000
19 2022-04-06 12:15:00 1.024 0.0 0.0 0.000000
20 2022-04-06 12:30:00 1.021 1.0 1.0 -0.025763
21 2022-04-06 12:45:00 1.014 1.0 0.0 -0.006856
22 2022-04-06 13:00:00 1.018 1.0 0.0 0.003945
23 2022-04-06 13:15:00 1.024 1.0 0.0 0.005894
24 2022-04-06 13:30:00 1.014 1.0 0.0 -0.009766
25 2022-04-06 13:45:00 1.011 1.0 0.0 -0.002959
26 2022-04-06 14:00:00 1.014 0.0 -1.0 0.002967
27 2022-04-06 14:15:00 1.017 0.0 0.0 0.000000
28 2022-04-06 14:30:00 1.019 0.0 0.0 0.000000
29 2022-04-06 14:45:00 1.015 0.0 0.0 0.000000
30 2022-04-06 15:00:00 1.009 0.0 0.0 0.000000
31 2022-04-06 15:15:00 1.003 1.0 1.0 -0.010848
32 2022-04-06 15:30:00 1.007 1.0 0.0 0.003988
33 2022-04-06 15:45:00 1.007 1.0 0.0 0.000000
34 2022-04-06 16:00:00 1.002 0.0 -1.0 -0.004965
35 2022-04-06 16:15:00 0.994 0.0 0.0 0.000000
36 2022-04-06 16:30:00 0.993 0.0 0.0 0.000000
37 2022-04-06 16:45:00 0.992 0.0 0.0 0.000000
38 2022-04-06 17:00:00 0.980 0.0 0.0 0.000000

How do I solve this NaN error by this function?

Input:
#Fixed-mono-cell temperature
parameters = pvlib.temperature.TEMPERATURE_MODEL_PARAMETERS['sapm']['open_rack_glass_glass'] #to extract specfic parameter
cell_temperature_mono_fixed = pvlib.temperature.sapm_cell(effective_irrad_mono_fixed,
df['T_a'],
df['W_s'],
**parameters)
cell_temperature_mono_fixed
Output:
2005-01-01 01:00:00 NaN
2005-01-01 02:00:00 NaN
2005-01-01 03:00:00 NaN
2005-01-01 04:00:00 NaN
2005-01-01 05:00:00 NaN
..
8755 NaN
8756 NaN
8757 NaN
8758 NaN
8759 NaN
Length: 17520, dtype: float64
cell_temperature_mono_fixed.plot
Output:
/Users/charlielinck/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:4024: RuntimeWarning: '
Extra data information:
df: dataframe
date_time Sun_Az Sun_alt GHI DHI DNI T_a W_s
0 2005-01-01 01:00:00 17.9 90.0 0.0 0.0 0.0 15.5 13.3
1 2005-01-01 02:00:00 54.8 90.0 0.0 0.0 0.0 17.0 14.5
2 2005-01-01 03:00:00 73.7 90.0 0.0 0.0 0.0 16.7 14.0
3 2005-01-01 04:00:00 85.7 90.0 0.0 0.0 0.0 16.7 14.2
4 2005-01-01 05:00:00 94.9 90.0 0.0 0.0 0.0 16.7 14.1
5 2005-01-01 06:00:00 103.5 90.0 0.0 0.0 0.0 16.6 14.3
6 2005-01-01 07:00:00 111.6 90.0 0.0 0.0 0.0 16.5 13.8
7 2005-01-01 08:00:00 120.5 89.6 1.0 1.0 0.0 16.6 16.0
8 2005-01-01 09:00:00 130.5 79.9 27.0 27.0 0.0 16.8 16.5
9 2005-01-01 10:00:00 141.8 71.7 55.0 55.0 0.0 16.9 16.9
10 2005-01-01 11:00:00 154.9 65.5 83.0 83.0 0.0 17.0 17.2
11 2005-01-01 12:00:00 169.8 61.9 114.0 114.0 0.0 17.4 17.9
12 2005-01-01 13:00:00 185.2 61.4 110.0 110.0 0.0 17.5 18.0
13 2005-01-01 14:00:00 200.4 64.0 94.0 94.0 0.0 17.5 17.8
14 2005-01-01 15:00:00 214.3 69.5 70.0 70.0 0.0 17.5 17.6
15 2005-01-01 16:00:00 226.3 77.2 38.0 38.0 0.0 17.2 17.0
16 2005-01-01 17:00:00 236.5 86.4 4.0 4.0 0.0 16.7 16.3
17 2005-01-01 18:00:00 245.5 90.0 0.0 0.0 0.0 16.0 14.5
18 2005-01-01 19:00:00 254.2 90.0 0.0 0.0 0.0 14.9 13.0
19 2005-01-01 20:00:00 262.3 90.0 0.0 0.0 0.0 16.0 14.1
20 2005-01-01 21:00:00 271.3 90.0 0.0 0.0 0.0 15.1 13.3
21 2005-01-01 22:00:00 282.1 90.0 0.0 0.0 0.0 15.5 13.2
22 2005-01-01 23:00:00 298.1 90.0 0.0 0.0 0.0 15.6 13.0
23 2005-01-02 00:00:00 327.5 90.0 0.0 0.0 0.0 15.8 13.1
df['T_a'] is temperature data,
df['W_s'] is windspeed data
effective_irrad_mono_fixed.head(24)
date_time
2005-01-01 01:00:00 0.000000
2005-01-01 02:00:00 0.000000
2005-01-01 03:00:00 0.000000
2005-01-01 04:00:00 0.000000
2005-01-01 05:00:00 0.000000
2005-01-01 06:00:00 0.000000
2005-01-01 07:00:00 0.000000
2005-01-01 08:00:00 0.936690
2005-01-01 09:00:00 25.168996
2005-01-01 10:00:00 51.165091
2005-01-01 11:00:00 77.354266
2005-01-01 12:00:00 108.002486
2005-01-01 13:00:00 103.809820
2005-01-01 14:00:00 88.138705
2005-01-01 15:00:00 65.051870
2005-01-01 16:00:00 35.390518
2005-01-01 17:00:00 3.742581
2005-01-01 18:00:00 0.000000
2005-01-01 19:00:00 0.000000
2005-01-01 20:00:00 0.000000
2005-01-01 21:00:00 0.000000
2005-01-01 22:00:00 0.000000
2005-01-01 23:00:00 0.000000
2005-01-02 00:00:00 0.000000
Question: I don't understand that if I simply run the function I only get NaN values, might it have something to with the timestamp.
I believe this also results in the RunTimeWarning when I want to plot the function.

This is not really a pvlib issue, more a pandas issue. The problem is that your input time series objects are not on a consistent index: the irradiance input has a pandas.DatetimeIndex while the temperature and wind speed inputs have pandas.RangeIndex (see the index printed out from your df). Math operations on Series are done by aligning index elements and substituting NaN where things don't line up. For example see how only the shared index elements correspond to non-NaN values here:
In [46]: a = pd.Series([1, 2, 3], index=[1, 2, 3])
...: b = pd.Series([2, 3, 4], index=[2, 3, 4])
...: a*b
Out[46]:
1 NaN
2 4.0
3 9.0
4 NaN
dtype: float64
If you examine the index of your cell_temperature_mono_fixed, you'll see it has both timestamps (from the irradiance input) and integers (from the other two), so it's taking the union of the indexes but only filling in values for the intersection (which is empty in this case).
So to fix your problem, you should make sure all the inputs are on a consistent index. The easiest way to do that is probably at the dataframe level, i.e. df = df.set_index('date_time').

Calculating sum of up to the current row in pandas while iterating on each row in a time series data

Suppose I have the following code that calculates how many products I can purchase given my budget-
import math
import pandas as pd
data = [['2021-01-02', 5.5], ['2021-02-02', 10.5], ['2021-03-02', 15.0], ['2021-04-02', 20.0]]
df = pd.DataFrame(data, columns=['Date', 'Current_Price'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(mn - pd.tseries.offsets.MonthBegin(), mx + pd.tseries.offsets.MonthEnd(), name="Date")
df = df.set_index("Date").reindex(dr).reset_index()
df['Current_Price'] = df.groupby(
pd.Grouper(key='Date', freq='1M'))['Current_Price'].ffill().bfill()
# The dataframe below shows the current price of the product
# I'd like to buy at the specific date_range
print(df)
# Create 'Day' column to know which day of the month
df['Day'] = pd.to_datetime(df['Date']).dt.day
# Create 'Deposit' column to record how much money is
# deposited in, say, my bank account to buy the product.
# 'Withdrawal' column is to record how much I spent in
# buying product(s) at the current price on a specific date.
# 'Num_of_Products_Bought' shows how many items I bought
# on that specific date.
#
# Please note that the calculate below takes into account
# the left over money, which remains after I've purchased a
# product, for future purchase. For example, if you observe
# the resulting dataframe at the end of this code, you'll
# notice that I was able to purchase 7 products on March 1, 2021
# although my deposit on that day was $100. That is because
# on the days leading up to March 1, 2021, I have been saving
# the spare change from previous product purchases and that
# extra money allows me to buy an extra product on March 1, 2021
# despite my budget of $100 should only allow me to purchase
# 6 products.
df[['Deposit', 'Withdrawal', 'Num_of_Products_Bought']] = 0.0
# Suppose I save $100 at the beginning of every month in my bank account
df.loc[df['Day'] == 1, 'Deposit'] = 100.0
for index, row in df.iterrows():
if df.loc[index, 'Day'] == 1:
# num_prod_bought = (sum_of_deposit_so_far - sum_of_withdrawal)/current_price
df.loc[index, 'Num_of_Products_Bought'] = math.floor(
(sum(df.iloc[0:(index + 1)]['Deposit'])
- sum(df.iloc[0:(index + 1)]['Withdrawal']))
/ df.loc[index, 'Current_Price'])
# Record how much I spent buying the product on specific date
df.loc[index, 'Withdrawal'] = df.loc[index, 'Num_of_Products_Bought'] * df.loc[index, 'Current_Price']
print(df)
# This code above is working as intended,
# but how can I make it more efficient/pandas-like?
# In particular, I don't like to idea of having to
# iterate the rows and having to recalculate
# the running (sum of) deposit amount and
# the running (sum of) the withdrawal.
As mentioned in the comment in the code, I would like to know how to accomplish the same without having to iterate the rows one by one and calculating the sum of the rows up to the current row in my iteration (I read around StackOverflow and saw cumsum() function, but I don't think cumsum has the notion of current row in the iteration).
Thank you very much in advance for your suggestions/answers!

A solution using .apply:
def fn():
leftover = 0
amount, deposit = yield
while True:
new_amount, new_deposit = yield (deposit + leftover) // amount
leftover = (deposit + leftover) % amount
amount, deposit = new_amount, new_deposit
df = df.set_index("Date")
s = fn()
next(s)
m = df.index.day == 1
df.loc[m, "Deposit"] = 100
df.loc[m, "Num_of_Products_Bought"] = df.loc[
m, ["Current_Price", "Deposit"]
].apply(lambda x: s.send((x["Current_Price"], x["Deposit"])), axis=1)
df.loc[m, "Withdrawal"] = (
df.loc[m, "Num_of_Products_Bought"] * df.loc[m, "Current_Price"]
)
print(df.fillna(0).reset_index())
Prints:
Date Current_Price Deposit Num_of_Products_Bought Withdrawal
0 2021-01-01 5.5 100.0 18.0 99.0
1 2021-01-02 5.5 0.0 0.0 0.0
2 2021-01-03 5.5 0.0 0.0 0.0
3 2021-01-04 5.5 0.0 0.0 0.0
4 2021-01-05 5.5 0.0 0.0 0.0
5 2021-01-06 5.5 0.0 0.0 0.0
6 2021-01-07 5.5 0.0 0.0 0.0
7 2021-01-08 5.5 0.0 0.0 0.0
8 2021-01-09 5.5 0.0 0.0 0.0
9 2021-01-10 5.5 0.0 0.0 0.0
10 2021-01-11 5.5 0.0 0.0 0.0
11 2021-01-12 5.5 0.0 0.0 0.0
12 2021-01-13 5.5 0.0 0.0 0.0
13 2021-01-14 5.5 0.0 0.0 0.0
14 2021-01-15 5.5 0.0 0.0 0.0
15 2021-01-16 5.5 0.0 0.0 0.0
16 2021-01-17 5.5 0.0 0.0 0.0
17 2021-01-18 5.5 0.0 0.0 0.0
18 2021-01-19 5.5 0.0 0.0 0.0
19 2021-01-20 5.5 0.0 0.0 0.0
20 2021-01-21 5.5 0.0 0.0 0.0
21 2021-01-22 5.5 0.0 0.0 0.0
22 2021-01-23 5.5 0.0 0.0 0.0
23 2021-01-24 5.5 0.0 0.0 0.0
24 2021-01-25 5.5 0.0 0.0 0.0
25 2021-01-26 5.5 0.0 0.0 0.0
26 2021-01-27 5.5 0.0 0.0 0.0
27 2021-01-28 5.5 0.0 0.0 0.0
28 2021-01-29 5.5 0.0 0.0 0.0
29 2021-01-30 5.5 0.0 0.0 0.0
30 2021-01-31 5.5 0.0 0.0 0.0
31 2021-02-01 10.5 100.0 9.0 94.5
32 2021-02-02 10.5 0.0 0.0 0.0
33 2021-02-03 10.5 0.0 0.0 0.0
34 2021-02-04 10.5 0.0 0.0 0.0
35 2021-02-05 10.5 0.0 0.0 0.0
36 2021-02-06 10.5 0.0 0.0 0.0
37 2021-02-07 10.5 0.0 0.0 0.0
38 2021-02-08 10.5 0.0 0.0 0.0
39 2021-02-09 10.5 0.0 0.0 0.0
40 2021-02-10 10.5 0.0 0.0 0.0
41 2021-02-11 10.5 0.0 0.0 0.0
42 2021-02-12 10.5 0.0 0.0 0.0
43 2021-02-13 10.5 0.0 0.0 0.0
44 2021-02-14 10.5 0.0 0.0 0.0
45 2021-02-15 10.5 0.0 0.0 0.0
46 2021-02-16 10.5 0.0 0.0 0.0
47 2021-02-17 10.5 0.0 0.0 0.0
48 2021-02-18 10.5 0.0 0.0 0.0
49 2021-02-19 10.5 0.0 0.0 0.0
50 2021-02-20 10.5 0.0 0.0 0.0
51 2021-02-21 10.5 0.0 0.0 0.0
52 2021-02-22 10.5 0.0 0.0 0.0
53 2021-02-23 10.5 0.0 0.0 0.0
54 2021-02-24 10.5 0.0 0.0 0.0
55 2021-02-25 10.5 0.0 0.0 0.0
56 2021-02-26 10.5 0.0 0.0 0.0
57 2021-02-27 10.5 0.0 0.0 0.0
58 2021-02-28 10.5 0.0 0.0 0.0
59 2021-03-01 15.0 100.0 7.0 105.0
60 2021-03-02 15.0 0.0 0.0 0.0
61 2021-03-03 15.0 0.0 0.0 0.0
62 2021-03-04 15.0 0.0 0.0 0.0
63 2021-03-05 15.0 0.0 0.0 0.0
64 2021-03-06 15.0 0.0 0.0 0.0
65 2021-03-07 15.0 0.0 0.0 0.0
66 2021-03-08 15.0 0.0 0.0 0.0
67 2021-03-09 15.0 0.0 0.0 0.0
68 2021-03-10 15.0 0.0 0.0 0.0
69 2021-03-11 15.0 0.0 0.0 0.0
70 2021-03-12 15.0 0.0 0.0 0.0
71 2021-03-13 15.0 0.0 0.0 0.0
72 2021-03-14 15.0 0.0 0.0 0.0
73 2021-03-15 15.0 0.0 0.0 0.0
74 2021-03-16 15.0 0.0 0.0 0.0
75 2021-03-17 15.0 0.0 0.0 0.0
76 2021-03-18 15.0 0.0 0.0 0.0
77 2021-03-19 15.0 0.0 0.0 0.0
78 2021-03-20 15.0 0.0 0.0 0.0
79 2021-03-21 15.0 0.0 0.0 0.0
80 2021-03-22 15.0 0.0 0.0 0.0
81 2021-03-23 15.0 0.0 0.0 0.0
82 2021-03-24 15.0 0.0 0.0 0.0
83 2021-03-25 15.0 0.0 0.0 0.0
84 2021-03-26 15.0 0.0 0.0 0.0
85 2021-03-27 15.0 0.0 0.0 0.0
86 2021-03-28 15.0 0.0 0.0 0.0
87 2021-03-29 15.0 0.0 0.0 0.0
88 2021-03-30 15.0 0.0 0.0 0.0
89 2021-03-31 15.0 0.0 0.0 0.0
90 2021-04-01 20.0 100.0 5.0 100.0
91 2021-04-02 20.0 0.0 0.0 0.0
92 2021-04-03 20.0 0.0 0.0 0.0
93 2021-04-04 20.0 0.0 0.0 0.0
94 2021-04-05 20.0 0.0 0.0 0.0
95 2021-04-06 20.0 0.0 0.0 0.0
96 2021-04-07 20.0 0.0 0.0 0.0
97 2021-04-08 20.0 0.0 0.0 0.0
98 2021-04-09 20.0 0.0 0.0 0.0
99 2021-04-10 20.0 0.0 0.0 0.0
100 2021-04-11 20.0 0.0 0.0 0.0
101 2021-04-12 20.0 0.0 0.0 0.0
102 2021-04-13 20.0 0.0 0.0 0.0
103 2021-04-14 20.0 0.0 0.0 0.0
104 2021-04-15 20.0 0.0 0.0 0.0
105 2021-04-16 20.0 0.0 0.0 0.0
106 2021-04-17 20.0 0.0 0.0 0.0
107 2021-04-18 20.0 0.0 0.0 0.0
108 2021-04-19 20.0 0.0 0.0 0.0
109 2021-04-20 20.0 0.0 0.0 0.0
110 2021-04-21 20.0 0.0 0.0 0.0
111 2021-04-22 20.0 0.0 0.0 0.0
112 2021-04-23 20.0 0.0 0.0 0.0
113 2021-04-24 20.0 0.0 0.0 0.0
114 2021-04-25 20.0 0.0 0.0 0.0
115 2021-04-26 20.0 0.0 0.0 0.0
116 2021-04-27 20.0 0.0 0.0 0.0
117 2021-04-28 20.0 0.0 0.0 0.0
118 2021-04-29 20.0 0.0 0.0 0.0
119 2021-04-30 20.0 0.0 0.0 0.0

Changing year of DataTimeIndex in Pandas

I have a timeseries with data related to the irradiance of the sun. I have data for every hour during a year, but every month has data from a diferent year. For example, the data taken in March can be from 2012 and the data taken in January can be from 2014.
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2012-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2012-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2012-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2012-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2012-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2011-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2011-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2011-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2011-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2011-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
My question is: there is a way I can set all the data to a certain year?
For example, set all data to 2014
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2014-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2014-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2014-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2014-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2014-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2014-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2014-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2014-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2014-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2014-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
Thanks in advance.

Use offsets.DateOffset with year (without s) for set same year in all DatetimeIndex:
rng = pd.date_range('2009-04-03', periods=10, freq='350D')
df = pd.DataFrame({ 'a': range(10)}, rng)
print (df)
a
2009-04-03 0
2010-03-19 1
2011-03-04 2
2012-02-17 3
2013-02-01 4
2014-01-17 5
2015-01-02 6
2015-12-18 7
2016-12-02 8
2017-11-17 9
df.index += pd.offsets.DateOffset(year=2014)
print (df)
a
2014-04-03 0
2014-03-19 1
2014-03-04 2
2014-02-17 3
2014-02-01 4
2014-01-17 5
2014-01-02 6
2014-12-18 7
2014-12-02 8
2014-11-17 9
Another idea with Index.map and replace:
df.index = df.index.map(lambda x: x.replace(year=2014))

Set value based on day in month in pandas timeseries

I have a timeseries
date
2009-12-23 0.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 0.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
I would like to set the value to 1 based on the following rule:
If the constant is set 9 this means the 9th of each month. Due to
that that 2010-01-09 doesn't exist I would like to set the next date
that exists in the series to 1 which is 2010-01-11 above.
I have tried to create two series one (series1) with day < 9 set to 1 and one (series2) with day > 9 to 1 and then series1.shift(1) * series2
It works in the middle of the month but not if day is set to 1 due to that the last date in previous month is set to 0 in series1.

Assume your timeseries is s with a datetimeindex
I want to create a groupby object of all index values whose days are greater than or equal to 9.
g = s.index.to_series().dt.day.ge(9).groupby(pd.TimeGrouper('M'))
Then I'll check that there is at least one day past >= 9 and grab the first among them. With those, I'll assign the value of 1.
s.loc[g.idxmax()[g.any()]] = 1
s
date
2009-12-23 1.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 1.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
Name: val, dtype: float64
Note that 2009-12-23 also was assigned a 1 as it satisfies this requirement as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to insert missing row into pandas dataframe? - python

Related

dataframe calculate pct_change() in slices between two specific values

How do I solve this NaN error by this function?

Calculating sum of up to the current row in pandas while iterating on each row in a time series data

Changing year of DataTimeIndex in Pandas

Set value based on day in month in pandas timeseries

Categories

Resources