Reindexing Pandas based on daterange - python

I am trying to reindex the dates in pandas. This is because there are dates which are missing, such as weekends or national hollidays.
To do this I am using the following code:
import pandas as pd
import yfinance as yf
import datetime
start = datetime.date(2015,1,1)
end = datetime.date.today()
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index = df.index.strftime('%Y-%m-%d')
full_dates = pd.date_range(start, end)
df.reindex(full_dates)
This code is producing this dataframe:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 NaN NaN NaN NaN NaN NaN
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2023-01-13 NaN NaN NaN NaN NaN NaN
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
Could you please advise why is it not reindexing the data and showing NaN values instead?
===Edit ===
Could it be a python version issue? I ran the same code in python 3.7 and 3.10
In python 3.7
In python 3.10
In python 3.10 - It is datetime as you can see from the image.
Getting datetime after yf.download('F', start, end, interval ='1d', progress = False) without strftime

Remove converting DatetimeIndex to strings by df.index = df.index.strftime('%Y-%m-%d'), so can reindex by datetimes.
df = yf.download('F', start, end, interval ='1d', progress = False)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407450 44079700.0
... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]
print (df.index)
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
'2015-01-09', '2015-01-10',
...
'2023-01-08', '2023-01-09', '2023-01-10', '2023-01-11',
'2023-01-12', '2023-01-13', '2023-01-14', '2023-01-15',
'2023-01-16', '2023-01-17'],
dtype='datetime64[ns]', length=2939, freq='D')
EDIT: There is timezones difference, for remove it use DatetimeIndex.tz_convert:
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index= df.index.tz_convert(None)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)

You need to use strings in reindex to keep an homogeneous type, else pandas doesn't match the string (e.g., 2015-01-02) with the Timestamp (e.g., pd.Timestamp('2015-01-02')):
df.reindex(full_dates.astype(str))
#or
df.reindex(full_dates.strftime('%Y-%m-%d'))
Output:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407451 44079700.0
... ... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]

Related

Forward fill column one year after last observation

I forward fill values in the following df using:
df = (df.resample('d') # ensure data is daily time series
.ffill()
.sort_index(ascending=True))
df before forward fill
id a b c d
datadate
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1 NaN 3 4
1980-05-31 NaN NaN NaN NaN
... ... ... ...
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20 33
However, I wish to only forward fill one year after (date is datetime) the last observation and then the remaining rows simply be NaN. I am not sure what is the best way to introduce this criteria in this task. Any help would be super!
Thanks
If I understand you correctly, you want to forward-fill the values on Dec 31, 2019 to the next year. Try this:
end_date = df.index.max()
new_end_date = end_date + pd.offsets.DateOffset(years=1)
new_index = df.index.append(pd.date_range(end_date, new_end_date, closed='right'))
df = df.reindex(new_index)
df.loc[end_date:, :] = df.loc[end_date:, :].ffill()
Result:
a b c d
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2.0 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1.0 NaN 3.0 4.0
1980-05-31 NaN NaN NaN NaN
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20.0 33.0
2020-01-01 NaN NaN 20.0 33.0
2020-01-02 NaN NaN 20.0 33.0
...
2020-12-31 NaN NaN 20.0 33.0
One solution is to forward fill using a limit parameter, but this wont handle the leap-year:
df.fillna(mehotd='ffill', limit=365)
The second solution is to define a more robust function to do the forward fill in the 1-year window:
from pandas.tseries.offsets import DateOffsets
def fun(serie_df):
serie = serie_df.copy()
indexes = serie[~serie.isnull()].index
for idx in indexes:
mask = (serie.index >= idx) & (serie.index < idx+DateOffset(years=1))
serie.loc[mask] = serie[mask].fillna(method='ffill')
return serie
df_filled = df.apply(fun, axis=0)
If a column has multiple non-nan values in the same 1-year window, then the first fill will stop once the most recent value is encounter. The second solution will treat the consecutive value as if they were independent.

Pandas: datetime indexed series to time indexed date columns dataframe

I have a datetime indexed series like this:
2018-08-27 17:45:01 1
2018-08-27 16:01:12 1
2018-08-27 13:48:47 1
2018-08-26 22:26:40 2
2018-08-26 20:10:42 1
2018-08-26 18:20:32 1
2018-08-25 23:07:51 1
2018-08-25 01:46:08 1
2018-09-18 14:08:23 1
2018-09-17 19:38:38 1
2018-09-15 22:40:45 1
What is an elegant way to reformat this into a time indexed dataframe whose columns are dates? For example:
2018-10-24 2018-06-28 2018-10-23
15:16:41 1.0 NaN NaN
15:18:16 1.0 NaN NaN
15:21:42 1.0 NaN NaN
23:35:00 NaN NaN 1.0
23:53:13 NaN 1.0 NaN
Current approach:
time_date_dict = defaultdict(partial(defaultdict, int))
for i in series.iteritems():
datetime = i[0]
value = i[1]
time_date_dict[datetime.time()][datetime.date()] = value
time_date_df = pd.DataFrame.from_dict(time_date_dict, orient='index')
Use pivot:
df1 = pd.pivot(s.index.time, s.index.date, s)
#if want strings index and columns names
#df1 = pd.pivot(s.index.strftime('%H:%M:%S'), s.index.strftime('%Y-%m-%d'), s)
print (df1)
date 2018-08-25 2018-08-26 2018-08-27 2018-09-15 2018-09-17 \
date
01:46:08 1.0 NaN NaN NaN NaN
13:48:47 NaN NaN 1.0 NaN NaN
14:08:23 NaN NaN NaN NaN NaN
16:01:12 NaN NaN 1.0 NaN NaN
17:45:01 NaN NaN 1.0 NaN NaN
18:20:32 NaN 1.0 NaN NaN NaN
19:38:38 NaN NaN NaN NaN 1.0
20:10:42 NaN 1.0 NaN NaN NaN
22:26:40 NaN 2.0 NaN NaN NaN
22:40:45 NaN NaN NaN 1.0 NaN
23:07:51 1.0 NaN NaN NaN NaN
date 2018-09-18
date
01:46:08 NaN
13:48:47 NaN
14:08:23 1.0
16:01:12 NaN
17:45:01 NaN
18:20:32 NaN
19:38:38 NaN
20:10:42 NaN
22:26:40 NaN
22:40:45 NaN
23:07:51 NaN

Python Pandas Datetime and dataframe indexing issue

I have a datetime issue where I am trying to match up a dataframe
with dates as index values.
For example, I have dr which is an array of numpy.datetime.
dr = [numpy.datetime64('2014-10-31T00:00:00.000000000'),
numpy.datetime64('2014-11-30T00:00:00.000000000'),
numpy.datetime64('2014-12-31T00:00:00.000000000'),
numpy.datetime64('2015-01-31T00:00:00.000000000'),
numpy.datetime64('2015-02-28T00:00:00.000000000'),
numpy.datetime64('2015-03-31T00:00:00.000000000')]
Then I have dataframe with returndf with dates as index values
print(returndf)
1 2 3 4 5 6 7 8 9 10
10/31/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11/30/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please ignore the missing values
Whenever I try to match date in dr and dataframe returndf, using the following code for just 1 month returndf.loc[str(dr[1])],
I get an error
KeyError: 'the label [2014-11-30T00:00:00.000000000] is not in the [index]'
I would appreciate if someone can help with me on how to convert numpy.datetime64('2014-10-31T00:00:00.000000000') into 10/31/2014 so that I can match it to the data frame index value.
Thank you,
Your index for returndf is not a DatetimeIndex. Make is so:
returndf = returndf.set_index(pd.to_datetime(returndf.index))
Your dr is a list of Numpy datetime64 objects. That bothers me:
dr = pd.to_datetime(dr)
Your sample data clearly shows that the index of returndf does not include all the items in dr. In that case, use reindex
returndf.reindex(dr)
1 2 3 4 5 6 7 8 9 10
2014-10-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-11-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-12-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-02-28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-03-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Select closest time to a certain hour in python pandas

I'm looking to see if there is a way to select the closet time to a certain hour. I have the following. The file contains 10 years worth of data and I've narrowed it down to some time series that I'd want to keep.
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import datetime
import numpy as np
dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")
aeronet = pd.read_csv('somefile', skiprows = 4, na_values = ['N/A'], parse_dates={'times':[0,1]}, date_parser=dateparse)
aeronet = aeronet.set_index('times')
del aeronet['Julian_Day']
aeronet.between_time('06:00:00', '07:00:00'), aeronet.between_time('12:00:00', '13:00:00')
I've selected a snippet of such. Is there such a way to select just the closest to time to 06 or 12 and it contents and discard/ignore the rest from the pandas series, and do this for the entirety of the file?
times AOT_1640 AOT_1020 AOT_870 AOT_675 AOT_667 AOT_555 ...
2000-08-07 06:49:10 NaN 0.380411 0.406041 0.445789 NaN NaN
2000-08-07 06:57:36 NaN 0.353378 0.377769 0.420168 NaN NaN
2000-08-08 06:31:00 NaN 0.322402 0.338164 0.364679 NaN NaN
2000-08-08 06:33:28 NaN 0.337819 0.353995 0.381201 NaN NaN
2000-08-08 06:36:26 NaN 0.347656 0.361839 0.390342 NaN NaN
2000-08-08 06:51:50 NaN 0.306449 0.325672 0.351885 NaN NaN
2000-08-08 06:54:23 NaN 0.336512 0.355386 0.380230 NaN NaN
2000-08-08 06:57:20 NaN 0.330028 0.345679 0.373780 NaN NaN
2000-08-09 06:34:56 NaN 0.290533 0.306911 0.336597 NaN NaN
2000-08-09 06:41:53 NaN 0.294413 0.311553 0.343473 NaN NaN
2000-08-09 06:49:45 NaN 0.311042 0.332054 0.360999 NaN NaN
2000-08-09 06:52:15 NaN 0.319396 0.339932 0.369617 NaN NaN
2000-08-09 06:55:20 NaN 0.327440 0.349084 0.378345 NaN NaN
2000-08-09 06:58:23 NaN 0.323247 0.345273 0.373879 NaN NaN
2000-08-12 06:30:01 NaN 0.465173 0.471528 0.483079 NaN NaN
2000-08-12 06:33:05 NaN 0.460013 0.465674 0.479500 NaN NaN
2000-08-12 06:35:59 NaN 0.433161 0.438488 0.453779 NaN NaN
2000-08-12 06:42:12 NaN 0.406479 0.415580 0.432160 NaN NaN
2000-08-12 06:50:06 NaN 0.414227 0.424330 0.439448 NaN NaN
2000-08-12 06:57:21 NaN 0.396034 0.404258 0.423866 NaN NaN
2000-08-12 06:59:47 NaN 0.372097 0.380798 0.401600 NaN NaN
[6200 rows x 42 columns]
...
times AOT_1640 AOT_1020 AOT_870 AOT_675 AOT_667 AOT_555 ...
2000-01-01 12:23:54 NaN 0.513307 0.557325 0.653497 NaN NaN
2000-01-03 12:24:49 NaN 0.439142 0.494118 0.593997 NaN NaN
2000-01-03 12:39:49 NaN 0.429130 0.477874 0.577334 NaN NaN
2000-01-03 12:54:48 NaN 0.437720 0.489006 0.586224 NaN NaN
2000-01-04 12:10:30 NaN 0.325203 0.362335 0.426348 NaN NaN
2000-01-04 12:25:15 NaN 0.323978 0.356274 0.423620 NaN NaN
2000-01-04 12:40:15 NaN 0.325356 0.361138 0.427271 NaN NaN
2000-01-04 12:55:14 NaN 0.326595 0.363519 0.431527 NaN NaN
2000-01-06 12:11:08 NaN 0.282777 0.307676 0.369811 NaN NaN
2000-01-06 12:26:09 NaN 0.285853 0.314178 0.374832 NaN NaN
2000-01-06 12:41:08 NaN 0.258836 0.289263 0.346880 NaN NaN
2000-01-08 12:12:04 NaN 0.165473 0.185018 0.235770 NaN NaN
2000-01-08 12:42:01 NaN 0.143540 0.164647 0.216335 NaN NaN
2000-01-08 12:57:01 NaN 0.142760 0.164886 0.215461 NaN NaN
2000-01-10 12:12:52 NaN 0.192453 0.225909 0.310540 NaN NaN
2000-01-10 12:27:53 NaN 0.202532 0.238400 0.322692 NaN NaN
2000-01-10 12:42:52 NaN 0.199996 0.235561 0.320756 NaN NaN
2000-01-10 12:57:52 NaN 0.208046 0.245054 0.331214 NaN NaN
2000-01-11 12:13:19 NaN 0.588879 0.646470 0.750459 NaN NaN
2000-01-11 12:28:17 NaN 0.621813 0.680442 0.788457 NaN NaN
2000-01-11 12:43:17 NaN 0.626547 0.685880 0.790631 NaN NaN
2000-01-11 12:58:16 NaN 0.631142 0.689125 0.796060 NaN NaN
2000-01-12 12:28:42 NaN 0.535105 0.584593 0.688904 NaN NaN
2000-01-12 12:43:41 NaN 0.518697 0.571025 0.676406 NaN NaN
2000-01-12 12:58:40 NaN 0.528318 0.583229 0.687795 NaN NaN
2000-01-13 12:14:20 NaN 0.382645 0.419463 0.496089 NaN NaN
2000-01-13 12:29:05 NaN 0.376186 0.414921 0.491920 NaN NaN
2000-01-13 12:44:05 NaN 0.387845 0.424576 0.501968 NaN NaN
2000-01-13 12:59:04 NaN 0.386237 0.423254 0.503163 NaN NaN
2000-01-14 12:14:43 NaN 0.400024 0.425522 0.485719 NaN NaN
[6672 rows x 42 columns])
Such a way that the aeronet dataframe looks similar to this when I print it out? I'm hoping to either still do some calculation with it still or export it to excel.
times AOT_1640 AOT_1020 AOT_870 AOT_675 AOT_667 AOT_555 ...
2000-08-07 06:49:10 NaN 0.380411 0.406041 0.445789 NaN NaN
2000-08-08 06:31:00 NaN 0.322402 0.338164 0.364679 NaN NaN
2000-08-09 06:34:56 NaN 0.290533 0.306911 0.336597 NaN NaN
2000-08-12 06:30:01 NaN 0.465173 0.471528 0.483079 NaN NaN
....
2000-01-01 12:23:54 NaN 0.513307 0.557325 0.653497 NaN NaN
2000-01-03 12:24:49 NaN 0.439142 0.494118 0.593997 NaN NaN
2000-01-04 12:10:30 NaN 0.325203 0.362335 0.426348 NaN NaN
2000-01-06 12:11:08 NaN 0.282777 0.307676 0.369811 NaN NaN
2000-01-08 12:12:04 NaN 0.165473 0.185018 0.235770 NaN NaN
2000-01-10 12:12:52 NaN 0.192453 0.225909 0.310540 NaN NaN
2000-01-11 12:13:19 NaN 0.588879 0.646470 0.750459 NaN NaN
2000-01-12 12:28:42 NaN 0.535105 0.584593 0.688904 NaN NaN
2000-01-13 12:14:20 NaN 0.382645 0.419463 0.496089 NaN NaN
2000-01-14 12:14:43 NaN 0.400024 0.425522 0.485719 NaN NaN
Probably a more efficient way to do this, but this gets the job done I think.
First, add fields for date and time:
aeronet['date'] = aeronet.times.dt.date
aeronet['time'] = aeronet.times.dt.time
Now, aeronet.date.unique() gets you a list of the unique dates. You'll need it later.
dates = aeronet.date.unique()
Create a column that gives absolute distance from 6 am
from datetime import date, datetime, time
sixam = time(6,0,0,0)
def fromsix(time):
abs(datetime.combine(date.min, time) - datetime.combine(date.min, sixam))
aeronet['fromsix'] = aeronet.time.apply(fromsix)
datetime.combine is necessary because apparently you can't just subtract two times.
And now, finally,
pd.concat([aeronet[aeronet.date == date][aeronet.fromsix == aeronet[aeronet.date == date].fromsix.min()] for date in dates])
use a list comprehension to slice the dataframe into individual dates, find the element with minimal distance from sixam, and concatenate them together.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge_asof.html
thats the way to go buddy. efficient, simple, fast.

"ValueError: cannot reindex from a duplicate axis"

I have the following df:
Timestamp A B C ...
2014-11-09 00:00:00 NaN 1 NaN NaN
2014-11-09 00:00:00 2 NaN NaN NaN
2014-11-09 00:00:00 NaN NaN 3 NaN
2014-11-09 08:24:00 NaN NaN 1 NaN
2014-11-09 08:24:00 105 NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
And I would like to make the following:
Timestamp A B C ...
2014-11-09 00:00:00 2 1 3 NaN
2014-11-09 00:01:00 NaN NaN NaN NaN
2014-11-09 00:02:00 NaN NaN NaN NaN
... NaN NaN NaN NaN
2014-11-09 08:23:00 NaN NaN NaN NaN
2014-11-09 08:24:00 105 NaN 1 NaN
2014-11-09 08:25:00 NaN NaN NaN NaN
2014-11-09 08:26:00 NaN NaN NaN NaN
2014-11-09 08:27:00 NaN NaN NaN NaN
... NaN NaN NaN NaN
2014-11-09 09:18:00 NaN NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
That is: I would like to merge the columns with the same Timestamp (I have 17 columns), resample at 1 min granularity and for those column with no values I would like to have NaN.
I started in the following ways:
df.groupby('Timestamp').sum()
and
df = df.resample('1Min', how='max')
but I obtained the following error:
ValueError: cannot reindex from a duplicate axis
How can I solve this problem? I'm just learning Python so I don't have experience at all.
Thank you!
Assumed that you have your Timestamp as index to begin with, you need to do the resample first, and reset_index before doing a groupby, here's the working sample:
import pandas as pd
df
A B C ...
Timestamp
2014-11-09 00:00:00 NaN 1 NaN NaN
2014-11-09 00:00:00 2 NaN NaN NaN
2014-11-09 00:00:00 NaN NaN 3 NaN
2014-11-09 08:24:00 NaN NaN 1 NaN
2014-11-09 08:24:00 105 NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
df.resample('1Min', how='max').reset_index().groupby('Timestamp').sum()
A B C ...
Timestamp
2014-11-09 00:00:00 2 1 3 NaN
2014-11-09 00:01:00 NaN NaN NaN NaN
2014-11-09 00:02:00 NaN NaN NaN NaN
2014-11-09 00:03:00 NaN NaN NaN NaN
2014-11-09 00:04:00 NaN NaN NaN NaN
...
2014-11-09 09:17:00 NaN NaN NaN NaN
2014-11-09 09:18:00 NaN NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
Hope this helps.
Updated:
As said in comment, your 'Timestamp' isn't datetime and probably as string so you cannot resample by DatetimeIndex, just reset_index and convert it something like this:
df = df.reset_index()
df['ts'] = pd.to_datetime(df['Timestamp'])
# 'ts' is now datetime of 'Timestamp', you just need to set it to index
df = df.set_index('ts')
...
Now just run the previous code again but replace 'Timestamp' with 'ts' and you should be OK.

Categories

Resources