How to add missing dates in pandas - python

I have the following dataframe:
data
Out[120]:
High Low Open Close Volume Adj Close
Date
2018-01-02 12.66 12.50 12.52 12.66 20773300.0 10.842077
2018-01-03 12.80 12.67 12.68 12.76 29765600.0 10.927719
2018-01-04 13.04 12.77 12.78 12.98 37478200.0 11.116128
2018-01-05 13.22 13.04 13.06 13.20 46121900.0 11.304538
2018-01-08 13.22 13.11 13.21 13.15 33828300.0 11.261715
... ... ... ... ... ...
2020-06-25 6.05 5.80 5.86 6.03 73612700.0 6.030000
2020-06-26 6.07 5.81 6.04 5.91 118435400.0 5.910000
2020-06-29 6.07 5.81 5.91 6.01 58208400.0 6.010000
2020-06-30 6.10 5.90 5.98 6.08 61909300.0 6.080000
2020-07-01 6.18 5.95 6.10 5.98 62333600.0 5.980000
[629 rows x 6 columns]
Some of the dates are missing in Dates Column. I know i can do this to get all the dates:
pd.date_range(start, end, freq ='D')
Out[121]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26',
'2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30',
'2020-07-01', '2020-07-02'],
dtype='datetime64[ns]', length=914, freq='D')
How can i compare all the dates with the index and just add the dates which are missing.

Use DataFrame.reindex, working also if need some custom start and end datimes:
df = df.reindex(pd.date_range(start, end, freq ='D'))
Or DataFrame.asfreq for add missing datetimes between existing data:
df = df.asfreq('d')

Related

How to merge two DataFrames with DatetimeIndex preserved in pandas?

I have 2 DataFrames, df1, and df2.
df1 has the following contents:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-01-29 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
2019-01-30 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2019-01-31 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
2019-02-01 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
2019-02-04 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
... ... ... ... ... ... ... ...
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-01-29 160.990005 3.970 163.199997 975200 30784200
2019-01-30 162.889999 4.035 163.399994 770700 41346500
2019-01-31 166.470001 4.040 166.699997 1108700 37258400
2019-02-01 166.990005 4.000 167.330002 889100 32143700
2019-02-04 167.330002 3.990 167.479996 871800 26718800
... ... ... ... ... ...
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
[23 rows x 12 columns]
And here's the contents of df2:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
... ... ... ... ... ... ... ...
2019-03-28 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
2019-03-29 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
2019-04-01 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
2019-04-02 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
2019-04-03 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
... ... ... ... ... ...
2019-03-28 177.240005 4.650 178.360001 2104400 30368200
2019-03-29 178.589996 4.710 179.690002 2937400 35205500
2019-04-01 180.770004 4.850 181.509995 2733600 30969500
2019-04-02 181.779999 5.660 182.240005 6062000 22645200
2019-04-03 183.210007 5.930 183.759995 10002400 31633500
[28 rows x 12 columns]
As you can see from the above, df1 and df2 have overlapping Dates.
How can I create a merged DataFrame df that contains dates from 2019-01-29 to 2019-04-03 with no overlapping Date?
I've tried running df = df1.merge(df2, how='outer'). However, this command returns a DataFrame with Date removed, which is not something desirable.
> df
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
0 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
1 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
3 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
4 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
.. ... ... ... ... ... ... ...
41 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
42 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
43 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
44 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
45 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
0 160.990005 3.970 163.199997 975200 30784200
1 162.889999 4.035 163.399994 770700 41346500
2 166.470001 4.040 166.699997 1108700 37258400
3 166.990005 4.000 167.330002 889100 32143700
4 167.330002 3.990 167.479996 871800 26718800
.. ... ... ... ... ...
41 177.240005 4.650 178.360001 2104400 30368200
42 178.589996 4.710 179.690002 2937400 35205500
43 180.770004 4.850 181.509995 2733600 30969500
44 181.779999 5.660 182.240005 6062000 22645200
45 183.210007 5.930 183.759995 10002400 31633500
[46 rows x 12 columns]
It seems that I should find a way to merge df1.index and df2.index. Then add the merged DatetimeIndex to df.
For the convenience of debugging, you can run the following code to get the same data as mine.
import yfinance as yf
symbols = ['QQQ', 'GBTC']
df1 = yf.download(symbols, start="2019-01-29", end="2019-03-01")
df2 = yf.download(symbols, start="2019-02-25", end="2019-04-03")
Taken from the docs:
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
So I believe that if you specify the index in the merge with on=Date, then you should be ok.
df1.merge(df2, how='outer', on='Date')
However, for the problem that you are trying to solve merge is note the correct tool. What you need to do is append the dataframes together and then remove the duplicated days:
df1.append(df2).drop_duplicates()

How to loop through dates column and assign values according to a certain condition?

I have a df as follows
dates winter summer rest Final
2020-01-01 00:15:00 65.5 71.5 73.0 NaN
2020-01-01 00:30:00 62.6 69.0 70.1 NaN
2020-01-01 00:45:00 59.6 66.3 67.1 NaN
2020-01-01 01:00:00 57.0 63.5 64.5 NaN
2020-01-01 01:15:00 54.8 60.9 62.3 NaN
2020-01-01 01:30:00 53.1 58.6 60.6 NaN
2020-01-01 01:45:00 51.7 56.6 59.2 NaN
2020-01-01 02:00:00 50.5 55.1 57.9 NaN
2020-01-01 02:15:00 49.4 54.2 56.7 NaN
2020-01-01 02:30:00 48.5 53.7 55.6 NaN
2020-01-01 02:45:00 47.9 53.4 54.7 NaN
2020-01-01 03:00:00 47.7 53.3 54.2 NaN
2020-01-01 03:15:00 47.9 53.1 54.1 NaN
2020-01-01 03:30:00 48.7 53.2 54.6 NaN
2020-01-01 03:45:00 50.2 54.1 55.8 NaN
2020-01-01 04:00:00 52.3 56.1 57.9 NaN
2020-04-28 12:30:00 225.1 200.0 209.8 NaN
2020-04-28 12:45:00 215.7 193.8 201.9 NaN
2020-04-28 13:00:00 205.6 186.9 193.4 NaN
2020-04-28 13:15:00 195.7 179.9 185.0 NaN
2020-04-28 13:30:00 186.7 173.4 177.4 NaN
2020-04-28 13:45:00 179.2 168.1 170.9 NaN
2020-04-28 14:00:00 173.8 164.4 166.3 NaN
2020-04-28 14:15:00 171.0 163.0 163.9 NaN
2020-04-28 14:30:00 170.7 163.5 163.6 NaN
2020-12-31 21:15:00 88.5 90.2 89.2 NaN
2020-12-31 21:30:00 85.2 88.5 87.2 NaN
2020-12-31 21:45:00 82.1 86.3 85.0 NaN
2020-12-31 22:00:00 79.4 84.1 83.2 NaN
2020-12-31 22:15:00 77.6 82.4 82.1 NaN
2020-12-31 22:30:00 76.4 81.2 81.7 NaN
2020-12-31 22:45:00 75.6 80.3 81.6 NaN
2020-12-31 23:00:00 74.7 79.4 81.3 NaN
2020-12-31 23:15:00 73.7 78.4 80.6 NaN
2020-12-31 23:30:00 72.3 77.2 79.5 NaN
2020-12-31 23:45:00 70.5 75.7 77.9 NaN
2021-01-01 00:00:00 68.2 73.8 75.7 NaN
The dates column has dates starting from 2020-01-01 00:15:00 till 2021-01-01 00:00:00 split at every 15 mins.
I also have the following date range conditions:
Winter: 01.11 - 20.03
Summer: 15.05 - 14.09
Rest: 21.03 - 14.05 & 15.09 - 31.10
What I want to do is to create a new column named season that checks every date in the dates column and assigns winter if the date is in Winter range, summer if it is in Summer range and rest if it is the Rest range.
Then, based on the value in the season column, the Final column must be filled. If the value in season column is 'winter', then the values from winter column must be placed, if the value in season column is 'summer', then the values from summer column must be placed and so on.
How can this be done?
Idea is normalize datetimes for same year, then filter by Series.between and set new column by numpy.select:
d = pd.to_datetime(df['dates'].dt.strftime('%m-%d-2020'))
m1 = d.between('2020-11-01','2020-12-31') | d.between('2020-01-01','2020-03-20')
m2 = d.between('2020-05-15','2020-09-14')
df['Final'] = np.select([m1, m2], ['Winter','Summer'], default='Rest')
print (df)
dates winter summer rest Final
0 2020-01-01 00:15:00 65.5 71.5 73.0 Winter
1 2020-06-15 00:30:00 62.6 69.0 70.1 Summer
2 2020-12-25 00:45:00 59.6 66.3 67.1 Winter
3 2020-10-10 01:00:00 57.0 63.5 64.5 Rest

Panda data-frame column without labels

I have following data set in panda dataframe
print data
Result:
Open High Low Close Adj Close Volume
Date
2018-05-25 12.70 12.73 12.48 12.61 12.610000 1469800
2018-05-24 12.99 13.08 12.93 12.98 12.980000 814800
2018-05-23 13.19 13.30 13.06 13.12 13.120000 1417500
2018-05-22 13.46 13.57 13.25 13.27 13.270000 1189000
2018-05-18 13.41 13.44 13.36 13.38 13.380000 986300
2018-05-17 13.19 13.42 13.19 13.40 13.400000 1056200
2018-05-16 13.01 13.14 13.01 13.12 13.120000 481300
If I just want to print single column just close it shows with the date index
print data.Low
Result:
Date
2018-05-25 12.48
2018-05-24 12.93
2018-05-23 13.06
2018-05-22 13.25
2018-05-18 13.36
2018-05-17 13.19
2018-05-16 13.01
Is there way to slice/print just the closing price. So the output will be like:
12.48
12.93
13.06
13.25
13.36
13.19
13.01
In pandas Series and DataFrame always need some index values.
Default RangeIndex is possible create by:
print data.reset_index(drop=True).Low
But if need write only values to file as column without index and with no header:
data.Low.to_csv(file, index=False, header=None)
If need convert column to list:
print data.Low.tolist()
[12.48, 12.93, 13.06, 13.25, 13.36, 13.19, 13.01]
And for 1d numpy array:
print data.Low.values
[12.48 12.93 13.06 13.25 13.36 13.19 13.01]
If want 1xM array:
print (data[['Low']].values)
[[12.48]
[12.93]
[13.06]
[13.25]
[13.36]
[13.19]
[13.01]]

Having trouble plotting this data frame of mutual funds

First off, here is my dataframe:
Date 2012-09-04 00:00:00 2012-09-05 00:00:00 2012-09-06 00:00:00 2012-09-07 00:00:00 2012-09-10 00:00:00 2012-09-11 00:00:00 2012-09-12 00:00:00 2012-09-13 00:00:00 2012-09-14 00:00:00 2012-09-17 00:00:00 ... 2017-08-22 00:00:00 2017-08-23 00:00:00 2017-08-24 00:00:00 2017-08-25 00:00:00 2017-08-28 00:00:00 2017-08-29 00:00:00 2017-08-30 00:00:00 2017-08-31 00:00:00 2017-09-01 00:00:00 Type
AABTX 9.73 9.73 9.83 9.86 9.83 9.86 9.86 9.96 9.98 9.96 ... 11.44 11.45 11.44 11.46 11.46 11.47 11.47 11.51 11.52 Hybrid
AACTX 9.66 9.65 9.77 9.81 9.78 9.81 9.82 9.92 9.95 9.93 ... 12.32 12.32 12.31 12.33 12.34 12.34 12.35 12.40 12.41 Hybrid
AADTX 9.71 9.70 9.85 9.90 9.86 9.89 9.91 10.02 10.07 10.05 ... 13.05 13.04 13.03 13.05 13.06 13.06 13.08 13.14 13.15 Hybrid
AAETX 9.92 9.91 10.07 10.13 10.08 10.12 10.14 10.26 10.32 10.29 ... 13.84 13.84 13.82 13.85 13.86 13.86 13.89 13.96 13.98 Hybrid
AAFTX 9.85 9.84 10.01 10.06 10.01 10.05 10.07 10.20 10.26 10.23 ... 14.09 14.08 14.07 14.09 14.11 14.11 14.15 14.24 14.26 Hybrid
That is a bit hard to read but essentially these are just closing prices for several mutual funds (638) which the Type label in the last column. I'd like to plot all of these on a single plot and have a legend labeling what type each plot is.
I'd like to see how many potential clusters I may need. This was my first though to visualize the data but if you have any other recommendations, feel free to suggest it.
Also, in my first attempt, I tried:
parallel_coordinates(closing_data, 'Type', alpha=0.2, colormap=dark2_cmap)
plt.show()
It just shows up as a black blob and after some research I found that it doesn't handle large number of features that well.
My suggestion is to transpose the dataframe, as timestamp comes more naturally as an index and you will be able to address individual time series as df.AABTX or df['AABTX'].
With a smaller number of time series you could have tried df.plot(), but when in it is rather large you should not be surpried to see some mess initially.
Try plotting a subset of your data, but please make sure the time is in index, not columns names.
You may be looking for something like the silhouette analysis which is implemented in the scikit-learn machine learning library. It should allow to find an optimal number of clusters to consider for your data.

Filter a timeseries with some predefined dates in Pandas

I have this code :
close[close['Datetime'].isin(datefilter)] #Only date in the range
close1='Close' ; start='12/18/2015 00:00:00';
end='3/1/2016 00:00:00'; freq='1d0h00min';
datefilter= pd.date_range(start=start, end=end, freq= freq).values
But, strangely, some columns are given back with Nan:
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
2015-12-18 31.73 63.38 16.34 56.88 12.24 NaN NaN 38.72
2015-12-21 32.04 63.60 16.26 56.75 12.18 NaN NaN 42.52
Just wondering the reasons, and how can we remedy ?
Original :
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
0 2013-03-21 17.18 29.0 20.75 30.1 11.52 11.52 38.72
1 2013-03-22 16.81 30.53 21.25 30.0 11.64 11.52 39.42
2 2013-03-25 16.83 32.15 20.8 27.59 11.7 11.52 42.52
3 2013-03-26 17.09 29.55 20.6 27.5 11.76 11.52 11.52
EDIT:
it seems related to the datetime hh:mm:ss filtering.

Categories

Resources