First off, here is my dataframe:
Date 2012-09-04 00:00:00 2012-09-05 00:00:00 2012-09-06 00:00:00 2012-09-07 00:00:00 2012-09-10 00:00:00 2012-09-11 00:00:00 2012-09-12 00:00:00 2012-09-13 00:00:00 2012-09-14 00:00:00 2012-09-17 00:00:00 ... 2017-08-22 00:00:00 2017-08-23 00:00:00 2017-08-24 00:00:00 2017-08-25 00:00:00 2017-08-28 00:00:00 2017-08-29 00:00:00 2017-08-30 00:00:00 2017-08-31 00:00:00 2017-09-01 00:00:00 Type
AABTX 9.73 9.73 9.83 9.86 9.83 9.86 9.86 9.96 9.98 9.96 ... 11.44 11.45 11.44 11.46 11.46 11.47 11.47 11.51 11.52 Hybrid
AACTX 9.66 9.65 9.77 9.81 9.78 9.81 9.82 9.92 9.95 9.93 ... 12.32 12.32 12.31 12.33 12.34 12.34 12.35 12.40 12.41 Hybrid
AADTX 9.71 9.70 9.85 9.90 9.86 9.89 9.91 10.02 10.07 10.05 ... 13.05 13.04 13.03 13.05 13.06 13.06 13.08 13.14 13.15 Hybrid
AAETX 9.92 9.91 10.07 10.13 10.08 10.12 10.14 10.26 10.32 10.29 ... 13.84 13.84 13.82 13.85 13.86 13.86 13.89 13.96 13.98 Hybrid
AAFTX 9.85 9.84 10.01 10.06 10.01 10.05 10.07 10.20 10.26 10.23 ... 14.09 14.08 14.07 14.09 14.11 14.11 14.15 14.24 14.26 Hybrid
That is a bit hard to read but essentially these are just closing prices for several mutual funds (638) which the Type label in the last column. I'd like to plot all of these on a single plot and have a legend labeling what type each plot is.
I'd like to see how many potential clusters I may need. This was my first though to visualize the data but if you have any other recommendations, feel free to suggest it.
Also, in my first attempt, I tried:
parallel_coordinates(closing_data, 'Type', alpha=0.2, colormap=dark2_cmap)
plt.show()
It just shows up as a black blob and after some research I found that it doesn't handle large number of features that well.
My suggestion is to transpose the dataframe, as timestamp comes more naturally as an index and you will be able to address individual time series as df.AABTX or df['AABTX'].
With a smaller number of time series you could have tried df.plot(), but when in it is rather large you should not be surpried to see some mess initially.
Try plotting a subset of your data, but please make sure the time is in index, not columns names.
You may be looking for something like the silhouette analysis which is implemented in the scikit-learn machine learning library. It should allow to find an optimal number of clusters to consider for your data.
Related
I have 2 questions on fbprophet.
First, for 2022-01, my model is greatly over-shooting the actual value. I would like to bring this model prediction down by making it put more weight on the 2021-01 actual data point and less weight on more historical January values (since 2021 Jan had a much lower increase relative to past Januaries). I tried to mess with the Fourier coefficient on seasonality (commented piece of code) but this did not help. Would you have any ideas on what hyper-parameter tuning could help me achieve this?
My 2nd question is why is the yearly seasonality plot wrong? As can be seen from my graph, January has a clear and distinct peak. But this is not reflected at all the yearly seasonality graph fbprophet produces. Note the forecast variable has a column "yearly" that produces a much better seasonality graph but shouldn't my code command plot components be using that column?
Please let me know if either question, the code or data provided is confusing. Thanks a lot for the help.
Attached is my data and code used. Note I had some issues getting fbprophet to import so had to write a unique pip line that you may not need.
Code
#restart kernel after running this
!pip install pystan
#restart kernel after running this
!pip install prophet --no-cache-dir
#Needed libraries
import pandas as pd
from prophet import Prophet
import datetime
import math
from matplotlib import pyplot as plt
#Read in training and testing data
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_data.csv", index_col = 'ds')
prophet = Prophet(yearly_seasonality = True)
prophet.fit(df_train)
#tried add custom yearly seasonality with higher fourier order to react quicker to seasonlity trends but it didn't work
#prophet.add_seasonality(name='yearly_seasonality_quick', period=365.25, fourier_order=50)
plt.figure()
#create a future data frame projecting 3 months out
future = prophet.make_future_dataframe(periods=3, freq='MS')
forecast = prophet.predict(future)
fig = plt.figure(figsize=(12, 8))
ax = fig.gca()
#plot
prophet.plot(forecast, ax=ax)
#plot testing data points on plot
df_test.index = pd.to_datetime(df_test.index)
df_test.plot(color = 'green', marker='o', ax=ax)
#plot trend and seasonality
fig2 = prophet.plot_components(forecast)
training data
ds y
1/1/2016 53.55
2/1/2016 33.95
3/1/2016 25.15
4/1/2016 19.5
5/1/2016 15.35
6/1/2016 16.8
7/1/2016 11.2
8/1/2016 16.55
9/1/2016 13.3
10/1/2016 10.3
11/1/2016 10.1
12/1/2016 5.85
1/1/2017 45.4
2/1/2017 25.9
3/1/2017 18.55
4/1/2017 13.55
5/1/2017 16.7
6/1/2017 15.65
7/1/2017 10.4
8/1/2017 14.4
9/1/2017 10.55
10/1/2017 10.75
11/1/2017 10.1
12/1/2017 4.55
1/1/2018 34.8
2/1/2018 20.25
3/1/2018 14.6
4/1/2018 14.95
5/1/2018 15.8
6/1/2018 14.95
7/1/2018 12.8
8/1/2018 15
9/1/2018 9.9
10/1/2018 14.1
11/1/2018 10.6
12/1/2018 5.6
1/1/2019 33.8
2/1/2019 18.65
3/1/2019 15.1
4/1/2019 19.35
5/1/2019 17.4
6/1/2019 13.9
7/1/2019 16.45
8/1/2019 15.55
9/1/2019 14.15
10/1/2019 15.6
11/1/2019 10.95
12/1/2019 8.7
1/1/2020 28.85
2/1/2020 16.45
3/1/2020 5.5
4/1/2020 -2.1
5/1/2020 5.4
6/1/2020 14.15
7/1/2020 11.6
8/1/2020 10.8
9/1/2020 12.35
10/1/2020 10.35
11/1/2020 7.45
12/1/2020 6.35
1/1/2021 16.35
2/1/2021 9.8
3/1/2021 16.05
4/1/2021 14.05
5/1/2021 11.2
6/1/2021 16.05
7/1/2021 10.95
8/1/2021 11.5
9/1/2021 10.85
10/1/2021 9.35
11/1/2021 9.95
12/1/2021 6.8
testing data
ds y
1/1/2022 16.75
2/1/2022 13.25
3/1/2022 13.9
I have the following dataframe:
data
Out[120]:
High Low Open Close Volume Adj Close
Date
2018-01-02 12.66 12.50 12.52 12.66 20773300.0 10.842077
2018-01-03 12.80 12.67 12.68 12.76 29765600.0 10.927719
2018-01-04 13.04 12.77 12.78 12.98 37478200.0 11.116128
2018-01-05 13.22 13.04 13.06 13.20 46121900.0 11.304538
2018-01-08 13.22 13.11 13.21 13.15 33828300.0 11.261715
... ... ... ... ... ...
2020-06-25 6.05 5.80 5.86 6.03 73612700.0 6.030000
2020-06-26 6.07 5.81 6.04 5.91 118435400.0 5.910000
2020-06-29 6.07 5.81 5.91 6.01 58208400.0 6.010000
2020-06-30 6.10 5.90 5.98 6.08 61909300.0 6.080000
2020-07-01 6.18 5.95 6.10 5.98 62333600.0 5.980000
[629 rows x 6 columns]
Some of the dates are missing in Dates Column. I know i can do this to get all the dates:
pd.date_range(start, end, freq ='D')
Out[121]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26',
'2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30',
'2020-07-01', '2020-07-02'],
dtype='datetime64[ns]', length=914, freq='D')
How can i compare all the dates with the index and just add the dates which are missing.
Use DataFrame.reindex, working also if need some custom start and end datimes:
df = df.reindex(pd.date_range(start, end, freq ='D'))
Or DataFrame.asfreq for add missing datetimes between existing data:
df = df.asfreq('d')
I'm still a newbie to matplotlib. Currently, I have below dataset for plotting:
Date Open High Low Close
Trade_Date
2018-01-02 736696.0 42.45 42.45 41.45 41.45
2018-01-03 736697.0 41.60 41.70 40.70 40.95
2018-01-04 736698.0 40.90 41.05 40.20 40.25
2018-01-05 736699.0 40.35 41.60 40.35 41.50
2018-01-08 736702.0 40.20 40.20 37.95 38.00
2018-01-09 736703.0 37.15 39.00 37.15 38.00
2018-01-10 736704.0 38.70 38.70 37.15 37.25
2018-01-11 736705.0 37.50 37.50 36.55 36.70
2018-01-12 736706.0 37.00 37.40 36.90 37.20
2018-01-15 736709.0 37.50 37.70 37.15 37.70
2018-01-16 736710.0 37.80 38.25 37.45 37.95
2018-01-17 736711.0 38.00 38.05 37.65 37.75
2018-01-18 736712.0 38.00 38.20 37.70 37.75
2018-01-19 736713.0 36.70 37.10 35.30 36.45
2018-01-22 736716.0 36.25 36.25 35.50 36.10
2018-01-23 736717.0 36.20 36.30 35.65 36.00
2018-01-24 736718.0 35.80 36.00 35.60 36.00
2018-01-25 736719.0 36.10 36.10 35.45 35.45
2018-01-26 736720.0 35.50 35.75 35.00 35.00
2018-01-29 736723.0 34.80 35.00 33.65 33.70
2018-01-30 736724.0 33.70 34.45 33.65 33.90
I've converted the date value to number using mdates.date2num
After that, I've tried to plot candlestick graph with codes below:
f1, ax = plt.subplots(figsize= (10,5))
candlestick_ohlc(ax, ohlc.values, width=.6, colorup='red', colordown='green')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.show()
However, I'm still getting the graph with gaps.
I've tried the possible solution from How do I plot only weekdays using Python's matplotlib candlestick?
However, I was not able to solve my problem with the solution above.
Can anyone kindly help me with this issue?
Thanks!
I have following data set in panda dataframe
print data
Result:
Open High Low Close Adj Close Volume
Date
2018-05-25 12.70 12.73 12.48 12.61 12.610000 1469800
2018-05-24 12.99 13.08 12.93 12.98 12.980000 814800
2018-05-23 13.19 13.30 13.06 13.12 13.120000 1417500
2018-05-22 13.46 13.57 13.25 13.27 13.270000 1189000
2018-05-18 13.41 13.44 13.36 13.38 13.380000 986300
2018-05-17 13.19 13.42 13.19 13.40 13.400000 1056200
2018-05-16 13.01 13.14 13.01 13.12 13.120000 481300
If I just want to print single column just close it shows with the date index
print data.Low
Result:
Date
2018-05-25 12.48
2018-05-24 12.93
2018-05-23 13.06
2018-05-22 13.25
2018-05-18 13.36
2018-05-17 13.19
2018-05-16 13.01
Is there way to slice/print just the closing price. So the output will be like:
12.48
12.93
13.06
13.25
13.36
13.19
13.01
In pandas Series and DataFrame always need some index values.
Default RangeIndex is possible create by:
print data.reset_index(drop=True).Low
But if need write only values to file as column without index and with no header:
data.Low.to_csv(file, index=False, header=None)
If need convert column to list:
print data.Low.tolist()
[12.48, 12.93, 13.06, 13.25, 13.36, 13.19, 13.01]
And for 1d numpy array:
print data.Low.values
[12.48 12.93 13.06 13.25 13.36 13.19 13.01]
If want 1xM array:
print (data[['Low']].values)
[[12.48]
[12.93]
[13.06]
[13.25]
[13.36]
[13.19]
[13.01]]
I have this code :
close[close['Datetime'].isin(datefilter)] #Only date in the range
close1='Close' ; start='12/18/2015 00:00:00';
end='3/1/2016 00:00:00'; freq='1d0h00min';
datefilter= pd.date_range(start=start, end=end, freq= freq).values
But, strangely, some columns are given back with Nan:
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
2015-12-18 31.73 63.38 16.34 56.88 12.24 NaN NaN 38.72
2015-12-21 32.04 63.60 16.26 56.75 12.18 NaN NaN 42.52
Just wondering the reasons, and how can we remedy ?
Original :
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
0 2013-03-21 17.18 29.0 20.75 30.1 11.52 11.52 38.72
1 2013-03-22 16.81 30.53 21.25 30.0 11.64 11.52 39.42
2 2013-03-25 16.83 32.15 20.8 27.59 11.7 11.52 42.52
3 2013-03-26 17.09 29.55 20.6 27.5 11.76 11.52 11.52
EDIT:
it seems related to the datetime hh:mm:ss filtering.