I have 2 questions on fbprophet.
First, for 2022-01, my model is greatly over-shooting the actual value. I would like to bring this model prediction down by making it put more weight on the 2021-01 actual data point and less weight on more historical January values (since 2021 Jan had a much lower increase relative to past Januaries). I tried to mess with the Fourier coefficient on seasonality (commented piece of code) but this did not help. Would you have any ideas on what hyper-parameter tuning could help me achieve this?
My 2nd question is why is the yearly seasonality plot wrong? As can be seen from my graph, January has a clear and distinct peak. But this is not reflected at all the yearly seasonality graph fbprophet produces. Note the forecast variable has a column "yearly" that produces a much better seasonality graph but shouldn't my code command plot components be using that column?
Please let me know if either question, the code or data provided is confusing. Thanks a lot for the help.
Attached is my data and code used. Note I had some issues getting fbprophet to import so had to write a unique pip line that you may not need.
Code
#restart kernel after running this
!pip install pystan
#restart kernel after running this
!pip install prophet --no-cache-dir
#Needed libraries
import pandas as pd
from prophet import Prophet
import datetime
import math
from matplotlib import pyplot as plt
#Read in training and testing data
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_data.csv", index_col = 'ds')
prophet = Prophet(yearly_seasonality = True)
prophet.fit(df_train)
#tried add custom yearly seasonality with higher fourier order to react quicker to seasonlity trends but it didn't work
#prophet.add_seasonality(name='yearly_seasonality_quick', period=365.25, fourier_order=50)
plt.figure()
#create a future data frame projecting 3 months out
future = prophet.make_future_dataframe(periods=3, freq='MS')
forecast = prophet.predict(future)
fig = plt.figure(figsize=(12, 8))
ax = fig.gca()
#plot
prophet.plot(forecast, ax=ax)
#plot testing data points on plot
df_test.index = pd.to_datetime(df_test.index)
df_test.plot(color = 'green', marker='o', ax=ax)
#plot trend and seasonality
fig2 = prophet.plot_components(forecast)
training data
ds y
1/1/2016 53.55
2/1/2016 33.95
3/1/2016 25.15
4/1/2016 19.5
5/1/2016 15.35
6/1/2016 16.8
7/1/2016 11.2
8/1/2016 16.55
9/1/2016 13.3
10/1/2016 10.3
11/1/2016 10.1
12/1/2016 5.85
1/1/2017 45.4
2/1/2017 25.9
3/1/2017 18.55
4/1/2017 13.55
5/1/2017 16.7
6/1/2017 15.65
7/1/2017 10.4
8/1/2017 14.4
9/1/2017 10.55
10/1/2017 10.75
11/1/2017 10.1
12/1/2017 4.55
1/1/2018 34.8
2/1/2018 20.25
3/1/2018 14.6
4/1/2018 14.95
5/1/2018 15.8
6/1/2018 14.95
7/1/2018 12.8
8/1/2018 15
9/1/2018 9.9
10/1/2018 14.1
11/1/2018 10.6
12/1/2018 5.6
1/1/2019 33.8
2/1/2019 18.65
3/1/2019 15.1
4/1/2019 19.35
5/1/2019 17.4
6/1/2019 13.9
7/1/2019 16.45
8/1/2019 15.55
9/1/2019 14.15
10/1/2019 15.6
11/1/2019 10.95
12/1/2019 8.7
1/1/2020 28.85
2/1/2020 16.45
3/1/2020 5.5
4/1/2020 -2.1
5/1/2020 5.4
6/1/2020 14.15
7/1/2020 11.6
8/1/2020 10.8
9/1/2020 12.35
10/1/2020 10.35
11/1/2020 7.45
12/1/2020 6.35
1/1/2021 16.35
2/1/2021 9.8
3/1/2021 16.05
4/1/2021 14.05
5/1/2021 11.2
6/1/2021 16.05
7/1/2021 10.95
8/1/2021 11.5
9/1/2021 10.85
10/1/2021 9.35
11/1/2021 9.95
12/1/2021 6.8
testing data
ds y
1/1/2022 16.75
2/1/2022 13.25
3/1/2022 13.9
Related
I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1
I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.
First off, here is my dataframe:
Date 2012-09-04 00:00:00 2012-09-05 00:00:00 2012-09-06 00:00:00 2012-09-07 00:00:00 2012-09-10 00:00:00 2012-09-11 00:00:00 2012-09-12 00:00:00 2012-09-13 00:00:00 2012-09-14 00:00:00 2012-09-17 00:00:00 ... 2017-08-22 00:00:00 2017-08-23 00:00:00 2017-08-24 00:00:00 2017-08-25 00:00:00 2017-08-28 00:00:00 2017-08-29 00:00:00 2017-08-30 00:00:00 2017-08-31 00:00:00 2017-09-01 00:00:00 Type
AABTX 9.73 9.73 9.83 9.86 9.83 9.86 9.86 9.96 9.98 9.96 ... 11.44 11.45 11.44 11.46 11.46 11.47 11.47 11.51 11.52 Hybrid
AACTX 9.66 9.65 9.77 9.81 9.78 9.81 9.82 9.92 9.95 9.93 ... 12.32 12.32 12.31 12.33 12.34 12.34 12.35 12.40 12.41 Hybrid
AADTX 9.71 9.70 9.85 9.90 9.86 9.89 9.91 10.02 10.07 10.05 ... 13.05 13.04 13.03 13.05 13.06 13.06 13.08 13.14 13.15 Hybrid
AAETX 9.92 9.91 10.07 10.13 10.08 10.12 10.14 10.26 10.32 10.29 ... 13.84 13.84 13.82 13.85 13.86 13.86 13.89 13.96 13.98 Hybrid
AAFTX 9.85 9.84 10.01 10.06 10.01 10.05 10.07 10.20 10.26 10.23 ... 14.09 14.08 14.07 14.09 14.11 14.11 14.15 14.24 14.26 Hybrid
That is a bit hard to read but essentially these are just closing prices for several mutual funds (638) which the Type label in the last column. I'd like to plot all of these on a single plot and have a legend labeling what type each plot is.
I'd like to see how many potential clusters I may need. This was my first though to visualize the data but if you have any other recommendations, feel free to suggest it.
Also, in my first attempt, I tried:
parallel_coordinates(closing_data, 'Type', alpha=0.2, colormap=dark2_cmap)
plt.show()
It just shows up as a black blob and after some research I found that it doesn't handle large number of features that well.
My suggestion is to transpose the dataframe, as timestamp comes more naturally as an index and you will be able to address individual time series as df.AABTX or df['AABTX'].
With a smaller number of time series you could have tried df.plot(), but when in it is rather large you should not be surpried to see some mess initially.
Try plotting a subset of your data, but please make sure the time is in index, not columns names.
You may be looking for something like the silhouette analysis which is implemented in the scikit-learn machine learning library. It should allow to find an optimal number of clusters to consider for your data.
I have a pandas DataFrame of statistics for NBA games. Here's a sample of the data for away teams:
away_team away_efg away_drb away_score
date
2000-10-31 19:00:00 Los Angeles Clippers 0.522 74.4 94
2000-10-31 19:00:00 Milwaukee Bucks 0.434 63.0 93
2000-10-31 19:30:00 Minnesota Timberwolves 0.523 73.8 106
2000-10-31 19:30:00 Charlotte Hornets 0.605 77.1 106
2000-10-31 19:30:00 Seattle SuperSonics 0.429 73.1 88
There are many more numeric columns other than the away_score column, and also analogous columns for the home team.
What I would like is, for each row, replace the numeric columns (other than score) with the mean of the previous three observations, partitioned by team. I can almost get what I want by doing the following:
home_df.groupby("team").apply(lambda x: x.rolling(window=3).mean())
This returns, for example,
>>> home_avg[home_avg["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb
0 NaN NaN NaN NaN NaN NaN NaN
50 NaN NaN NaN NaN NaN NaN NaN
81 0.146667 71.600000 9.4 74.666667 0.512000 0.347667 25.833333
Taking this, along with
>>> home_df[home_df["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb stl team tov trb
0 0.118 76.7 7.1 64.7 0.535 0.365 25.6 11.5 Utah Jazz 10.8 42.9
50 0.100 63.9 9.1 80.5 0.536 0.414 27.6 2.2 Utah Jazz 20.2 58.6
81 0.222 74.2 12.0 78.8 0.465 0.264 24.3 7.3 Utah Jazz 13.9 50.0
122 0.119 81.8 11.3 75.0 0.515 0.642 25.0 12.2 Utah Jazz 21.8 52.5
135 0.129 76.7 17.8 75.9 0.650 0.400 37.9 5.7 Utah Jazz 18.8 62.7
demonstrates that it is including the current row in the calculation of the mean. I want to avoid this. More specifically, the desired output for row 81 would be all NaNs (because there haven't been three games yet), and the entry in the 3par column for row 122 would be .146667 (the average of the values in that column for rows 0, 50, and 81).
So, my question is, how can I exclude the current row in the rolling mean calculation?
You can use shift here which shifts the index for a given amount to make your rolling window use the last three values excluding the current value:
# create dummy data frame with numeric values
df = pd.DataFrame({"numeric_col": np.random.randint(0, 100, size=5)})
print(df)
numeric_col
0 66
1 60
2 74
3 41
4 83
df["mean"] = df["numeric_col"].shift(1).rolling(window=3).mean()
print(df)
numeric_col mean
0 66 NaN
1 60 NaN
2 74 NaN
3 41 66.666667
4 83 58.333333
Accordingly, change your apply function to lambda x: x.shift(1).rolling(window=3).mean() to make it work in your specific example.
Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)