a KeyError when trying to forecast using ExponentialSmoothing - python

I'm trying to forecast some data about my city in terms of population. I have a table showing the population of my city from 1950 till 2021. Using pandas and ExpotentialSmoothing, I'm trying to forecast and see the next 10 years how much my city will have population. I'm stuck here:
train_data = df.iloc[:60]
test_data = df.iloc[59:]
fitted = ExponentialSmoothing(train_data["Population"],
trend = "add",
seasonal = "add",
seasonal_periods=12).fit()
fitted.forecast(10)
However, I get this message:
'The start argument could not be matched to a location related to the index of the data.'
Update: here are some codes from my work:
Jeddah_tb = pd.read_html("https://www.macrotrends.net/cities/22421/jiddah/population", match ="Jiddah - Historical Population Data", parse_dates=True)
df['Year'] = pd.to_datetime(df['Year'], format="%Y")
df.set_index("Year", inplace=True)
and here is the index:
DatetimeIndex(['2021-01-01', '2020-01-01', '2019-01-01', '2018-01-01',
'2017-01-01', '2016-01-01', '2015-01-01', '2014-01-01',
'2013-01-01', '2012-01-01', '2011-01-01', '2010-01-01',
'2009-01-01', '2008-01-01', '2007-01-01', '2006-01-01',
'2005-01-01', '2004-01-01', '2003-01-01', '2002-01-01',
'2001-01-01', '2000-01-01', '1999-01-01', '1998-01-01',
'1997-01-01', '1996-01-01', '1995-01-01', '1994-01-01',
'1993-01-01', '1992-01-01', '1991-01-01', '1990-01-01',
'1989-01-01', '1988-01-01', '1987-01-01', '1986-01-01',
'1985-01-01', '1984-01-01', '1983-01-01', '1982-01-01',
'1981-01-01', '1980-01-01', '1979-01-01', '1978-01-01',
'1977-01-01', '1976-01-01', '1975-01-01', '1974-01-01',
'1973-01-01', '1972-01-01', '1971-01-01', '1970-01-01',
'1969-01-01', '1968-01-01', '1967-01-01', '1966-01-01',
'1965-01-01', '1964-01-01', '1963-01-01', '1962-01-01',
'1961-01-01', '1960-01-01', '1959-01-01', '1958-01-01',
'1957-01-01', '1956-01-01', '1955-01-01', '1954-01-01',
'1953-01-01', '1952-01-01', '1951-01-01', '1950-01-01'],
dtype='datetime64[ns]', name='Year', freq='-1AS-JAN')

I didn't face any issue while trying to reproduce your code. However, before for time series forecasting make sure your data is in ascending order of dates. df = df.sort_values(by='Year',ascending = True). In your case, train_data is from 2021 to 1962 and test_data is from 1962-1950. So you are training on recent data but testing it on past. So sort your dataframe in ascending order. Also make test_data = df.iloc[60:] because 1962 is present in both train_data and test_data.

Related

How to extract data based on monthly wise for the certain period of time using python Data Frame

df = pd.DataFrame({'Date': pd.date_range(start='2019-07-01', end='2020-06-30')})
df['Date'].groupby(df['Date'].dt.to_period("M"))
df
I have tried:
df[['Date ', 'Sector ', 'Total']].groupby(pd.Grouper(key='Datel', freq='M')).sum()

12 month moving average from python

df_close['MA'] = df_close.rolling(window=12).mean()
I keep getting this error can anyone help please
ValueError: Wrong number of items passed 20, placement implies 1
My assignment:
Pull 20 years of monthly stock price and trading volume data for any 20 stocks of your pick from Yahoo Finance.
Calculate the monthly returns and 12-month moving average of each stock.
Other parts of the code:
start = dt.datetime(2000,1,1)
end = dt.datetime(2020,2,1)
df = web.DataReader(['AAPL', 'MSFT', 'ROKU', 'GS', 'GOOG', 'KO', 'ULTA', 'JNJ', 'ZM', 'AMZN', 'NFLX', 'TSLA', 'CMG', 'ATVI', 'LOW', 'BA', 'SYY', 'SNAP', 'BYND', 'OSTK'], 'yahoo',start,end)
df['Adj Close']
df['Volume']
data1 = df[['Adj Close', 'Volume']].copy()
data1['date1'] = data1.index
print(data1)
data2 = data1.merge(data1, left_on='date1', right_on='date1')
data2
df.sort_index()
df
df_monthly_returns = df['Adj Close'].ffill().pct_change()
df_monthly_returns.sort_index()
print(df_monthly_returns.tail())
df_close['MA'] = df_close.rolling(window=12).mean()
df_close ```
ValueError: Wrong number of items passed 20, placement implies 1 means that you are attempting to put too many elements in less space.
df_close['MA'] = df_close.rolling(window=12).mean()
You are pushing 20 "things" into a container that allows only one.
So if you want to put a element with 20 "columns" into a single dataframe column, try iterating over df_close.rolling(window=12).mean() and store in df_close['MA'].
And please update your question with more code so that I can give you a exact solution.

Converting string to date-time pandas

I am fetching data from an API into a pandas dataframe whose index values are as follows:-
df.index=['Q1-2013',
'Q1-2014',
'Q1-2015',
'Q1-2016',
'Q1-2017',
'Q1-2018',
'Q2-2013',
'Q2-2014',
'Q2-2015',
'Q2-2016',
'Q2-2017',
'Q2-2018',
'Q3-2013',
'Q3-2014',
'Q3-2015',
'Q3-2016',
'Q3-2017',
'Q3-2018',
'Q4-2013',
'Q4-2014',
'Q4-2015',
'Q4-2016',
'Q4-2017',
'Q4-2018']
It is a list of string values. Is there a way to convert this to pandas datetime?
I explored few Q&A and they are about using pd.to_datetime which works when the index is of object type.
In this example, index values are strings.
Expected output:
new_df=magic_function(df.index)
print(new_df.index[0])
01-2013
Wondering how to build "magic_function". Thanks in advance.
Q1 is quarter1 which is January, Q2 is quarter2 which is April and Q3 is quarter3 which is July, Q4 is quarter4 which is October
With a bit of manipulation for the parsing to work, you can use pd.PeriodIndex and format as wanted (reason being that the format %Y%q is expected):
df.index = [''.join(s.split('-')[::-1]) for s in df.index]
df.index = pd.PeriodIndex(df.index, freq='Q').to_timestamp().strftime('%m-%Y')
print(df.index)
Index(['01-2013', '01-2014', '01-2015', '01-2016', '01-2017', '01-2018',
'04-2013', '04-2014', '04-2015', '04-2016', '04-2017', '04-2018',
'07-2013', '07-2014', '07-2015', '07-2016', '07-2017', '07-2018',
'10-2013', '10-2014', '10-2015', '10-2016', '10-2017', '10-2018'],
dtype='object')
We could also get the required format using str.replace:
df.index = df.index.str.replace(r'(Q\d)-(\d+)', r'\2\1')
df.index = pd.PeriodIndex(df.index, freq='Q').to_timestamp().strftime('%m-%Y')
You can map a function to index: pandas.Index.map
quarter_months = {
'Q1': 1,
'Q2': 4,
'Q3': 7,
'Q4': 10,
}
def quarter_to_month_year(quarter_year):
quarter, year = quarter_year.split('-')
month_year = '%s-%s'%(quarter_months[quarter], year)
return pd.to_datetime(month_year, format='%m-%Y')
df.index = df.index.map(quarter_to_month_year)
This would produce following result:
DatetimeIndex(['2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01',
'2017-01-01', '2018-01-01', '2013-04-01', '2014-04-01',
'2015-04-01', '2016-04-01', '2017-04-01', '2018-04-01',
'2013-07-01', '2014-07-01', '2015-07-01', '2016-07-01',
'2017-07-01', '2018-07-01', '2013-10-01', '2014-10-01',
'2015-10-01', '2016-10-01', '2017-10-01', '2018-10-01'],
dtype='datetime64[ns]', name='index', freq=None)
to_datetime() function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
It is a datetime64 object when applying to_datetime(), to_period() turns it into a period object, further modifications like to_timestamp().strftime('%m-%Y') turn the index items into strings :
import pandas as pd
df = pd.DataFrame(index=['Q1-2013',
'Q1-2014',
'Q1-2015',
'Q1-2016',
'Q1-2017',
'Q1-2018',
'Q2-2013',
'Q2-2014',
'Q2-2015',
'Q2-2016',
'Q2-2017',
'Q2-2018',
'Q3-2013',
'Q3-2014',
'Q3-2015',
'Q3-2016',
'Q3-2017',
'Q3-2018',
'Q4-2013',
'Q4-2014',
'Q4-2015',
'Q4-2016',
'Q4-2017',
'Q4-2018'])
# df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]))
df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]).to_period('M'))
# df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]).to_period('M').to_timestamp().strftime('m-%Y'))
print(df_new.index)
PeriodIndex(['2013-01', '2014-01', '2015-01', '2016-01', '2017-01', '2018-01',
'2013-04', '2014-04', '2015-04', '2016-04', '2017-04', '2018-04',
'2013-07', '2014-07', '2015-07', '2016-07', '2017-07', '2018-07',
'2013-10', '2014-10', '2015-10', '2016-10', '2017-10', '2018-10'],
dtype='period[M]', freq='M')

How to filter stocks in a large dataset - python

I have a large dataset of over 20,000 stocks from 1964-2018. (It's CRSP data I got from my university). I now want to apply the following filter technique according to Amihud (2002):
1. include all stocks that have a price greater than $5 at end of year t-1
2. include all stocks that have data for at least 200 days at end of year t-1
3. the stocks have information about market capitalization at end of year t-1
I'm quite stuck on this since I've never worked with such a large dataset. Any suggestions where I can find ideas on how to solve this problem? Many thanks.
I already tried to filter on a monthly basis. I created new dataframe that included those stocks whose prices where above $5 in december. Now I got stuck. The graph shows the number of stocks over time before and after applying the first filter. dataframe with filter
#of stocks over time
df['month'] = pd.DatetimeIndex(df.index).month
df2= df[(df.month == 12) & (df.prc >= 5)]
EDIT:
I created a sample dataframe that looks like my original dataframe
import pandas as pd
import numpy as np
df1 = pd.DataFrame( { 'date': ['2010-05-12', '2010-05-13', '2010-05-13',
'2011-11-13', '2011-11-14', '2011-03-30', '2011-12-01',
'2011-12-02', '2011-12-01', '2011-12-02'],
"stock" : ["stock_1", "stock_1", "stock_2", "stock_3",
"stock_3", "stock_3", 'stock_1', 'stock_1', 'stock_2',
'stock_2'] ,
"price" : [100, 102, 300, 51, 49, 45, 101, 104, 301, 299],
'volume':[1000, 1020, np.nan, 510, 490, 450, 1010, 1040,
np.nan, 2990],
'return':[0.01, 0.03, 0.02, np.nan, 0.02, -0.04, -0.08,
-0.01, np.nan, -0.01] } )
df1 = df1.set_index(pd.DatetimeIndex(df1['date']))
pivot_df = df1.pivot_table(index=[df1.index, 'stock'], values=['price',
'vol', 'ret'])
The resulting dataframe is basically panel data. I want to to check whether each stock has return and volume data (not NaN) each day. Then I want to remove all stocks that have return and volume data for less than 200 days in a given year. Since the original dataframe contains nearly 20,000 stocks from 1964 - 2018 I want to do this in an efficient way.

Efficiently calculating remaining useful lifetime with pandas

I have a pandas dataframe that contains multiple rows with a datetime and a sensor value. My goal is to add a column that calculates the days until the sensor value will exceed the threshold the next time.
For instance, for the data <2019-01-05 11:00:00, 200>, <2019-01-06 12:00:00, 250>, <2019-01-07 13:00:00, 300> I would want the additional column to look like [1 day, 0 days, 0 days] for thresholds between 200 and 250 and [2 days, 1 day, 0 days] for thresholds between 250 and 300.
I tried subsampling the dataframe with df_sub = df[df[sensor_value] >= threshold], iterate over both dataframes and calculate the next timestamp in df_sub given the current timestamp in df. However, this solution seems to be every inefficient and I think that pandas might have some optimized way to calculating what I need.
In the following example code, I tried what I described above.
import pandas as pd
data = [{'time': '2019-01-05 11:00:00', 'sensor_value' : 200},
{'time': '2019-01-05 14:37:52', 'sensor_value' : 220},
{'time': '2019-01-05 17:55:12', 'sensor_value' : 235},
{'time': '2019-01-06 12:00:00', 'sensor_value' : 250},
{'time': '2019-01-07 13:00:00', 'sensor_value' : 300},
{'time': '2019-01-08 14:00:00', 'sensor_value' : 250},
{'time': '2019-01-09 15:00:00', 'sensor_value' : 320}]
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])
def calc_rul(df, threshold):
# calculate all datetime where the threshold is exceeded
df_sub = sorted(df[df['sensor_value'] >= threshold]['time'].tolist())
# variable to store all days
remaining_days = []
for v1 in df['time'].tolist():
for v2 in df_sub:
# if the exceeding date is the first in future calculate the days difference
if(v2 > v1):
remaining_days.append((v2-v1).days)
break
elif(v2 == v1):
remaining_days.append(0)
break
df['RUL'] = pd.Series(remaining_days)
calc_rul(df, 300)
Expected output (output of the above sample):
Here's what I would do for one threshold
def calc_rul(df, thresh):
# we mark all the values greater than thresh
markers =df.value.ge(thresh)
# copy dates of the above row
df['last_day'] = np.nan
df.loc[markers, 'last_day'] = df.timestamp
# back fill those dates
df['last_day'] = df['last_day'].bfill().astype('datetime64[ns]')
df['RUL'] = (df.last_day - df.timestamp).dt.days
# drop the columns if necessary,
# remove this line to better see how the code works
df.drop('last_day', axis=1, inplace=True)
calc_rul(df, 300)
Instead of spliting the dataframe, you can use the '.loc' that allows you to filter and iterate through your threshold the same way:
df['RUL'] = '[2 days, 1 day, 0 days]'
for threshold in threshold_list:
df.loc[df['sensor_value'] > <your_rule>,'RUL'] = '[1 day, 0 days, 0 days]'
This technique avoids splitting the dataframe.

Categories

Resources