12 month moving average from python

12 month moving average from python - python

df_close['MA'] = df_close.rolling(window=12).mean()
I keep getting this error can anyone help please
ValueError: Wrong number of items passed 20, placement implies 1
My assignment:
Pull 20 years of monthly stock price and trading volume data for any 20 stocks of your pick from Yahoo Finance.
Calculate the monthly returns and 12-month moving average of each stock.
Other parts of the code:
start = dt.datetime(2000,1,1)
end = dt.datetime(2020,2,1)
df = web.DataReader(['AAPL', 'MSFT', 'ROKU', 'GS', 'GOOG', 'KO', 'ULTA', 'JNJ', 'ZM', 'AMZN', 'NFLX', 'TSLA', 'CMG', 'ATVI', 'LOW', 'BA', 'SYY', 'SNAP', 'BYND', 'OSTK'], 'yahoo',start,end)
df['Adj Close']
df['Volume']
data1 = df[['Adj Close', 'Volume']].copy()
data1['date1'] = data1.index
print(data1)
data2 = data1.merge(data1, left_on='date1', right_on='date1')
data2
df.sort_index()
df
df_monthly_returns = df['Adj Close'].ffill().pct_change()
df_monthly_returns.sort_index()
print(df_monthly_returns.tail())
df_close['MA'] = df_close.rolling(window=12).mean()
df_close ```

ValueError: Wrong number of items passed 20, placement implies 1 means that you are attempting to put too many elements in less space.
df_close['MA'] = df_close.rolling(window=12).mean()
You are pushing 20 "things" into a container that allows only one.
So if you want to put a element with 20 "columns" into a single dataframe column, try iterating over df_close.rolling(window=12).mean() and store in df_close['MA'].
And please update your question with more code so that I can give you a exact solution.

Related

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?

import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Group By Dinstinct in Pandas

I Have Script Like This in Pandas :
dfmi['Time'] = pd.to_datetime(dfmi['Time'], format='%H:%M:%S')
dfmi['hours'] = dfmi['Time'].dt.hour
sum_dh = dfmi.groupby(['Date','hours']).agg({'Amount': 'sum', 'Price':'sum'})
dfdhsum = pd.DataFrame(sum_dh)
dfdhsum.columns = ['Amount', 'Gas Sales']
dfdhsum
And the output :
I want Sum Distinct Group BY and the final result is like This :
How its pandas code solution ??

I don't understand what you want to exactly but this instruction will sum hours , amount ans gas sales for each date
dfmi.groupby("Date").agg({'hours': 'sum', 'Amount': 'sum','Gas Sales':'sum})

a KeyError when trying to forecast using ExponentialSmoothing

I'm trying to forecast some data about my city in terms of population. I have a table showing the population of my city from 1950 till 2021. Using pandas and ExpotentialSmoothing, I'm trying to forecast and see the next 10 years how much my city will have population. I'm stuck here:
train_data = df.iloc[:60]
test_data = df.iloc[59:]
fitted = ExponentialSmoothing(train_data["Population"],
trend = "add",
seasonal = "add",
seasonal_periods=12).fit()
fitted.forecast(10)
However, I get this message:
'The start argument could not be matched to a location related to the index of the data.'
Update: here are some codes from my work:
Jeddah_tb = pd.read_html("https://www.macrotrends.net/cities/22421/jiddah/population", match ="Jiddah - Historical Population Data", parse_dates=True)
df['Year'] = pd.to_datetime(df['Year'], format="%Y")
df.set_index("Year", inplace=True)
and here is the index:
DatetimeIndex(['2021-01-01', '2020-01-01', '2019-01-01', '2018-01-01',
'2017-01-01', '2016-01-01', '2015-01-01', '2014-01-01',
'2013-01-01', '2012-01-01', '2011-01-01', '2010-01-01',
'2009-01-01', '2008-01-01', '2007-01-01', '2006-01-01',
'2005-01-01', '2004-01-01', '2003-01-01', '2002-01-01',
'2001-01-01', '2000-01-01', '1999-01-01', '1998-01-01',
'1997-01-01', '1996-01-01', '1995-01-01', '1994-01-01',
'1993-01-01', '1992-01-01', '1991-01-01', '1990-01-01',
'1989-01-01', '1988-01-01', '1987-01-01', '1986-01-01',
'1985-01-01', '1984-01-01', '1983-01-01', '1982-01-01',
'1981-01-01', '1980-01-01', '1979-01-01', '1978-01-01',
'1977-01-01', '1976-01-01', '1975-01-01', '1974-01-01',
'1973-01-01', '1972-01-01', '1971-01-01', '1970-01-01',
'1969-01-01', '1968-01-01', '1967-01-01', '1966-01-01',
'1965-01-01', '1964-01-01', '1963-01-01', '1962-01-01',
'1961-01-01', '1960-01-01', '1959-01-01', '1958-01-01',
'1957-01-01', '1956-01-01', '1955-01-01', '1954-01-01',
'1953-01-01', '1952-01-01', '1951-01-01', '1950-01-01'],
dtype='datetime64[ns]', name='Year', freq='-1AS-JAN')

I didn't face any issue while trying to reproduce your code. However, before for time series forecasting make sure your data is in ascending order of dates. df = df.sort_values(by='Year',ascending = True). In your case, train_data is from 2021 to 1962 and test_data is from 1962-1950. So you are training on recent data but testing it on past. So sort your dataframe in ascending order. Also make test_data = df.iloc[60:] because 1962 is present in both train_data and test_data.

Using For Loop to Generate Dataframe

I am working on some portfolio analysis and am trying to get a working function for pulling data for stocks, using a list of Ticker Symbols. Here is my list:
Ticker_List={'Tickers':['SPY', 'AAPL', 'TSLA', 'AMZN', 'BRK.B', 'DAL', 'EURN', 'AMD',
'NVDA', 'SPG', 'DIS', 'SBUX', 'MMP', 'USFD', 'CHEF', 'SYY',
'GOOGL', 'MSFT']}
I'm passing the list through this function like so:
Port=kit.d(Ticker_List)
def d(Ticker_List):
x=[]
for i in Ticker_List['Tickers']:
x.append(Closing_price_alltime(i))
return x
def Closing_price_alltime(Ticker):
Closedf=td_client.get_price_history(Ticker, period_type='year', period=20, frequency_type='daily', frequency=1)
return Closedf
Which pulls data from TDAmeritrade and gives me back:
[{'candles': [{'open': 147.46875,'high': 148.21875,
'low': 146.875,'close': 147.125,
'volume': 6998100,'datetime': 960181200000},
{'open': 146.625,'high': 147.78125,
'low': 145.90625,'close': 146.46875,
'volume': 4858900,'datetime': 960267600000},
...],
'symbol': 'MSFT',
'empty': False}]`
(This is just a sample of course)
Finally, I'm cleaning up with:
Port=pd.DataFrame(Port)
Port=pd.DataFrame.drop(Port, columns='empty')`
Which gives the DataFrame:
candles symbol
0 [{'open': 147.46875, 'high': 148.21875, 'low': 146.875, 'close': 147.125, 'volume': 6998100, 'datetime': 960181200000}, {'open': 146.625, 'high': ...} SPY
1 [{'open': 3.33259, 'high': 3.401786, 'low': 3.203126, 'close': 3.261161, 'volume': 80917200, 'datetime': 960181200000}, {'open': 3.284599, 'high':...} AAPL
How can I get the close price out of the nested dictionary in each row and set that as a the columns, with the ticker symbols (currently in their own column) as the headers for the closing price columns. Also how to extract the datetime from each nested dictionary and set it as the index.
EDIT: More info
My original method of building this DataFrame was:
SPY_close=kit.Closing_price_alltime('SPY')
AAPL_close=kit.Closing_price_alltime('AAPL')
TSLA_close=kit.Closing_price_alltime('TSLA')
AMZN_close=kit.Closing_price_alltime('AMZN')
BRKB_close=kit.Closing_price_alltime('BRK.B')
DAL_close=kit.Closing_price_alltime('DAL')
EURN_close=kit.Closing_price_alltime('EURN')
AMD_close=kit.Closing_price_alltime('AMD')
NVDA_close=kit.Closing_price_alltime('NVDA')
SPG_close=kit.Closing_price_alltime('SPG')
DIS_close=kit.Closing_price_alltime('DIS')
SBUX_close=kit.Closing_price_alltime('SBUX')
MMP_close=kit.Closing_price_alltime('MMP')
USFD_close=kit.Closing_price_alltime('USFD')
CHEF_close=kit.Closing_price_alltime('CHEF')
SYY_close=kit.Closing_price_alltime('SYY')
GOOGL_close=kit.Closing_price_alltime('GOOGL')
MSFT_close=kit.Closing_price_alltime('MSFT')
def Closing_price_alltime(Ticker):
"""
Gets Closing Price for Past 20 Years w/ Daily Intervals
and Formats it to correct Date and single 'Closing Price'
column.
"""
Raw_close=td_client.get_price_history(Ticker,
period_type='year', period=20, frequency_type='daily', frequency=1)
#Closedf = pd.DataFrame(Raw_close['candles']).set_index('datetime')
#Closedf=pd.DataFrame.drop(Closedf, columns=['open', 'high',
'low', 'volume'])
#Closedf.index = pd.to_datetime(Closedf.index, unit='ms')
#Closedf.index.names=['Date']
#Closedf.columns=[f'{Ticker} Close']
#Closedf=Closedf.dropna()
return Closedf
SPY_pct=kit.pct_change(SPY_close)
AAPL_pct=kit.pct_change(AAPL_close)
TSLA_pct=kit.pct_change(TSLA_close)
AMZN_pct=kit.pct_change(AMZN_close)
BRKB_pct=kit.pct_change(BRKB_close)
DAL_pct=kit.pct_change(DAL_close)
EURN_pct=kit.pct_change(EURN_close)
AMD_pct=kit.pct_change(AMD_close)
NVDA_pct=kit.pct_change(NVDA_close)
SPG_pct=kit.pct_change(SPG_close)
DIS_pct=kit.pct_change(DIS_close)
SBUX_pct=kit.pct_change(SBUX_close)
MMP_pct=kit.pct_change(MMP_close)
USFD_pct=kit.pct_change(USFD_close)
CHEF_pct=kit.pct_change(CHEF_close)
SYY_pct=kit.pct_change(SYY_close)
GOOGL_pct=kit.pct_change(GOOGL_close)
MSFT_pct=kit.pct_change(MSFT_close)
def pct_change(Ticker_ClosingValues):
"""
Takes Closing Values and Finds Percent Change.
Closing Value Column must be named 'Closing Price'.
"""
return_pct=Ticker_ClosingValues.pct_change()
return_pct=return_pct.dropna()
return return_pct
Portfolio_hist_rets=[SPY_pct, AAPL_pct, TSLA_pct, AMZN_pct,
BRKB_pct, DAL_pct, EURN_pct, AMD_pct,
NVDA_pct, SPG_pct, DIS_pct, SBUX_pct,
MMP_pct, USFD_pct, CHEF_pct, SYY_pct,
GOOGL_pct, MSFT_pct]
Which returned exactly what I wanted:
SPY Close AAPL Close TSLA Close AMZN Close BRK.B Close
Date
2000-06-06 05:00:00 -0.004460 0.017111 NaN -0.072248 -0.002060
2000-06-07 05:00:00 0.006934 0.039704 NaN 0.024722 0.013416
2000-06-08 05:00:00 -0.003920 -0.018123 NaN 0.001206 -0.004073
This method is obviously much less efficient than just using a for loop to create a DataFrame from a list of tickers.
In short, I'm asking what changes can be made to my new code (above my edit) to achieve the same end result as my old code (below my edit) (a well formatted and labeled DataFrame).

Closing_price_alltime return value:
d = [{'candles': [{'open': 147.46875,'high': 148.21875,
'low': 146.875,'close': 147.125,
'volume': 6998100,'datetime': 960181200000},
{'open': 146.625,'high': 147.78125,
'low': 145.90625,'close': 146.46875,
'volume': 4858900,'datetime': 960267600000}
],
'symbol': 'MSFT',
'empty': False}]
You could extract symbol,datetime and closing like this.
import operator
import pandas as pd
data = operator.itemgetter('datetime','close')
symbol = d[0]['symbol']
candles = d[0]['candles']
dt, closing = zip(*map(data, candles))
# for loop equivalent to zip(*map...)
#dt = []
#closing = []
#for candle in candles:
# dt.append(candle['datetime'])
# closing.append(candle['close'])
s = pd.Series(data=closing,index=dt,name=symbol)
This will create a DataFrame of closing prices for each symbol in the list.
results = []
for ticker in Ticker_List['Tickers']:
d = Closing_price_alltime(ticker)
symbol = d[0]['symbol']
candles = d[0]['candles']
dt, closing = zip(*map(data, candles))
results.append(pd.Series(data=closing,index=dt,name=symbol))
df = pd.concat(results, axis=1)
pandas.DataFrame.pct_change

This is the final function I wrote which accomplishes my goal:
def Port_consol(Ticker_List):
"""
Consolidates Ticker Symbol Returns and Returns
a Single Portolio
"""
Port=[]
Port_=[]
for i in Ticker_List['Tickers']:
Port.append(Closing_price_alltime(i))
j=list(range(0, (n_assets)))
for i in j:
data = operator.itemgetter('datetime','close')
symbol = Port[i]['symbol']
candles = Port[i]['candles']
dt, closing = zip(*map(data, candles))
s = pd.Series(data=closing,index=dt,name=symbol)
s=pd.DataFrame(s)
s.index = pd.to_datetime(s.index, unit='ms')
Port_.append(s)
Portfolio=pd.concat(Port_, axis=1, sort=False)
return Portfolio
I can now pass though a list of tickers to this function, the data will be pulled from TDAmeritrade's API (using python package td-ameritrade-python-api), and a DataFrame is formed with historical closing prices for the Stocks whose tickers I pass through.

Efficiently calculating remaining useful lifetime with pandas

I have a pandas dataframe that contains multiple rows with a datetime and a sensor value. My goal is to add a column that calculates the days until the sensor value will exceed the threshold the next time.
For instance, for the data <2019-01-05 11:00:00, 200>, <2019-01-06 12:00:00, 250>, <2019-01-07 13:00:00, 300> I would want the additional column to look like [1 day, 0 days, 0 days] for thresholds between 200 and 250 and [2 days, 1 day, 0 days] for thresholds between 250 and 300.
I tried subsampling the dataframe with df_sub = df[df[sensor_value] >= threshold], iterate over both dataframes and calculate the next timestamp in df_sub given the current timestamp in df. However, this solution seems to be every inefficient and I think that pandas might have some optimized way to calculating what I need.
In the following example code, I tried what I described above.
import pandas as pd
data = [{'time': '2019-01-05 11:00:00', 'sensor_value' : 200},
{'time': '2019-01-05 14:37:52', 'sensor_value' : 220},
{'time': '2019-01-05 17:55:12', 'sensor_value' : 235},
{'time': '2019-01-06 12:00:00', 'sensor_value' : 250},
{'time': '2019-01-07 13:00:00', 'sensor_value' : 300},
{'time': '2019-01-08 14:00:00', 'sensor_value' : 250},
{'time': '2019-01-09 15:00:00', 'sensor_value' : 320}]
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])
def calc_rul(df, threshold):
# calculate all datetime where the threshold is exceeded
df_sub = sorted(df[df['sensor_value'] >= threshold]['time'].tolist())
# variable to store all days
remaining_days = []
for v1 in df['time'].tolist():
for v2 in df_sub:
# if the exceeding date is the first in future calculate the days difference
if(v2 > v1):
remaining_days.append((v2-v1).days)
break
elif(v2 == v1):
remaining_days.append(0)
break
df['RUL'] = pd.Series(remaining_days)
calc_rul(df, 300)
Expected output (output of the above sample):

Here's what I would do for one threshold
def calc_rul(df, thresh):
# we mark all the values greater than thresh
markers =df.value.ge(thresh)
# copy dates of the above row
df['last_day'] = np.nan
df.loc[markers, 'last_day'] = df.timestamp
# back fill those dates
df['last_day'] = df['last_day'].bfill().astype('datetime64[ns]')
df['RUL'] = (df.last_day - df.timestamp).dt.days
# drop the columns if necessary,
# remove this line to better see how the code works
df.drop('last_day', axis=1, inplace=True)
calc_rul(df, 300)

Instead of spliting the dataframe, you can use the '.loc' that allows you to filter and iterate through your threshold the same way:
df['RUL'] = '[2 days, 1 day, 0 days]'
for threshold in threshold_list:
df.loc[df['sensor_value'] > <your_rule>,'RUL'] = '[1 day, 0 days, 0 days]'
This technique avoids splitting the dataframe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

12 month moving average from python - python

Related

Compare df's including detailed insight in data

Group By Dinstinct in Pandas

a KeyError when trying to forecast using ExponentialSmoothing

Using For Loop to Generate Dataframe

Efficiently calculating remaining useful lifetime with pandas

Categories

Resources