real = MAVP(close, periods, minperiod=2, maxperiod=30, matype=0)
i am trying to use this method but it raises an error because the periods parameters,
How to use this method for Dataframe like this
This period parameter must be passed as an array of the periods you want to get. I got the Apple stock price from Yahoo Finance and got a moving average for the entire period I got it.
import yfinance as yf
data = yf.download("AAPL", start="2020-01-01", end="2021-01-01")
data.head()
Open High Low Close Adj Close Volume
Date
2020-01-02 74.059998 75.150002 73.797501 75.087502 74.333511 135480400
2020-01-03 74.287498 75.144997 74.125000 74.357498 73.610840 146322800
2020-01-06 73.447502 74.989998 73.187500 74.949997 74.197395 118387200
2020-01-07 74.959999 75.224998 74.370003 74.597504 73.848442 108872000
2020-01-08 74.290001 76.110001 74.290001 75.797501 75.036385 132079200
import talib as ta
import numpy as np
data.reset_index(drop=False,inplace=True)
periods = data.Date
real = ta.MAVP(data.Close, periods, minperiod=2, maxperiod=30, matype=0)
real
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
248 131.465004
249 134.330002
250 135.779999
251 134.294998
252 133.205002
Length: 253, dtype: float64
Related
I have tested a few different ways to calculate technical indicators for a large dataframe and am unsure how to determine the most efficient and pythonic way to go about it. The data is stock data (date, price, volume). The goal is to iterate through the dataframe, per ticker, calculating multiple technical indicators, and then sending the result back into the source (SQL db).
The data contains about 4,200 stock symbols with daily price data from 2000 to date (roughly 13m rows x 8 columns).
For testing, I've limited the data to just 2021 date range.
Here is a sample of the data:
Date Open High Low Close Adj_close Volume Tick
529326 2021-01-04 3270.00 3272.00 3144.02 3186.63 3186.63 4411400 AMZN
521846 2021-01-05 3166.01 3223.38 3165.06 3218.51 3218.51 2655500 AMZN
521691 2021-01-06 3146.48 3197.51 3131.16 3138.38 3138.38 4394800 AMZN
514195 2021-01-07 3157.00 3208.54 3155.00 3162.16 3162.16 3514500 AMZN
514038 2021-01-08 3180.00 3190.64 3142.20 3182.70 3182.70 3537700 AMZN
506535 2021-01-11 3148.01 3156.38 3110.00 3114.21 3114.21 3683400 AMZN
506376 2021-01-12 3120.00 3142.14 3086.00 3120.83 3120.83 3514600 AMZN
498871 2021-01-13 3128.44 3189.95 3122.08 3165.89 3165.89 3321200 AMZN
498706 2021-01-14 3167.52 3178.00 3120.59 3127.47 3127.47 3070900 AMZN
491194 2021-01-15 3123.02 3142.55 3095.17 3104.25 3104.25 4244000 AMZN
491037 2021-01-19 3107.00 3145.00 3096.00 3120.76 3120.76 3305100 AMZN
483504 2021-01-20 3181.99 3279.80 3175.00 3263.38 3263.38 5309800 AMZN
483351 2021-01-21 3293.00 3348.55 3289.57 3306.99 3306.99 4936100 AMZN
475802 2021-01-22 3304.31 3321.91 3283.16 3292.23 3292.23 2821900 AMZN
475649 2021-01-25 3328.50 3363.89 3243.15 3294.00 3294.00 3749800 AMZN
468087 2021-01-26 3296.36 3338.00 3282.87 3326.13 3326.13 2955200 AMZN
467939 2021-01-27 3341.49 3346.52 3207.08 3232.58 3232.58 4660200 AMZN
460368 2021-01-28 3235.04 3301.68 3228.69 3237.62 3237.62 3149200 AMZN
460219 2021-01-29 3230.00 3236.99 3184.55 3206.20 3206.20 4293600 AMZN
452618 2021-02-01 3242.36 3350.26 3235.03 3342.88 3342.88 4160200 AMZN
I'm not sure how to fully code dummy data, but here are two methods (should just need Numpy) to create random price and ticker data, I am just unsure how to merge them all into a dataframe. To simulate the same dataframe, there would be 4,200 symbols and 134 days of data.
letters = ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',)
x=np.random.randint(500, size=(134)) # <<< generates random price
y=''.join(np.random.choice(letters) for i in range(4)) # <<< generate random 4 character string
Here are all the imports being used:
#imports
from datetime import datetime, timedelta, date
import time
import sqlalchemy as sa
import pandas as pd
import numpy as np
import yfinance as yf
import pyodbc
import pandas_ta as ta
import talib
Dataframe 'sc' is referenced, with data in the following format:
Date Open High Low Close Adj_close Volume Tick
529377 2021-01-04 38.68 38.69 37.18 37.88 37.88 647700 ACIW
526834 2021-01-04 29.72 29.94 28.68 29.10 29.10 1527600 GOOS
526833 2021-01-04 15.35 15.40 14.92 15.01 14.39 421400 ETV
526832 2021-01-04 42.22 42.36 41.13 41.46 40.84 204000 HMN
526831 2021-01-04 13.94 15.72 13.75 15.38 15.38 880500 GATO
Then I want to iterate through the dataframe (which is sorted by Date) for each ticker and calculate a number of technical indicators. For now I am starting with just two moving average calculations. I've tried 3 different methods and compared the times below.
Talib package: 10 minutes
start_time = time.time()
ticks = pd.unique(sc['Tick'].tolist()) # <<< 4,200 unique tickers
ndf = [] # <<< initialize df
for tick in ticks:
# store ID(symbol), Date, Close(adj_close), and two indicators (SMA,EMA) in variables to
# concat into a temporary df, and then append outside of loop. Not sure if this is most
# efficient/pythonic way to do this.
ID = sc[sc["Tick"]==tick]["Tick"]
DATE = sc[sc["Tick"]==tick]["Date"]
CLOSE = sc[sc["Tick"]==tick]["Adj_close"]
SMA = round(talib.SMA(sc[sc["Tick"]==tick]['Adj_close']),2)
EMA = round(talib.EMA(sc[sc["Tick"]==tick]['Adj_close']),2)
#concat into one df
tempdf = pd.concat([ID, DATE, CLOSE, SMA, EMA], axis=1)
#append into main df outside of loop
ndf.append(tempdf)
print("Completed Indicators for "+tick)
# Concat everything in -ndf into a flattened df (-df)
df = pd.concat(ndf)
df['t_id'] = df['Tick']+'-'+df['Date']
df.rename(columns={'Adj_close':'Close', 0: "SMA", 1: "EMA"},inplace=True)
df=df.sort_values(by='Date')
print("--- %s seconds ---" % (time.time() - start_time))
df.tail(20)
Pandas TA: 13.6 minutes
#Swapped the "talib" lines for Pandas-TA package
SMA5 = ta.sma(sc[sc["Tick"]==tick]['Adj_close'], length=5)
SMA15 = ta.sma(sc[sc["Tick"]==tick]['Adj_close'], length=15)
Rolling method / Pandas: 14 minutes
# Swapped the "talib" lines for Rolling():
SMA5 = sc[sc["Tick"]==tick]['Adj_close'].rolling(5,min_periods=1).mean()
SMA15 = sc[sc["Tick"]==tick]['Adj_close'].rolling(15,min_periods=1).mean()
I am unsure how to gauge what an efficient time would be (is 10 minutes generally good? bad? or is it just dependent on personal requirements?) and if the approach of looping through each ticker, storing each indicator separately, then concating and finally appending back into a master daaframe is an appropriately pythonic approach. The final code will insert the complete dataframe back into a SQL table.
I have a DataFrame result from the yfinance api.
stock = yf.download ('PETR4.SA', start = '2019-01-01', stop = '2020-12-31')
stock.head()
Open High Low Close Adj Close Volume
2019-01-02 22.549999 24.200001 22.280001 24.059999 23.284782 104534800
2019-01-03 23.959999 24.820000 23.799999 24.650000 23.855774 95206400
2019-01-04 24.850000 24.940001 24.469999 24.719999 23.923517 72119800
2019-01-07 24.850000 25.920000 24.700001 25.110001 24.300953 1217119
I would like to compare the High of a given day with the value of the HIGH of the previous day or 2 or 3 days before or after. How to access the previous lines in a function?
What about using pandas.DataFrame.shift ?
>>> stock.shift(periods=1).head() # the result is logically deduced.
Open High Low Close Adj Close Volume
2019-01-02 NaN NaN NaN NaN NaN NaN
2019-01-03 22.549999 24.200001 22.280001 24.059999 23.284782 104534800
2019-01-04 23.959999 24.820000 23.799999 24.650000 23.855774 95206400
2019-01-07 24.850000 24.940001 24.469999 24.719999 23.923517 72119800
2019-01-08 24.850000 25.920000 24.700001 25.110001 24.300953 1217119
?
pandas is indispensable when working with data frames.
import pandas_datareader as pdr
from datetime import datetime
ibm = pdr.get_data_yahoo(symbols='IBM', start=datetime(2000, 1, 1), end=datetime(2012, 1, 1))
print(ibm['Adj Close'])
In order to install the datareader, please follow this link: https://github.com/pydata/pandas-datareader
hopefully a basic one for most.
I have created two datasets using random data one for days of the year and other for energy per day:
import numpy as np
import pandas as pd
np.random.seed(2)
start2018 = pd.datetime(2018, 1, 1)
end2018 = pd.datetime(2018, 12, 31)
dates2018 = pd.date_range(start2018, end2018, freq='d')
synEne2018 = np.random.normal(loc=66.883795, scale=5.448145, size=365)
syn2018data = pd.DataFrame({'Date': [dates2018], 'Total Daily Energy': [synEne2018]})
syn2018data
When I run this code I was hoping to get the daily energy for each date on separate rows. However, what I get is one row similar to below:
Date Total Daily Energy
0 DatetimeIndex(['2018-01-01', '2018-01-02', '20... [64.61323781744713, 66.57724516658102, 55.2454...
Can someone suggest the edit to get this to display as described above..
Remove the square brackets around dates2018 and synEne2018. You are making them nested list by putting square brackets around them. Just leave them alone as it is and you should be good to go.
syn2018data = pd.DataFrame({'Date': dates2018, 'Total Daily Energy': synEne2018})
Prints:
Date Total Daily Energy
0 2018-01-01 64.613238
1 2018-01-02 66.577245
2 2018-01-03 55.245489
3 2018-01-04 75.820228
4 2018-01-05 57.112898
.. ... ...
360 2018-12-27 73.685533
361 2018-12-28 60.096896
362 2018-12-29 65.973035
363 2018-12-30 63.742335
364 2018-12-31 69.150342
[365 rows x 2 columns]
I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918
I'm learning Python & pandas and practicing with different stock calculations. I've tried to search help with this but just haven't found a response similar enough or then didn't understand how to deduce the correct approach based on the previous responses.
I have read stock data of a given time frame with datareader into dataframe df. In df I have Date Volume and Adj Close columns which I want to use to create a new column "OBV" based on given criteria. OBV is a cumulative value that adds or subtracts the value of the volume today to the previous' days OBV depending on the adjusted close price.
The calculation of OBV is simple:
If Adj Close is higher today than Adj Close of yesterday then add the Volume of today to the (cumulative) volume of yesterday.
If Adj Close is lower today than Adj Close of yesterday then substract the Volume of today from the (cumulative) volume of yesterday.
On day 1 the OBV = 0
This is then repeated along the time frame and OBV gets accumulated.
Here's the basic imports and start
import numpy as np
import pandas as pd
import pandas_datareader
import datetime
from pandas_datareader import data, wb
start = datetime.date(2012, 4, 16)
end = datetime.date(2017, 4, 13)
# Reading in Yahoo Finance data with DataReader
df = data.DataReader('GOOG', 'yahoo', start, end)
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#This is what I cannot get to work, and I've tried two different ways.
#ATTEMPT1
def obv1(column):
if column["Adj Close"] > column["Adj close"].shift(-1):
val = column["Volume"].shift(-1) + column["Volume"]
else:
val = column["Volume"].shift(-1) - column["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
#ATTEMPT 2
def obv1(df):
if df["Adj Close"] > df["Adj close"].shift(-1):
val = df["Volume"].shift(-1) + df["Volume"]
else:
val = df["Volume"].shift(-1) - df["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
Both give me an error.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
Volume=np.random.randint(100, 200, 10),
AdjClose=np.random.rand(10)
))
print(df)
AdjClose Volume
0 0.951710 111
1 0.346711 198
2 0.289758 174
3 0.662151 190
4 0.171633 115
5 0.018571 155
6 0.182415 113
7 0.332961 111
8 0.150202 113
9 0.810506 126
Multiply the Volume by -1 when change in AdjClose is negative. Then cumsum
(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum()
0 111
1 -87
2 -261
3 -71
4 -186
5 -341
6 -228
7 -117
8 -230
9 -104
dtype: int64
Include this along side the rest of the df
df.assign(new=(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum())
AdjClose Volume new
0 0.951710 111 111
1 0.346711 198 -87
2 0.289758 174 -261
3 0.662151 190 -71
4 0.171633 115 -186
5 0.018571 155 -341
6 0.182415 113 -228
7 0.332961 111 -117
8 0.150202 113 -230
9 0.810506 126 -104