I want to compute the positive volume index on a stock time series (https://www.investopedia.com/terms/p/pvi.asp), and I'd like to do it oneshot (without for loop).
The input is a classic stock dataframe with OHLC and Volume as columns and datetime as index.
This is what I was thinking:
df['PVI'] = 0
df['Volume'].diff() > 0
prev_c = df['Close'].shift()
prev_pvi = df['PVI'].shift()
df['PVI'] = np.where(df['Volume'].diff() > 0,
prev_pvi + (df['Close'] - prev_c / prev_c * df['PVI'].shift()),
df['PVI'].shift())
But I get a ValueError: cannot set a row with mismatched columns.
I can easily split it in a for loop with an if condition on volume but I wanted to write more pandonic coeds.
In order to have a working sample just:
import yfinance
df = yf.download(tickers='AAPL', start='2019-01-01', end='2021-01-01', period=max, interval=1d)
Thanks for any help/suggestion
ADDITION:
the for loop approach could be (given a df.reset_index):
df['pvi'] = 0.
for index,row in df.iterrows():
if index > 0:
prev_pvi = df.at[index-1, 'pvi']
prev_close = df.at[index-1, 'Close']
if row['Close'] > df.at[index-1, 'Volume']:
pvi = prev_pvi + (row['Close'] - prev_close / prev_close * prev_pvi)
else:
pvi = prev_pvi
else:
pvi = 1000 # dummy value
df.set_value(index, 'pvi', pvi)
Related
In my data, I have this column "price_range".
Dummy dataset:
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146', 'No pricing available']})
I am using pandas. What is the most efficient way to get the upper and lower bound of the price range in seperate columns?
Alternatively, you can parse the string accordingly (if you want to limits for each row, rather than the total range:
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146']})
def get_lower_limit(some_string):
a = some_string.split(' - ')
return int(a[0].split('€')[-1])
def get_upper_limit(some_string):
a = some_string.split(' - ')
return int(a[1].split('€')[-1])
df['lower_limit'] = df.price_range.apply(get_lower_limit)
df['upper_limit'] = df.price_range.apply(get_upper_limit)
Output:
Out[153]:
price_range lower_limit upper_limit
0 €4 - €25 4 25
1 €3 - €14 3 14
2 €25 - €114 25 114
3 €112 - €146 112 146
You can do the following. First create two extra columns lower and upper which contain the lower bound and the upper bound from each row. Then find the minimum from the lower column and maximum from the upper column.
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146', 'No pricing available']})
df.loc[df.price_range != 'No pricing available', 'lower'] = df['price_range'].str.split('-').str[0]
df.loc[df.price_range != 'No pricing available', 'upper'] = df['price_range'].str.split('-').str[1]
df['lower'] = df.lower.str.replace('€', '').astype(float)
df['upper'] = df.upper.str.replace('€', '').astype(float)
price_range = [df.lower.min(), df.upper.max()]
Output:
>>> price_range
[3.0, 146.0]
I have a data of stock prices like that
ticker date close volume type target_date
0 NVDA 1999-01-22 1.6086 18469934 STOCK
1 NVDA 1999-01-25 1.6270 3477722 STOCK
2 NVDA 1999-01-26 1.6822 2342848 STOCK
3 NVDA 1999-01-27 1.5439 1678315 STOCK
4 NVDA 1999-01-28 1.5349 1554613 STOCK
I need to add to 'target_date' column value equal the first date when close price more or equal close * 3. I want to find the first date when close price becomes three times more.
I tried that:
df['target_date'] = df[df.close >= df.close * 3].drop_duplicates('ticker')['date']
But got NaT values in the whole column
Upd.1
I write that
target_date = []
for i in df.itertuples():
close = i.close
date = i.date
f1 = df.date > date
f2 = df.close > close
f = f1&f2
result = df[f].drop_duplicates('ticker')['date']
target_date.append(result.iloc[0])
and got "IndexError: single positional indexer is out-of-bounds"
UPD2
I think I did it
target_date = []
for i in df.itertuples():
close = i.close
date = i.date
f1 = df.date > date
f2 = df.close > close
f = f1&f2
result = df[f].drop_duplicates('ticker')['date']
try:
target_date.append(result.iloc[0])
except:
target_date.append(pd.NaT)
df['target_date'] = target_date
But is it the way to make it more elegant?
Assuming the df dataframe contains your data and a target_date column, this code should do the trick:
for i, row in df.iterrows():
rest = df.iloc[i+1:] # the rest of the rows (next ones)
x = rest[rest.close >= 3*row.close]
df.loc[i, 'target_date'] = np.nan if len(x) == 0 else x.iloc[0].date
I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)
Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.
For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize
I have sales data till Jul-2020 and want to predict the next 3 months using a recovery rate.
This is the dataframe:
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
This is how it looks:
Now, I want to add a "Predicted" column resulting into this dataframe:
The first value 300 at row 3, is basically (200 * 1.5/1). This will be our base value going ahead, so next value i.e. 500 is basically (300 * 2.5/1.5) and so on.
How do I iterate over row every row, starting from row 3 onwards? I tried using shift() but couldn't iterate over the rows.
You could do it like this:
import pandas as pd
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
test['Prediction'] = test['Sales']
for i in range(1, len(test)):
#prevent division by zero
if test.loc[i-1, 'Recovery'] != 0:
test.loc[i, 'Prediction'] = test.loc[i-1, 'Prediction'] * test.loc[i, 'Recovery'] / test.loc[i-1, 'Recovery']
The sequence you have is straight up just Recovery * base level (Sales = 200)
You can compute that sequence like this:
valid_sales = test.Sales > 0
prediction = (test.Recovery * test.Sales[valid_sales].iloc[-1]).rename("Predicted")
And then combine by index, insert column or concat:
pd.concat([test, prediction], axis=1)
I am doing some styling to pandas columns where I want to highlight green or red values + or - 2*std of the corresponding column, but when I loop over to go to the next column, previous work is essentially deleted and only the last column shows any changes.
Function:
def color_outliers(value):
if value <= (mean - (2*std)):
# print(mean)
# print(std)
color = 'red'
elif value >= (mean + (2*std)):
# print(mean)
# print(std)
color = 'green'
else:
color = 'black'
return 'color: %s' % color
Code:
comp_holder = []
titles = []
i = 0
for value in names:
titles.append(names[i])
i+=1
#Number of Articles and Days of search
num_days = len(page_list[0]['items']) - 2
num_arts = len(titles)
arts = 0
days = 0
# print(num_days)
# print(num_arts)
#Sets index of dataframe to be timestamps of articles
for days in range(num_days):
comp_dict = {}
comp_dict = {'timestamp(YYYYMMDD)' : int(int(page_list[0]['items'][days]['timestamp'])/100)}
#Adds each article from current day in loop to dictionary for row append
for arts in range(num_arts):
comp_dict[titles[arts]] = page_list[arts]['items'][days]['views']
comp_holder.append(comp_dict)
comp_df = pd.DataFrame(comp_holder)
arts = 0
days = 0
outliers = comp_df
for arts in range(num_arts):
mean = comp_df[titles[arts]].mean()
std = comp_df[titles[arts]].std()
outliers = comp_df.style.applymap(color_outliers, subset = [titles[arts]])
Each time I go through this for loop, the 'outliers' styling data frame resets itself and only works on the current subset, but if I remove the subset, it uses one mean and std for the entire data frame. I have tried style.apply using axis=0 but i can't get it to work.
My data frame consists of 21 columns, the first being the timestamp and the next twenty being columns of ints based upon input files. I also have two lists indexed from 0 to 19 of means and std of each column.
I would apply on the whole column instead of applymap. I'm not sure I can follow your code since I don't know how your data look like, but this is what I would do:
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,100, [10,3]))
# compute the statistics
stats = df.agg(['mean','std'])
# format function on columns
def color_outlier(col, thresh=2):
# extract mean and std of the column
mean, std = stats[col.name]
return np.select((col<=mean-std*thresh, col>=mean+std*thresh),
('color: red', 'color: green'),
'color: black')
# thresh changes for demonstration, remove when used
df.style.apply(color_outlier, thresh=0.5)
Output: