multidimensional computation using pandas dataframe

multidimensional computation using pandas dataframe - python

We are currently using 3 different dataframes to store product,performance and assortments data.
The foreign key relationship is maintained between all the dimensions.
I need to update the cost column in the performance by doing the below math operation
performance['cost'] = performance['coulmn1']+sin(product['column3'])+2*Assortment['column2']
I need this operation to be performed for each row of the performance dataframe.
Please suggest any approach to make the calculations faster.
The performance dataframe consists of 1 million records.
Can we use any other approach rather than dataframe??

One approach could be to pre-calculate the sin function for the entire column. I used 1million records for it and it was pretty fast for me.
import pandas as pd
import math
import numpy as np
product = pd.DataFrame(columns=['col3', 'sincol3'])
product['col3'] = np.random.randn(1000000)
s = pd.Series(product['col3'])
product['sincol3'] = np.sin(s)
performance = pd.DataFrame(columns=['col1', 'cost'])
performance['col1'] = np.random.randn(1000000)
assortments = pd.DataFrame(columns=['col2'])
assortments['col2'] = np.random.randn(1000000)
performance['cost'] = performance['col1']+ product['sincol3'] + 2*assortments['col2']
print(performance)
It gives you output like:
col1 cost
0 0.194011 -1.940931
1 0.535375 1.891468
Edit after comments:
You have to understand that the expression on its own do not take a lot of time to calculate. If the expression is the only thing you are doing at run time (given your data frames already have values). Lets compare an example with 50 runs.
Example:
import pandas as pd
import math
import numpy as np
import time
def cal():
performance['cost'] = performance['col1']+ product['sincol3'] + 2*assortments['col2']
execution_time = []
product = pd.DataFrame(columns=['col3', 'sincol3'])
product['col3'] = np.random.randn(1000000)
s = pd.Series(product['col3'])
product['sincol3'] = np.sin(s)
performance = pd.DataFrame(columns=['col1', 'cost'])
performance['col1'] = np.random.randn(1000000)
assortments = pd.DataFrame(columns=['col2'])
assortments['col2'] = np.random.randn(1000000)
for i in range(0,50):
start_time = time.time()
cal()
execution_time.append(time.time() - start_time)
print('average time taken:\n')
print(np.mean(execution_time))
Gives me: average time taken: 0.15080997943878174
At the same time:
import pandas as pd
import math
import numpy as np
import time
def cal():
execution_time = []
product = pd.DataFrame(columns=['col3', 'sincol3'])
product['col3'] = np.random.randn(1000000)
s = pd.Series(product['col3'])
product['sincol3'] = np.sin(s)
performance = pd.DataFrame(columns=['col1', 'cost'])
performance['col1'] = np.random.randn(1000000)
assortments = pd.DataFrame(columns=['col2'])
assortments['col2'] = np.random.randn(1000000)
performance['cost'] = performance['col1']+ product['sincol3'] + 2*assortments['col2']
for i in range(0,50):
start_time = time.time()
cal()
execution_time.append(time.time() - start_time)
print('average time taken:\n')
print(np.mean(execution_time))
Gives me average time taken: 0.5624121456611447

Related

Reduce runtime by simplifying a piece of code which I am using at many places in my solution?

I am trying to improve my code performance. I am using the following piece of code at multiple locations -
import pandas as pd
import random
import numpy as np
x = pd.DataFrame({'x':['id_1']*10,
'lag_01':[random.randrange(1, 50, 1) for i in range(10)],
'lag_02':[random.randrange(1, 50, 1) for i in range(10)]
})
x['p'] = [np.nan]*10
LAG_COLS = ['lag_01', 'lag_02']
## Part I want to optimize
for row_num, idx in x.iterrows():
p = random.randint(70, 80)
x.at[row_num, 'p'] = p
i = row_num+1
for lag in LAG_COLS:
if i in x.index:
x.at[i, lag] = p
i = i + 1
The code updates the value of the lag columns for subsequent records with the value of 'p' which is calculated for each row. For example, for the following scenario -
The output would look like this -
step-1
step-2
How can I optimize this code?

heatmap of values grouped by time - seaborn

I'm plotting the counts of a variable grouped by time as a heatmap. However, when including both hour and minute, the counts are quite low so the resulting heatmap doesn't really provide any real insight. Is it possible to group the counts in a bigger block of time? I'm hoping to test some different periods (5, 10 mins).
I'm also hoping to plot time on the x-axis. Similar to the output attached.
import seaborn as sns
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M%:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
g = df2.groupby(['Hour','Min','Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df)

To deal with such cases, I think it would be easy to use data downsampling. It is also easy to change the thresholds. The axis labels in the output graph will need to be modified, but we recommend this method.
import seaborn as sns
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
df2.set_index('Time', inplace=True)
count_df = df2.resample('10min')['Count'].value_counts().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df.T)

The way you could achieve this is by creating a column with numbers that have repeating elements for the number of minutes.
For example:
minutes = 3
x = [0,1,2]
np.repeat(x, repeats=minutes, axis=0)
>>>> [0,0,0,1,1,1,2,2,2]
and then group your data using this column.
So your code would look like:
...
minutes = 5
x = [i for i in range(int(df2.shape[0]/5))]
df2['group'] = np.repeat(x, repeats=minutes, axis=0)
g = df2.groupby(['Min', 'Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)

Dataframe with Monte Carlo Simulation calculation next row Problem

I want to build up a Dataframe from scratch with calculations based on the Value before named Barrier option. I know that i can use a Monte Carlo simulation to solve it but it just wont work the way i want it to.
The formula is:
Value in row before * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
The first code I write just calculates the first column. I know that I need a second loop but can't really manage it.
The result should be, that for each simulation it will calculate a new value using the the value before, for 500 Day meaning S_1 should be S_500 with a total of 1000 simulations. (I need to generate new columns based on the value before using the formular.)
similar to this:
So for the 1. Simulations 500 days, 2. Simulation 500 day and so on...
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
simulation = 0
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 500
df = pd.DataFrame()
for i in range (0,TradingDays):
z = norm.ppf(rd.random())
simulation = simulation + 1
S_1 = S_0*np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
df = df.append ({
'S_1':S_1,
'S_0':S_0
}, ignore_index=True)
df = df.round ({'Z':6,
'S_T':2
})
df.index += 1
df.index.name = 'Simulation'
print(df)
I found another possible code which i found here and it does solve the problem but just for one row, the next row is just the same calculation. Generate a Dataframe that follow a mathematical function for each column / row
If i just replace it with my formular i get the same problem.
replacing:
exp(r - q * sqrt(sigma))*T+ (np.random.randn(nrows) * sqrt(deltaT)))
with:
exp((r-sigma**2/2)*T/nrows+sigma*np.sqrt(T/nrows)*z))
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 50
Simulation = 100
df = pd.DataFrame({'s0': [S_0] * Simulation})
for i in range(1, TradingDays):
z = norm.ppf(rd.random())
df[f's{i}'] = df.iloc[:, -1] * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
print(df)
I would work more likely with the last code and solve the problem with it.

How about just overwriting the value of S_0 by the new value of S_1 while you loop and keeping all simulations in a list?
Like this:
import numpy as np
import pandas as pd
import random
from scipy.stats import norm
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
trading_days = 50
output = []
for i in range(trading_days):
z = norm.ppf(random.random())
value = S_0*np.exp((r - sigma**2 / 2) * T / trading_days + sigma * np.sqrt(T/trading_days) * z)
output.append(value)
S_0 = value
df = pd.DataFrame({'simulation': output})
Perhaps I'm missing something, but I don't see the need for a second loop.
Also, this eliminates calling df.append() in a loop, which should be avoided. (See here)

Solution based on the the answer of bartaelterman, thank you very much!
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
#Dividing the list in chunks to later append it to the dataframe in the right order
def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def blackscholes():
d1 = ((math.log(S_0/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(2)))
d2 = ((math.log(S_0/K)+(r-sigma**2/2)*T)/(sigma*np.sqrt(2)))
preis_call_option = S_0*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return preis_call_option
K = 40
S_0 = 42
T = 2
r = 0.02
sigma = 0.2
U = 38
simulation = 10000
trading_days = 500
trading_days = trading_days -1
#creating 2 lists for the first and second loop
loop_simulation = []
loop_trading_days = []
#first loop calculates the first column in a list
for j in range (0,simulation):
print("Progressbar_1_2 {:2.2%}".format(j / simulation), end="\n\r")
S_Tag_new = 0
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_0*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
S_Tag_new = S_Tag
loop_simulation.append(S_Tag)
#second loop calculates the the rows for the columns in a list
for i in range (0,trading_days):
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_Tag_new*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
loop_trading_days.append(S_Tag)
S_Tag_new = S_Tag
#values from the second loop will be divided in number of Trading days per Simulation
loop_trading_days_chunked = list(chunk_list(loop_trading_days,trading_days))
#First dataframe with just the first results from the firstloop for each simulation
df1 = pd.DataFrame({'S_Tag 1': loop_simulation})
#Appending the the chunked list from the second loop to a second dataframe
df2 = pd.DataFrame(loop_trading_days_chunked)
#Merging both dataframe into one
df3 = pd.concat([df1, df2], axis=1)

Getting a python error "cannot reindex from a duplicate index" and a warning too "setting with copy warning" in the below code

import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score, classification_report
#kindly download the data FIRST which is required and then update the path accordingly for the variables you have to give the path
# variable1 = pd.read_csv(r"give the path to the data")
variable1 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/TCS.csv")
variable2 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/WIPRO.csv")
variable3 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/HDFC.csv")
variable4 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/ITC.csv")
frames = [variable1,variable2,variable3,variable4]
all = pd.concat(frames)
print(all)
price_data = all [['Symbol','Date','Close','High','Low','Open','Volume']]
First, for average investors, the return of an asset is a complete and scale–free summary of the investment opportunity. Second, return series are easier to handle than prices series as they have more attractive statistical properties
# sort the values by symbol and then date
price_data.sort_values(by = ['Symbol','Date'], inplace = True)
# calculate the change in price
price_data['change_in_price'] = price_data['Close'].diff()
# identify rows where the symbol changes
mask = price_data['Symbol'] != price_data['Symbol'].shift(1)
# For those rows, let's make the value null
price_data['change_in_price'] = np.where(mask == True, np.nan, price_data['change_in_price'])
# print the rows that have a null value, should only be 5
price_data[price_data.isna().any(axis = 1)]
days_out = 30
# Group by symbol, then apply the rolling function and grab the Min and Max.
price_data_smoothed = price_data.groupby(['Symbol'])
[['Close','Low','High','Open','Volume']].transform(lambda x: x.ewm(span = days_out).mean())
# Join the smoothed columns with the symbol and datetime column from the old data frame.
smoothed_df = pd.concat([price_data[['Symbol','Date']], price_data_smoothed], axis=1, sort=False)
smoothed_df
days_out = 30
# create a new column that will house the flag, and for each group calculate the diff compared to 30 days ago. Then use Numpy to define the sign.
smoothed_df['Signal_Flag'] = smoothed_df.groupby('Symbol')['Close'].transform(lambda x :
np.sign(x.diff(days_out)))
# print the first 50 rows
smoothed_df.head(50)
up to here it is working but when i execute the below code then it throws an error cannot reindex from a duplicate axis
n = 14
# First make a copy of the data frame twice
up_df, down_df = price_data[['Symbol','change_in_price']].copy(),
price_data[['Symbol','change_in_price']].copy()
# For up days, if the change is less than 0 set to 0.
up_df.loc['change_in_price'] = up_df.loc[(up_df['change_in_price'] < 0), 'change_in_price'] = 0
# For down days, if the change is greater than 0 set to 0.
down_df.loc['change_in_price'] = down_df.loc[(down_df['change_in_price'] > 0), 'change_in_price']
= 0
# We need change in price to be absolute.
down_df['change_in_price'] = down_df['change_in_price'].abs()
# Calculate the EWMA (Exponential Weighted Moving Average), meaning older values are given less weight compared to newer values.
ewma_up = up_df.groupby('Symbol')['change_in_price'].transform(lambda x: x.ewm(span = n).mean())
ewma_down = down_df.groupby('Symbol')['change_in_price'].transform(lambda x: x.ewm(span =
n).mean())
# Calculate the Relative Strength
relative_strength = ewma_up / ewma_down
# Calculate the Relative Strength Index
relative_strength_index = 100.0 - (100.0 / (1.0 + relative_strength))
# Add the info to the data frame.
price_data['down_days'] = down_df['change_in_price']
price_data['up_days'] = up_df['change_in_price']
price_data['RSI'] = relative_strength_index
# Display the head.
price_data.head(30)

Pandas rolling standard deviation

Is anyone else having trouble with the new rolling.std() in pandas? The deprecated method was rolling_std(). The new method runs fine but produces a constant number that does not roll with the time series.
Sample code is below. If you trade stocks, you may recognize the formula for Bollinger bands. The output I get from rolling.std() tracks the stock day by day and is obviously not rolling.
This in in pandas 0.19.1. Any help would be appreciated.
import datetime
import pandas as pd
import pandas_datareader.data as web
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2012,12,31)
g = web.DataReader(['AAPL'], 'yahoo', start, end)
stocks = g['Close']
stocks['Date'] = pd.to_datetime(stocks.index)
stocks['AAPL_LO'] = stocks['AAPL'] - stocks['AAPL'].rolling(20).std() * 2
stocks['AAPL_HI'] = stocks['AAPL'] + stocks['AAPL'].rolling(20).std() * 2
stocks.dropna(axis=0, how='any', inplace=True)

import pandas as pd
from pandas_datareader import data as pdr
import numpy as np
import datetime
end = datetime.date.today()
begin=end-pd.DateOffset(365*10)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
def bollinger_strat(data, window, no_of_std):
rolling_mean = data['Close'].rolling(window).mean()
rolling_std = data['Close'].rolling(window).std()
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)
bollinger_strat(data,20,2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

multidimensional computation using pandas dataframe - python

Related

Reduce runtime by simplifying a piece of code which I am using at many places in my solution?

heatmap of values grouped by time - seaborn

Dataframe with Monte Carlo Simulation calculation next row Problem

Getting a python error "cannot reindex from a duplicate index" and a warning too "setting with copy warning" in the below code

Pandas rolling standard deviation

Categories

Resources