Python's `.loc` is really slow on selecting subsets of Data

Python's `.loc` is really slow on selecting subsets of Data - python

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.
import numpy as np
import pandas as pd
# Full DataFrame
y_max = 50
Y_max = range(1, y_max+1)
t_max = 100
T_max = range(1, t_max+1)
idx_max = tuple((y,t) for y in Y_max for t in T_max)
df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])
# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)
t1 = 5
tN = 9
T = range(t1, tN+1)
idx_sub = tuple((y,t) for y in Y for t in T)
data_sub = df.loc[(Y,T), :] #This is really slow
dict_sub = dict(zip(idx_sub, data_sub['Value']))
# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']
I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.

One idea is use Index.isin with itertools.product in boolean indexing:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Related

Dataframe with Monte Carlo Simulation calculation next row Problem

I want to build up a Dataframe from scratch with calculations based on the Value before named Barrier option. I know that i can use a Monte Carlo simulation to solve it but it just wont work the way i want it to.
The formula is:
Value in row before * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
The first code I write just calculates the first column. I know that I need a second loop but can't really manage it.
The result should be, that for each simulation it will calculate a new value using the the value before, for 500 Day meaning S_1 should be S_500 with a total of 1000 simulations. (I need to generate new columns based on the value before using the formular.)
similar to this:
So for the 1. Simulations 500 days, 2. Simulation 500 day and so on...
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
simulation = 0
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 500
df = pd.DataFrame()
for i in range (0,TradingDays):
z = norm.ppf(rd.random())
simulation = simulation + 1
S_1 = S_0*np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
df = df.append ({
'S_1':S_1,
'S_0':S_0
}, ignore_index=True)
df = df.round ({'Z':6,
'S_T':2
})
df.index += 1
df.index.name = 'Simulation'
print(df)
I found another possible code which i found here and it does solve the problem but just for one row, the next row is just the same calculation. Generate a Dataframe that follow a mathematical function for each column / row
If i just replace it with my formular i get the same problem.
replacing:
exp(r - q * sqrt(sigma))*T+ (np.random.randn(nrows) * sqrt(deltaT)))
with:
exp((r-sigma**2/2)*T/nrows+sigma*np.sqrt(T/nrows)*z))
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 50
Simulation = 100
df = pd.DataFrame({'s0': [S_0] * Simulation})
for i in range(1, TradingDays):
z = norm.ppf(rd.random())
df[f's{i}'] = df.iloc[:, -1] * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
print(df)
I would work more likely with the last code and solve the problem with it.

How about just overwriting the value of S_0 by the new value of S_1 while you loop and keeping all simulations in a list?
Like this:
import numpy as np
import pandas as pd
import random
from scipy.stats import norm
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
trading_days = 50
output = []
for i in range(trading_days):
z = norm.ppf(random.random())
value = S_0*np.exp((r - sigma**2 / 2) * T / trading_days + sigma * np.sqrt(T/trading_days) * z)
output.append(value)
S_0 = value
df = pd.DataFrame({'simulation': output})
Perhaps I'm missing something, but I don't see the need for a second loop.
Also, this eliminates calling df.append() in a loop, which should be avoided. (See here)

Solution based on the the answer of bartaelterman, thank you very much!
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
#Dividing the list in chunks to later append it to the dataframe in the right order
def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def blackscholes():
d1 = ((math.log(S_0/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(2)))
d2 = ((math.log(S_0/K)+(r-sigma**2/2)*T)/(sigma*np.sqrt(2)))
preis_call_option = S_0*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return preis_call_option
K = 40
S_0 = 42
T = 2
r = 0.02
sigma = 0.2
U = 38
simulation = 10000
trading_days = 500
trading_days = trading_days -1
#creating 2 lists for the first and second loop
loop_simulation = []
loop_trading_days = []
#first loop calculates the first column in a list
for j in range (0,simulation):
print("Progressbar_1_2 {:2.2%}".format(j / simulation), end="\n\r")
S_Tag_new = 0
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_0*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
S_Tag_new = S_Tag
loop_simulation.append(S_Tag)
#second loop calculates the the rows for the columns in a list
for i in range (0,trading_days):
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_Tag_new*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
loop_trading_days.append(S_Tag)
S_Tag_new = S_Tag
#values from the second loop will be divided in number of Trading days per Simulation
loop_trading_days_chunked = list(chunk_list(loop_trading_days,trading_days))
#First dataframe with just the first results from the firstloop for each simulation
df1 = pd.DataFrame({'S_Tag 1': loop_simulation})
#Appending the the chunked list from the second loop to a second dataframe
df2 = pd.DataFrame(loop_trading_days_chunked)
#Merging both dataframe into one
df3 = pd.concat([df1, df2], axis=1)

Pandas subsetting returing different results to numpy

I am trying to subset a pandas dataframe using two conditions. However, I am not getting the same results as when done with numpy. What am I doing wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(20,120,101)
y = np.linspace(-45,25,101)
xs,ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
plt.scatter(xs,ys,s=2,c='b')
plt.scatter(xs[idx],ys[idx],s=2,c='r')
I need to remove the red block from my dataset, which I can do with numpy by using:
plt.scatter(xs[~idx],ys[~idx],s=2,c='b')
How do I replicate this with a pandas dataframe?
I've tried using the same logic as I used above:
data = {'x':x,'y':y}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
I've also tried using loc:
df.loc[(df.x >=100) & (df.y >= 0),['x','y']] = np.nan
Both of these methods give the following result:
How do I replicate the results from numpy?
Many thanks.

You don't obtain the same result because you didn't create all the couple of coordinates before passing them to pandas. Here is a quick solution:
data = {'x':xs.flatten(),'y':ys.flatten()}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
plt.scatter(df2.x,df2.y,s=2,c='b')
Flatten reshape your arrays to only have one dimension so that they can be used to construct a DF containing couple of coordinates and not lists.
Output:
Edit: Same result but with dataframe containing x and y
Split the df in chunks
data_x = np.linspace(20,120,101)
data_y = np.linspace(-45,25,101)
dataframe = pd.DataFrame({'x':data_x,'y':data_y})
chunk_size = 25
dfs = [dataframe[i:i+chunk_size] for i in range(0,dataframe.shape[0],chunk_size)]
Define the function that will give you the points you are interested in. Two loops because you need to get every configuration of x and y values
def generatorPoints(dfs):
for i in range(len(dfs)):
x = dfs[i].x
for j in range(len(dfs)):
y = dfs[j].y
xs, ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
yield xs[~idx], ys[~idx]
x, y = [], []
for xs, ys in generatorPoints(dfs):
x.extend(xs), y.extend(ys)
plt.scatter(x,y,s=2,c='b')
This gives the same result as the previous code. There is certainly place to make some optimization but this is a start for your request :).

python: increase performance of finding the best timeshift for a correlation between each X column and y

I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code?
(In the example I used the iris data set which is of course not a timeseries...)
from sklearn import datasets
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from copy import deepcopy
def get_best_shift(dfX, dfy, ti=60, maxt=1440):
"""
determines the best correlation for the last maxt minutes based on a
timestep of ti minutes. Creates a dataframe with the shifted variables based on the
best match (strongest correlation).
"""
df_out = deepcopy(dfX)
for xcol in dfX:
bestshift = 0
Rmax = 0
for ishift in range(0, int(maxt / ti)):
xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values
yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values])
selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool)
xvals = xvals[selector]
yvals = yvals[selector]
R = np.corrcoef(xvals,yvals)[0][1]
# plt.figure()
# plt.plot(xvals,yvals,'k.')
# plt.show()
if R ** 2 > Rmax:
Rmax = R ** 2
# print(Rmax)
bestshift = ishift
df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values)
df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])})
return df_out
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
df = get_best_shift(X,y)

Modifying a numpy array after conversion from pandas dataframe

I have the following code which I am writing as part of a simple movie recommender in python so I can mimic the results I get as part of coursera's Machine Learning Course taught by Andrew NG.
I want to modify the numpy.ndarray that I get after calling as_matrix() on the pandas dataframe and add a column vector to it like we can in MATLAB
Y = [ratings Y]
Following is my python code
dataFile='/filepath/'
userItemRatings = pd.read_csv(dataFile, sep="\t", names=['userId', 'movieId', 'rating','timestamp'])
movieInfoFile = '/filepath/'
movieInfo = pd.read_csv(movieInfoFile, sep="|", names=['movieId','Title','Release Date','Video Release Date','IMDb URL','Unknown','Action','Adventure','Animation','Childrens','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western'], encoding = "ISO-8859-1")
userMovieMatrix=pd.merge(userItemRatings, movieInfo, left_on='movieId', right_on='movieId')
userMovieSubMatrix = userMovieMatrix[['userId', 'movieId', 'rating','timestamp','Title']]
Y = pd.pivot_table(userMovieSubMatrix, values='rating', index=['movieId'], columns=['userId'])
Y.fillna(0,inplace=True)
movies = Y.shape[0]
users = Y.shape[1] +1
ratings = np.zeros((1682, 1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
features = 10
theta = pd.DataFrame(np.random.rand(users,features))# users 943*3
X = pd.DataFrame(np.random.rand(movies,features))# movies 1682 * 3
X = X.as_matrix()
theta = theta.as_matrix()
Y = Y.as_matrix()
"""want to insert a column vector into this Y to get a new Y of dimension
1682*944, but only seeing 1682*943 after the following statement
"""
np.insert(Y, 0, ratings, axis=1)
R = Y.copy()
R[R!=0] = 1
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1)[0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
print(type(Ymean), type(Ynorm), type(Y), Y.shape)
Ynorm[np.isnan(Ynorm)] = 0.
Ymean[np.isnan(Ymean)] = 0.
There is an inline comment inserted, but my problem is when I create a new numpy array and call insert, it works just fine. However the numpy array I get after calling as_matrix() on pandas dataframe on which pivot_table() is called doesn't work. Is there any alternative?

insert does not operate in place, you need to assign the output to a variable. Try:
Y = np.insert(Y, 0, ratings, axis=1)

Appending function created column to an existing data frame

I currently have a dataframe as below:
and wish to add a column, E, that is calculated based on the following function.
def geometric_brownian_motion(T = 1, N = 100, mu = 0.1, sigma = 0.01, S0 = 20):
dt = float(T)/N
t = np.linspace(0, T, N)
W = np.random.standard_normal(size = N)
W = np.cumsum(W)*np.sqrt(dt) ### standard brownian motion ###
X = (mu-0.5*sigma**2)*t + sigma*W
S = S0*np.exp(X) ### geometric brownian motion ###
return S
(originating from here)
How to i create a time-series for all of the dates contained within the data-frame and append it?
The function input parameters are as follows:
T = (#days between df row 1 and df last)/365
N = # rows in data frame
S0 = 100

As i understand the essense of question is how to apply some method to every column, taking into account, the fact that to calculate a new value you need an index from dataframe:
I suggest you to extract index as separate column and use apply as usually.
from functools import partial
df['index'] = df.index
T = # precalculate T here
N = df.shape[0]
applying_method = partial(geometric_brownian_motion,T=T,N=N, S0=100)
df['E'] = df.apply(lambda row: applying_method(*row),axis=1)
Or if you rename columns of dataframe accroding to you function arguments:
df['E'] = df.apply(lambda row: applying_method(**row),axis=1)
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python's `.loc` is really slow on selecting subsets of Data - python

One idea is use Index.isin with itertools.product in boolean indexing: from itertools import product idx_sub = tuple(product(Y, T)) dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict() print (dict_sub)

Related

Dataframe with Monte Carlo Simulation calculation next row Problem

Pandas subsetting returing different results to numpy

python: increase performance of finding the best timeshift for a correlation between each X column and y

Modifying a numpy array after conversion from pandas dataframe

Appending function created column to an existing data frame

Categories

Resources