Using Truncated Poisson in Pandas data frame - python

Withe help of # Welcome to Stack Overflow, I managed to truncate the Poisson distribution using upper limit.When I used the function so called truncated Poisson, which is user defined function, it worked with single value entry, which I have shown in the code below:
import scipy.stats as sct
import pandas as pd
def truncated_Poisson(mu, max_value, size):
temp_size = size
while True:
temp_size *= 2
temp = sct.poisson.rvs(mu, size=temp_size)
truncated = temp[temp <= max_value]
if len(truncated) >= size:
return truncated[:size]
mu = 2.5
max_value = 10
print(truncated_Poisson(mu, max_value, 1))
Unfortunately, I threw me an error when I applied it in the data frame as follow:
data = pd.DataFrame()
data['Name'] = ['A','B','C','D','E']
data ['mu'] = [0.5,1.2,2,2.5,2.8]
max_value = 5
size = 1
data ['Pos'] = truncated_Poisson(data.mu,max_value,size = 1)
The error statement is
ValueError: size does not match the broadcast shape of the parameters.
Can anyone advise me how to use that function in dataframe?
Thanks
Zep.

As I understand, you want to call truncated_Poisson with the same parameter and each of the mu from the data. You can do this for example by using .apply:
data['Pos'] = data.mu.apply(lambda mu: truncated_Poisson(mu, max_value, size=1))
>>> data
Name mu Pos
0 A 0.5 [0]
1 B 1.2 [0]
2 C 2.0 [3]
3 D 2.5 [4]
4 E 2.8 [3]

Related

Dataframe with Monte Carlo Simulation calculation next row Problem

I want to build up a Dataframe from scratch with calculations based on the Value before named Barrier option. I know that i can use a Monte Carlo simulation to solve it but it just wont work the way i want it to.
The formula is:
Value in row before * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
The first code I write just calculates the first column. I know that I need a second loop but can't really manage it.
The result should be, that for each simulation it will calculate a new value using the the value before, for 500 Day meaning S_1 should be S_500 with a total of 1000 simulations. (I need to generate new columns based on the value before using the formular.)
similar to this:
So for the 1. Simulations 500 days, 2. Simulation 500 day and so on...
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
simulation = 0
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 500
df = pd.DataFrame()
for i in range (0,TradingDays):
z = norm.ppf(rd.random())
simulation = simulation + 1
S_1 = S_0*np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
df = df.append ({
'S_1':S_1,
'S_0':S_0
}, ignore_index=True)
df = df.round ({'Z':6,
'S_T':2
})
df.index += 1
df.index.name = 'Simulation'
print(df)
I found another possible code which i found here and it does solve the problem but just for one row, the next row is just the same calculation. Generate a Dataframe that follow a mathematical function for each column / row
If i just replace it with my formular i get the same problem.
replacing:
exp(r - q * sqrt(sigma))*T+ (np.random.randn(nrows) * sqrt(deltaT)))
with:
exp((r-sigma**2/2)*T/nrows+sigma*np.sqrt(T/nrows)*z))
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 50
Simulation = 100
df = pd.DataFrame({'s0': [S_0] * Simulation})
for i in range(1, TradingDays):
z = norm.ppf(rd.random())
df[f's{i}'] = df.iloc[:, -1] * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
print(df)
I would work more likely with the last code and solve the problem with it.
How about just overwriting the value of S_0 by the new value of S_1 while you loop and keeping all simulations in a list?
Like this:
import numpy as np
import pandas as pd
import random
from scipy.stats import norm
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
trading_days = 50
output = []
for i in range(trading_days):
z = norm.ppf(random.random())
value = S_0*np.exp((r - sigma**2 / 2) * T / trading_days + sigma * np.sqrt(T/trading_days) * z)
output.append(value)
S_0 = value
df = pd.DataFrame({'simulation': output})
Perhaps I'm missing something, but I don't see the need for a second loop.
Also, this eliminates calling df.append() in a loop, which should be avoided. (See here)
Solution based on the the answer of bartaelterman, thank you very much!
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
#Dividing the list in chunks to later append it to the dataframe in the right order
def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def blackscholes():
d1 = ((math.log(S_0/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(2)))
d2 = ((math.log(S_0/K)+(r-sigma**2/2)*T)/(sigma*np.sqrt(2)))
preis_call_option = S_0*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return preis_call_option
K = 40
S_0 = 42
T = 2
r = 0.02
sigma = 0.2
U = 38
simulation = 10000
trading_days = 500
trading_days = trading_days -1
#creating 2 lists for the first and second loop
loop_simulation = []
loop_trading_days = []
#first loop calculates the first column in a list
for j in range (0,simulation):
print("Progressbar_1_2 {:2.2%}".format(j / simulation), end="\n\r")
S_Tag_new = 0
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_0*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
S_Tag_new = S_Tag
loop_simulation.append(S_Tag)
#second loop calculates the the rows for the columns in a list
for i in range (0,trading_days):
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_Tag_new*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
loop_trading_days.append(S_Tag)
S_Tag_new = S_Tag
#values from the second loop will be divided in number of Trading days per Simulation
loop_trading_days_chunked = list(chunk_list(loop_trading_days,trading_days))
#First dataframe with just the first results from the firstloop for each simulation
df1 = pd.DataFrame({'S_Tag 1': loop_simulation})
#Appending the the chunked list from the second loop to a second dataframe
df2 = pd.DataFrame(loop_trading_days_chunked)
#Merging both dataframe into one
df3 = pd.concat([df1, df2], axis=1)

Statsmodels OLS with rolling window problem

I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304

How to use scipy.minimize with multiple parameters in error function?

I have two sets of frequencies data from experiment and from theoretical formula. I want to use minimize function of scipy.
Here's my code snippet.
where g is coupling which I want to find out.
Ad ind is inductance for plotting on x-axis.
from scipy.optimize import minimize
def eigenfreq1_func(ind,w_q,w_r,g):
return (w_q+w_r)+np.sqrt((w_q+w_r)**2.0-4*(w_q+w_r-g**2.0))/2
def eigenfreq2_func(ind,w_q,w_r,g):
return (w_q+w_r)-np.sqrt((w_q+w_r)**2.0-4*(w_q+w_r-g**2))/2.0
def err_func(y1,y1_fit,y2,y2_fit):
return np.sqrt((y1-y1_fit)**2+(y2-y2_fit)**2)
g_init=80e6
res1=eigenfreq1_func(ind,qubit_freq,readout_freq,g_init)
print res1
res2=eigenfreq2_func(ind,qubit_freq,readout_freq,g_init)
print res2
fit=minimize(err_func,args=[qubit_freq,res1,readout_freq,res2])
But it's showing the following error :
"TypeError: minimize() takes at least 2 arguments (2 given)"
First, the indentation in your example is messed up. Hope you don't try and run this
Second, here is a baby example to minimize the chi2 with the function scipy.optimize.minimize (note you can minimize what you want: likelihood, |chi|**?, toto, etc.):
import numpy as np
import scipy.optimize as opt
def functionyouwanttofit(x,y,z,t,u):
return np.array([x+y+z+t+u , x+y+z+t-u , x+y+z-t-u , x+y-z-t-u ]) # baby test here but put what you want
def calc_chi2(parameters):
x,y,z,t,u = parameters
data = np.array([100,250,300,500])
chi2 = sum( (data-functiontofit(x,y,z,t,u))**2 )
return chi2
# baby example for init, min & max values
x_init = 0
x_min = -1
x_max = 10
y_init = 1
y_min = -2
y_max = 9
z_init = 2
z_min = 0
z_max = 1000
t_init = 10
t_min = 1
t_max = 100
u_init = 10
u_min = 1
u_max = 100
parameters = [x_init,y_init,z_init,t_init,u_init]
bounds = [[x_min,x_max],[y_min,y_max],[z_min,z_max],[t_min,t_max],[u_min,u_max]]
result = opt.minimize(calc_chi2,parameters,bounds=bounds)
In your example you don't give initial values... This with the indentation... Were you waiting for someone doing the job for you ?
Third, note the optimization processes proposed by scipy are not always adapted to your needs. You may prefer minimizers such as lmfit

Modifying a numpy array after conversion from pandas dataframe

I have the following code which I am writing as part of a simple movie recommender in python so I can mimic the results I get as part of coursera's Machine Learning Course taught by Andrew NG.
I want to modify the numpy.ndarray that I get after calling as_matrix() on the pandas dataframe and add a column vector to it like we can in MATLAB
Y = [ratings Y]
Following is my python code
dataFile='/filepath/'
userItemRatings = pd.read_csv(dataFile, sep="\t", names=['userId', 'movieId', 'rating','timestamp'])
movieInfoFile = '/filepath/'
movieInfo = pd.read_csv(movieInfoFile, sep="|", names=['movieId','Title','Release Date','Video Release Date','IMDb URL','Unknown','Action','Adventure','Animation','Childrens','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western'], encoding = "ISO-8859-1")
userMovieMatrix=pd.merge(userItemRatings, movieInfo, left_on='movieId', right_on='movieId')
userMovieSubMatrix = userMovieMatrix[['userId', 'movieId', 'rating','timestamp','Title']]
Y = pd.pivot_table(userMovieSubMatrix, values='rating', index=['movieId'], columns=['userId'])
Y.fillna(0,inplace=True)
movies = Y.shape[0]
users = Y.shape[1] +1
ratings = np.zeros((1682, 1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
features = 10
theta = pd.DataFrame(np.random.rand(users,features))# users 943*3
X = pd.DataFrame(np.random.rand(movies,features))# movies 1682 * 3
X = X.as_matrix()
theta = theta.as_matrix()
Y = Y.as_matrix()
"""want to insert a column vector into this Y to get a new Y of dimension
1682*944, but only seeing 1682*943 after the following statement
"""
np.insert(Y, 0, ratings, axis=1)
R = Y.copy()
R[R!=0] = 1
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1)[0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
print(type(Ymean), type(Ynorm), type(Y), Y.shape)
Ynorm[np.isnan(Ynorm)] = 0.
Ymean[np.isnan(Ymean)] = 0.
There is an inline comment inserted, but my problem is when I create a new numpy array and call insert, it works just fine. However the numpy array I get after calling as_matrix() on pandas dataframe on which pivot_table() is called doesn't work. Is there any alternative?
insert does not operate in place, you need to assign the output to a variable. Try:
Y = np.insert(Y, 0, ratings, axis=1)

iteration through pandas dataframes and numpy

I am coming from a java background and new to numpy and pandas.
I want to translate the following pseudo code into python.
theta[0...D] - numpy
input[1...D][0...N-1] - Pandas data frame
PSEUDO CODE:
mean = theta[0]
for(row = 0 to N-1)
for(col = 1 to D)
mean += theta[col] * input[row][col]
Implementation:
class simulator:
theta = np.array([])
stddev = 0
def __init__(self, v_coefficents, v_stddev):
self.theta = v_coefficents
self.stddev = v_stddev
def sim( self, input ):
mean = self.theta[0]
D = input.shape[0]
N = input.shape[1]
for index, row in input.iterrows():
mean = self.theta[0]
for i in range(D):
mean += self.theta[i+1] *row['y']
I am concerned with iteration in the last line of code:
mean += self.theta[i+1] *row['y'].
Since you are working with NumPy, I would suggest extracting the pandas dataframe as an array and then we would have the luxury of working with theta and the extracted version of input both as arrays.
Thus, starting off we would have the array as -
input_arr = input.values
Then, the translation of the pseudo code would be -
mean = theta[0]
for row in range(N):
for col in range(1,D+1):
mean += theta[col] * input_arr[row,col]
To perform the sum-reductions, with NumPy supporting vectorized operations and broadcasting, we would have the output with simply -
mean = theta[0] + (theta[1:D+1]*input_arr[:,1:D+1]).sum()
This could be optimized further with np.dot as a matrix-multiplication, like so -
mean = theta[0] + np.dot(input_arr[:,1:D+1], theta[1:D+1]).sum()
Please note that if you meant that input has a length of D-1, then we need few edits :
Loopy code would have : input_arr[row,col-1] instead of input_arr[row,col].
Vectorized codes would have : input_arr instead of input_arr[:,1:D+1].
Sample run based on comments -
In [71]: df = {'y' : [1,2,3,4,5]}
...: data_frame = pd.DataFrame(df)
...: test_coefficients = np.array([1,2,3,4,5,6])
...:
In [79]: input_arr = data_frame.values
...: theta = test_coefficients
...:
In [80]: theta[0] + np.dot(input_arr[:,0], theta[1:])
Out[80]: 71

Categories

Resources