Python: Only exponentiate array if values is not equal to zero - python

I have the below function in which I run several regressions. Some estimated coefficients are outputted as '0s' and naturally when they're exponentiated they turn into '1s'.
Ideally, I would have sm.OLS() output 'blanks' rather than 'zeros' in those cases where the estimated coefficient is zero. But I've tried and this doesn't seem possible.
So, alternatively, I would prefer to keep zeros rather than 1s. This would require not exponentiating the zeros in this line of the code: exp_coefficients=np.exp(results.params)
How could I do this?
import statsmodels.api as sm
df_index = []
coef_mtr = [] # start with an empty list
for x in df_main.project_x.unique():
df_holder=df_main[df_main.project_x == x]
X = df_holder.drop(['unneeded1', 'unneeded2','unneeded3'], axis=1)
X['constant']=1
Y = df_holder['sales']
eq=sm.OLS(y, X)
results=eq.fit()
exp_coefficients=np.exp(results.params)
# print(exp_coefficients)
coef_mtr.append(exp_coefficients)
df_index.append(x)
coef_mtr = np.array(coef_mtr)
# create a dataframe with this data
df_columns = [f'coef_{n}' for n in range(coef_mtr.shape[1])]
df_matrix=pd.DataFrame(data = coef_mtr, index = df_index, columns = df_columns)

The cleanest would probably be using the where keyword (not the function) as in
out = np.exp(in_,where=in_!=0)
This will skip al zero values. But because when I say skip I mean skip this will leave the corresponding values in out uninitialized. We therefore need to preset them to zero:
out = np.zeros_like(in_)
np.exp(in_,where=in_!=0,out=out)

Related

How to get value from nsmallest instead of .core.series.Series

Pretty new to python so any advice is always welcome.
I am trying to map data from multiple sets of coordinates to one set and am trying to use Bilinear interpolation to do it.
I have a set of DataFrames I iterate over and am trying to find the nearest neighbors for my interpolation.
Since my grids may not be uniform in spacing I am sorting by Y position first:
for i in range(0, len(df_x['X'])):
x_pos = df_x._get_value(i, 'X')#pull x coord y coord
y_pos = df_y._get_value(i, 'Y')
for n in data_list:
df = data_list[n] #
d_y = abs(df['Y'] - y_pos) #array of distance from Y pos
d_y.drop_duplicates() # remove duplicates
nn_y1 = d_y.nsmallest(1) # finds closest row
nn_y2 = d_y.nsmallest(2).iloc[-1] # finds next closest row
print(type(nn_y1))
d_x_y1 = df[df['DesignY'] == nn_y1] # creates list of X at closest row
I think this should provide me with my upper and lower bounds nearest my points.
however when then sorting for X position I get an error
ValueError: Can only compare identically-labeled Series objects
I think this is due to the fact that the type for nn_y1 kicks out <class 'pandas.core.series.Series'>
any advice for how to get the value instead of the series? I could create a dataframe with one element but that seems hacky? I tried some combinations of _get_value() but to no avail.
nsmallest returns:
"The n smallest values in the Series, sorted in increasing order." (Type Series)
In this case the simple way is to unpack from nsmallest(2) since both values are needed:
nn_y1, nn_y2 = d_y.nsmallest(2)
To modify the code directly iloc is needed to get the first value from the Series:
nn_y1 = d_y.nsmallest(1).iloc[0]
Alternatively d_y.nsmallest(2) could've been used twice with iloc to get both values:
smallest = d_y.nsmallest(2)
nn_y1 = smallest.iloc[0]
nn_y2 = smallest.iloc[1]

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

Trying to create a dataframe from another dataframe with certain restrictions

I am trying to write a VAR model in Python (where it is not allowed to use pre-made functions such as VAR within statsmodel).
For this I need the matrix of the dependent variables.
I have a data set of 3 government bonds, all with different maturities.
The data is imported and treated as follows
# importing file
df = pd.read_csv("C://Users/raymond/Desktop/Econometrie3/us_tbills_8019.csv")
# dropping years > 1999
df = df.iloc[:240]
# calculating log differences
Dates = pd.to_datetime(df['DATE'], format='%Y-%m-%d')
mData = df[['GS10','GS5','GS1']]
mData.index = pd.DatetimeIndex(Dates)
AllData = mData
logdif = np.log(mData).diff().shift(1).dropna()
To create the matrix Y, which is the matrix of dependent variables, I want to take values of logdif with range i = 1:K and j = P+1:T-1
I tried the following to create my matrix:
# variables
K = 3
T = df.shape[0]
P = 4
# matrix of Dependent Variable
Y = logdif
def functionY():
for i in range(1, 3, 1):
for j in range(P+1, T-1, 1):
Y[i][j-1] = logdif[i][j]
return Y
I have tried other ways to find the matrix as well, with none working.
Any tips for creating my matrix?
I cannot understand all of your code (so there may be other problems), but I notice that you're assigning your dataframe logdif to a second variable Y and then writing values to Y. You should be aware that this modifies logdif, as it is only assigned by reference. (This doesn't look like what you want, since you're reading from logdif at the same time.)
Use Y = logdif.copy() to avoid that.
Edit: Also, range(1,3,1) is identical to range(1,3).
Edit: Also, you mention that you are trying to create a matrix; the variable you're using right now is still a DataFrame (np.log preserves the type of the argument here). You can use Y = np.matrix(logdif) to change this.

Vectorize an operation in Numpy

I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))

Find average of all columns of matrix filtered on last column

I'm new to numpy. I have an Nx4 matrix and I want to find the average of each column when the last column equals 1 for instance. In matlab I would do something like mean1 = mean(data[column 4] == 1). This would return a matrix (or vector) with the mean of the columns, with the mean of column 4 being equal to 1. I can't find any specific documentation that specifies how to handle this. This shows how to filter the matrix, but I shouldn't have to reassign the matrix to a new variable, doubling storage size. Thanks in advance.
#make artificial data to match problem
data = np.random.random((100,4))
print( id(data) )
data[:,3] = data[:,3] < 0.5
print( id(data) ) #same object (memory location)
#get the filter
dfilter = data[:,3].astype(np.bool_)
#find the means
means = data[dfilter].mean(axis=0)

Categories

Resources