Python ML Backward Elimination deleting the wrong arrays

Python ML Backward Elimination deleting the wrong arrays - python

I tried to build the optimal model by using Backward Elimination when I am practicing machine learning course,and I think I've done something wrong in my define code,please help me to find out what is the problem.
The function should run correctly is to delete the P-values that are greater than 0.05.
I've tried my idea on console step by step,and when i tried to transform my idea into define code something went wrong.
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm
# Automatic Backward Elimination
def backwardelimination(x,sl):
regressor_OLS=sm.OLS(y,x).fit()
for i in range(0,len(x[0])):
maxp=max(regressor_OLS.pvalues) #Find the max P-value
if maxp>sl: #delete the max P-value which is greater than SL
x = np.delete(x,maxp,axis=1)
regressor_OLS=sm.OLS(y,x).fit()
return x
X=np.append(arr = np.ones((50,1)).astype(int) , values = X ,axis = 1)
X_opt=X[:,[0,1,2,3,4,5]]
SL=0.05 #Significance Level
X_Modeled=backwardelimination(X_opt,SL)
This is the original matrix, I expect this result,but somehow became this.
X_opt=X[:,[0,1,2,3,4,5]] Original matrix
X_opt=X[:,[0,3]] What it should be(after delete the P-values that greater then 0.05)!!

You need to add the index of pmax to the np.delete() fun not pmax value itself, for that first convert p-values to a list then its really easy.
def backwardelimination(x, sl):
regressor_OLS = sm.OLS(y, x).fit()
for i in range(0, len(x[0])):
c = list(regressor_OLS.pvalues) # <-- list conversion
j = c.index(max(c)) # <-- getting the index of max value
maxp = max(regressor_OLS.pvalues).astype(float)
if maxp > sl:
x = np.delete(x, j, axis=1) # <-- insert j in delete as an index
regressor_OLS = sm.OLS(y, x).fit()
print(regressor_OLS.summary())
return x

Related

MLE application with gekko in python

I want to implement MLE (Maximum likelihood estimation) with gekko package in python. Suppose that we have a DataFrame that contains two columns: ['Loss', 'Target'] and it length is equal to 500.
First we have to import packages that we need:
from gekko import GEKKO
import numpy as np
import pandas as pd
Then we simply create the DataFrame like this:
My_DataFrame = pd.DataFrame({"Loss":np.linspace(-555.795 , 477.841 , 500) , "Target":0.0})
My_DataFrame = My_DataFrame.sort_values(by=["Loss"] , ascending=False).reset_index(drop=True)
My_DataFrame
It going to be look like this:
Some components of [‘Target’] column should be calculated with a formula that I wrote it right down below in the picture(and the rest of them remains zero. I explained more in continue, please keep reading) so you can see it perfectly. Two main elements of formula are ‘Kasi’ and ‘Betaa’. I want to find best value for them that maximize sum of My_DataFrame[‘Target’]. So you got the idea and what is going to happen!
Now let me show you how I wrote the code for this purpose. First I define my objective function:
def obj_function(Array):
"""
[Purpose]:
+ it will calculate each component of My_DataFrame["Target"] column! then i can maximize sum(My_DataFrame["Target"]) and find best 'Kasi' and 'Betaa' for it!
[Parameters]:
+ This function gets Array that contains 'Kasi' and 'Betaa'.
Array[0] represents 'Kasi' and Array[1] represents 'Betaa'
[returns]:
+ returns a pandas.series.
actually it returns new components of My_DataFrame["Target"]
"""
# in following code if you don't know what is `qw`, just look at the next code cell right after this cell (I mean next section).
# in following code np.where(My_DataFrame["Loss"] == item)[0][0] is telling me the row's index of item.
for item in My_DataFrame[My_DataFrame["Loss"]>160]['Loss']:
My_DataFrame.iloc[np.where(My_DataFrame["Loss"] == item)[0][0] , 1] = qw.log10((1/Array[1])*( 1 + (Array[0]*(item-160)/Array[1])**( (-1/Array[0]) - 1 )))
return My_DataFrame["Target"]
if you got confused what's happening in for loop in obj_function function, check picture below, it contains a brief example! and if not, skip this part :
Then just we need to go through optimization. I use gekko package for this purpose. Note that I want to find best values of ‘Kasi’ and ‘Betaa’ so I have two main variables and I don’t have any kind of constraints!
So let’s get started:
# i have 2 variables : 'Kasi' and 'Betaa', so I put nd=2
nd = 2
qw = GEKKO()
# now i want to specify my variables ('Kasi' and 'Betaa') with initial values --> Kasi = 0.7 and Betaa = 20.0
x = qw.Array(qw.Var , nd , value = [0.7 , 20])
# So i guess now x[0] represents 'Kasi' and x[1] represents 'Betaa'
qw.Maximize(np.sum(obj_function(x)))
And then when I want to solve the optimization with qw.solve():
qw.solve()
But i got this error:
Exception: This steady-state IMODE only allows scalar values.
How can I fix this problem? (Complete script gathered in next section for the purpose of convenience)
from gekko import GEKKO
import numpy as np
import pandas as pd
My_DataFrame = pd.DataFrame({"Loss":np.linspace(-555.795 , 477.841 , 500) , "Target":0.0})
My_DataFrame = My_DataFrame.sort_values(by=["Loss"] , ascending=False).reset_index(drop=True)
def obj_function(Array):
"""
[Purpose]:
+ it will calculate each component of My_DataFrame["Target"] column! then i can maximize sum(My_DataFrame["Target"]) and find best 'Kasi' and 'Betaa' for it!
[Parameters]:
+ This function gets Array that contains 'Kasi' and 'Betaa'.
Array[0] represents 'Kasi' and Array[1] represents 'Betaa'
[returns]:
+ returns a pandas.series.
actually it returns new components of My_DataFrame["Target"]
"""
# in following code if you don't know what is `qw`, just look at the next code cell right after this cell (I mean next section).
# in following code np.where(My_DataFrame["Loss"] == item)[0][0] is telling me the row's index of item.
for item in My_DataFrame[My_DataFrame["Loss"]>160]['Loss']:
My_DataFrame.iloc[np.where(My_DataFrame["Loss"] == item)[0][0] , 1] = qw.log10((1/Array[1])*( 1 + (Array[0]*(item-160)/Array[1])**( (-1/Array[0]) - 1 )))
return My_DataFrame["Target"]
# i have 2 variables : 'Kasi' and 'Betaa', so I put nd=2
nd = 2
qw = GEKKO()
# now i want to specify my variables ('Kasi' and 'Betaa') with initial values --> Kasi = 0.7 and Betaa = 20.0
x = qw.Array(qw.Var , nd)
for i,xi in enumerate([0.7, 20]):
x[i].value = xi
# So i guess now x[0] represents 'Kasi' and x[1] represents 'Betaa'
qw.Maximize(qw.sum(obj_function(x)))
proposed potential script is here:
from gekko import GEKKO
import numpy as np
import pandas as pd
My_DataFrame = pd.read_excel("[<FILE_PATH_IN_YOUR_MACHINE>]\\Losses.xlsx")
# i'll put link of "Losses.xlsx" file in the end of my explaination
# so you can download it from my google drive.
loss = My_DataFrame["Loss"]
def obj_function(x):
k,b = x
target = []
for iloss in loss:
if iloss>160:
t = qw.log((1/b)*(1+(k*(iloss-160)/b)**((-1/k)-1)))
target.append(t)
return target
qw = GEKKO(remote=False)
nd = 2
x = qw.Array(qw.Var,nd)
# initial values --> Kasi = 0.7 and Betaa = 20.0
for i,xi in enumerate([0.7, 20]):
x[i].value = xi
# bounds
k,b = x
k.lower=0.1; k.upper=0.8
b.lower=10; b.upper=500
qw.Maximize(qw.sum(obj_function(x)))
qw.options.SOLVER = 1
qw.solve()
print('k = ',k.value[0])
print('b = ',b.value[0])
python output:
objective function = -1155.4861315885942
b = 500.0
k = 0.1
note that in python output b is representing "Betaa" and k is representing "Kasi".
output seems abit strange, so i decide to test it! for this purpose I used Microsoft Excel Solver!
(i put the link of excel file at the end of my explaination so you can check it out yourself if
you want.) as you can see in picture bellow, optimization by excel has been done and optimal solution
has been found successfully (see picture bellow for optimization result).
excel output:
objective function = -108.21
Betaa = 32.53161
Kasi = 0.436246
as you can see there is huge difference between python output and excel output and seems that excel is performing pretty well! so i guess problem still stands and proposed python script is not performing well...
Implementation_in_Excel.xls file of Optimization by Microsoft excel application is available here.(also you can see the optimization options in Data tab --> Analysis --> Slover.)
data that used for optimization in excel and python are same and it's available here (it's pretty simple and contains 501 rows and 1 column).
*if you can't download the files, let me know then I'll update them.

The initialization is applying the values of [0.7, 20] to each parameter. A scalar should be used to initialize value instead such as:
x = qw.Array(qw.Var , nd)
for i,xi in enumerate([0.7, 20]):
x[i].value = xi
Another issue is that gekko needs to use special functions to perform automatic differentiation for the solvers. For the objective function, switch to the gekko version of summation as:
qw.Maximize(qw.sum(obj_function(x)))
If loss is computed by changing the values of x then the objective function has logical expressions that need special treatment for solution with gradient-based solvers. Try using the if3() function for a conditional statement or else slack variables (preferred). The objective function is evaluated once to build a symbolic expressions that are then compiled to byte-code and solved with one of the solvers. The symbolic expressions are found in m.path in the gk0_model.apm file.
Response to Edit
Thanks for posting an edit with the complete code. Here is a potential solution:
from gekko import GEKKO
import numpy as np
import pandas as pd
loss = np.linspace(-555.795 , 477.841 , 500)
def obj_function(x):
k,b = x
target = []
for iloss in loss:
if iloss>160:
t = qw.log((1/b)*(1+(k*(iloss-160)/b)**((-1/k)-1)))
target.append(t)
return target
qw = GEKKO(remote=False)
nd = 2
x = qw.Array(qw.Var,nd)
# initial values --> Kasi = 0.7 and Betaa = 20.0
for i,xi in enumerate([0.7, 20]):
x[i].value = xi
# bounds
k,b = x
k.lower=0.6; k.upper=0.8
b.lower=10; b.upper=30
qw.Maximize(qw.sum(obj_function(x)))
qw.options.SOLVER = 1
qw.solve()
print('k = ',k.value[0])
print('b = ',b.value[0])
The solver reaches bounds at the solution. The bounds may need to be widened so that arbitrary limits are not the solution.
Update
Here is a final solution. That objective function in code had a problem so it should be fixed Here is the correct script:
from gekko import GEKKO
import numpy as np
import pandas as pd
My_DataFrame = pd.read_excel("<FILE_PATH_IN_YOUR_MACHINE>\\Losses.xlsx")
loss = My_DataFrame["Loss"]
def obj_function(x):
k,b = x
q = ((-1/k)-1)
target = []
for iloss in loss:
if iloss>160:
t = qw.log(1/b) + q* ( qw.log(b+k*(iloss-160)) - qw.log(b))
target.append(t)
return target
qw = GEKKO(remote=False)
nd = 2
x = qw.Array(qw.Var,nd)
# initial values --> Kasi = 0.7 and Betaa = 20.0
for i,xi in enumerate([0.7, 20]):
x[i].value = xi
qw.Maximize(qw.sum(obj_function(x)))
qw.solve()
print('Kasi = ',x[0].value)
print('Betaa = ',x[1].value)
Output:
The final value of the objective function is 108.20609317143486
---------------------------------------------------
Solver : IPOPT (v3.12)
Solution time : 0.031200000000000006 sec
Objective : 108.20609317143486
Successful solution
---------------------------------------------------
Kasi = [0.436245842]
Betaa = [32.531632983]
Results are close to the optimization result from Microsoft Excel.

qw.Maximize() only sets the objective of the optimization, you still need to call solve() on your model.

If I can see correctly, My_DataFrame has been defined in the global scope.
The problem is that the obj_funtion tries to access it (successful) and then, modify it's value (fails)
This is because you can't modify global variables from a local scope by default.
Fix:
At the beginning of the obj_function, add a line:
def obj_function(Array):
# comments
global My_DataFrame
for item .... # remains same
This should fix your problem.
Additional Note:
If you just wanted to access My_DataFrame, it would work without any errors and you don't need to add the global keyword
Also, just wanted to appreciate the effort you put into this. There's a proper explanation of what you want to do, relevant background information, an excellent diagram (Whiteboard is pretty great too), and even a minimal working example.
This should be how all SO questions are, it would make everyone's lives easier

Sum values from numpy array if condition on value in another array is met

I'm facing a problem with vectorizing a function so that it applies efficiently on a numpy array.
My program entries :
A pos_part 2D Array of Nb_particles lines, 3 columns (basicaly x,y,z coordinates, only z is relevant for the part that bothers me) Nb_particles can up to several hundreds of thousands.
An prop_part 1D array with Nb_particles values. This part I got covered, creation is made with some nice numpy functions ; I just put here a basic distribution that ressembles real values.
A z_distances 1D Array, a simple np.arange betwwen z=0 and z=z_max.
Then come the calculation that takes time, because where I can't find a way to do things properply with only numpy operation of arrays. What i want to do is :
For all distances z_i in z_distances, sum all values from prop_part if corresponding particle coordinate z_particle < z_i. This would return a 1D array the same length as z_distances.
My ideas so far :
Version 0, for loop, enumerate and np.where do retrieve the index of values that I need to sum. Obviously quite long.
Version 1, using a mask on a new array (combination of z coordinates and particle properties), and sum on the masked array. Seems better than v0
Version 2, another mask and a np.vectorize, but i understand it's not efficient as vectorize is basicaly a for loop. Still seems better than v0
Version 3, I'm trying to use mask on a function that can I directly apply to z_distances, but it's not working so far.
So, here I am. There is maybe something to do with a sort and a cumulative sum, but I don't know how to do this, so any help would be greatly appreciated. Please find below the code to make things clearer
Thanks in advance.
import numpy as np
import time
import matplotlib.pyplot as plt
# Creation of particles' positions
Nb_part = 150_000
pos_part = 10*np.random.rand(Nb_part,3)
pos_part[:,0] = pos_part[:,1] = 0
#usefull property creation
beta = 1/1.5
prop_part = (1/beta)*np.exp(-pos_part[:,2]/beta)
z_distances = np.arange(0,10,0.1)
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = np.where(pos_part[:,2]<val_dist)[0]
result[index_dist] = sum(prop_part[i] for i in positions)
print("v0 :",time.time()-t0)
#A graph to help understand
plt.figure()
plt.plot(z_distances,result, c="red")
plt.ylabel("Sum of particles' usefull property for particles with z-pos<d")
plt.xlabel("d")
#version 1 ??
t1=time.time()
combi = np.column_stack((pos_part[:,2],prop_part))
result2 = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
mask = (combi[:,0]<val_dist)
result2[index_dist]=sum(combi[:,1][mask])
print("v1 :",time.time()-t1)
plt.plot(z_distances,result2, c="blue")
#version 2
t2=time.time()
def themask(a):
mask = (combi[:,0]<a)
return sum(combi[:,1][mask])
thefunc = np.vectorize(themask)
result3 = thefunc(z_distances)
print("v2 :",time.time()-t2)
plt.plot(z_distances,result3, c="green")
### This does not work so far
# version 3
# =============================
# t3=time.time()
# def thesum(a):
# mask = combi[combi[:,0]<a]
# return sum(mask[:,1])
# result4 = thesum(z_distances)
# print("v3 :",time.time()-t3)
# =============================

You can get a lot more performance by writing your first version completely in numpy. Replace pythons sum with np.sum. Instead of the for i in positions list comprehension, simply pass the positions mask you are creating anyways.
Indeed, the np.where is not necessary and my best version looks like:
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = pos_part[:, 2] < val_dist
result[index_dist] = np.sum(prop_part[positions])
print("v0 :",time.time()-t0)
# out: v0 : 0.06322097778320312
You can get a bit faster if z_distances is very long by using numba.
Running calc for the first time usually creates some overhead which we can get rid of by running the function for some small set of `z_distances.
The below code achieves roughly a factor of two speedup over pure numpy on my laptop.
import numba as nb
#nb.njit(parallel=True)
def calc(result, z_distances):
n = z_distances.shape[0]
for ii in nb.prange(n):
pos = pos_part[:, 2] < z_distances[ii]
result[ii] = np.sum(prop_part[pos])
return result
result4 = np.zeros_like(result)
# _t = time.time()
# calc(result4, z_distances[:10])
# print(time.time()-_t)
t3 = time.time()
result4 = calc(result4, z_distances)
print("v3 :", time.time()-t3)
plt.plot(z_distances, result4)

Build a coupled map lattice using 2D array

So I'm trying to build a coupled map lattice on my computer.
A coupled map lattice (CML) is given by this eq'n:
where, the function f(Xn) is a logistic map :
with x value from 0-1, and r=4 for this CML.
Note: 'n' can be thought of as time, and 'i' as space
I have spent a lot of time understanding the iterations and i came up with a code as below, however i'm not sure if this is the correct code to iterate this equation.
Note: I have used 2d numpy arrays, where rows are 'n' and columns are 'i' as obvious from the code.
So basically, I want to develop a code to simulate this equation, and here is my take on that
Don't jump to the code directly, you won't understand what's happening without bothering to look at the equations first.
import numpy as np
import matplotlib.pyplot as plt
'''The 4 definitions created below are actually similar and only vary in their indexings. These 4
have been created only because of the if conditions I have put in the for loop '''
def logInit(r,x):
y[n,0]=r*x[n,0]*(1-x[n,0])
return y[n,0]
def logPresent(r,x):
y[n,i]=r*x[n,i]*(1-x[n,i])
return y[n,i]
def logLast(r,x):
y[n,L-1]=r*x[n,L-1]*(1-x[n,L-1])
return y[n,L-1]
def logNext(r,x):
y[n,i+1]=r*x[n,i+1]*(1-x[n,i+1])
return y[n,i+1]
def logPrev(r,x):
y[n,i-1]=r*x[n,i-1]*(1-x[n,i-1])
return y[n,i-1]
# 2d array with 4 row, 3 col. I created this because I want to store the evaluated values of log
function into this y[n,i] array
y=np.ones(12).reshape(4,3)
# creating an array of random numbers between 0-1 with 4 rows 3 columns
np.random.seed(0)
x=np.random.random((4,3))
L=3
r=4
eps=0.5
for n in range(3):
for i in range(L):
if i==0:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logLast(r,x)+logNext(r,x))
elif i==L-1:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logPrev(r,x) + logInit(r,x))
elif i > 0 and i < L - 1:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logPrev(r,x) +logNext(r,x))
print(x)
This does give an output. Here it is:
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.64589411]
[0.43758721 0.891773 0.96366276]
[0.38344152 0.79172504 0.52889492]]
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.92306303]
[0.2449672 0.49731638 0.96366276]
[0.38344152 0.79172504 0.52889492]]
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.92306303]
[0.2449672 0.49731638 0.29789622]
[0.75613708 0.93368134 0.52889492]]
But I'm very sure this is not what I'm looking for.
If you can please figure out a correct way to iterate and loop the CML equation with code ? Suggest me the changes I have to make. Thank you very much!!
You'll have to think about the iterations and looping to be made to simulate this equation. It might be tedious, but that's the only way you can suggest me some changes in my code.

Your calculations seem fine to me. You could improve the speed by using vectorization along the space dimension and by reusing your intermediate results y. I restructured your program a little, but in essence it does the same thing as before. For me the results look plausible. The image shows the random initial vector in the first row and as the time goes on (top to bottom) the coupling comes in to play and little islands and patterns form.
import numpy as np
import matplotlib.pyplot as plt
L = 128 # grid size
N = 128 # time steps
r = 4
eps = 0.5
# Create random values for the initial time step
np.random.seed(0)
x = np.zeros((N+1, L))
x[0, :] = np.random.random(L)
# Create a helper matrix to save and reuse part of the calculations
y = np.zeros((N, L))
# Indices for previous, present, next position for every point on the grid
idx_present = np.arange(L) # 0, 1, ..., L-2, L-1
idx_next = (idx_present + 1) % L # 1, 2, ..., L-1, 0
idx_prev = (idx_present - 1) % L # L-1, 0, ..., L-3, L-2
def log_vector(rr, xx):
return rr * xx * (1 - xx)
# Loop over the time steps
for n in range(N):
# Compute y once for the whole time step and reuse it
# to build the next time step with coupling the neighbours
y[n, :] = log_vector(rr=r, xx=x[n, :])
x[n+1, :] = (1-eps)*y[n,idx_present] + 0.5*eps*(y[n,idx_prev]+y[n,idx_next])
# Plot the results
plt.imshow(x)

How do I find the euclidean distances between rows of my test and train set efficiently?

I have test and train sets with the following dimensions with all features (i.e. columns) as integers.
X_train.shape
(990188L, 19L)
X_test.shape
(424367L, 19L)
I want to find out the euclidean distance among all the rows of train set and all the rows of the test set.
I have to also remove the rows from the train set with a distance threshold of 0.005.
I have a following linear code which is too slow but works fine.
for a in range(X_test.shape[0]):
a_test = np_Test[a]
for b in range(X_train.shape[0]):
a_train = np_Train[b]
if(a != b):
dst = distance.euclidean(a_test, a_train)
if(dst <= 0.005):
train.append(b)
where I note down the indexes of the rows that lie within the distance threshold.
Is there any way to parallelize this code?
I tried using from sklearn.metrics.pairwise import euclidean_distances
but as the data set is huge, I am getting a memory error.
I tried to parallelize the code by using euclidean_distances is batches but some how I think the following code is not working fine.
Please help me if there is any way to parallelize the code.
rows = X_train.shape[0]
rem = rows%1000
no = rows/1000
i = 0
while (i <= no*1000) :
dst_mat = euclidean_distances(X_train[i:i+1000, :], X_test)
condition = np.any(dst_mat <= 0.005, axis = 1)
index = np.where(condition == True)
index = np.add(index, i)
print(index)
print(dst_mat)
i+=1000

Use scipy.spatial.cdist. This will calculate the pairwise distance.
Thanks to Warren Weckesser for pointing out this solution.

Numpy symmetric matrix becomes asymmetric when I applied min-max scaling

I have a symmetric matrix (1877 x 1877), here is the matrix file. I try to standardize the values between 0-1. After I apply this method, the matrix is no longer symmetric. Any help is appreciated.
print((dist.transpose() == dist).all()) # this prints 'True'
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler()
return minmax_scale.fit_transform(X)
sci_dist_scaled = sci_minmax(dist)
(sci_dist_scaled.transpose() == sci_dist_scaled).all() # this print 'False'
sci_dist_scaled.dtype, dist.dtype # (dtype('float64'), dtype('float64'))

Looking at this description the minmaxscaler appears to work column-by-column, so, naturally, you can't expect it to preserve symmetry.
What's best to do in your case depends a bit on what you are trying to achieve, really. If having the values between 0 and 1 is all you require you can rescale by hand:
mn, mx = dist.min(), dist.max()
dist01 = (dist - mn) / (mx - mn)
but depending on your ultimate problem this may be too simplistic...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python ML Backward Elimination deleting the wrong arrays - python

Related

MLE application with gekko in python

Sum values from numpy array if condition on value in another array is met

Build a coupled map lattice using 2D array

How do I find the euclidean distances between rows of my test and train set efficiently?

Numpy symmetric matrix becomes asymmetric when I applied min-max scaling

Categories

Resources