Discrepancies between R optim vs Scipy optimize: Nelder-Mead - python

I wrote a script that I believe should produce the same results in Python and R, but they are producing very different answers. Each attempts to fit a model to simulated data by minimizing deviance using Nelder-Mead. Overall, optim in R is performing much better. Am I doing something wrong? Are the algorithms implemented in R and SciPy different?
Python result:
>>> res = minimize(choiceProbDev, sparams, (stim, dflt, dat, N), method='Nelder-Mead')
final_simplex: (array([[-0.21483287, -1. , -0.4645897 , -4.65108495],
[-0.21483909, -1. , -0.4645915 , -4.65114839],
[-0.21485426, -1. , -0.46457789, -4.65107337],
[-0.21483727, -1. , -0.46459331, -4.65115965],
[-0.21484398, -1. , -0.46457725, -4.65099805]]), array([107.46037865, 107.46037868, 107.4603787 , 107.46037875,
107.46037875]))
fun: 107.4603786452194
message: 'Optimization terminated successfully.'
nfev: 349
nit: 197
status: 0
success: True
x: array([-0.21483287, -1. , -0.4645897 , -4.65108495])
R result:
> res <- optim(sparams, choiceProbDev, stim=stim, dflt=dflt, dat=dat, N=N,
method="Nelder-Mead")
$par
[1] 0.2641022 1.0000000 0.2086496 3.6688737
$value
[1] 110.4249
$counts
function gradient
329 NA
$convergence
[1] 0
$message
NULL
I've checked over my code and as far as I can tell this appears to be due to some difference between optim and minimize because the function I'm trying to minimize (i.e., choiceProbDev) operates the same in each (besides the output, I've also checked the equivalence of each step within the function). See for example:
Python choiceProbDev:
>>> choiceProbDev(np.array([0.5, 0.5, 0.5, 3]), stim, dflt, dat, N)
143.31438613033876
R choiceProbDev:
> choiceProbDev(c(0.5, 0.5, 0.5, 3), stim, dflt, dat, N)
[1] 143.3144
I've also tried to play around with the tolerance levels for each optimization function, but I'm not entirely sure how the tolerance arguments match up between the two. Either way, my fiddling so far hasn't brought the two into agreement. Here is the entire code for each.
Python:
# load modules
import math
import numpy as np
from scipy.optimize import minimize
from scipy.stats import binom
# initialize values
dflt = 0.5
N = 1
# set the known parameter values for generating data
b = 0.1
w1 = 0.75
w2 = 0.25
t = 7
theta = [b, w1, w2, t]
# generate stimuli
stim = np.array(np.meshgrid(np.arange(0, 1.1, 0.1),
np.arange(0, 1.1, 0.1))).T.reshape(-1,2)
# starting values
sparams = [-0.5, -0.5, -0.5, 4]
# generate probability of accepting proposal
def choiceProb(stim, dflt, theta):
utilProp = theta[0] + theta[1]*stim[:,0] + theta[2]*stim[:,1] # proposal utility
utilDflt = theta[1]*dflt + theta[2]*dflt # default utility
choiceProb = 1/(1 + np.exp(-1*theta[3]*(utilProp - utilDflt))) # probability of choosing proposal
return choiceProb
# calculate deviance
def choiceProbDev(theta, stim, dflt, dat, N):
# restrict b, w1, w2 weights to between -1 and 1
if any([x > 1 or x < -1 for x in theta[:-1]]):
return 10000
# initialize
nDat = dat.shape[0]
dev = np.array([np.nan]*nDat)
# for each trial, calculate deviance
p = choiceProb(stim, dflt, theta)
lk = binom.pmf(dat, N, p)
for i in range(nDat):
if math.isclose(lk[i], 0):
dev[i] = 10000
else:
dev[i] = -2*np.log(lk[i])
return np.sum(dev)
# simulate data
probs = choiceProb(stim, dflt, theta)
# randomly generated data based on the calculated probabilities
# dat = np.random.binomial(1, probs, probs.shape[0])
dat = np.array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
# fit model
res = minimize(choiceProbDev, sparams, (stim, dflt, dat, N), method='Nelder-Mead')
R:
library(tidyverse)
# initialize values
dflt <- 0.5
N <- 1
# set the known parameter values for generating data
b <- 0.1
w1 <- 0.75
w2 <- 0.25
t <- 7
theta <- c(b, w1, w2, t)
# generate stimuli
stim <- expand.grid(seq(0, 1, 0.1),
seq(0, 1, 0.1)) %>%
dplyr::arrange(Var1, Var2)
# starting values
sparams <- c(-0.5, -0.5, -0.5, 4)
# generate probability of accepting proposal
choiceProb <- function(stim, dflt, theta){
utilProp <- theta[1] + theta[2]*stim[,1] + theta[3]*stim[,2] # proposal utility
utilDflt <- theta[2]*dflt + theta[3]*dflt # default utility
choiceProb <- 1/(1 + exp(-1*theta[4]*(utilProp - utilDflt))) # probability of choosing proposal
return(choiceProb)
}
# calculate deviance
choiceProbDev <- function(theta, stim, dflt, dat, N){
# restrict b, w1, w2 weights to between -1 and 1
if (any(theta[1:3] > 1 | theta[1:3] < -1)){
return(10000)
}
# initialize
nDat <- length(dat)
dev <- rep(NA, nDat)
# for each trial, calculate deviance
p <- choiceProb(stim, dflt, theta)
lk <- dbinom(dat, N, p)
for (i in 1:nDat){
if (dplyr::near(lk[i], 0)){
dev[i] <- 10000
} else {
dev[i] <- -2*log(lk[i])
}
}
return(sum(dev))
}
# simulate data
probs <- choiceProb(stim, dflt, theta)
# same data as in python script
dat <- c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
# fit model
res <- optim(sparams, choiceProbDev, stim=stim, dflt=dflt, dat=dat, N=N,
method="Nelder-Mead")
UPDATE:
After printing the estimates at each iteration, it now appears to me that the discrepancy might stem from differences in 'step sizes' that each algorithm takes. Scipy appears to take smaller steps than optim (and in a different initial direction). I haven't figured out how to adjust this.
Python:
>>> res = minimize(choiceProbDev, sparams, (stim, dflt, dat, N), method='Nelder-Mead')
[-0.5 -0.5 -0.5 4. ]
[-0.525 -0.5 -0.5 4. ]
[-0.5 -0.525 -0.5 4. ]
[-0.5 -0.5 -0.525 4. ]
[-0.5 -0.5 -0.5 4.2]
[-0.5125 -0.5125 -0.5125 3.8 ]
...
R:
> res <- optim(sparams, choiceProbDev, stim=stim, dflt=dflt, dat=dat, N=N, method="Nelder-Mead")
[1] -0.5 -0.5 -0.5 4.0
[1] -0.1 -0.5 -0.5 4.0
[1] -0.5 -0.1 -0.5 4.0
[1] -0.5 -0.5 -0.1 4.0
[1] -0.5 -0.5 -0.5 4.4
[1] -0.3 -0.3 -0.3 3.6
...

This isn't exactly an answer of "what are the optimizer differences", but I want to contribute some exploration of the optimization problem here. A few take-home points:
the surface is smooth, so derivative-based optimizers might work better (even without an explicitly coded gradient function, i.e. falling back on finite difference approximation - they'd be even better with a gradient function)
this surface is symmetric, so it has multiple optima (apparently two), but it's not highly multimodal or rough, so I don't think a stochastic global optimizer would be worth the trouble
for optimization problems that aren't too high-dimensional or expensive to compute, it's feasible to visualize the global surface to understand what's going on.
for optimization with bounds, it's generally better either to use an optimizer that explicitly handles bounds, or to change the scale of parameters to an unconstrained scale
Here's a picture of the whole surface:
The red contours are the contours of log-likelihood equal to (110, 115, 120) (the best fit I could get was LL=105.7). The best points are in the second column, third row (achieved by L-BFGS-B) and fifth column, fourth row (true parameter values). (I haven't inspected the objective function to see where the symmetries come from, but I think it would probably be clear.) Python's Nelder-Mead and R's Nelder-Mead do approximately equally badly.
parameters and problem setup
## initialize values
dflt <- 0.5; N <- 1
# set the known parameter values for generating data
b <- 0.1; w1 <- 0.75; w2 <- 0.25; t <- 7
theta <- c(b, w1, w2, t)
# generate stimuli
stim <- expand.grid(seq(0, 1, 0.1), seq(0, 1, 0.1))
# starting values
sparams <- c(-0.5, -0.5, -0.5, 4)
# same data as in python script
dat <- c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
objective functions
Note use of built-in functions (plogis(), dbinom(...,log=TRUE) where possible.
# generate probability of accepting proposal
choiceProb <- function(stim, dflt, theta){
utilProp <- theta[1] + theta[2]*stim[,1] + theta[3]*stim[,2] # proposal utility
utilDflt <- theta[2]*dflt + theta[3]*dflt # default utility
choiceProb <- plogis(theta[4]*(utilProp - utilDflt)) # probability of choosing proposal
return(choiceProb)
}
# calculate deviance
choiceProbDev <- function(theta, stim, dflt, dat, N){
# restrict b, w1, w2 weights to between -1 and 1
if (any(theta[1:3] > 1 | theta[1:3] < -1)){
return(10000)
}
## for each trial, calculate deviance
p <- choiceProb(stim, dflt, theta)
lk <- dbinom(dat, N, p, log=TRUE)
return(sum(-2*lk))
}
# simulate data
probs <- choiceProb(stim, dflt, theta)
model fitting
# fit model
res <- optim(sparams, choiceProbDev, stim=stim, dflt=dflt, dat=dat, N=N,
method="Nelder-Mead")
## try derivative-based, box-constrained optimizer
res3 <- optim(sparams, choiceProbDev, stim=stim, dflt=dflt, dat=dat, N=N,
lower=c(-1,-1,-1,-Inf), upper=c(1,1,1,Inf),
method="L-BFGS-B")
py_coefs <- c(-0.21483287, -0.4645897 , -1, -4.65108495) ## transposed?
true_coefs <- c(0.1, 0.25, 0.75, 7) ## transposed?
## start from python coeffs
res2 <- optim(py_coefs, choiceProbDev, stim=stim, dflt=dflt, dat=dat, N=N,
method="Nelder-Mead")
explore log-likelihood surface
cc <- expand.grid(seq(-1,1,length.out=51),
seq(-1,1,length.out=6),
seq(-1,1,length.out=6),
seq(-8,8,length.out=51))
## utility function for combining parameter values
bfun <- function(x,grid_vars=c("Var2","Var3"),grid_rng=seq(-1,1,length.out=6),
type=NULL) {
if (is.list(x)) {
v <- c(x$par,x$value)
} else if (length(x)==4) {
v <- c(x,NA)
}
res <- as.data.frame(rbind(setNames(v,c(paste0("Var",1:4),"z"))))
for (v in grid_vars)
res[,v] <- grid_rng[which.min(abs(grid_rng-res[,v]))]
if (!is.null(type)) res$type <- type
res
}
resdat <- rbind(bfun(res3,type="R_LBFGSB"),
bfun(res,type="R_NM"),
bfun(py_coefs,type="Py_NM"),
bfun(true_coefs,type="true"))
cc$z <- apply(cc,1,function(x) choiceProbDev(unlist(x), dat=dat, stim=stim, dflt=dflt, N=N))
library(ggplot2)
library(viridisLite)
ggplot(cc,aes(Var1,Var4,fill=z))+
geom_tile()+
facet_grid(Var2~Var3,labeller=label_both)+
scale_fill_viridis_c()+
scale_x_continuous(expand=c(0,0))+
scale_y_continuous(expand=c(0,0))+
theme(panel.spacing=grid::unit(0,"lines"))+
geom_contour(aes(z=z),colour="red",breaks=seq(105,120,by=5),alpha=0.5)+
geom_point(data=resdat,aes(colour=type,shape=type))+
scale_colour_brewer(palette="Set1")
ggsave("liksurf.png",width=8,height=8)

'Nelder-Mead' has always been a problematic optimization method, and its coding in optim is not up-to-date. We will try three other implementations available in R packages.
To avoild the other parameters, let's define function fn as
fn <- function(theta)
choiceProbDev(theta, stim=stim, dflt=dflt, dat=dat, N=N)
Then the solvers dfoptim::nmk(), adagio::neldermead(), and pracma::anms() will all return the same minimum value xmin = 105.7843, but at different positions, for instance
dfoptim::nmk(sparams, fn)
## $par
## [1] 0.1274937 0.6671353 0.1919542 8.1731618
## $value
## [1] 105.7843
These are real local minima while, for example, the Python solution 107.46038 at c(-0.21483287,-1.0,-0.4645897,-4.65108495) is not. Your problem data are obviously not sufficient for fitting the model.
You might try a global optimizer to possibly find a global optimum within certain bounds. To me it looks like all local minima have the same minimum value.

Related

Optimal combination of linked-buckets

Let's say I have the following (always binary) options:
import numpy as np
a=np.array([1, 1, 0, 0, 1, 1, 1])
b=np.array([1, 1, 0, 0, 1, 0, 1])
c=np.array([1, 0, 0, 1, 0, 0, 0])
d=np.array([1, 0, 1, 1, 0, 0, 0])
And I want to find the optimal combination of the above that get's me to at least, with minimal above:
req = np.array([50,50,20,20,100,40,10])
For example:
final = X1*a + X2*b + X3*c + X4*d
Does this map to a known operational research problem? Or does it fall under mathematical programming?
Is this NP-hard, or exactly solveable in a reasonable amount of time (I've assumed it's combinatorally impossible to solve exactly)
Are there know solutions to this?
Note: The actual length of arrays are longer - think ~50, and the number of options are ~20
My current research has led me to some combination of the assignment problem and knapsack, but not too sure.
It's a covering problem, easily solvable using an integer program solver (I used OR-Tools below). If the X variables can be fractional, substitute NumVar for IntVar. If the X variables are 0--1, substitute BoolVar.
import numpy as np
a = np.array([1, 1, 0, 0, 1, 1, 1])
b = np.array([1, 1, 0, 0, 1, 0, 1])
c = np.array([1, 0, 0, 1, 0, 0, 0])
d = np.array([1, 0, 1, 1, 0, 0, 0])
opt = [a, b, c, d]
req = np.array([50, 50, 20, 20, 100, 40, 10])
from ortools.linear_solver import pywraplp
solver = pywraplp.Solver.CreateSolver("SCIP")
x = [solver.IntVar(0, solver.infinity(), "x{}".format(i)) for i in range(len(opt))]
extra = [solver.NumVar(0, solver.infinity(), "y{}".format(j)) for j in range(len(req))]
for j, (req_j, extra_j) in enumerate(zip(req, extra)):
solver.Add(extra_j == sum(opt_i[j] * x_i for (opt_i, x_i) in zip(opt, x)) - req_j)
solver.Minimize(sum(extra))
status = solver.Solve()
if status == pywraplp.Solver.OPTIMAL:
print("Solution:")
print("Objective value =", solver.Objective().Value())
for i, x_i in enumerate(x):
print("x{} = {}".format(i, x[i].solution_value()))
else:
print("The problem does not have an optimal solution.")
Output:
Solution:
Objective value = 210.0
x0 = 40.0
x1 = 60.0
x2 = -0.0
x3 = 20.0

Compute precision and accuracy using numpy

Suppose two lists true_values = [1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0] and predictions = [1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0]. How can I compute the accuracy and the precision using numpy?
import numpy as np
true_values = np.array([[1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0]])
predictions = np.array([[1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0]])
N = true_values.shape[1]
accuracy = (true_values == predictions).sum() / N
TP = ((predictions == 1) & (true_values == 1)).sum()
FP = ((predictions == 1) & (true_values == 0)).sum()
precision = TP / (TP+FP)
This is the most concise way I came up with (assuming no sklearn), there might be even shorter though!
This is what sklearn, which uses numpy behind the curtain, is for:
from sklearn.metrics import precision_score, accuracy_score
accuracy_score(true_values, predictions), precision_score(true_values, predictions)
Output:
(0.3333333333333333, 0.375)
I'm only going to answer for precision, because I posted a duplicate for accuracy and there should be one question per thread:
sum(map(lambda x, y: x == y == 1, true_values, predictions))/sum(true_values)
0.5
Use np.sum if you absolutely want to use Numpy
Here is for the mean:
np.equal(true_values, predictions).mean()
If you really want to calculate it yourself instead of using a library like in Quang Hoang's answer, just count the number of (true|false) (positives|negatives) and plug the values into your formula:
tp = 0
tn = 0
fp = 0
fn = 0
for t,p in zip(true_values, predictions):
if t == p:
if p == 1:
tp += 1
else:
tn += 1
else:
if p == 1:
fn += 1
else:
fp += 1
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
This wikipedia article has a nice info box with formulas that use exactly these values.

How to implement stdp in tensorflow?

I'm trying to implement STDP (Spike-Timing Dependent Plasticity) in tensorflow. It's a bit complicated. Any ideas (to get running entirely within a tensorflow graph)?
It works like this: say I have 2 input neurons, and they connect to 3 output neurons, via this matrix: [[1.0, 1.0, 0.0], [0.0, 0.0, 1.0]] (input neuron 0 connects to output neurons 0 and 1...).
Say I have these spikes for the input neurons (2 neurons, 7 timesteps):
Input Spikes:
[[0, 0, 1, 1, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 1]]
And these spikes for the output neurons (3 neurons, 7 timesteps):
Output Spikes:
[[0, 0, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
Now, for each non-zero weight, I want to compute a dw. For instance, for input neuron 0 connecting to output neuron 0:
The time stamps of the spikes for input neuron 0 are [2, 3, 5], and the timestamps for output neuron 0 are [3, 6]. Now, I compute all the delta times:
Delta Times = [ 2-3, 2-6, 3-3, 3-6, 5-3, 5-6 ] = [ -1, -4, 0, -3, 2, -1 ]
Then, I compute some function (the actual STDP function, which isn't important for this question - some exponential thing)
dw = SUM [ F(-1), F(-4), F(0), F(-3), F(2), F(-1) ]
And that's the dw for the weight connecting input neuron 0 to output neuron 0. Repeat for all non-zero weights.
So I can do all this in numpy, but I'd like to be able to do it entirely within a single tensorflow graph. In particular, I'm stuck on computing the delta times. And how to do all this for all non-zero weights, in parallel.
This is the actual stdp function, btw (the constants can be parameters):
def stdp_f(x):
return tf.where(
x == 0, np.zeros(x.shape), tf.where(
x > 0, 1.0 * tf.exp(-1.0 * x / 10.0), -1.0 * 1.0 * tf.exp(x / 10.0)))
A note on performance: the method given by #jdehesa, below, is both correct and clever. But it also turns out to be slow. In particular, for a real neural network of 784 input neurons feeding into 400 neurons, over 500 time steps, the spike_match = step performs multiplication of (784, 1, 500, 1) and (1, 400, 1, 500) tensors.
I am not familiar with STDP, so I hope I understood correctly what you meant. I think this does what you describe:
import tensorflow as tf
def f(x):
# STDP function
return x * 1
def stdp(input_spikes, output_spikes):
input_shape = tf.shape(input_spikes)
t = input_shape[-1]
# Compute STDP function for all possible time difference values
stdp_values = f(tf.cast(tf.range(-t + 1, t), dtype=input_spikes.dtype))
# Arrange in matrix such that position [i, j] contains f(i - j)
matrix_idx = tf.expand_dims(tf.range(t - 1, 2 * t - 1), 1) + tf.range(0, -t, -1)
stdp_matrix = tf.gather(stdp_values, matrix_idx)
# Find spike matches
spike_match = (input_spikes[:, tf.newaxis, :, tf.newaxis] *
output_spikes[tf.newaxis, :, tf.newaxis, :])
# Sum values where there are spike matches
return tf.reduce_sum(spike_match * stdp_matrix, axis=(2, 3))
# Test
input_spikes = [[0, 0, 1, 1, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 1]]
output_spikes = [[0, 0, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
with tf.Graph().as_default(), tf.Session() as sess:
ins = tf.placeholder(tf.float32, [None, None])
outs = tf.placeholder(tf.float32, [None, None])
res = stdp(ins, outs)
res_val = sess.run(res, feed_dict={ins: input_spikes, outs: output_spikes})
print(res_val)
# [[ -7. 10. -15.]
# [-13. 7. -24.]]
Here I assume that f is probably expensive (and that its value is the same for every pair of neurons), so I compute it only once for every possible time delta and then redistribute the computed values in a matrix, so I can multiply at the pairs of coordinates where the input and output spikes happen.
I used the identity function for f as a placeholder, so the resulting values are actually just the sum of time differences in this case.
EDIT: Just for reference, replacing f with the STDP function you included:
def f(x):
return tf.where(x == 0,
tf.zeros_like(x),
tf.where(x > 0,
1.0 * tf.exp(-1.0 * x / 10.0),
-1.0 * 1.0 * tf.exp(x / 10.0)))
The result is:
[[-3.4020822 2.1660795 -5.694256 ]
[-2.974073 0.45364904 -3.1197631 ]]

Multi layer perceptron weights not changing

I am fairly new to Machine Learning and started off with Machine Learning an algorithmic perspective. I am trying to make a logistic classifier that identifies malign programs from benign ones by tweaking the code given on book website. However the weights associated with the hidden layer and output layer are not changing even after 100000 epochs.
I have tried running the algorithm with the complete dataset as well as a partial version of it, still no luck.
Here is my MLP class
import numpy as np
class mlp:
def __init__(self, inputs, targets, nhidden, beta=1, momentum=0.9, outtype='logistic'):
""" Constructor """
# Set up network size
self.nin = np.shape(inputs)[1]
self.nout = np.shape(targets)[1]
self.ndata = np.shape(inputs)[0]
self.nhidden = nhidden
self.beta = beta
self.momentum = momentum
self.outtype = outtype
# Initialise network
self.weights1 = (np.zeros((self.nin + 1, self.nhidden), dtype=float) - 0.5) * 2 / np.sqrt(self.nin)
self.weights2 = (np.zeros((self.nhidden + 1, self.nout), dtype=float) - 0.5) * 2 / np.sqrt(self.nhidden)
def earlystopping(self, inputs, targets, valid, validtargets, eta, niterations=100):
valid = np.concatenate((valid, -np.ones((np.shape(valid)[0], 1))), axis=1)
old_val_error1 = 100002
old_val_error2 = 100001
new_val_error = 100000
count = 0
while (((old_val_error1 - new_val_error) > 0.001) or ((old_val_error2 - old_val_error1) > 0.001)):
count += 1
print
count
self.mlptrain(inputs, targets, eta, niterations)
old_val_error2 = old_val_error1
old_val_error1 = new_val_error
validout = self.mlpfwd(valid)
new_val_error = 0.5 * np.sum((validtargets - validout) ** 2)
print("Stopped", new_val_error, old_val_error1, old_val_error2)
return new_val_error
def mlptrain(self, inputs, targets, eta, niterations):
""" Train the thing """
# Add the inputs that match the bias node
inputs = np.concatenate((inputs, -np.ones((self.ndata, 1))), axis=1)
change = range(self.ndata)
print(self.weights2)
updatew1 = np.zeros((np.shape(self.weights1)))
updatew2 = np.zeros((np.shape(self.weights2)))
for n in range(niterations):
self.outputs = self.mlpfwd(inputs)
#error = 0.5 * np.sum((self.outputs - targets) ** 2)
if (np.mod(n, 100) == 0):
print ("Iteration: ", n, " Weight2: ", self.weights2)
# Different types of output neurons
if self.outtype == 'linear':
deltao = (self.outputs - targets) / self.ndata
elif self.outtype == 'logistic':
deltao = self.beta * (self.outputs - targets) * self.outputs * (1.0 - self.outputs)
elif self.outtype == 'softmax':
deltao = (self.outputs - targets) * (self.outputs * (-self.outputs) + self.outputs) / self.ndata
else:
print("error")
deltah = self.hidden * self.beta * (1.0 - self.hidden) * (np.dot(deltao, np.transpose(self.weights2)))
updatew1 = eta * (np.dot(np.transpose(inputs), deltah[:, :-1])) + self.momentum * updatew1
updatew2 = eta * (np.dot(np.transpose(self.hidden), deltao)) + self.momentum * updatew2
self.weights1 -= updatew1
self.weights2 -= updatew2
# Randomise order of inputs (not necessary for matrix-based calculation)
# np.random.shuffle(change)
# inputs = inputs[change,:]
# targets = targets[change,:]
print(self.weights2)
def mlpfwd(self, inputs):
""" Run the network forward """
self.hidden = np.dot(inputs, self.weights1);
self.hidden = 1.0 / (1.0 + np.exp(-self.beta * self.hidden))
self.hidden = np.concatenate((self.hidden, -np.ones((np.shape(inputs)[0], 1))), axis=1)
outputs = np.dot(self.hidden, self.weights2);
# Different types of output neurons
if self.outtype == 'linear':
return outputs
elif self.outtype == 'logistic':
return 1.0 / (1.0 + np.exp(-self.beta * outputs))
elif self.outtype == 'softmax':
normalisers = np.sum(np.exp(outputs), axis=1) * np.ones((1, np.shape(outputs)[0]))
return np.transpose(np.transpose(np.exp(outputs)) / normalisers)
else:
print("error")
def confmat(self, inputs, targets):
"""Confusion matrix"""
# Add the inputs that match the bias node
inputs = np.concatenate((inputs, -np.ones((np.shape(inputs)[0], 1))), axis=1)
outputs = self.mlpfwd(inputs)
nclasses = np.shape(targets)[1]
if nclasses == 1:
nclasses = 2
outputs = np.where(outputs > 0.5, 1, 0)
else:
# 1-of-N encoding
outputs = np.argmax(outputs, 1)
targets = np.argmax(targets, 1)
cm = np.zeros((nclasses, nclasses))
for i in range(nclasses):
for j in range(nclasses):
cm[i, j] = np.sum(np.where(outputs == j, 1, 0) * np.where(targets == i, 1, 0))
print(outputs)
print(targets)
print("Confusion matrix is:")
print(cm)
print("Percentage Correct: ", np.trace(cm) / np.sum(cm) * 100)
Here is my calling code that supplies data
import mlp
import numpy as np
apk_train_data = np.array([
[4, 1, 6, 29, 0, 3711, 1423906, 0],
[20, 1, 5, 24, 0, 4082, 501440, 0],
[3, 0, 1, 6, 0, 5961, 2426358, 0],
[0, 0, 2, 27, 0, 6074, 28762, 0],
[12, 1, 3, 17, 0, 4066, 505, 0],
[1, 0, 2, 5, 0, 1284, 38504, 0],
[2, 0, 2, 10, 0, 2421, 5827165, 0],
[5, 0, 17, 97, 0, 25095, 7429, 0],
[1, 1, 3, 22, 6, 4539, 9100705, 0],
[2, 0, 4, 15, 0, 2054, 264563, 0],
[3, 1, 6, 19, 0, 3562, 978171, 0],
[8, 0, 5, 12, 3, 1741, 1351990, 0],
[9, 0, 5, 12, 2, 1660, 2022743, 0],
[9, 0, 5, 12, 2, 1664, 2022743, 0],
[10, 4, 11, 70, 8, 43944, 51488321, 1],
[6, 0, 3, 18, 0, 8511, 19984102, 1],
[11, 2, 6, 44, 0, 61398, 32139, 1],
[0, 0, 0, 0, 0, 1008, 23872, 1],
[7, 1, 1, 16, 3, 46792, 94818, 1],
[3, 2, 1, 13, 2, 8263, 208820, 1],
[0, 0, 0, 2, 0, 2749, 3926, 1],
[10, 0, 1, 9, 0, 5220, 2275848, 1],
[1, 1, 3, 34, 6, 50030, 814322, 1],
[2, 2, 4, 48, 7, 86406, 12895, 1],
[0, 1, 5, 45, 2, 63060, 803121, 1],
[1, 0, 2, 11, 7, 7602, 1557, 1],
[3, 0, 1, 15, 3, 20813, 218352, 1]
])
apk_test_data = np.array([
[0, 0, 1, 9, 0, 4317, 118082, 0],
[8, 0, 5, 12, 3, 1742, 1351990, 0],
[8, 0, 5, 12, 3, 1744, 1351990, 0],
[0, 0, 1, 11, 2, 17630, 326164, 1],
[10, 2, 6, 45, 7, 22668, 30257520, 1],
[1, 0, 1, 8, 0, 9317, 33000349, 1],
[3, 0, 1, 15, 3, 20813, 218352, 1]
])
p = mlp.mlp(apk_train_data[:, 0:7], apk_train_data[:, 7:], 9)
p.mlptrain(apk_train_data[:, 0:7], apk_train_data[:, 7:], 0.25, 100000)
p.confmat(apk_test_data[:, 0:7], apk_test_data[:, 7:])
Each vector has 7 dimensions and last entry is the target
Here is the full text file containing the dataset
https://drive.google.com/open?id=1q_aGNgHxTBh_mmVAzVXKBa27NTJKeKV8
Please tell me what am I doing wrong. In case there is some easy to use library to do the same please suggest the same.
As mentioned in the comments, initializing the network weights randomly should make the network train.
# Initialise network
self.weights1 = (np.random.rand(self.nin+1,self.nhidden)-0.5)*2/np.sqrt(self.nin)
self.weights2 = (np.random.rand(self.nhidden+1,self.nout)-0.5)*2/np.sqrt(self.nhidden)
Then, my observation from your data is that the properties are not at all comparable. This will mean that your network gradient updates will be dominated by a single feature. To remedy that, one option is to standardize your data.
from sklearn.preprocessing import StandardScaler
for i in range(apk_train_data.shape[1]-1):
scaler = StandardScaler().fit(apk_train_data[:,i].copy())
apk_train_data[:,i] = scaler.transform(apk_train_data[:,i].copy())
apk_test_data[:,i] = scaler.transform(apk_test_data[:,i].copy())
And last but not least, having eta be 0.25 is way too big. I'll illustrate by using the opposite extreme:
p.mlptrain(apk_train_data[:, 0:7], apk_train_data[:, 7:], 0.0001, 100000)
p.confmat(apk_test_data[:, 0:7], apk_test_data[:, 7:])
# >> Percentage Correct: 71.4285714286
p.confmat(apk_train_data[:,0:7], apk_train_data[:,7:])
# >> Percentage Correct: 88.8888888889

Solving linear regression equation for a sparse matrix

I'd like to solve a multivariate linear regression equation for vector X with m elements while I have n observations, Y. If I assume the measurements have Gaussian random errors. How can I solve this problem using python? My problem looks like this:
A simple example of W when m=5, is given as follows:
P.S. I would like to consider the effect of errors and precisely I want to measure the standard deviation of errors.
You can do it like this
def myreg(W, Y):
from numpy.linalg import pinv
m, n = Y.shape
k = W.shape[1]
X = pinv(W.T.dot(W)).dot(W.T).dot(Y)
Y_hat = W.dot(X)
Residuals = Y_hat - Y
MSE = np.square(Residuals).sum(axis=0) / (m - 2)
X_var = (MSE[:, None] * pinv(W.T.dot(W)).flatten()).reshape(n, k, k)
Tstat = X / np.sqrt(X_var.diagonal(axis1=1, axis2=2)).T
return X, Tstat
demo
W = np.array([
[ 1, -1, 0, 0, 1],
[ 0, -1, 1, 0, 0],
[ 1, 0, 0, 1, 0],
[ 0, 1, -1, 0, 1],
[ 1, -1, 0, 0, -1],
])
Y = np.array([2, 4, 1, 5, 3])[:, None]
X, V = myreg(W, Y)

Categories

Resources