Neural net with Pybrain will not converge - python

I am trying to build a simple neural network using Python and Pybrain package.
As I am starting to learn both the method and Pybrain package. I tried to make a very simple neuralnet with some real data that I have available!
I know there is an underlying connection to my data, however the code does not converge at all, and the results after the training are basically the same for any set of real validation data that I put there. Below is my code and a small part of the data. I have over 5000 lines of data available with known g to train my network, but it does not matter the number of points added to the training.
from pybrain.tools.shortcuts import buildNetwork as bld
from pybrain.datasets import SupervisedDataSet as spds
from pybrain.supervised.trainers import BackpropTrainer as bpt
import numpy as np
u,g,r,i,z = np.loadtxt("dataset.dat",unpack=True)
data = spds(4,1)
net = bld(4,1000,1)
for i in range(0,len(umag)):
data.addSample((u[i],r[i],i[i],z[i]),(g[i]))
trainer = bpt(net,data)
trainer.trainUntilConvergence(dataset=data,maxEpochs=300,validationProportion=0.5)
p = net.activate([17.136,15.812,15.693,15.675])
print p
#expected result 16.225
p = net.activate([19.382,17.684,17.511,17.435])
# 18.195 - expected result
print p
18.14981 15.10829 13.96468 -10.8685 13.20411
16.84580 15.17839 14.61974 14.44930 14.44493
16.70895 15.57959 15.28097 15.16538 15.19260
18.44166 16.32709 15.45345 15.14938 15.04544
18.03881 16.49129 15.96768 15.78446 15.77211
21.15679 18.66248 17.46381 16.97513 16.75475
19.25665 17.80023 17.18956 16.97563 16.94967
17.01522 16.08040 15.85172 15.81930 15.92262
19.21695 17.72263 17.17900 16.98280 16.97201
19.98507 18.56911 17.98143 17.80738 17.81714
16.94824 15.97417 15.70555 15.59221 15.64357
21.20893 19.40982 18.68114 18.46647 18.43065
18.72652 17.38880 16.93716 16.73246 16.75096
20.57421 19.55045 19.15475 18.99772 19.02503
22.48833 20.07709 18.68276 17.60561 17.09613
22.27604 20.34056 19.66521 19.37319 19.30457
20.58372 19.18035 18.64691 18.43370 18.39288
22.25103 20.74570 20.16532 19.94144 19.78580
22.49646 19.63043 18.39409 17.97594 17.77803
19.22686 17.55373 16.97127 16.76445 16.70418
20.44500 19.34502 18.96556 18.80437 18.78767
22.69331 21.19628 19.89190 19.39628 19.11377
19.51075 18.02397 17.46963 17.31436 17.27759
19.92604 18.49456 17.97421 17.83519 17.80557
19.18904 18.22256 17.84221 17.70319 17.64457
20.23186 18.43468 17.81423 17.60103 17.54677
19.86590 18.32822 17.75089 17.57386 17.53067
20.84188 19.78345 19.42506 19.27895 19.34572
22.14103 21.86670 21.74832 21.61244 21.99680
18.02018 16.69380 16.23947 16.12869 16.09864
19.92574 18.63316 18.15877 17.95703 17.90224

Generally speaking, I get better results if I have scaled my data to be between 0 and 1, or better yet between 0.1 and 0.9. The neuron output is usually going to be between 0 and 1. You might try scaling your inputs and outputs to be within this range, and see if you get better results.

Related

How to successsfully run an ML algorithm with a medium sized data set on a mediocre laptop?

I have a Lenovo IdeaPad laptop with 8 GB RAM and Intel Core I5 processor. I have 60k data points each 100 dimentional. I want to do KNN and for it I am running LMNN algorithm to find a Mahalanobis Metric.
Problem is after 2 hours of running a blank screen appears on my ubuntu. I am not getting what is the problem! Is my memory getting full or something else?
So is there some way to optimize this my code?
My dataset: data
My LMNN implementation:
import numpy as np
import sys
from modshogun import LMNN, RealFeatures, MulticlassLabels
from sklearn.datasets import load_svmlight_file
def main():
# Get training file name from the command line
traindatafile = sys.argv[1]
# The training file is in libSVM format
tr_data = load_svmlight_file(traindatafile);
Xtr = tr_data[0].toarray(); # Converts sparse matrices to dense
Ytr = tr_data[1]; # The trainig labels
# Cast data to Shogun format to work with LMNN
features = RealFeatures(Xtr.T)
labels = MulticlassLabels(Ytr.astype(np.float64))
# Number of target neighbours per example - tune this using validation
k = 18
# Initialize the LMNN package
lmnn = LMNN(features, labels, k)
init_transform = np.eye(Xtr.shape[1])
# Choose an appropriate timeout
lmnn.set_maxiter(200000)
lmnn.train(init_transform)
# Let LMNN do its magic and return a linear transformation
# corresponding to the Mahalanobis metric it has learnt
L = lmnn.get_linear_transform()
M = np.matrix(np.dot(L.T, L))
# Save the model for use in testing phase
# Warning: do not change this file name
np.save("model.npy", M)
if __name__ == '__main__':
main()
Exact k-NN has scalability problems.
Scikit-learn has documentation page (scaling strategies) on what to do in such situation (many algorithms have partial_fit method, butunfortunately kNN doesn't have it).
If you'd accept to trade some accuracy for speed you can run something like approximate nearest neighbors.

Pattern recognition with Pybrain

Is there a method for training pybrain to recognize multiple patterns within a single neural net? For example, I've added several permutations of two different patterns:
First pattern:
(200[1-9], 200[1-9]),(400[1-9],400[1-9])
Second pattern:
(900[1-9], 900[1-9]),(100[1-9],100[1-9])
Then for my unsupervised data set I added (90002, 90009), for which I was hoping it would return [100[1-9],100[1-9]] (second pattern) however it returns [25084, 25084]. I realize that its trying to find the best value given ALL the inputs, however I'm trying to have it distinquish certain patterns within the set if that makes sense.
This is the example I'm working from :
Request for example: Recurrent neural network for predicting next value in a sequence
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet,UnsupervisedDataSet
from pybrain.structure import LinearLayer
from pybrain.datasets import ClassificationDataSet
from pybrain.structure.modules.sigmoidlayer import SigmoidLayer
import random
ds = ClassificationDataSet(2, 1)
tng_dataset_size = 1000
unseen_dataset_size = 100
print 'training dataset size is ', tng_dataset_size
print 'unseen dataset size is ', unseen_dataset_size
print 'adding data..'
for x in range(tng_dataset_size):
rand1 = random.randint(1,9)
rand2 = random.randint(1,9)
pattern_one_0 = int('2000'+str(rand1))
pattern_one_1 = int('2000'+str(rand2))
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand2))
ds.addSample((pattern_one_0,pattern_one_1),(0))#pattern 1, maps to 0
ds.addSample((pattern_two_0,pattern_two_1),(1))#pattern 2, maps to 1
unsupervised_results = []
net = buildNetwork(2, 1, 1, outclass=LinearLayer,bias=True, recurrent=True)
print 'training ...'
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(500)
ts = UnsupervisedDataSet(2,)
print 'adding pattern 2 to unseen data'
for x in xrange(unseen_dataset_size):
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand1))
ts.addSample((pattern_two_0, pattern_two_1))#adding first part of pattern 2 to unseen data
a = [int(i) for i in net.activateOnDataset(ts)[0]]#should map to 1
unsupervised_results.append(a[0])
print 'total hits for pattern 1 ', unsupervised_results.count(0)
print 'total hits for pattern 2 ', unsupervised_results.count(1)
[[EDIT]] added categorical variable and ClassificationDataSet.
[[EDIT 1]] added larger training set and unseen set
Yes, there is. The problem here is the representation you are choosing. You are training the network to output real numbers, so your NN is a function that approximates to a certain degree the function you sampled and provided in the dataset. Hence the result of some value between 10000 and 40000.
It looks more like you are looking for a classifier.
Given your description I am assuming you have a clearly defined set of patterns, that you are looking for. Then you must map your patterns to a categorical variable. For instance the pattern 1 you mention (200[1-9], 200[1-9]),(400[1-9],400[1-9]) would be 0, pattern 2 would be 1 and so on.
Then, you train the network to output the class (0,1,...) to which the input pattern belongs.
Arguably, given the structure of your patterns, rule-based classification is probably more adequate than ANNs.
Concerning the amount of data, you need much more of it. Tipically, the most basic approach is to split the dataset into two groups (70-30, for instance). You use 70% of the samples for training, and the remaining 30% you use as unseen data (test data), to assess the generalization/over-fitting of the model. You might want to read about cross-validation once you get the basics running.

How to monitor convergence of Gensim LDA model?

I can't seem to find it or probably my knowledge on statistics and its terms are the problem here but I want to achieve something similar to the graph found on the bottom page of the LDA lib from PyPI and observe the uniformity/convergence of the lines. How can I achieve this with Gensim LDA?
You are right to wish to plot the convergence of your model fitting.
Gensim unfortunately does not seem to make this very straight forward.
Run the model in such a way that you will be able to analyze the output of the model fitting function. I like to setup a log file.
import logging
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
Set the eval_every parameter in LdaModel. The lower this value is the better resolution your plot will have. However, computing the perplexity can slow down your fit a lot!
lda_model =
LdaModel(corpus=corpus,
id2word=id2word,
num_topics=30,
eval_every=10,
pass=40,
iterations=5000)
Parse the log file and make your plot.
import re
import matplotlib.pyplot as plt
p = re.compile("(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
matches = [p.findall(l) for l in open('gensim.log')]
matches = [m for m in matches if len(m) > 0]
tuples = [t[0] for t in matches]
perplexity = [float(t[1]) for t in tuples]
liklihood = [float(t[0]) for t in tuples]
iter = list(range(0,len(tuples)*10,10))
plt.plot(iter,liklihood,c="black")
plt.ylabel("log liklihood")
plt.xlabel("iteration")
plt.title("Topic Model Convergence")
plt.grid()
plt.savefig("convergence_liklihood.pdf")
plt.close()

Using pybrain optimization algorithm to solve search problems

I recently started using pybrain library for classification problems using neural networks and with some struggle and documentation I made it work.
Now, I would like to use blackbox optimization algorithms from the same library, but not applied to classification.
Basically, I am trying to reproduce some result from Randy's blog http://www.randalolson.com/2015/02/03/heres-waldo-computing-the-optimal-search-strategy-for-finding-waldo/.
So, as a first step, I constructed supervised dataset with the following snippet:
ds = SupervisedDataSet(2, 2)
for row in range(len(waldo_df)):
ds.addSample(inp=waldo_df.iloc[row][['Book', 'Page']], target=waldo_df.iloc[row][['X', 'Y']])
return ds
Now, one sample from the dataset looks like:
ds.getSample()
[array([ 5., 8.]), array([ 3.51388889, 4.31944444])]
On the next step I would like to use HillClimber algorithm to find the optimal path:
ef = ds.evaluateModuleMSE
init_value = ds.getSample()
learner = HillClimber(evaluator=ef, initEvaluable=init_value, minimize=True)
learner.learn()
What I get back in exception:
/Users/maestro/anaconda/lib/python2.7/site-packages/pybrain/datasets/supervised.pyc in evaluateModuleMSE(self, module, averageOver, **args)
96 res = 0.
97 for dummy in range(averageOver):
---> 98 module.reset()
99 res += self.evaluateMSE(module.activate, **args)
100 return res/averageOver
AttributeError: 'numpy.ndarray' object has no attribute 'reset'
Can someone help me figure out what I am doing wrong? The documentation on this is very sparse and even searching through the code base did not help.
Thanks
P.S. If I am reading the API correctly
class pybrain.optimization.HillClimber(evaluator=None, initEvaluable=None, **kwargs)
The simplest kind of stochastic search: hill-climbing in the fitness landscape.
the optimization algorithm needs to take only evaluator which in my case would be ds.evaluateModuleMSE
Update
The whole code snippet is:
import pandas as pd
from pybrain.optimization import HillClimber
from pybrain.datasets import SupervisedDataSet
waldo_df = pd.read_csv('whereis-waldo-locations.csv')
ds = SupervisedDataSet(2, 2)
for row in range(len(waldo_df)):
ds.addSample(inp=waldo_df.iloc[row][['Book', 'Page']], target=waldo_df.iloc[row][['X', 'Y']])
learner = HillClimber(evaluator=ds.evaluateModuleMSE, initEvaluable=ds.getSample(), minimize=True)

unexpected poor performance of GMM from sklearn

I'm trying to model some simulated data using the DPGMM classifier from scikitlearn, but I'm getting poor performance. Here is the example I'm using:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
clf = mixture.DPGMM(n_components=5, init_params='wc')
s = 0.1
a = np.random.normal(loc=1, scale=s, size=(1000,))
b = np.random.normal(loc=2, scale=s, size=(1000,))
c = np.random.normal(loc=3, scale=s, size=(1000,))
d = np.random.normal(loc=4, scale=s, size=(1000,))
e = np.random.normal(loc=7, scale=s*2, size=(5000,))
noise = np.random.random(500)*8
data = np.hstack([a,b,c,d,e,noise]).reshape((-1,1))
clf.means_ = np.array([1,2,3,4,7]).reshape((-1,1))
clf.fit(data)
labels = clf.predict(data)
plt.scatter(data.T, np.random.random(len(data)), c=labels, lw=0, alpha=0.2)
plt.show()
I would think that this would be exactly the kind of problem that gaussian mixture models would work for. I've tried playing around with alpha, using gmm instead of dpgmm, changing the number of starting components, etc. I can't seem to get a reliable and accurate classification. Is there something I'm just missing? Is there another model that would be more appropriate?
Because you didn't iterate long enough for it to converge.
Check the value of
clf.converged_
and try increasing n_iter to 1000.
Note that, however, the DPGMM still fails miserably IMHO on this data set, decreasing the number of clusters to just 2 eventually.

Categories

Resources