Pattern recognition with Pybrain - python

Is there a method for training pybrain to recognize multiple patterns within a single neural net? For example, I've added several permutations of two different patterns:
First pattern:
(200[1-9], 200[1-9]),(400[1-9],400[1-9])
Second pattern:
(900[1-9], 900[1-9]),(100[1-9],100[1-9])
Then for my unsupervised data set I added (90002, 90009), for which I was hoping it would return [100[1-9],100[1-9]] (second pattern) however it returns [25084, 25084]. I realize that its trying to find the best value given ALL the inputs, however I'm trying to have it distinquish certain patterns within the set if that makes sense.
This is the example I'm working from :
Request for example: Recurrent neural network for predicting next value in a sequence
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet,UnsupervisedDataSet
from pybrain.structure import LinearLayer
from pybrain.datasets import ClassificationDataSet
from pybrain.structure.modules.sigmoidlayer import SigmoidLayer
import random
ds = ClassificationDataSet(2, 1)
tng_dataset_size = 1000
unseen_dataset_size = 100
print 'training dataset size is ', tng_dataset_size
print 'unseen dataset size is ', unseen_dataset_size
print 'adding data..'
for x in range(tng_dataset_size):
rand1 = random.randint(1,9)
rand2 = random.randint(1,9)
pattern_one_0 = int('2000'+str(rand1))
pattern_one_1 = int('2000'+str(rand2))
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand2))
ds.addSample((pattern_one_0,pattern_one_1),(0))#pattern 1, maps to 0
ds.addSample((pattern_two_0,pattern_two_1),(1))#pattern 2, maps to 1
unsupervised_results = []
net = buildNetwork(2, 1, 1, outclass=LinearLayer,bias=True, recurrent=True)
print 'training ...'
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(500)
ts = UnsupervisedDataSet(2,)
print 'adding pattern 2 to unseen data'
for x in xrange(unseen_dataset_size):
pattern_two_0 = int('9000'+str(rand1))
pattern_two_1 = int('9000'+str(rand1))
ts.addSample((pattern_two_0, pattern_two_1))#adding first part of pattern 2 to unseen data
a = [int(i) for i in net.activateOnDataset(ts)[0]]#should map to 1
unsupervised_results.append(a[0])
print 'total hits for pattern 1 ', unsupervised_results.count(0)
print 'total hits for pattern 2 ', unsupervised_results.count(1)
[[EDIT]] added categorical variable and ClassificationDataSet.
[[EDIT 1]] added larger training set and unseen set

Yes, there is. The problem here is the representation you are choosing. You are training the network to output real numbers, so your NN is a function that approximates to a certain degree the function you sampled and provided in the dataset. Hence the result of some value between 10000 and 40000.
It looks more like you are looking for a classifier.
Given your description I am assuming you have a clearly defined set of patterns, that you are looking for. Then you must map your patterns to a categorical variable. For instance the pattern 1 you mention (200[1-9], 200[1-9]),(400[1-9],400[1-9]) would be 0, pattern 2 would be 1 and so on.
Then, you train the network to output the class (0,1,...) to which the input pattern belongs.
Arguably, given the structure of your patterns, rule-based classification is probably more adequate than ANNs.
Concerning the amount of data, you need much more of it. Tipically, the most basic approach is to split the dataset into two groups (70-30, for instance). You use 70% of the samples for training, and the remaining 30% you use as unseen data (test data), to assess the generalization/over-fitting of the model. You might want to read about cross-validation once you get the basics running.

Related

Gensim Doc2vec model: how to compute similarity on a corpus obtained using a pre-trained doc2vec model?

I have a model based on doc2vec trained on multiple documents. I would like to use that model to infer the vectors of another document, which I want to use as the corpus for comparison. So, when I look for the most similar sentence to one I introduce, it uses this new document vectors instead of the trained corpus.
Currently, I am using the infer_vector() to compute the vector for each one of the sentences of the new document, but I can't use the most_similar() function with the list of vectors I obtain, it has to be KeyedVectors.
I would like to know if there's any way that I can compute these vectors for the new document that will allow the use of the most_similar() function, or if I have to compute the similarity between each one of the sentences of the new document and the sentence I introduce individually (in this case, is there any implementation in Gensim that allows me to compute the cosine similarity between 2 vectors?).
I am new to Gensim and NLP, and I'm open to your suggestions.
I can not provide the complete code, since it is a project for the university, but here are the main parts in which I'm having problems.
After doing some pre-processing of the data, this is how I train my model:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)]
assert gensim.models.doc2vec.FAST_VERSION > -1
cores = multiprocessing.cpu_count()
doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores)
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)
I try to compute the vectors for the new document this way:
questions = [doc2vec_model.infer_vector(line) for line in lines_4]
And then I try to compute the similarity between the new document vectors and an input phrase:
text = str(input('Me: '))
tokens = text.split()
new_vector = doc2vec_model.infer_vector(tokens)
index = questions[i].most_similar([new_vector])
A dirty solution I used about a month ago in gensim==3.2.0 (the syntax might have changed).
You can save your inferred vectors in KeyedVectors format.
from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))
for name in y_names:
# vectors[name] = doc2vec_model.docvecs[name]
vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
f.write(line)
f.close()
then you can load and use most_similar function
keyed_model = KeyedVectors.load_word2vec_format("question_vectors.txt")
keyed_model.most_similar(str(list(y_names)[0]))
Another solution (esp. if the number of questions is not so high) would be just to convert questions to a np.array and get cosine distance), e.g.
import numpy as np
questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T
product = np.matmul(questions, questions.T)
product = product.T / norm
# Otherwise the item is the closest to itself
for j in range(len(questions)):
product[j, j] = 0
# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)

doc2vec How to cluster DocvecsArray

I've patched the following code from examples I've found over the web:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
# random
from random import shuffle
# classifier
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
sources = {'test.txt' : 'DOCS'}
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
print(model.docvecs)
my test.txt file contains a paragraph per line.
The code runs fine and generates DocvecsArray for each line of text
my goal is to have an output like so:
cluster 1: [DOC_5,DOC_100,...DOC_N]
cluster 2: [DOC_0,DOC_1,...DOC_N]
I have found the following Answer, but the output is:
cluster 1: [word,word...word]
cluster 2: [word,word...word]
How can I alter the code and get document clusters?
So it looks like you're almost there.
You are outputting a set of vectors. For the sklearn package, you have to put those into a numpy array - using the numpy.toarray() function would probably be best. The documentation for KMeans is really stellar and even across the whole library it's good.
A note for you is that I have had much better luck with DBSCAN than KMeans, which are both contained in the same sklearn library. DBSCAN doesn't require you to specify how many clusters you want to have on the output.
There are well-commented code examples in both links.
In my case I used:
for doc in docs:
doc_vecs = model.infer_vector(doc.split())
# creating a matrix from list of vectors
mat = np.stack(doc_vecs)
# Clustering Kmeans
km_model = KMeans(n_clusters=5)
km_model.fit(mat)
# Get cluster assignment labels
labels = km_model.labels_
# Clustering DBScan
dbscan_model = DBSCAN()
labels = dbscan_model.fit_predict(mat)
Where model is the pre-trained Doc2Vec model. In my case I didn't need to cluster the same documents of the training but new documents saved in the docs list

Neural Network predictions are always the same while testing an fMRI dataset with pyBrain. Why?

I am quite new to fMRI analysis. I am trying to determine which object (out of 9 objects) a person is thinking about just by looking at their Brain Images. I am using the dataset on https://openfmri.org/dataset/ds000105/ . So, I am using a neural network by inputting 2D slices of brain images to get the output as 1 of the 9 objects. There are details about every step and the images in the code below.
import os, mvpa2, pyBrain
import numpy as np
from os.path import join as opj
from mvpa2.datasets.sources import OpenFMRIDataset
from pybrain.datasets import SupervisedDataSet,classification
path = opj(os.getcwd() , 'datasets','ds105')
of = OpenFMRIDataset(path)
#12th run of the 1st subject
ds = of.get_model_bold_dataset(model_id=1, subj_id=1,run_ids=[12])
#Get the unique list of 8 objects (sicissors, ...) and 'None'.
target_list = np.unique(ds.sa.targets).tolist()
#Returns Nibabel Image instance
img = of.get_bold_run_image(subj=1,task=1,run=12)
# Getting the actual image from the proxy image
img_data = img.get_data()
#Get the middle voxelds of the brain samples
mid_brain_slices = [x/2 for x in img_data.shape]
# Each image in the img_data is a 3D image of 40 x 64 x 64 voxels,
# and there are 121 such samples taken periodically every 2.5 seconds.
# Thus, a single person's brain is scanned for about 300 seconds (121 x 2.5).
# This is a 4D array of 3 dimensions of space and 1 dimension of time,
# which forms a matrix of (40 x 64 x 64 x 121)
# I only want to extract the slice of the 2D images brain in it's top view
# i.e. a series of 2D images 40 x 64
# So, i take the middle slice of the brain, hence compute the middle_brain_slices
DS = classification.ClassificationDataSet(40*64, class_labels=target_list)
# Loop over every brain image
for i in range(0,121):
#Image of brain at i th time interval
brain_instance = img_data[:,:,:,i]
# We will slice the brain to create 2D plots and use those 'pixels'
# as the features
slice_0 = img_data[mid_brain_slices[0],:,:,i] #64 x 64
slice_1 = img_data[:,mid_brain_slices[1],:,i] #40 x 64
slice_2 = img_data[:,:,mid_brain_slices[2],i] #40 x 64
#Note : we may actually only need one of these slices (the one with top view)
X = slice_2 #Possibly top view
# Reshape X from 40 x 64 to 1D vector 2560 x 1
X = np.reshape(X,40*64)
#Get the target at this intance (y)
y = ds.sa.targets[i]
y = target_list.index(y)
DS.appendLinked(X,y)
print DS.calculateStatistics()
print DS.classHist
print DS.nClasses
print DS.getClass(1)
# Generate y as a 9 x 1 matrix with eight 0's and only one 1 (in this training set)
DS._convertToOneOfMany(bounds=[0, 1])
#Split into Train and Test sets
test_data, train_data = DS.splitWithProportion( 0.25 )
#Note : I think splitWithProportion will also internally shuffle the data
#Build neural network
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure.modules import SoftmaxLayer
nn = buildNetwork(train_data.indim, 64, train_data.outdim, outclass=SoftmaxLayer)
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(nn, dataset=train_data, momentum=0.1, learningrate=0.01 , verbose=True, weightdecay=0.01)
trainer.trainUntilConvergence(maxEpochs = 20)
The line nn.activate(X_test[i]) should take the 2560 inputs and generate a probability output, right? in the predicted y vector (shape 9 x 1 )
So, I assume the highest of the 9 values should be assigned answer. But it is not the case when I verify it with y_test[i]. Furthermore, I get similar values for X_test for every test sample. Why is this so?
#Just splitting the test and trainset
X_train = train_data.getField('input')
y_train = train_data.getField('target')
X_test = test_data.getField('input')
y_test = test_data.getField('target')
#Testing the network
for i in range(0,len(X_test)):
print nn.activate(X_test[i])
print y_test[i]
When I include the code above, here are some values of X_test :
.
.
.
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [0 1 0 0 0 0 0 0 0]
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [1 0 0 0 0 0 0 0 0]
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [0 0 0 0 0 0 1 0 0]
.
.
.
So the probability of the test sample being index 0 in every case id 44.4% irrespective of the sample value. The actual values keep varying though.
print 'print predictions: ' , trainer.testOnClassData (dataset=test_data)
x = []
for item in y_test:
x.extend(np.where(item == 1)[0])
print 'print actual: ' , x
Here, the output comparison is :
print predictions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print actual: [7, 0, 4, 8, 2, 0, 2, 1, 0, 6, 1, 4]
All the predictions are for the first item. I don't know what the problem is. The total error seems to be decreasing, which is a good sign though :
Total error: 0.0598287764931
Total error: 0.0512272330797
Total error: 0.0503835076374
Total error: 0.0486402801867
Total error: 0.0498354140541
Total error: 0.0495447833038
Total error: 0.0494208449895
Total error: 0.0491162599037
Total error: 0.0486775862084
Total error: 0.0486638648161
Total error: 0.0491337891419
Total error: 0.0486965691406
Total error: 0.0490016912735
Total error: 0.0489939195858
Total error: 0.0483910986235
Total error: 0.0487459940103
Total error: 0.0485516142106
Total error: 0.0477407360102
Total error: 0.0490661144891
Total error: 0.0483103097669
Total error: 0.0487965594586
I can't be sure -- because I haven't used all of these tools together before, or worked specifically in this kind of project -- but I would look at the documentation and be sure that your nn is being created as you expect it to.
Specifically, it mentions here:
http://pybrain.org/docs/api/tools.html?highlight=buildnetwork#pybrain.tools.shortcuts.buildNetwork
that "If the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork.", and you can read here:
http://pybrain.org/docs/api/structure/networks.html?highlight=feedforwardnetwork
that "FeedForwardNetworks are networks that do not work for sequential data. Every input is treated as independent of any previous or following inputs.".
Did you mean to create a "FeedForward" network object?
You're testing by looping over an index and activating each "input" field that's based off the instantiation of a FeedForwardNetwork object, which the documentation suggests are treated as independent of other inputs. This may be why you're getting such similar results each time, when you are expecting better convergences.
You initialize your dataset ds object with the parameters model_id=1, subj_id=1,run_ids=[12], suggesting that you're only looking at a single subject and model, but 12 "runs" from that subject under that model, right?
Most likely there's nothing semantically or grammatically wrong with your code, but a general confusion from the PyBrain library's presumed and assumed models, parameters, and algorithms. So don't tear your hair out looking for code "errors"; this is definitely a common difficulty with under-documented libraries.
Again, I may be off base, but in my experience with similar tools and libraries, it's most often that the benefit of taking an extremely complicated process and simplifying it to just a couple dozen lines of code, comes with a TON of completely opaque and fixed assumptions.
My guess is that you're essentially re-running "new" tests on "new" or independent training data, without all the actual information and parameters that you thought you had setup in the previous code lines. You are exactly correct that the highest value (read: largest probability) is the "most likely" (that's exactly what each value is, a "likeliness") answer, especially if your probability array represents a unimodal distribution.
Because there are no obvious code syntax errors -- like accidentally looping over a range iterator equivalent to the list [0,0,0,0,0,0]; which you can verify because you reuse the i index integer in printing y_test which varies and the result of nn.activate(X_test[i]) which isn't varying -- then most likely what's happening is that you're basically restarting your test every time and that's why you're getting an identical result, not just similar but identical for every printout of that nn.activate(...) method's results.
This is a complex, but very well written and well illustrated question, but unfortunately I don't think there will be a simple or blatantly obvious solution.
Again, you're getting the benefits of PyBrain's simplificaiton of neural networks, data training, heuristics, data reading, sampling, statistical modelling, classification, and so on and so forth, all reduced into single line or two line commands. There are assumptions being made, TONS of them. That's what the documentation needs to be illuminating, and we have to be very very careful when we use tools like these that it's not just a matter of correct syntax, but an actually correct (read: expected) algorithm, assumptions and all.
Good luck!
(P.S. -- Open source libraries also, despite a lack of documentation, give you the benefit of checking the source code to see [assumptions and all] what they're actually doing: https://github.com/pybrain/pybrain )

How to perform text classification with naive bayes using sklearn library?

I am trying text classification using naive bayes text classifier.
My data is in the below format and based on the question and excerpt i have to decide the topic of the question. The training data is having more than 20K records. I know SVM would be a better option here but i want to go with Naive Bayes using sklearn library.
{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n "}]}
This is what i have tried so far,
import numpy as np
import json
from sklearn.naive_bayes import *
topic = []
question = []
excerpt = []
with open('training.json') as f:
for line in f:
data = json.loads(line)
topic.append(data["topic"])
question.append(data["question"])
excerpt.append(data["excerpt"])
unique_topics = list(set(topic))
new_topic = [x.encode('UTF8') for x in topic]
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
numeric_topics = [float(i) for i in numeric_topics]
x1 = np.array(question)
x2 = np.array(excerpt)
X = zip(*[x1,x2])
Y = np.array(numeric_topics)
print X[0]
clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( ['hello'] )
But as expected i am getting ValueError: could not convert string to float. My question is how can i create a simple classifier to classify the question and excerpt into related topic ?
All classifiers in sklearn require input to be represented as vectors of some fixed dimensionality. For text there are CountVectorizer, HashingVectorizer and TfidfVectorizer which can transform your strings into vectors of floating numbers.
vect = TfidfVectorizer()
X = vect.fit_transform(X)
Obviously, you'll need to vectorize your test set in the same way
clf.predict( vect.transform(['hello']) )
See a tutorial on using sklearn with textual data.

Neural net with Pybrain will not converge

I am trying to build a simple neural network using Python and Pybrain package.
As I am starting to learn both the method and Pybrain package. I tried to make a very simple neuralnet with some real data that I have available!
I know there is an underlying connection to my data, however the code does not converge at all, and the results after the training are basically the same for any set of real validation data that I put there. Below is my code and a small part of the data. I have over 5000 lines of data available with known g to train my network, but it does not matter the number of points added to the training.
from pybrain.tools.shortcuts import buildNetwork as bld
from pybrain.datasets import SupervisedDataSet as spds
from pybrain.supervised.trainers import BackpropTrainer as bpt
import numpy as np
u,g,r,i,z = np.loadtxt("dataset.dat",unpack=True)
data = spds(4,1)
net = bld(4,1000,1)
for i in range(0,len(umag)):
data.addSample((u[i],r[i],i[i],z[i]),(g[i]))
trainer = bpt(net,data)
trainer.trainUntilConvergence(dataset=data,maxEpochs=300,validationProportion=0.5)
p = net.activate([17.136,15.812,15.693,15.675])
print p
#expected result 16.225
p = net.activate([19.382,17.684,17.511,17.435])
# 18.195 - expected result
print p
18.14981 15.10829 13.96468 -10.8685 13.20411
16.84580 15.17839 14.61974 14.44930 14.44493
16.70895 15.57959 15.28097 15.16538 15.19260
18.44166 16.32709 15.45345 15.14938 15.04544
18.03881 16.49129 15.96768 15.78446 15.77211
21.15679 18.66248 17.46381 16.97513 16.75475
19.25665 17.80023 17.18956 16.97563 16.94967
17.01522 16.08040 15.85172 15.81930 15.92262
19.21695 17.72263 17.17900 16.98280 16.97201
19.98507 18.56911 17.98143 17.80738 17.81714
16.94824 15.97417 15.70555 15.59221 15.64357
21.20893 19.40982 18.68114 18.46647 18.43065
18.72652 17.38880 16.93716 16.73246 16.75096
20.57421 19.55045 19.15475 18.99772 19.02503
22.48833 20.07709 18.68276 17.60561 17.09613
22.27604 20.34056 19.66521 19.37319 19.30457
20.58372 19.18035 18.64691 18.43370 18.39288
22.25103 20.74570 20.16532 19.94144 19.78580
22.49646 19.63043 18.39409 17.97594 17.77803
19.22686 17.55373 16.97127 16.76445 16.70418
20.44500 19.34502 18.96556 18.80437 18.78767
22.69331 21.19628 19.89190 19.39628 19.11377
19.51075 18.02397 17.46963 17.31436 17.27759
19.92604 18.49456 17.97421 17.83519 17.80557
19.18904 18.22256 17.84221 17.70319 17.64457
20.23186 18.43468 17.81423 17.60103 17.54677
19.86590 18.32822 17.75089 17.57386 17.53067
20.84188 19.78345 19.42506 19.27895 19.34572
22.14103 21.86670 21.74832 21.61244 21.99680
18.02018 16.69380 16.23947 16.12869 16.09864
19.92574 18.63316 18.15877 17.95703 17.90224
Generally speaking, I get better results if I have scaled my data to be between 0 and 1, or better yet between 0.1 and 0.9. The neuron output is usually going to be between 0 and 1. You might try scaling your inputs and outputs to be within this range, and see if you get better results.

Categories

Resources