scikit Mixtypes of Y error - python

Hi I'm a scikit newbie here. I'm trying to train the computer that given an array of float decide between the 3 classes. I was classifying the classes as 0, 0.5, and 1. I also tried 0, 1.0, and 2.0 . I still get the following error:
File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line 85, in unique_labels
raise ValueError("Mix type of y not allowed, got types %s" % ys_types)
ValueError: Mix type of y not allowed, got types set(['continuous', 'multiclass'])
I have no idea what that error means

Try using integer types for your target labels. Or, perhaps better, use string labels like ['a', 'b', 'c'] but with more descriptive names.
If you check the code for this file multiclass.py (code is here) and look for the function type_of_target, you'll see that it is well-documented for this case.
Because some of the data are treated as float type (when 0.5 is included), it will believe you've got continuous-valued outputs, which won't do for multiclass discrete classification.
On the other hand, it will look at [0, 1.0, 2.0] like it is one integer and two floats, which is why you get both continuous and multiclass. Switching the last example to [0, 1, 2] should work. The documentation also makes it sound like switching to [0.0, 1.0. 2.0] would also work, but be careful and test that first.

Its hard to tell for sure without the code, but my guess is that the shape of your y data is not what is expected.
For example when my code threw this error it was because I was trying to pass y data into classification_report in the shape of (60000, 10, 2) when it was expecting it to be in the shape of (60000, 10)
I was re-running cells where I called to_categorical(y_test) more than once... When I loaded my code into a proper script and ran it it worked fine :)

Related

HMMLearn: Too Many Values to Unpack

I'm trying to use hmmlearn to get the most likely hidden state sequence from a Hidden Markov Model, given start probabilities, transition probabilities, and emission probabilities.
I have two hidden states and four possible emission values, so I'm doing this:
num_states = 2
num_observations = 4
start_probs = np.array([0.2, 0.8])
trans_probs = np.array([[0.75, 0.25], [0.1, 0.9]])
emission_probs = np.array([[0.3, 0.2, 0.2, 0.3], [0.3, 0.3, 0.3, 0.1]])
model = hmm.MultinomialHMM(n_components=num_states)
model.startprob_ = start_probs
model.transmat_ = trans_probs
model.emissionprob_ = emission_probs
seq = np.array([[3, 3, 2, 2]]).T
model.fit(seq)
log_prob, state_seq = model.decode(seq)
My stack trace points to the decode call and throws this error:
ValueError: too many values to unpack (expected 2)
I thought decode (looking at the docs) returns a log probability and the state sequence, so I'm confused.
Any idea?
Thanks!
The call model.fit(seq) requires seq to be a list of lists, as you correctly set it up like this.
However, model.decode(seq) requires seq to only be a list, not a list of lists. Thus,
model.fit([[3, 3, 2, 2]])
log_prob, state_seq = model.decode([3, 3, 2, 2])
should work without throwing an error.
See also here.
The error ValueError: too many values to unpack (expected 2) is thrown from a function called by a function called by a function... inside decode. So, the error does not mean that the number of returned objects of decode was wrong, but from framelogprob.shape somewhere inside the base.py. A more meaningful error message would make life easier here.
I had the same issue and it drove me crazy. Hope my post helps somebody.

Error in acorr_ljungbox from statsmodel

So I am trying to do a box-ljung test on a resudual, but I am getting a strange error and am not able to figure out why.
x = diag.acorr_ljungbox(np.random.random(20))
I tried doing the same with a random array also, still the same error:
ValueError: operands could not be broadcast together with shapes (19,) (40,)
This looks like a bug in the default lag setting, which is set to 40 independent of the length of the data.
As a workaround and to get a proper statistic, the lags needs to be restricted, e.g. using 5 lags below.
>>> from statsmodels.stats import diagnostic as diag
>>> diag.acorr_ljungbox(np.random.random(50))[0].shape
(40,)
>>> diag.acorr_ljungbox(np.random.random(20), lags=5)
(array([ 0.36718151, 1.02009595, 1.23734092, 3.75338034, 4.35387236]),
array([ 0.54454461, 0.60046677, 0.74406305, 0.44040973, 0.49966951]))

Neural Network predictions are always the same while testing an fMRI dataset with pyBrain. Why?

I am quite new to fMRI analysis. I am trying to determine which object (out of 9 objects) a person is thinking about just by looking at their Brain Images. I am using the dataset on https://openfmri.org/dataset/ds000105/ . So, I am using a neural network by inputting 2D slices of brain images to get the output as 1 of the 9 objects. There are details about every step and the images in the code below.
import os, mvpa2, pyBrain
import numpy as np
from os.path import join as opj
from mvpa2.datasets.sources import OpenFMRIDataset
from pybrain.datasets import SupervisedDataSet,classification
path = opj(os.getcwd() , 'datasets','ds105')
of = OpenFMRIDataset(path)
#12th run of the 1st subject
ds = of.get_model_bold_dataset(model_id=1, subj_id=1,run_ids=[12])
#Get the unique list of 8 objects (sicissors, ...) and 'None'.
target_list = np.unique(ds.sa.targets).tolist()
#Returns Nibabel Image instance
img = of.get_bold_run_image(subj=1,task=1,run=12)
# Getting the actual image from the proxy image
img_data = img.get_data()
#Get the middle voxelds of the brain samples
mid_brain_slices = [x/2 for x in img_data.shape]
# Each image in the img_data is a 3D image of 40 x 64 x 64 voxels,
# and there are 121 such samples taken periodically every 2.5 seconds.
# Thus, a single person's brain is scanned for about 300 seconds (121 x 2.5).
# This is a 4D array of 3 dimensions of space and 1 dimension of time,
# which forms a matrix of (40 x 64 x 64 x 121)
# I only want to extract the slice of the 2D images brain in it's top view
# i.e. a series of 2D images 40 x 64
# So, i take the middle slice of the brain, hence compute the middle_brain_slices
DS = classification.ClassificationDataSet(40*64, class_labels=target_list)
# Loop over every brain image
for i in range(0,121):
#Image of brain at i th time interval
brain_instance = img_data[:,:,:,i]
# We will slice the brain to create 2D plots and use those 'pixels'
# as the features
slice_0 = img_data[mid_brain_slices[0],:,:,i] #64 x 64
slice_1 = img_data[:,mid_brain_slices[1],:,i] #40 x 64
slice_2 = img_data[:,:,mid_brain_slices[2],i] #40 x 64
#Note : we may actually only need one of these slices (the one with top view)
X = slice_2 #Possibly top view
# Reshape X from 40 x 64 to 1D vector 2560 x 1
X = np.reshape(X,40*64)
#Get the target at this intance (y)
y = ds.sa.targets[i]
y = target_list.index(y)
DS.appendLinked(X,y)
print DS.calculateStatistics()
print DS.classHist
print DS.nClasses
print DS.getClass(1)
# Generate y as a 9 x 1 matrix with eight 0's and only one 1 (in this training set)
DS._convertToOneOfMany(bounds=[0, 1])
#Split into Train and Test sets
test_data, train_data = DS.splitWithProportion( 0.25 )
#Note : I think splitWithProportion will also internally shuffle the data
#Build neural network
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure.modules import SoftmaxLayer
nn = buildNetwork(train_data.indim, 64, train_data.outdim, outclass=SoftmaxLayer)
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(nn, dataset=train_data, momentum=0.1, learningrate=0.01 , verbose=True, weightdecay=0.01)
trainer.trainUntilConvergence(maxEpochs = 20)
The line nn.activate(X_test[i]) should take the 2560 inputs and generate a probability output, right? in the predicted y vector (shape 9 x 1 )
So, I assume the highest of the 9 values should be assigned answer. But it is not the case when I verify it with y_test[i]. Furthermore, I get similar values for X_test for every test sample. Why is this so?
#Just splitting the test and trainset
X_train = train_data.getField('input')
y_train = train_data.getField('target')
X_test = test_data.getField('input')
y_test = test_data.getField('target')
#Testing the network
for i in range(0,len(X_test)):
print nn.activate(X_test[i])
print y_test[i]
When I include the code above, here are some values of X_test :
.
.
.
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [0 1 0 0 0 0 0 0 0]
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [1 0 0 0 0 0 0 0 0]
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [0 0 0 0 0 0 1 0 0]
.
.
.
So the probability of the test sample being index 0 in every case id 44.4% irrespective of the sample value. The actual values keep varying though.
print 'print predictions: ' , trainer.testOnClassData (dataset=test_data)
x = []
for item in y_test:
x.extend(np.where(item == 1)[0])
print 'print actual: ' , x
Here, the output comparison is :
print predictions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print actual: [7, 0, 4, 8, 2, 0, 2, 1, 0, 6, 1, 4]
All the predictions are for the first item. I don't know what the problem is. The total error seems to be decreasing, which is a good sign though :
Total error: 0.0598287764931
Total error: 0.0512272330797
Total error: 0.0503835076374
Total error: 0.0486402801867
Total error: 0.0498354140541
Total error: 0.0495447833038
Total error: 0.0494208449895
Total error: 0.0491162599037
Total error: 0.0486775862084
Total error: 0.0486638648161
Total error: 0.0491337891419
Total error: 0.0486965691406
Total error: 0.0490016912735
Total error: 0.0489939195858
Total error: 0.0483910986235
Total error: 0.0487459940103
Total error: 0.0485516142106
Total error: 0.0477407360102
Total error: 0.0490661144891
Total error: 0.0483103097669
Total error: 0.0487965594586
I can't be sure -- because I haven't used all of these tools together before, or worked specifically in this kind of project -- but I would look at the documentation and be sure that your nn is being created as you expect it to.
Specifically, it mentions here:
http://pybrain.org/docs/api/tools.html?highlight=buildnetwork#pybrain.tools.shortcuts.buildNetwork
that "If the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork.", and you can read here:
http://pybrain.org/docs/api/structure/networks.html?highlight=feedforwardnetwork
that "FeedForwardNetworks are networks that do not work for sequential data. Every input is treated as independent of any previous or following inputs.".
Did you mean to create a "FeedForward" network object?
You're testing by looping over an index and activating each "input" field that's based off the instantiation of a FeedForwardNetwork object, which the documentation suggests are treated as independent of other inputs. This may be why you're getting such similar results each time, when you are expecting better convergences.
You initialize your dataset ds object with the parameters model_id=1, subj_id=1,run_ids=[12], suggesting that you're only looking at a single subject and model, but 12 "runs" from that subject under that model, right?
Most likely there's nothing semantically or grammatically wrong with your code, but a general confusion from the PyBrain library's presumed and assumed models, parameters, and algorithms. So don't tear your hair out looking for code "errors"; this is definitely a common difficulty with under-documented libraries.
Again, I may be off base, but in my experience with similar tools and libraries, it's most often that the benefit of taking an extremely complicated process and simplifying it to just a couple dozen lines of code, comes with a TON of completely opaque and fixed assumptions.
My guess is that you're essentially re-running "new" tests on "new" or independent training data, without all the actual information and parameters that you thought you had setup in the previous code lines. You are exactly correct that the highest value (read: largest probability) is the "most likely" (that's exactly what each value is, a "likeliness") answer, especially if your probability array represents a unimodal distribution.
Because there are no obvious code syntax errors -- like accidentally looping over a range iterator equivalent to the list [0,0,0,0,0,0]; which you can verify because you reuse the i index integer in printing y_test which varies and the result of nn.activate(X_test[i]) which isn't varying -- then most likely what's happening is that you're basically restarting your test every time and that's why you're getting an identical result, not just similar but identical for every printout of that nn.activate(...) method's results.
This is a complex, but very well written and well illustrated question, but unfortunately I don't think there will be a simple or blatantly obvious solution.
Again, you're getting the benefits of PyBrain's simplificaiton of neural networks, data training, heuristics, data reading, sampling, statistical modelling, classification, and so on and so forth, all reduced into single line or two line commands. There are assumptions being made, TONS of them. That's what the documentation needs to be illuminating, and we have to be very very careful when we use tools like these that it's not just a matter of correct syntax, but an actually correct (read: expected) algorithm, assumptions and all.
Good luck!
(P.S. -- Open source libraries also, despite a lack of documentation, give you the benefit of checking the source code to see [assumptions and all] what they're actually doing: https://github.com/pybrain/pybrain )

How to get scipy.stats.chisquare to function properly

I have 2 input files of identical size/shape, however the data they contain has a different resolution and I am looking to perform a chi squared test on them.
The input files are 500 lines long and contain 4 columns delineated by spaces, I am trying to test the second column of each input file against the other.
My code is as follows:
# Import statements
C = pl.loadtxt("input_1.txt")
D = pl.loadtxt("input_2.txt")
col2_C = C[:,1]
col2_D = D[:,1]
f_obs = np.array([col2_C])
f_exp = np.array([col2_D])
chisquare(f_obs, f_exp)
This gives me an error saying:
ValueError: df <= 0
I don't even understand what it is complaining about here.
I have tried several other syntaxes within the script, each of which also resulted in various errors:
This one was found here.
chisquare = f_obs=[col2_C], f_exp=[col2_D])
TypeError: chisquare() takes at least one positional argument
Then I tried
chisquare = f_obs(col2_C), F_exp=[col2_D)
NameError: name 'f_obs' is not defined
I also tried several other syntactical tweaks but nothing to any avail. If anybody could please help me get this running I would appreciate it greatly.
Thanks in advance.
First, be sure you are importing chisquare from scipy.stats. Numpy has the function numpy.random.chisquare, but that does not do a statistical test. It generates samples from a chi-square probability distribution.
So be sure you use:
from scipy.stats import chisquare
There is a second problem.
As slices of the two-dimensional array returned by loadtxt, col2_C and col2_D are one-dimensional numpy arrays, so there is no need to use, for example, np.array([col2_C]) when you pass these to chisquare. Just use col2_C and col2_D directly:
chisquare(col2_C, col2_D)
Wrapping the arrays with np.array like you did is causing the problem. chisquare accepts multidimensional arrays and an axis argument. When you do f_exp = np.array([col2_C]) (with the extra square brackets), f_exp is actually a two-dimensional array, with shape (1, 500). Similarly f_obs has shape (1, 500). The default axis argument of chisquare is 0. So when you called chisquare(f_obs, f_exp), you were asking chisquare to perform 500 chi-square tests, with each test having a single observed and expected value.

numpy, h5py: How do I make an array sorted by one of its columns from a bigger one saved with h5py?

I'd like to give you some background info so you understand my problem better.
From the results of an experiment I fill a big hdf5 table with lots of columns using h5py. Once all my measurements are done, I need to plot and fit some results. This is already working but when I get to the point when I want to plot the fitting function, as my data is not sorted by the column with the 'x' axis data, instead of a single line I get an ugly back-and-forth line (I'd show it to you but I don't have enough reputation yet).
So my first thought was to sort the arrays before plotting and fitting. I tried following several guides I found here but my joined array had the wrong shape and that was the time I though there might be a better way of doing it.
So my question is, What's the best way of getting an array sorted by one of its columns from a bigger array saved in an hdf5 file using h5py?
This is how I'm currently doing it:
Let's say I already extracted the columns from the hdf5 file (even though maybe this could be improved!), now I'm making them up.
x_d = array([5, 2, 10, 4])
y_d = array([0.2, 1.0, 4.1, 0.1])
wtype = np.dtype([('x', x_d.dtype), ('y', y_d.dtype)])
w = np.empty(len(x_d), dtype=wtype)
w['x'] = x_d
w['y'] = y_d
w.sort(order='x')
Something along these lines should work:
f = h5py.File('myfile.hdf5','r')
x_d = f['x_axis'][:]
y_d = f['values'][:]
sorted_y = y_d[numpy.argsort(x_d)]
or if you want to have the reverse order:
sorted_y = y_d[numpy.argsort(x_d)[::-1]]

Categories

Resources