Sklearn Multiclass Dataset Loading - python

For a multiclass problem I use Scikit-Learn. I find very little examples on how to load a custom dataset with multiple classes. The sklearn.datasets.load_files method does not seem to be suitable as files need to be stored multiple times. I now have the following structure:
X => Python list with lists of features (in text).
y => Python list with lists of classes (in text).
How do I transform this to a structure Scikit-Learn can use in a classifier?

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
X = np.loadtxt('samples.csv', delimiter=",")
y_aux = np.loadtxt('targets.csv', delimiter=",")
y = MultiLabelBinarizer().fit_transform(y_aux)
Code explanation: Let's say you have all your features stored in a file called samples.csv and the multiclass labels in another file called targets.csv (they could be of course stored in the same file and you'd just need to split columns). For clarity in this example my files contain:
samples.csv
4.0,3.2,5.5
6.8,5.6,3.3
targets.csv
1,4 <-- sample one belongs to classes 1 and 4
2,3 <-- sample two belongs to classes 2,3
MultiLabelBinarizer encodes the output targets in such a way that y variable is ready to be fed into Multiclass classifiers. The output of the code is:
y = array([[1, 0, 0, 1],
[0, 1, 1, 0]])
meaning sample one belongs to classes 1 and 4 and sample two belongs to 2 and 3.

Related

how does voting work between two Tree-Based classifiers that are combined using hard voting in scikit?

For a classification task, I am using sklearn VotingClassifier to ensemble Random Forest and Extra-Tree Classifier with parameter set to voting='hard'. I don't understand how it works correctly since both Tree-Based models already give final prediction using voting technique. How can they work in combination using hard voting? Also if there is a tie between two models?
Can anyone explain this with example?
You can look that up from the source code of the voting classifier.
For short, it doesn't make much sense to use two classifiers with hard-voting. Rather use soft-voting.
The reason is, that in hard voting modus, the sklearn VotingClassifier votes for the mayority vote and with two it only gets interesting if there is a tie. In case there are as many zeros as there are ones in a binary classification, the hard voting classifier will vote for 0.
You can simply test that by looking at the code, that it executed:
First set up the data for the experiment:
import numpy as np
# create a random int array with values 0 and 1
# with 20 rows (20 predictions) of 10 voters (10 columns)
a = np.random.randint(0, 2, size=(20,10))
# then produce some tie-lines with different combinations
a[0,:] = [0]*5 + [1]*5 # the first 5 predict 0 the next 5 predict 1
a[1,:] = [1]*5 + [0]*5 # vice versa
a[2,:] = [0,1]*5 # the first predicts 0 then 1 follows 0 and 0 follows 1
a[3,:] = [1,0]*5 # the same, just starting with 1
# if you want to check, how many ones you have, do:
a.sum(axis=1)
Now see, what the voter code does with this. The voter code for hard voting is (the code below simulates the case, where you have equally weighted classiifiers weights=[1]*10):
np.apply_along_axis(
lambda x: np.argmax(
np.bincount(x, weights=[1.0]*10)),
axis=1, arr=a)
The result is:
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
You see, that for the first four entries (the ties we manually introduced), the result is 0. So in case of a tie, the voting classifier will always choose 0 (if you choose the same weight for each of the classifiers). Note, that the weights are not only used to resolve ties, so you could have a classifier which has the double weight of another classifier, so you might even get a tie with 3 classifiers this way. But whenever the sum of predictionweight for all 0-predicting classifiers is equal to the sumf of predicitonweight for all 1-predicting classifiers, the voting classifier will predict 0 rather than 1.
Here is the relevant code:
Sklearn Voting code and Description of numpy.argmax

How to select feature sizes

im trying to replicate an experiment on a paper using SVM, to increment my learning/knownledge on machine learning. In this paper, the author extracts the features and chooses the feature sizes. He, then shows a table where F represents the size of the feature vector and N represents the face images
He then works with F >= 9 and N >= 15 parameters.
Now, what i want to do is to actually grab the features i extract as he does in the paper.
Basically, this is how i extract the features:
def load_image_files(fullpath, dimension=(64, 64)):
descr = "A image classification dataset"
images = []
flat_data = []
target = []
dimension=(64, 64)
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
for person in os.listdir(path):
personfolder = os.path.join(path, person)
for imgname in os.listdir(personfolder):
class_num = CATEGORIES.index(category)
fullpath = os.path.join(personfolder, imgname)
img_resized = resize(skimage.io.imread(fullpath), dimension, anti_aliasing=True, mode='reflect')
flat_data.append(img_resized.flatten())
images.append(skimage.io.imread(fullpath))
target.append(class_num)
flat_data = np.array(flat_data)
target = np.array(target)
images = np.array(images)
print(CATEGORIES)
return Bunch(data=flat_data,
target=target,
target_names=category,
images=images,
DESCR=descr)
How do i select the amount of features extracted and stored? or how do i manually store a vector with the amount of features that i need? For instance a feature vector of size 9
I'm trying to separate my features this way:
X_train, X_test, y_train, y_test = train_test_split(
image_dataset.data, image_dataset.target, test_size=0.3,random_state=109)
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X_train, y_train)
print(model.feature_importances_)
Though, my output is:
[0. 0. 0. ... 0. 0. 0.]
for SVM classification, im trying to use OneVsRestClassifier
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly", "rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters)
model_tunning
model_tunning.fit(X_train, y_train)
prediction = model_tunning.best_estimator_.predict(X_test)
Then, once i call prediction, i get:
Out[29]:
array([1, 0, 4, 2, 1, 3, 3, 0, 1, 1, 3, 4, 1, 1, 0, 3, 2, 2, 2, 0, 4, 2,
2, 4])
So you've got two arrays of image information (one unprocessed, the other resized and flattened) as well as a list of corresponding class values (which we usually call labels). There are currently 2 things not quite right with the setup, however:
1) What's missing here are multiple features - these might include specific arrays from data associated with feature extraction from morphological/computer vision processes of your images, or they may be ancillary data like a list of preferences, behaviors, purchases. Basically, anything that can act as an array in either a numerical or categorical format. Technically speaking, your resized images are a second feature, but I don't think this will add much if any improvement in model performance.
2) target_names=category in your function return will store the last iteration pf category in CATEGORIES. I don't know if this is what you want.
Going back to your table, N would refer to the number of images in the dataset, and F would be the number of corresponding feature arrays associated with that image. By way of example, let's say we have fifty individual wines and five features (colour, taste, alcohol content, pH, optical density). N of 5 would be five of those wines, and F of 2 would be, say, colour, taste.
If I had to guess at what your features would be, they would in fact be a single feature - the image data itself. Looking at your data structure, every label/category you have will have multiple individuals (people) each with multiple examples of images of that person. Note that multiple individuals are not separate features - the way you're structuring the data, the individuals are grouped together under a single category.
So, where to from here? Without knowing what paper you're reading it's hard to suggest what to do, but I would go back and see if you can perhaps provide us with more information about the problem.

Sklearn inverse_transform return only one column when fit to many

Is there a way to inverse_transform one column with sklearn, when the initial transformer was fit on the whole data set? Below is an example of what I am trying to get after.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
# Setting up a dummy pipeline
pipes = []
pipes.append(('scaler', MinMaxScaler()))
transformation_pipeline = Pipeline(pipes)
# Random data.
df = pd.DataFrame(
{'data1': [1, 2, 3, 1, 2, 3],
'data2': [1, 1, 1, 2, 2, 2],
'Y': [1, 4, 1, 2, 2, 2]
}
)
# Fitting the transformation pipeline
test = transformation_pipeline.fit_transform(df)
# Pulling the scaler function from the pipeline.
scaler = transformation_pipeline.named_steps['scaler']
# This is what I thought may work.
predicted_transformed = scaler.inverse_transform(test['Y'])
# The output would look something like this
# Essentially overlooking that scaler was fit on 3 variables and fitting
# the last one, or any I need.
predicted_transfromed = [1, 4, 1, 2, 2, 2]
I need to be able to fit the whole dataset as part of a data prep process. But then I am importing the scaler later into another instance with sklearn.externals joblibs. In this new instance the predicted values are the only thing that exists. So I need to extract just the inverse scaler for the Y column to get back the originals.
I am aware that I could fit one transformer for X variables and Y variables, However, I would like to avoid this. This method would add to the complexity of moving the scalers around and maintaining both of them in future projects.
A bit late but I think this code does what you are looking for:
# - scaler = the scaler object (it needs an inverse_transform method)
# - data = the data to be inverse transformed as a Series, ndarray, ...
# (a 1d object you can assign to a df column)
# - ftName = the name of the column to which the data belongs
# - colNames = all column names of the data on which scaler was fit
# (necessary because scaler will only accept a df of the same shape as the one it was fit on)
def invTransform(scaler, data, colName, colNames):
dummy = pd.DataFrame(np.zeros((len(data), len(colNames))), columns=colNames)
dummy[colName] = data
dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=colNames)
return dummy[colName].values
Note that you need to provide enough information to run use the inverse_transform method of the scaler object behind the scenes.
Similar problems. I have a multidimensional timeseries as input (a quantity and 'exogenous' variables), and one dimension (a quantity) as output. I am unable to invert the scaling to compare the forecast to the original test set, since the scaler expects a multidimensional input.
One solution I can think of is using separate scalers for the quantity and the exogenous columns.
Another solution I can think of is to give the scaler sufficient 'junk' columns just to fill out the dimensions of the array to be unscaled, then only look at the first column of the output.
Then, once I forecast, I can invert the scaling on the forecast to get values which I can compare to the test set.
Improving on what Willem said. This will work with less input.
def invTransform(scaler, data):
dummy = pd.DataFrame(np.zeros((len(data), scaler.n_features_in_)))
dummy[0] = data
dummy = pd.DataFrame(scaler.inverse_transform(dummy), columns=dummy.columns)
return dummy[0].values

Neural Network predictions are always the same while testing an fMRI dataset with pyBrain. Why?

I am quite new to fMRI analysis. I am trying to determine which object (out of 9 objects) a person is thinking about just by looking at their Brain Images. I am using the dataset on https://openfmri.org/dataset/ds000105/ . So, I am using a neural network by inputting 2D slices of brain images to get the output as 1 of the 9 objects. There are details about every step and the images in the code below.
import os, mvpa2, pyBrain
import numpy as np
from os.path import join as opj
from mvpa2.datasets.sources import OpenFMRIDataset
from pybrain.datasets import SupervisedDataSet,classification
path = opj(os.getcwd() , 'datasets','ds105')
of = OpenFMRIDataset(path)
#12th run of the 1st subject
ds = of.get_model_bold_dataset(model_id=1, subj_id=1,run_ids=[12])
#Get the unique list of 8 objects (sicissors, ...) and 'None'.
target_list = np.unique(ds.sa.targets).tolist()
#Returns Nibabel Image instance
img = of.get_bold_run_image(subj=1,task=1,run=12)
# Getting the actual image from the proxy image
img_data = img.get_data()
#Get the middle voxelds of the brain samples
mid_brain_slices = [x/2 for x in img_data.shape]
# Each image in the img_data is a 3D image of 40 x 64 x 64 voxels,
# and there are 121 such samples taken periodically every 2.5 seconds.
# Thus, a single person's brain is scanned for about 300 seconds (121 x 2.5).
# This is a 4D array of 3 dimensions of space and 1 dimension of time,
# which forms a matrix of (40 x 64 x 64 x 121)
# I only want to extract the slice of the 2D images brain in it's top view
# i.e. a series of 2D images 40 x 64
# So, i take the middle slice of the brain, hence compute the middle_brain_slices
DS = classification.ClassificationDataSet(40*64, class_labels=target_list)
# Loop over every brain image
for i in range(0,121):
#Image of brain at i th time interval
brain_instance = img_data[:,:,:,i]
# We will slice the brain to create 2D plots and use those 'pixels'
# as the features
slice_0 = img_data[mid_brain_slices[0],:,:,i] #64 x 64
slice_1 = img_data[:,mid_brain_slices[1],:,i] #40 x 64
slice_2 = img_data[:,:,mid_brain_slices[2],i] #40 x 64
#Note : we may actually only need one of these slices (the one with top view)
X = slice_2 #Possibly top view
# Reshape X from 40 x 64 to 1D vector 2560 x 1
X = np.reshape(X,40*64)
#Get the target at this intance (y)
y = ds.sa.targets[i]
y = target_list.index(y)
DS.appendLinked(X,y)
print DS.calculateStatistics()
print DS.classHist
print DS.nClasses
print DS.getClass(1)
# Generate y as a 9 x 1 matrix with eight 0's and only one 1 (in this training set)
DS._convertToOneOfMany(bounds=[0, 1])
#Split into Train and Test sets
test_data, train_data = DS.splitWithProportion( 0.25 )
#Note : I think splitWithProportion will also internally shuffle the data
#Build neural network
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure.modules import SoftmaxLayer
nn = buildNetwork(train_data.indim, 64, train_data.outdim, outclass=SoftmaxLayer)
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(nn, dataset=train_data, momentum=0.1, learningrate=0.01 , verbose=True, weightdecay=0.01)
trainer.trainUntilConvergence(maxEpochs = 20)
The line nn.activate(X_test[i]) should take the 2560 inputs and generate a probability output, right? in the predicted y vector (shape 9 x 1 )
So, I assume the highest of the 9 values should be assigned answer. But it is not the case when I verify it with y_test[i]. Furthermore, I get similar values for X_test for every test sample. Why is this so?
#Just splitting the test and trainset
X_train = train_data.getField('input')
y_train = train_data.getField('target')
X_test = test_data.getField('input')
y_test = test_data.getField('target')
#Testing the network
for i in range(0,len(X_test)):
print nn.activate(X_test[i])
print y_test[i]
When I include the code above, here are some values of X_test :
.
.
.
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [0 1 0 0 0 0 0 0 0]
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [1 0 0 0 0 0 0 0 0]
nn.activated = [ 0.44403205 0.06144328 0.04070154 0.09399672 0.08741378 0.05695479 0.08178353 0.0623408 0.07133351]
y_test [0 0 0 0 0 0 1 0 0]
.
.
.
So the probability of the test sample being index 0 in every case id 44.4% irrespective of the sample value. The actual values keep varying though.
print 'print predictions: ' , trainer.testOnClassData (dataset=test_data)
x = []
for item in y_test:
x.extend(np.where(item == 1)[0])
print 'print actual: ' , x
Here, the output comparison is :
print predictions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print actual: [7, 0, 4, 8, 2, 0, 2, 1, 0, 6, 1, 4]
All the predictions are for the first item. I don't know what the problem is. The total error seems to be decreasing, which is a good sign though :
Total error: 0.0598287764931
Total error: 0.0512272330797
Total error: 0.0503835076374
Total error: 0.0486402801867
Total error: 0.0498354140541
Total error: 0.0495447833038
Total error: 0.0494208449895
Total error: 0.0491162599037
Total error: 0.0486775862084
Total error: 0.0486638648161
Total error: 0.0491337891419
Total error: 0.0486965691406
Total error: 0.0490016912735
Total error: 0.0489939195858
Total error: 0.0483910986235
Total error: 0.0487459940103
Total error: 0.0485516142106
Total error: 0.0477407360102
Total error: 0.0490661144891
Total error: 0.0483103097669
Total error: 0.0487965594586
I can't be sure -- because I haven't used all of these tools together before, or worked specifically in this kind of project -- but I would look at the documentation and be sure that your nn is being created as you expect it to.
Specifically, it mentions here:
http://pybrain.org/docs/api/tools.html?highlight=buildnetwork#pybrain.tools.shortcuts.buildNetwork
that "If the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork.", and you can read here:
http://pybrain.org/docs/api/structure/networks.html?highlight=feedforwardnetwork
that "FeedForwardNetworks are networks that do not work for sequential data. Every input is treated as independent of any previous or following inputs.".
Did you mean to create a "FeedForward" network object?
You're testing by looping over an index and activating each "input" field that's based off the instantiation of a FeedForwardNetwork object, which the documentation suggests are treated as independent of other inputs. This may be why you're getting such similar results each time, when you are expecting better convergences.
You initialize your dataset ds object with the parameters model_id=1, subj_id=1,run_ids=[12], suggesting that you're only looking at a single subject and model, but 12 "runs" from that subject under that model, right?
Most likely there's nothing semantically or grammatically wrong with your code, but a general confusion from the PyBrain library's presumed and assumed models, parameters, and algorithms. So don't tear your hair out looking for code "errors"; this is definitely a common difficulty with under-documented libraries.
Again, I may be off base, but in my experience with similar tools and libraries, it's most often that the benefit of taking an extremely complicated process and simplifying it to just a couple dozen lines of code, comes with a TON of completely opaque and fixed assumptions.
My guess is that you're essentially re-running "new" tests on "new" or independent training data, without all the actual information and parameters that you thought you had setup in the previous code lines. You are exactly correct that the highest value (read: largest probability) is the "most likely" (that's exactly what each value is, a "likeliness") answer, especially if your probability array represents a unimodal distribution.
Because there are no obvious code syntax errors -- like accidentally looping over a range iterator equivalent to the list [0,0,0,0,0,0]; which you can verify because you reuse the i index integer in printing y_test which varies and the result of nn.activate(X_test[i]) which isn't varying -- then most likely what's happening is that you're basically restarting your test every time and that's why you're getting an identical result, not just similar but identical for every printout of that nn.activate(...) method's results.
This is a complex, but very well written and well illustrated question, but unfortunately I don't think there will be a simple or blatantly obvious solution.
Again, you're getting the benefits of PyBrain's simplificaiton of neural networks, data training, heuristics, data reading, sampling, statistical modelling, classification, and so on and so forth, all reduced into single line or two line commands. There are assumptions being made, TONS of them. That's what the documentation needs to be illuminating, and we have to be very very careful when we use tools like these that it's not just a matter of correct syntax, but an actually correct (read: expected) algorithm, assumptions and all.
Good luck!
(P.S. -- Open source libraries also, despite a lack of documentation, give you the benefit of checking the source code to see [assumptions and all] what they're actually doing: https://github.com/pybrain/pybrain )

Getting feature names in addition to values - SciKitLearn+Pandas

I generate a set of features for input, that I store as a table using pandas and the CSV format.
(Each column header represents a feature names, except for the first, blank column, which is where the class labels are stored for each row).
My next step is reading the table from the csv file, into scikit learn. (I'm currently doing this with pandas again). However, after training and experimenting with my models using different feature selection methods (and different initially generated features), I want the NAMES of the selected features.
I assume this should be trivial, but I just haven't found how to do it.
(Note: I am NOT working on standard text documents, so "CountVectorizer" and "NaiveBayes"/nltk and the like do not help me).
I need a method to get the selected features, (and preferably something to drop the unselected ones, for when I apply the models and selected features on new "test" data).
Thank you very much!
My data is currently loaded like this:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
def load_data(filename="Feat_normalized.csv") :
df = pd.read_csv(filename, index_col=0)
lb = LabelEncoder()
labels = lb.fit_transform((df.index.values))
features = df.values
feature_names = list(df.columns)
feature_names.pop(0) #Remove index.
return (features, labels, lb)
features, labels, lb_encoder = load_data(filename)
X, y = features, labels
clf_logit = LogisticRegression(penalty="l1", dual=False, class_weight='auto')
X_reduced = clf_logit.fit_transform(X, y)
print('New sparse (filtered) features matrix size:')
print(X_svm.shape)
#Then fit to various models, Random forests, SVM, etc'..
Truncated Example of the first 2 rows in the input data/csv:
AA_C AA__D AA__E AA_F AA__G AA_H AA_I AA_K AA_L AA_M
Mammal_sequence_1.0.fasta 3.838099345 0.456591162 3.764884604 3.620232638 3.460992571 3.858487012 2.69247235 3.18710619 3.671029774 4.625996297 1.542632799
(AA_"" = Feature name. Mammal_sequence_1.0.fasta = Class name/label; (1 per row, empty header).
Thank you very much!

Categories

Resources