OneClassSVM scikit learn - python

I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?

It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.

Related

How to use scikit-learn linear regression without using split?

I got 2 CSV named train.csv and test.csv.
Both files have the same structure, and I want to use train.csv as train data and test.csv as test data.
The thing is, I can't find anywhere how to use scikit-learn linear regression without using split, every tutorial/documentation I find uses the function train_test_split(), but if I understand correctly it's used to split one file (let's say data.csv) as both train and test data.
Is it even possible? If no, what alternative can I use?
If you have separate train, test data,
define X_train and y_train
X_train is the features excluding the target variable
# Sudo Code
X_train = train.drop(target, axis=1)
y_train is the target variable
# Sudo Code
y_train = train[target]
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

Why the line not cut across the data?

I using linear regression model to predict my data.
Orig Data
When I using sns plot; I able to see the line cut's thru to all the data point.
Using snsborn.lmplot
But when I using train_test_split function:
The coeff & interc as below :
Weight = [0.20504568]
Intercept = -1.0383656275693958
But graph is totally out
graph using train test split
How can I fix this?
My bad... i using the wrong X value for the line graph.

Handling larger tensorflow dataset

I am relatively new to Tensorflow and have been putting together some model training based on the tutorial I found on the ts website. I have been able to put together something functional that satisfies my preliminary requirements.
I am reading locally a csv files that provides some links towards images associated with labels written on the same csv row. My code roughly look like that:
def map_func(*row):
img = process_img(img_filename)
output = read(row)
return img, output
dataset = tf.data.experimental.CsvDataset(CSV_FILE, default_format, header=True)
dataset = dataset.map(map_func)
dataset = dataset.shuffle(buffer_size=shuffle_buffer_size)
dataset = dataset.batch(NB_IMG)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
X, y = next(iter(dataset))
X_train, X_test = tf.split(X, split, axis=0)
y_train, y_test = tf.split(y, split, axis=0)
model = create_model()
model.compile(optimizer=OPTIMIZER, loss='mse')
model.fit(x=X_train, y=y_train, epochs=EPOCHS, validation_data=(X_test, y_test))
NB_IMG is the total number of images I have. EPOCHS is here arbitrary fixed to a given value (in general 20 or 40) and the split is a ratio applied on NB_IMG.
All my images are locally on my computer and with that code my GPU currently can manage up to 50000 images roughly. The training is failing with more images (GPU exhausted). I can understand that is due to the fact that I am reading the data all at once, but I am bit blocked to take the next step here to be able to manage a bigger dataset.
This part below is the one that need improvement I guess:
X, y = next(iter(dataset))
Could someone here help me to move forward and guide me towards some examples or snippets where I can train the model on a bigger dataset? I am a bit lost here for the next move and not sure where to focus in the ts documentation. I did not really find a clear example online that would suit my needs. How should I loop on different batches? How is coded the iterator?
Thanks!
Well, can you give more details about the two functions process_img and read?
During my experiments, I have noticed that the shuffle function can be slow when you have a lot of data and the buffer size is big. Try to comment that line and check if it runs faster. If so, you can use pandas to load your CSV file and then shuffle it and use tf.data.Dataset.from_tensor_slices
Tensorflow has a great tool now to profile models and the dataset pipeline (https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras).
process_img and read are very simple functions:
def process_img(filename):
img = tf.io.read_file(filename)
return tf.image.decode_jpeg(img, channels=3)
def read(row):
return row[1]
The shuffle part of my code is slow but does not seem to be the cause of failure, I can remove it and shuffle the data directly from the csv. It seems to fail at the X, y = next(iter(dataset)) line if the dataset is too big
Thanks for your suggestions to profile the code, I will give it a go. Is there any other possible approach to split and iterate among the dataset?

Sign Language Glove project from GitHub: Help in understanding code

I'm very new to Python and I'm trying to replicate this Sign Language Glove project heree with my own hardware for a first practice into Machine Learning. I could already write data in CSV files from my accelerometers, but I can't understand the process. The file named 'modeling' confuses me. Can anyone help me understand what are the processes happening?
import numpy as np
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd
df= pd.read_csv("final.csv") ##This I understand. I've successfully created csv files with data
#########################################################################
## These below, I do not.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2)
train_features = train[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
train_label = train.cl
test_features = test[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
test_label = test.cl
## SVM
model = svm.SVC(kernel='linear', gamma=1, C=1)
model.fit(train_features, train_label)
model.score(train_features, train_label)
predicted_svm = model.predict(test_features)
print "svm"
print accuracy_score(test_label, predicted_svm)
cn =confusion_matrix(test_label, predicted_svm)
Welcome to the community. That looks like a nice way to start off.
Like #hilverts_drinking_problem suggested, I would recommend looking at sklearn documentation. But here's a quick explanation of what's going on.
The train, test split function randomly splits the dataset into two datasets for the sake of training and testing. test_size = 0.2 means 20% of the data will be in the test set, remaining 80% in train.
The next two lines are just separating out the inputs (features) and outputs (targets) for training. Same for test in the next two lines.
Finally, you create an SVM object, train the model using model.fit, and get its score using .score. You then use the model to predict stuff for the test set. Finally, you print the accuracy score for your test set, along with its confusion matrix.
If you need me to clarify/detail something, let me know!

Tensorflow OneHotEncoder on large file for Linear Regression

I need to run a simple linear regression on a large dataset 30GB that can't be load in memory. The features are mostly categorical data. I've already build a prototype in scikit-learn that works just fine, but only works on a subsample of the data.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
data = pd.read_csv('datafile.csv', nrows=5e6)
""" data['categorical_feature'] it's a text field, which has categories comma separated. Example of structure is shown below.
categorical_feature
1 1671,1293
2 1293
3 1233,1671
"""
cat_vec = CountVectorizer(min_df=2)
m_cat = cat_vec.fit_transform(data['categorical_feature'])
lm = linear_model.Ridge()
lm.fit(m_cat, data['target'])
How would I write this in tensorflow? I've looked around didn't find much that can replicate the behaviour of CountVectorizer in scikit-learn.

Categories

Resources