I got 2 CSV named train.csv and test.csv.
Both files have the same structure, and I want to use train.csv as train data and test.csv as test data.
The thing is, I can't find anywhere how to use scikit-learn linear regression without using split, every tutorial/documentation I find uses the function train_test_split(), but if I understand correctly it's used to split one file (let's say data.csv) as both train and test data.
Is it even possible? If no, what alternative can I use?
If you have separate train, test data,
define X_train and y_train
X_train is the features excluding the target variable
# Sudo Code
X_train = train.drop(target, axis=1)
y_train is the target variable
# Sudo Code
y_train = train[target]
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)
Related
I'm very new to Python and I'm trying to replicate this Sign Language Glove project heree with my own hardware for a first practice into Machine Learning. I could already write data in CSV files from my accelerometers, but I can't understand the process. The file named 'modeling' confuses me. Can anyone help me understand what are the processes happening?
import numpy as np
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd
df= pd.read_csv("final.csv") ##This I understand. I've successfully created csv files with data
#########################################################################
## These below, I do not.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2)
train_features = train[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
train_label = train.cl
test_features = test[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
test_label = test.cl
## SVM
model = svm.SVC(kernel='linear', gamma=1, C=1)
model.fit(train_features, train_label)
model.score(train_features, train_label)
predicted_svm = model.predict(test_features)
print "svm"
print accuracy_score(test_label, predicted_svm)
cn =confusion_matrix(test_label, predicted_svm)
Welcome to the community. That looks like a nice way to start off.
Like #hilverts_drinking_problem suggested, I would recommend looking at sklearn documentation. But here's a quick explanation of what's going on.
The train, test split function randomly splits the dataset into two datasets for the sake of training and testing. test_size = 0.2 means 20% of the data will be in the test set, remaining 80% in train.
The next two lines are just separating out the inputs (features) and outputs (targets) for training. Same for test in the next two lines.
Finally, you create an SVM object, train the model using model.fit, and get its score using .score. You then use the model to predict stuff for the test set. Finally, you print the accuracy score for your test set, along with its confusion matrix.
If you need me to clarify/detail something, let me know!
I need to run a simple linear regression on a large dataset 30GB that can't be load in memory. The features are mostly categorical data. I've already build a prototype in scikit-learn that works just fine, but only works on a subsample of the data.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
data = pd.read_csv('datafile.csv', nrows=5e6)
""" data['categorical_feature'] it's a text field, which has categories comma separated. Example of structure is shown below.
categorical_feature
1 1671,1293
2 1293
3 1233,1671
"""
cat_vec = CountVectorizer(min_df=2)
m_cat = cat_vec.fit_transform(data['categorical_feature'])
lm = linear_model.Ridge()
lm.fit(m_cat, data['target'])
How would I write this in tensorflow? I've looked around didn't find much that can replicate the behaviour of CountVectorizer in scikit-learn.
I have downloaded the MNIST files from:
http://yann.lecun.com/exdb/mnist/index.html
because I want to train an SVM like the following:
clf_svm = LinearSVC()
clf_svm.fit(X_train, y_train)
but I see that the data that I downloaded is divided in training images and labels, so how can I join them to form a numpy array that comprises the X_train variable.
I have tried to do the following:
path_train="D:\\Anaconda\\t10k-images-idx3-ubyte\\t10k-images.idx3-ubyte"
f=open(path_train,"rb")
train_data=cPickle.load(f)
but I got the following error:
train_data=cPickle.load(f)
EOFError
so the question is how to form that X_train with the information that I need?
Thanks
I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?
It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.
I'm trying to train some neural network using sknn. I have preprocessed my data through a pandas dataframe. The preprocessing works fine when I use the fit(x_train,y_train) on standard sklearn classifiers, but it throws the attribute error
anaconda/envs/py3k/lib/python3.4/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'todense'
or this error:
/anaconda/envs/py3k/lib/python3.4/site-packages/pandas/core/indexing.py", line 1750, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
Seemingly at random (different runs, without changing anything).
The relevant piece of code looks like this:
x_train, x_test, y_train, y_test = cross_validation.train_test_split(X_data, Y_data, test_size=1/kfold)
regr = linear_model.LinearRegression(copy_X=True,fit_intercept=True)
abr = AdaBoostRegressor(base_estimator=tree.DecisionTreeRegressor(max_depth=max_depth_gridsearch_values[max_depth_counter]), n_estimators = n_estimators_gridsearch_values[n_estimators_counter])
nn=nn_simple_regressor
x_train_numeric = x_train.iloc[:,2:]
x_test_numeric = x_test.iloc[:,2:]
regr.fit(x_train_numeric, y_train)
abr.fit(x_train_numeric, y_train)
nn.fit(x_train_numeric,y_train)
And the regressor is defined as
nn_simple_regressor = Regressor(
layers=[
Layer("Rectifier", units=100),
Layer("Linear")],
learning_rate=0.02,
n_iter=10)
I cannot understand why this is happening, and seems like the support for sknn is pretty small. I suspect the issue is actually with the preprocessing, but I don't understand why it works for the first two classifiers but not my NN. Any ideas?
As of February 2016, Sknn does not support pandas. In order to fix the issues stated in the question, the best approach is to convert the dataframe into a numpy array. Using the .as_martix() function in pandas is the easiest way to do so.