data dimension of scikit learn linear regression - python

I just started using Python scikit-learn package to do linear regression. I am confused with the dimension of data set it required. For example, I want to regress X on Y using the following code
from sklearn import linear_model
x=[0,1,2]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
System returned with error : tuple index out of range.
According the scikit-learn website, effective arrays should be like
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
(http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
from sklearn import linear_model
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
so it means that package can not regress X[i] on Y[i] for two single numbers? it must be an array on a number? like [0,0] in X to 0in Y?
Thanks in advance.

You can.
Simply reshape your data to be x = [[0], [1], [2]].
In this case , every point in your data will have a single feature - single number.

Scikit requires your x to be a 2-dimensional array. It need not be a numpy array. You can always use a simple python list.
In case if you have your x as a 1-dimensional array like you just mentioned in your question, you can simply do the following:
x = [[value] for value in [0,1,2]]
This will store a 2D array of your 1D array in x i.e. every individual value of your list is stored as an array.

x can also be converted into a numpy array, and then reshaped as follows:
import numpy as np
x = np.array(x).reshape(-1, 1)
This converts your data into a 2D array so that you can use it for fitting the linear regression model from sklearn.
array([[0],
[1],
[2]])

Related

Found input variables with inconsistent numbers of samples: [14559, 1455900]

I am facing some problems when I try to fit the model. This happens when I try to use LogisticRegression, Naive bayes or svm models. But I get results when I use random forest regression or decision tree.
The error says:
ValueError: y should be a 1d array, got an array of shape (20799, 100)
instead.
The solution is to use y_train.ravel() when I fit the model. But then again, the below error appears:
Found input variables with inconsistent numbers of samples: [14559,
1455900]
Here's my code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
df=pd.read_csv('../input/filteredbymany.csv',low_memory=False,usecols=['county','crashalcoh','drvrsex','developmen','lightcond','drvrvehtyp','drvrage','pedage','city','crashloc','crashtype','pedpos'])
df.dropna(inplace=True)
dummies= pd.get_dummies(df)
merged=pd.concat([df,dummies],axis='columns')
X = merged
X = X.drop(['county','crashalcoh','city','developmen','drvrage','drvrsex','drvrvehtyp','lightcond','pedage','crashloc','crashtype','pedpos'],axis='columns')
y = X.loc[:, X.columns.str.startswith('county')]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
model = LogisticRegression()
model.fit(X_train,y_train.values.ravel())
model.predict(X_test)
I have been struggling with this for around 80 hours or so. Please help.
The problem
You want to have an array X with N rows. Each row is a sample of something and each column is a feature of these samples. And then you want to have an array y with N values. The i'th value of y is the value ("label") you want to predict for the i'th row of X.
The first error
Your y is two-dimensional (shape is (N, 100)), but it should be one-dimensional (shape (N,)). So you have 100 labels for each instance in X, but the model you chose can only predict one label per instance.
The second error
Then you ravel it to a one-dimensional array with shape (100*N,). Now you have one dimension, but still too many values.
Solution
Look at your tables X and y and see which column of y you actually want.

Python Linear Regression Predict Error - Array Issue

When I try to use .predict on my linear regression, I get thrown the following error:
ValueError: Expected 2D array, got scalar array instead:
array=80.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't really understand the reshape feature and why its needed. Can somebody please explain to me what this does, and how to apply it to get a prediction for my model?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([95,85,80,70,60])
y = np.array([85,95,70,65,70])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
plt.scatter(x,y)
plt.show()
reg = LinearRegression()
reg.fit(x,y)
reg.predict(80)
input of predict() is 2d array you are passing integer value that's why you are getting error. You need to pass 80 as a 2d list [[80]]
reg.predict([[80]])

How can I replace svm.SVR 'rbf' kernel in sklearn using my own RBF function?

I have developed the code below for starting a project for svm method:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
housing = load_boston()
df = pd.DataFrame(np.c_[housing['data'], housing['target']],
columns= np.append(housing['feature_names'], ['target']))
features = df.columns.tolist()
label = features[-1]
features = features[:-1]
x_train = df[features].iloc[:400]
y_train = df[label].iloc[:400]
x_test = df[features].iloc[400:]
y_test = df[label].iloc[400:]
svr = svm.SVR(kernel='rbf')
svr.fit(x_train, y_train)
y_pred = svr.predict(x_test)
print(mean_absolute_error(y_pred, y_test))
Now I want to use my customized rbf kernel which is:
def my_rbf(feat, lbl):
#feat = feat.values
#lbl = lbl.values
ans = np.array([])
gamma = 0.000005
for i in range(len(feat)):
ans = np.append(ans, np.exp(-gamma * np.dot(feat[i]-lbl[i], feat[i]-lbl[i])))
return ans
Then I changed svm.SVR(kernel=my_rbf) But I get plenty of errors while modifying it in any way. I also tried to use a simple function like np.dot(feat-lbl,feat-lbl) which worked fine in SVR.fit method but in svr.predict some error occurred which said that shape of input matrix has to be like [n_samples_test, n_samples_train].
I'm stymied to deal with the errors. Can anyone help me make this code work?
The custom kernel method my_rbf you coded uses both X (features) and y (labels). You cannot evaluate this function during predictions as you have no access to labels. The custom kernel if flawed.
Backgroud
The RBF function is defined as below (from wiki)
where x and x' are two feature (X) vectors.
Let H(X) is a function with transforms a vector X to other dimension (normally to very very high dimension). SVM needs to calculate the dot product between all combinations of the feature vectors (ie all H(X)'s). So if H(X1) . H(X2) = K(X1, X2) then K is called the kernel function or kernalization of H. So instead of transforming the points X1 and X2 to very high dimensions and calculating the dot product there, K calculates it directly from X1 and X2.
Conclusion
The my_rbf is not a valid kernel function simply because it uses labels (Ys). It should be only on the feature vectors.
According to this source, RBF function which I was looking for (takes training featues as X and testing features as X' as inputs) and outputs [n_training_samples, n_testing_samples] as explained more thoroughly in docs, is something like this:
def my_kernel(X,Y):
K = np.zeros((X.shape[0],Y.shape[0]))
for i,x in enumerate(X):
for j,y in enumerate(Y):
K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
return K
clf=SVR(kernel=my_kernel)
which results exactly equal to:
clf=SVR(kernel="rbf",gamma=1)
In terms of speed it lacks performance as efecient as the default svm library rbf. It could be useful to use static typing of cython library for indexes and also using memory-views for numpy arrays to speed it up a little bit.

Scikit-learn f1_score for list of strings

Is there any way to compute f1_score for a list of labels as strings regardless their order?
f1_score(['a','b','c'],['a','c','b'],average='macro')
I wish this to return 1 instead of 0.33333333333
I know I could vectorize labels but this syntax would be far easier, in my case, since I am dealing with many labels
What you need is the f1_score for a multilabel classification task and for that you need a 2-d matrix for y_true and y_pred of shape [n_samples, n_labels].
You are currently supplying a 1-d array only. Hence it will be considered as a multi-class problem, not multilabel.
The official documentation provides the necessary details.
And for that to be scored correctly you need to convert the y_true, y_pred to label-indicator matrix as documented here:
y_true : 1d array-like, or label indicator array / sparse matrix
y_pred : 1d array-like, or label indicator array / sparse matrix
So you need to change the code like this:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
y_true = [['a','b','c']]
y_pred = [['a','c','b']]
binarizer = MultiLabelBinarizer()
# This should be your original approach
#binarizer.fit(your actual true output consisting of all labels)
# In this case, I am considering only the given labels.
binarizer.fit(y_true)
f1_score(binarizer.transform(y_true),
binarizer.transform(y_pred),
average='macro')
Output: 1.0
You can have a look at examples of MultilabelBinarizer here:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
https://stackoverflow.com/a/42392689/3374996

using x.reshape on a 1D array in sklearn

I tried to use sklearn to use a simple decision tree classifier, and it complained that using a 1D array is now depricated and must use X.reshape(1,-1). So I did but it has turned my labels list to a list of lists with only one element so number of labels and samples do not match now. Another words my list of labels=[0,0,1,1] turns into [[0 0 1 1]]. Thanks
This is the simple code that I used:
from sklearn import tree
import numpy as np
features =[[140,1],[130,1],[150,0],[170,0]]
labels=[0,0,1,1]
labels = np.array(labels).reshape(1,-1)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features,labels)
print clf.predict([150,0])
You are reshaping the wrong thing. Reshape the data your are predicting on, not your labels.
>>> clf.predict(np.array([150,0]).reshape(1,-1))
array([1])
Your labels have to align with your training data (features) ,so the length of both arrays should be the same. If labels is reshaped, you are right, it is a list of lists with a length of 1 and not equal to the length of your features.
You have to reshape your test data because prediction needs an array that looks like your training data. i.e. each index needs to be a training example with the same number of features as in training. You'll see that the following two commands return a list of lists and just a list respectively.
>>> np.array([150,0]).reshape(1,-1)
array([[150, 0]])
>>> np.array([150,0])
array([150, 0])

Categories

Resources