Found input variables with inconsistent numbers of samples: [14559, 1455900] - python

I am facing some problems when I try to fit the model. This happens when I try to use LogisticRegression, Naive bayes or svm models. But I get results when I use random forest regression or decision tree.
The error says:
ValueError: y should be a 1d array, got an array of shape (20799, 100)
instead.
The solution is to use y_train.ravel() when I fit the model. But then again, the below error appears:
Found input variables with inconsistent numbers of samples: [14559,
1455900]
Here's my code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
df=pd.read_csv('../input/filteredbymany.csv',low_memory=False,usecols=['county','crashalcoh','drvrsex','developmen','lightcond','drvrvehtyp','drvrage','pedage','city','crashloc','crashtype','pedpos'])
df.dropna(inplace=True)
dummies= pd.get_dummies(df)
merged=pd.concat([df,dummies],axis='columns')
X = merged
X = X.drop(['county','crashalcoh','city','developmen','drvrage','drvrsex','drvrvehtyp','lightcond','pedage','crashloc','crashtype','pedpos'],axis='columns')
y = X.loc[:, X.columns.str.startswith('county')]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
model = LogisticRegression()
model.fit(X_train,y_train.values.ravel())
model.predict(X_test)
I have been struggling with this for around 80 hours or so. Please help.

The problem
You want to have an array X with N rows. Each row is a sample of something and each column is a feature of these samples. And then you want to have an array y with N values. The i'th value of y is the value ("label") you want to predict for the i'th row of X.
The first error
Your y is two-dimensional (shape is (N, 100)), but it should be one-dimensional (shape (N,)). So you have 100 labels for each instance in X, but the model you chose can only predict one label per instance.
The second error
Then you ravel it to a one-dimensional array with shape (100*N,). Now you have one dimension, but still too many values.
Solution
Look at your tables X and y and see which column of y you actually want.

Related

Fit a Normalizer with an array, then transform another in python with sklearn

I'm not sure if i'm doing something wrong, or if this is not the correct way to do this..
I'm encoding variables in a dataset for a model, now, i'm using a Normalizer() from sklearn.preprocessing to normalize one of my variables which is numerical.
My dataset is split in two, one for the training and one for the inference. Now, my goal is to normalize this numerical variable (let's call it column x) in the training subset, and then use the normalization parameters to normalize the same variable in the inference dataset. Now, both subsets don't have the same amount of entries, so, what i'm doing is:
nr = Normalizer()
nr.fit([df1.x])
new_col = nr.transform(df1.x)
Now, the problme is.. when i try to use the same normalizer parameters on the column x in the inference subset, since it has a different number of rows:
new_col1 = nr.transform(df2.x)
I get:
X has 10 features, but Normalizer is expecting 697 features as input.
I'm not sure if it's some reshape problem or if the Normalizer() shouldn't be used in that way, so, any advice would be more than welcome.
Normalizer is used to normalize rows whereas StandardScaler is used to normalize column. Concerning your questions, it seems that you want to scale columns. Therefore you should use StandardScaler.
scikit-learn transformers excepts 2D array as input of shape (n_sample, n_feature) but pandas.Series are one-dimensional ndarray with axis labels.
You can fix that by passing a pandas.DataFrame to the transformer.
As follows:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
df1 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=1000)})
df2 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=850)})
scaler = StandardScaler()
new_col = scaler.fit_transform(df1[['x']])
new_col1 = scaler.transform(df2[['x']])

Python Linear Regression Predict Error - Array Issue

When I try to use .predict on my linear regression, I get thrown the following error:
ValueError: Expected 2D array, got scalar array instead:
array=80.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't really understand the reshape feature and why its needed. Can somebody please explain to me what this does, and how to apply it to get a prediction for my model?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([95,85,80,70,60])
y = np.array([85,95,70,65,70])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
plt.scatter(x,y)
plt.show()
reg = LinearRegression()
reg.fit(x,y)
reg.predict(80)
input of predict() is 2d array you are passing integer value that's why you are getting error. You need to pass 80 as a 2d list [[80]]
reg.predict([[80]])

Python PolynomialFeatures transforms data into different shape from the original one

I'm using sklearn's PolynomialFeatures to preprocess data into various degree transformations in order to compare their model fit.
Below is my code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
# x and y are the original data
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
# using .PolynomialFeatures and fit_transform to transform original data to degree 2
poly1 = PolynomialFeatures(degree=2)
x_D2_poly = poly1.fit_transform(x)
#check out their dimensions
x.shape
x_D2_poly.shape
However, the above transformation returned an array of (1, 5151) from the original x of (100, 1). This is not what I have expected. I couldn't figure out what's wrong with my code. It will be great if someone could point out the error of my code or misconception on my part.
Should I use alternative methods to transform original data instead?
Thank you.
Sincerely,
[update]
So after I used x = x.reshape(-1, 1) to transform the original x, Python does give me the desired output dimension (100, 1) via poly1.fit_transform(x). However, when I did a train_test_split, fitted the data, and tried to obtain predicted values:
x_poly1_train, x_poly1_test, y_train, y_test = train_test_split(x_poly1, y, random_state = 0)
linreg = LinearRegression().fit(x_poly1_train, y_train)
poly_predict = LinearRegression().predict(x)
Python returned an error message:
shapes (1,100) and (2,) not aligned: 100 (dim 1) != 2 (dim 0)
Apparently, there must be somewhere I got the dimensional thing wrong again. Could anyone shed some light on this?
Thank you.
I think you need to reshape your x like
x=x.reshape(-1,1)
Your x had shape (100,) not (100,1) and fit_transform expects 2 dimensions.
The reason you were getting 5151 features is that you were seeing one feature for each distinct pair (100*99/2 = 4950), one feature for each feature squared (100), 1 feature for first power of each feature (100), and one the 0th power (1).
Response to your edited question:
You need to call transform to convert the data you wish to predict on.

using x.reshape on a 1D array in sklearn

I tried to use sklearn to use a simple decision tree classifier, and it complained that using a 1D array is now depricated and must use X.reshape(1,-1). So I did but it has turned my labels list to a list of lists with only one element so number of labels and samples do not match now. Another words my list of labels=[0,0,1,1] turns into [[0 0 1 1]]. Thanks
This is the simple code that I used:
from sklearn import tree
import numpy as np
features =[[140,1],[130,1],[150,0],[170,0]]
labels=[0,0,1,1]
labels = np.array(labels).reshape(1,-1)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features,labels)
print clf.predict([150,0])
You are reshaping the wrong thing. Reshape the data your are predicting on, not your labels.
>>> clf.predict(np.array([150,0]).reshape(1,-1))
array([1])
Your labels have to align with your training data (features) ,so the length of both arrays should be the same. If labels is reshaped, you are right, it is a list of lists with a length of 1 and not equal to the length of your features.
You have to reshape your test data because prediction needs an array that looks like your training data. i.e. each index needs to be a training example with the same number of features as in training. You'll see that the following two commands return a list of lists and just a list respectively.
>>> np.array([150,0]).reshape(1,-1)
array([[150, 0]])
>>> np.array([150,0])
array([150, 0])

data dimension of scikit learn linear regression

I just started using Python scikit-learn package to do linear regression. I am confused with the dimension of data set it required. For example, I want to regress X on Y using the following code
from sklearn import linear_model
x=[0,1,2]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
System returned with error : tuple index out of range.
According the scikit-learn website, effective arrays should be like
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
(http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
from sklearn import linear_model
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
so it means that package can not regress X[i] on Y[i] for two single numbers? it must be an array on a number? like [0,0] in X to 0in Y?
Thanks in advance.
You can.
Simply reshape your data to be x = [[0], [1], [2]].
In this case , every point in your data will have a single feature - single number.
Scikit requires your x to be a 2-dimensional array. It need not be a numpy array. You can always use a simple python list.
In case if you have your x as a 1-dimensional array like you just mentioned in your question, you can simply do the following:
x = [[value] for value in [0,1,2]]
This will store a 2D array of your 1D array in x i.e. every individual value of your list is stored as an array.
x can also be converted into a numpy array, and then reshaped as follows:
import numpy as np
x = np.array(x).reshape(-1, 1)
This converts your data into a 2D array so that you can use it for fitting the linear regression model from sklearn.
array([[0],
[1],
[2]])

Categories

Resources