Found input variables with inconsistent numbers of samples when fitting LogisticRegression

Found input variables with inconsistent numbers of samples when fitting LogisticRegression - python

I am creating LogisticRegression classifier with the following code:
regressor = LogisticRegression()
regressor.fit(x_train, y_train)
Both x_train and y_train shapes are
<class 'tuple'>: (32383,)
x_train contains values around range [0..1], and y_train contains only 0s and 1s.
Unfortunately, fit fails with error
ValueError: Found input variables with inconsistent numbers of samples: [1, 32383]
Adding transpose to arguments doesn't help.

To continue the solution that I proposed in my comment:
The problem is the shape of the x_train. So we need to reshape it:
From the documentation:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
Example using scikit-learn and numpy:
from sklearn.linear_model import LogisticRegression
import numpy as np
# create the tuple data
x_train = tuple(range(32383))
x_train = np.asarray(x_train)
#same for y_train
y_train=tuple(range(32383))
y_train = np.asarray(y_train)
#convert tuples to nparray and reshape the x_train
x_train = x_train.reshape(32383,1)
#check if shape if (32383,)
y_train.shape
#create the model
lg = LogisticRegression()
#Fit the model
lg.fit(x_train, y_train)
This should work fine.
Hope it helps

I guess a little reshaping is necessary. I tried it like this :
from sklearn.linear_model import LogisticRegression
import numpy as np
#x_train = np.random.randn(10,1)
x_train = np.asarray(x_train).reshape(32383,1)
con = np.ones_like(x_train)
x_train = np.concatenate((con,x_train), axis =1)
#y = np.random.randn(10,1)
#y_train = np.where(y<0.5,1,0)
y_train = np.asarray(y_train).reshape(32383,1)
regressor = LogisticRegression()
regressor.fit(x_train,y_train)
The comments are just what i did to create some data. And dont forget to ad a constant like in the example, as far as i know sklearn isn't doing it. Also Statsmodels could be helpfull to you if you are interested in some statistical test and a pretty print of the results :
from statsmodels.api import Logit
logit =Logit(y_train, x_train)
fit= logit.fit()
fit.summary()
That will give you a little more statistical intel without much effort.

Related

When I want to pass the tf-idf vectorize data and numeric data at the same time in the fit() of machine learning model why does this error arised?

I want to combine text and numeric feature same time for duplicate question pair detection. But when I pass the data to the classififer.fit()
setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
error occured. How can I solve this? The snippet of code:
from sklearn.model_selection import train_test_split
X=train_df.drop(['is_duplicate', 'Unnamed: 0', 'id', 'qid1',
'qid2','word_mover_distance','jaccard_sim'], axis=1)
y=train_df['is_duplicate'].values
X_num = train_df[['word_mover_distance', 'jaccard_sim']].values
scaler = StandardScaler()
X_num = scaler.fit_transform(X_num)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_num_train, X_num_test = train_test_split(X_num, test_size=0.2, random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=True,
preprocessor=None, # applied in Data Cleaning
use_idf=True,
norm='l2',
smooth_idf=True
train_q1 =tfidf.fit_transform(X_train['question1'])
train_q2 =tfidf.fit_transform(X_train['question2'])
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(C=0.5, max_iter=1000)
log_clf.fit([train_q1,train_q2, X_num_train],y_train) #Error arise here

Logistic regression .fit() method expects (from documentation):
X{array-like, sparse matrix} of shape (n_samples, n_features)
Passing features as list of array leads to fact that shape of X becomes [3, n_samples, n_features] and this is why it is not working.
You can try wrapping your list with np.concatenate() to avoid this problem.

Incomprehension of input data for the LSTM model

I'm facing a problem I can't solve. Indeed, I try to create a model LSTM with keras, but I don't understand what the input data format should be.
My data train and my data test look like this:
date/value/value/value/value/value_i_want_to_predict
I've seen some people doing this:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)X_train = []
y_train = []
for i in range(60, len(training_set_scaled)):
X_train.append(training_set_scaled[i-60: i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
But if I do that how do I predict my features without modifying the test data set?
I have a hard time understanding why we do this. Moreover, what I would like to use the values to predict the target in the last column. With this method I feel like I have to change the format of the data test and it's important that I can test the model on test data that are different and that I don't have to change.
Can someone help me?
EDIT
scaler.fit(df_train_x)
X_train = scaler.fit_transform(df_train_x)
X_test = scaler.transform(df_test_x)
y_train = np.array(df_train_y)
y_train = np.insert(y_train, 0, 0)
y_train = np.delete(y_train, -1)
The shape of the data is: (2420, 7)
That what I did. But The shape still remain 2D. So i used :
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=32)
And the input shape of first layer is:
model.add(LSTM(150, activation='relu', return_sequences=True,input_shape=(2419, 7)))
but when i fit the generator to the model:
ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (1, 1)
i really don't understand

I'm not sure to fullly understand your question but I will try my best.
I think the code you provided is problem specific, meaning it maybe not suitable for your imlementation.
For an LSTM (and for pretty much any neural network) you always want to scale your data before feeding it to the model. This helps avoid having completely different data ranges across your features. The MinMaxScaler scale your features to the range provided. For an explanation of why do you need scaling, you can have a look at this article.
Usualy, you want to first split your dataset in training and testing sets, using for example the train_test_split function of sklearn, then scale your features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.drop("feature_I_want_to_predict",axis=1)
y = data["feature_I_want_to_predict"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
That way, X_train represent your training data, and y_train represent your labels for the training data. (and similarly for the test data)
I here used the StandardScaler instead of the MinMaxScaler. The standard scaler substracts the mean of the feature then divides by the standard deviation.

Predict_proba function for Multi-target Classification

I am working on a Multi-Target (binary) classification. There are 11 targets and I am using sklearn's MultiOutputClassifier. I am having difficulty with the Predict_proba function. See a snippet of the dataset, and code below:
import pandas as pd
import numpy as npy
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
data = pd.read_csv("123.csv")
dataset
target = ['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72']
train, test = train_test_split(data, test_size=0.2)
X_train = train.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
X_test = test.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
Y_train = train[target]
Y_test = test[target]
model = MultiOutputClassifier(GradientBoostingClassifier())
model.fit(X_train, Y_train)
target_probabilities = model.predict_proba(X_test)
print(target_probabilities)
probabilities
The probabilities output do not seem to be in the correct form. I get 11 565x2 arrays (565 is length of my test set). I'd like to save the target_probabilities to a csv file, but I get the error: ValueError: Expected 1D or 2D array, got 3D array instead. My question is essentially the same as on the link -
https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier, but the answer there only explains why the output is a set of arrays.
EDIT: I have simplified the problem.
target_probabilities = array(target_probabilities)
Now target_probabilities is an (11,565,2) matrix - need to change the form of the matrix to be (565,11), where each row is of the form target_probabilities[:,i][:,1], for i in the range(0,565).

Python - SKLearn Fit Array Error

I'm relatively new to using sklearn and python for data analysis and am trying to run some linear regression on a dataset that I loaded from a .csv file.
I have loaded my data into train_test_split without any issues, but when I try to fit my training data I receive an error ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample..
Error at model = lm.fit(X_train, y_train)
Because of my freshness with working with these packages, I'm trying to determine if this is the result of not setting my imported csv to a pandas data frame before running the regression or if this has to do with something else.
My CSV is in the format of:
Month,Date,Day of Week,Growth,Sunlight,Plants
7,7/1/17,Saturday,44,611,26
7,7/2/17,Sunday,30,507,14
7,7/5/17,Wednesday,55,994,25
7,7/6/17,Thursday,50,1014,23
7,7/7/17,Friday,78,850,49
7,7/8/17,Saturday,81,551,50
7,7/9/17,Sunday,59,506,29
Here is how I set up the regression:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
organic = pd.read_csv("linear-regression.csv")
organic.columns
Index(['Month', 'Date', 'Day of Week', 'Growth', 'Sunlight', 'Plants'], dtype='object')
# Set the depedent (Growth) and independent (Sunlight)
y = organic['Growth']
X = organic['Sunlight']
# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, X_test.shape)
print (y_train.shape, y_test.shape)
(192,) (49,)
(192,) (49,)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# Error pointing to an array with values from Sunlight [611, 507, 994, ...]

You just need to adjust your last columns to
lm = linear_model.LinearRegression()
model = lm.fit(X_train.values.reshape(-1,1), y_train)
and the model will fit. The reason for this is that the linear model from sklearn expects
X : numpy array or sparse matrix of shape [n_samples,n_features]
So our training data must be of form [7,1] in this particular case

You are only using one feature, so it tells you what to do within the error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature.
The data always has to be 2D in scikit-learn.
(Don't forget the typo in X = organic['Sunglight'])

Once you load the data into train_test_split(X, y, test_size=0.2), it returns Pandas Series X_train and X_test with (192, ) and (49, ) dimensions. As mentioned in the previous answers, sklearn expect matrices of shape [n_samples,n_features] as the X_train, X_test data. You can simply convert the Pandas Series X_train and X_test to Pandas Dataframes to change their dimensions to (192, 1) and (49, 1).
lm = linear_model.LinearRegression()
model = lm.fit(X_train.to_frame(), y_train)

Unable to use PolynomialFeatures on tensorflow variables

I am working with tensorflow. I have a training dataset X_train and I want to create polynomial features from it. Specifically, I want to use PolynomialFeatures from sklearn.preprocessing. I'm getting an error
ValueError: setting an array element with a sequence.
The code that I try to compile is
n_features = 3
n_samples = 1000
X_train = tf.placeholder(tf.float32, shape=(n_samples,n_features))
poly = PolynomialFeatures(degree = 3)
poly.fit_transform(X_train)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Found input variables with inconsistent numbers of samples when fitting LogisticRegression - python

Related

When I want to pass the tf-idf vectorize data and numeric data at the same time in the fit() of machine learning model why does this error arised?

Incomprehension of input data for the LSTM model

Predict_proba function for Multi-target Classification

Python - SKLearn Fit Array Error

Unable to use PolynomialFeatures on tensorflow variables

Categories

Resources