Value Error faced during my logistic regression code - python

I am getting value error related to shape of input when I am traning logistic model
titanic_data = pd.read_csv("E:\\Python\\CSV\\train.csv")
titanic_data.drop('Cabin', axis=1, inplace=True)
titanic_data.dropna(inplace=True)
#print(titanic_data.head(10))
new_sex = pd.get_dummies(titanic_data['Sex'],drop_first=True)
new_embarked = pd.get_dummies(titanic_data['Embarked'],drop_first=True)
new_pcl = pd.get_dummies(titanic_data['Pclass'],drop_first=True)
titanic_data = pd.concat([titanic_data,new_sex,new_embarked,new_pcl],axis=1)
titanic_data.drop(['PassengerId','Pclass','Name','Sex','Ticket','Embarked','Age','Fare'],axis=1,inplace=True)
X = titanic_data.drop(['Survived'],axis=1)
y = titanic_data['Survived']
print(X)
print(y)
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.3, random_state=1)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
Error
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (214, 7)

you are unpacking your split data in to the wrong variables the order should be as follows:
X_train, X_test, y_train, y_test = train_test_split(...)
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Related

sklearn ValueError: dtype='numeric' is not compatible with arrays of bytes/strings

I am working on a sklearn project for spam/ham prediction. I am having an issue when trying to calculate the auc_score:
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'],
spam_data['target'],
test_size=0.3,
random_state=0)
vect = CountVectorizer().fit(X_train)
X_train_vectorized = vect.transform(X_train)
mnb_clf_model = MultinomialNB(alpha = 1.0, class_prior = None,
fit_prior = True)
mnb_clf_model.fit(X_train_vectorized, y_train)
test_predictions = mnb_clf_model.predict(vect.transform(X_test))
auc = roc_auc_score(y_test, test_predictions)
The error I am getting is:
ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead.
What do I need to convert here - predictions or the y_test classes?

Why do I receive this numpy error when using statsmodels to predict test values?

I am getting an error when trying to use statsmodels .predict to predict my test values.
Code:
X_train, X_test, y_train, y_test = train_test_split(X_new_np, y, test_size=0.2, random_state=42)
logit = sm.Logit(y_train, X_train)
reg = logit.fit_regularized(start_params=None, method='l1_cvxopt_cp', maxiter= 1000, full_output=1, disp=1, callback=None, alpha=.01, trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
reg.summary()
y_pred_test = logit.predict(X_test)
Error:
ValueError: shapes (1000,61) and (251,61) not aligned: 61 (dim 1) != 251 (dim 0)
You simply don't predict from the right object. reg is the one that was fitted, you should then use reg.predict. The following code runs without error (I used your fit_regularized parameters).
from sklearn.model_selection import train_test_split
import numpy as np
from statsmodels.api import Logit
x = np.random.randn(100,50)
y = np.random.randint(0,2,100).astype(bool)
print(x.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.2)
logit = Logit(y_train, X_train)
reg = logit.fit_regularized(start_params=None, method='l1_cvxopt_cp',
maxiter= 1000, full_output=1, disp=1, callback=None,
alpha=.01, trim_mode='auto', auto_trim_tol=0.01,
size_trim_tol=0.0001, qc_tol=0.03)
print(reg.summary())
y_pred_test = reg.predict(X_test)

How to solve "ValueError: y should be a 1d array, got an array of shape (3, 5) instead." for naive Bayes?

from sklearn.model_selection import train_test_split
X = data.drop('Vickers Hardness\n(HV0.5)', axis=1)
y = data['Vickers Hardness\n(HV0.5)']
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size = 0.3)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
ValueError: y should be a 1d array, got an array of shape (3, 5) instead.
Used data:
How to rectify this error in naive bayes? how can I put y in 1D array?
The assignments of the train/test split are not ordered right, use:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

How to solve sklearn error: "Found input variables with inconsistent numbers of samples"?

I have a challenge using the sklearn 70-30 division. I receive an error on line:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
The error is:
Found input variables with inconsistent numbers of samples
Context
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors = 1)
X = data.drop('cluster',axis=1)
y = data['cluster']
X_smote, y_smote= sm.fit_sample(X,y)
data_bal = pd.DataFrame(columns=X.columns.values, data=X_smote)
data_bal['cluster']=y_smote
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
y_train.value_counts().plot(kind='bar')
Edit
I solve the error, I just had to put the stratify=y in stratify=y_smote
Just an observation in your line of code:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
The error thrown typically is a result of some input value that is expected to have a particular dimension or length that is consistent with other input values.
Check the length and/or dimensions of X_smote, y_smote and y to see if they are all as expected.
I got the same Issue but when I changed
x_train,y_train,x_test,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
to
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
my error got removed.

Sklearn | LinearRegression | Fit

I'm having a few issues with LinearRegression algorithm in Scikit Learn - I have trawled through the forums and Googled a lot, but for some reason, I haven't managed to bypass the error. I am using Python 3.5
Below is what I've attempted, but keep getting a value error:"Found input variables with inconsistent numbers of samples: [403, 174]"
X = df[["Impressions", "Clicks", "Eligible_Impressions", "Measureable_Impressions", "Viewable_Impressions"]].values
y = df["Total_Conversions"].values.reshape(-1,1)
print ("The shape of X is {}".format(X.shape))
print ("The shape of y is {}".format(y.shape))
The shape of X is (577, 5)
The shape of y is (577, 1)
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
print (y_pred)
print ("The shape of X_train is {}".format(X_train.shape))
print ("The shape of y_train is {}".format(y_train.shape))
print ("The shape of X_test is {}".format(X_test.shape))
print ("The shape of y_test is {}".format(y_test.shape))
The shape of X_train is (403, 5)
The shape of y_train is (174, 5)
The shape of X_test is (403, 1)
The shape of y_test is (174, 1)
Am I missing something glaringly obvious?
Any help would be greatly appreciated.
Kind Regards,
Adrian
Looks like your Train and Tests contain different number of rows for X and y. And its because you're storing the return values of train_test_split() in the incorrect order
Change this
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
To this
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

Categories

Resources