Pandas and scikit-learn - train_test_split dimensions of X, y

Pandas and scikit-learn - train_test_split dimensions of X, y - python

I have a pandas datafrane with the following info:
RangeIndex: 920 entries, 0 to 919 Data columns (total 41 columns)
X = df[df.columns[:-1]]
Y = df['my_Target']
train_X,train_y,test_X, test_y =train_test_split(X,Y,test_size=0.33,shuffle = True, random_state=45)
The last column is the target, and the rest is the data.
The shape is the following:
print(train_X.shape,train_y.shape,test_X.shape, test_y.shape)
(616, 40) (304, 40) (616,) (304,)
However when I train a model:
model=svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))
it gives the following error:
model.fit(train_X,train_Y)
ValueError: Found input variables with inconsistent numbers of
samples: [616, 2]
Anyone got a hint about what is going on?

Your variables are in the wrong order:
X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
Per docs
X_train then X_test then y_train and then y_test
You have:
train_X,train_y,test_X, test_y

Related

How to split a tuple using train_test_split?

X = (569,30)
y = (569,)
X_train, X_test, y_train, y_test = train_test_split(np.asarray(X),np.asarray(y),test_size = 0.25, random_state=0)
I am expecting output as below:
X_train has shape (426, 30)
X_test has shape (143, 30)
y_train has shape (426,)
y_test has shape (143,)
But i am getting the following warning
ValueError: Found input variables with inconsistent numbers of samples: [2, 1]
I know that, i can get the desired output in another way, all the problems found in the online show that lengths of X and y are not same but in my case that's not the problem.

It seems that you're misunderstanding what train_test_split does. It is not expecting the shapes of the input arrays, what it does is to split the input arrays into train and test sets. So you must feed it the actual arrays, for instace:
X = np.random.rand(569,30)
y = np.random.randint(0,2,(569))
X_train, X_test, y_train, y_test = train_test_split(np.asarray(X),np.asarray(y),test_size = 0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(426, 30)
(143, 30)
(426,)
(143,)

Value Error faced during my logistic regression code

I am getting value error related to shape of input when I am traning logistic model
titanic_data = pd.read_csv("E:\\Python\\CSV\\train.csv")
titanic_data.drop('Cabin', axis=1, inplace=True)
titanic_data.dropna(inplace=True)
#print(titanic_data.head(10))
new_sex = pd.get_dummies(titanic_data['Sex'],drop_first=True)
new_embarked = pd.get_dummies(titanic_data['Embarked'],drop_first=True)
new_pcl = pd.get_dummies(titanic_data['Pclass'],drop_first=True)
titanic_data = pd.concat([titanic_data,new_sex,new_embarked,new_pcl],axis=1)
titanic_data.drop(['PassengerId','Pclass','Name','Sex','Ticket','Embarked','Age','Fare'],axis=1,inplace=True)
X = titanic_data.drop(['Survived'],axis=1)
y = titanic_data['Survived']
print(X)
print(y)
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.3, random_state=1)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
Error
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (214, 7)

you are unpacking your split data in to the wrong variables the order should be as follows:
X_train, X_test, y_train, y_test = train_test_split(...)
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Python SKLearn: 'Bad input shape' error when predicting a sequence

I have an Excel file that stores a sequence in each column (reading from top cell to bottom cell), and the trend of the sequence is similar to the previous column. So I'd like to predict the sequence for the nth column in this dataset.
A sample of my data set:
See that each column has a set of values / sequence, and they sort of progress as we move to the right, so I want to predict e.g. the values in the Z column.
Here's my code so far:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Read the Excel file in rows
df = pd.read_excel(open('vec_sol2.xlsx', 'rb'),
header=None, sheet_name='Sheet1')
print(type(df))
length = len(df.columns)
# Get the sequence for each row
x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)
print("y_train shape: ", y_train.shape)
pred_model = LogisticRegression()
pred_model.fit(x_train, y_train)
print(pred_model)
I'll explain the logic as much as possible:
x_train and x_test will just be the index / column number that is associated with a sequence.
y_train is an array of sequences.
There is a total of 51 columns, so splitting it with 25% being test data results in 37 train sequences and 13 test sequences.
I've managed to get the shapes of each var when debugging, they are:
x_train : (37, 1)
x_test : (13, 1)
y_train : (37, 51)
y_test : (13, 51)
But right now, running the program gives me this error:
ValueError: bad input shape (37, 51)
What is my mistake here?

I don't understand why are you using this:
x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)
You have data here in df. Extract X and y from it and then split it to train and test.
Try this:
X = df.iloc[:,:-1]
y = df.iloc[:, -1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Otherwise, the stats you shared shows you are trying to have 51 columned output from one feature, which is weird if you think about it.

Sklearn | LinearRegression | Fit

I'm having a few issues with LinearRegression algorithm in Scikit Learn - I have trawled through the forums and Googled a lot, but for some reason, I haven't managed to bypass the error. I am using Python 3.5
Below is what I've attempted, but keep getting a value error:"Found input variables with inconsistent numbers of samples: [403, 174]"
X = df[["Impressions", "Clicks", "Eligible_Impressions", "Measureable_Impressions", "Viewable_Impressions"]].values
y = df["Total_Conversions"].values.reshape(-1,1)
print ("The shape of X is {}".format(X.shape))
print ("The shape of y is {}".format(y.shape))
The shape of X is (577, 5)
The shape of y is (577, 1)
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
print (y_pred)
print ("The shape of X_train is {}".format(X_train.shape))
print ("The shape of y_train is {}".format(y_train.shape))
print ("The shape of X_test is {}".format(X_test.shape))
print ("The shape of y_test is {}".format(y_test.shape))
The shape of X_train is (403, 5)
The shape of y_train is (174, 5)
The shape of X_test is (403, 1)
The shape of y_test is (174, 1)
Am I missing something glaringly obvious?
Any help would be greatly appreciated.
Kind Regards,
Adrian

Looks like your Train and Tests contain different number of rows for X and y. And its because you're storing the return values of train_test_split() in the incorrect order
Change this
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
To this
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

Python ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13)

I would like to normalize a training and test data set using MinMaxScaler in sklearn.preprocessing. However, the package does not appear to be accepting my test data set.
import pandas as pd
import numpy as np
# Read in data.
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
# Split into train/test data.
from sklearn.model_selection import train_test_split
X = df_wine.iloc[:, 1:].values
y = df_wine.iloc[:, 0].values
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3,
random_state = 0)
# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
When executing this, I get a DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. along with a ValueError: operands could not be broadcast together with shapes (124,) (13,) (124,).
Reshaping the data still yields an error.
X_test_norm = mms.transform(X_test.reshape(-1, 1))
This reshaping yields an error ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13).
Any input on how to get fix this error would be helpful.

The partitioning of train/test data must be specified in the same order as the input array to the train_test_split() function for it to unpack them corresponding to that order.
Clearly, when the order was specified as X_train, y_train, X_test, y_test, the resulting shapes of y_train (len(y_train)=54) and X_test (len(X_test)=124) got swapped resulting in the ValueError.
Instead, you must:
# Split into train/test data.
# _________________________________
# | | \
# | | \
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# | | /
# |__________|_____________________________________/
# (or)
# y_train, y_test, X_train, X_test = train_test_split(y, X, test_size=0.3, random_state=0)
# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
produces:
X_train_norm[0]
array([ 0.72043011, 0.20378151, 0.53763441, 0.30927835, 0.33695652,
0.54316547, 0.73700306, 0.25 , 0.40189873, 0.24068768,
0.48717949, 1. , 0.5854251 ])
X_test_norm[0]
array([ 0.72849462, 0.16386555, 0.47849462, 0.29896907, 0.52173913,
0.53956835, 0.74311927, 0.13461538, 0.37974684, 0.4364852 ,
0.32478632, 0.70695971, 0.60566802])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas and scikit-learn - train_test_split dimensions of X, y - python

Your variables are in the wrong order: X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) Per docs X_train then X_test then y_train and then y_test You have: train_X,train_y,test_X, test_y

Related

How to split a tuple using train_test_split?

Value Error faced during my logistic regression code

Python SKLearn: 'Bad input shape' error when predicting a sequence

Sklearn | LinearRegression | Fit

Python ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13)

Categories

Resources