How to use train_test_split? Fix error n_samples = 0 - python

I'm trying to split the data I am working with into training and testing sets but I get the error that n_samples = 0 when I use the train_test_split function.
Here's my code:
X_train, X_test, y_train, y_test = model_selection.train_test_split(summary, labels, test_size=0.35)
summary and labels are lists and after converting them to arrays this is the shape I get:
(1248,)
(1248,)
They both have 1248 values. Can someone tell me why its not working? Thanks
Error Message:
With n_samples=0, test_size=0.35 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters

Works for me, check if this works for you:
from sklearn.model_selection import train_test_split
import numpy as np
# dummy examples
summary, labels = np.arange(0,1248), np.arange(0,1248)
X_train, X_test, y_train, y_test = train_test_split(summary, labels, test_size=0.35)
Test with string list
summary, labels = ["hello"]*1248, ["test"]*1248

Related

Converting predicted random forest results into dataframe

This question may look silly but did not getting an idea so need your help.
I used random forest to predict the result and wrote the following codes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
regressor.fit(X_train, y_train)
Y_pred = regressor.predict(X_test)
Y_pred is the result for a given X_test. Now, I would like to create a data frame of my Y_pred and y_test data and save it into CSV format.
Any idea how to do this?
It seems quite simple, just clicked into my mind. So this way it can be done
df_new = pd.DataFrame({'x':Y_pred, 'y':y_test})
df_new.head()

How to solve sklearn error: "Found input variables with inconsistent numbers of samples"?

I have a challenge using the sklearn 70-30 division. I receive an error on line:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
The error is:
Found input variables with inconsistent numbers of samples
Context
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors = 1)
X = data.drop('cluster',axis=1)
y = data['cluster']
X_smote, y_smote= sm.fit_sample(X,y)
data_bal = pd.DataFrame(columns=X.columns.values, data=X_smote)
data_bal['cluster']=y_smote
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
y_train.value_counts().plot(kind='bar')
Edit
I solve the error, I just had to put the stratify=y in stratify=y_smote
Just an observation in your line of code:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y)
The error thrown typically is a result of some input value that is expected to have a particular dimension or length that is consistent with other input values.
Check the length and/or dimensions of X_smote, y_smote and y to see if they are all as expected.
I got the same Issue but when I changed
x_train,y_train,x_test,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
to
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=42)
my error got removed.

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

ValueError : x and y must be the same size

I have a dataset which i'm trying to calculate Linear regression using sklearn.
The dataset i'm using is already made so there are not suppose to be problems with it.
I have used train_test_split in order to split my data into train and test groups.
When I try to use matplotlib in order to create scatter plot between my ttest and prediction group, I get the next error:
ValueError: x and y must be the same size
This is my code:
y=data['Yearly Amount Spent']
x=data[['Avg. Session Length','Time on App','Time on Website','Length of Membership','Yearly Amount Spent']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
#training the model
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
lm.fit(x_train,y_train)
lm.coef_
predictions=lm.predict(X_test)
#here the problem starts:
plt.scatter(y_test,predictions)
Why does this error occurs?
I have seen previous posts here and the suggestions for this was to use x.shape and y.shape but i'm not sure what is the purpose of that.
Thanks
It seems that you are using the EcommerceCustomers.csv dataset (link here)
In your original post the column 'Yearly Amount Spent' is also included in the y as well as in x but this is wrong.
The following should work fine:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv("EcommerceCustomers.csv")
y = data['Yearly Amount Spent']
X = data[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# ## Training the Model
lm = LinearRegression()
lm.fit(X_train,y_train)
# The coefficients
print('Coefficients: \n', lm.coef_)
# ## Predicting Test Data
predictions = lm.predict( X_test)
See also this

ValueError: could not convert string to float: '?'

I have tried to run a SVM program, and I got the above error. The code is here below. Please point out the error in it.
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
data = pd.read_csv('risk_factors_cervical_cancer.csv')
X = np.array(data[[#some data elements]])
y = np.array(data[#some data elements])
print(X)
print(y)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=30)
classifier = svm.SVC()
classifier.fit(X_train, y_train) #the error occurs here
y_pred = svm.predict(X_test)
acc = accuracy_score(y_test, y_pred)
`
As #Guimoute wrote, preprocessing your data is always necessary in order to train it with any machine learning algorithm. Try X.head(10) to get an introduction to the data you are using. Your error occurs because there is a value "?" in your X column. Replace it with some reasonable number, i.e. the mean of the column for example in order to get better results.

Categories

Resources