I split the data into the test and training set without using train_test_split
My Function:
def split(X, y):
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand, 75)
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
#print (len(X_Train)), (len(y_Train)), (len(X_Test)), (len(y_Test))
return X_train, y_train, X_test, y_test
My problem is, when I output X_train I receive info that it has 76 rows x 8 columns. However while printing X_test this info is missing. This is how it looks like. My df is a csv file:
I needed to split it for X,y labels which I did with such approach: X, y = df.iloc[:,0:8], df.iloc[:,8:9] And later X_train, y_train, X_test, y_test = split(X,y)
This is the output why shape info is missing?
Resuls:
When all the rows are shown in the result cell (in your example you have only 26 rows for X_test), the shape information is not shown. By default, the maximum number of rows shown is 60 (unless you change pandas.options.display.max_rows), so if X_test has less than 60 rows, the shape information is not shown.
Try X_test.shape to see the shape.
Related
EDIT:
I am quite new to coding and looping is still giving me a headache. What I am trying to do is to change the indices for X_train, Y_train, Y_test, Y_test by 1 step at a time in a pandas dataframe.
My dataset has 250 rows and 12 columns.
*For instance, in period 0 I would like to create a loop that generate intervals that looks like the intervals below:
X_train = X.iloc[0:80]
y_train = y.iloc[0:80]
X_test = X.iloc[81:]
y_test = y.iloc[81:]*
*In period 1 :
X_train = X.iloc[1:81]
y_train = y.iloc[1:81]
X_test = X.iloc[82:]
y_test = y.iloc[82:]
etc.*
My attempt is copied inn below:
**i = 0
for i in range (i,len(X)):
i = i + 1
X_train = X.iloc[i:80+i]
y_train = y.iloc[i:80+i]
X_test = X.iloc[81+i:]
y_test = y.iloc[81+i:]
reg = regr.fit(X_train, y_train)
print(i,reg.coef_ )
print(i,reg.intercept_)
print(i,reg.score(X_train,y_train))**
My preferred output is to print out coefficients, intercepts and R2 for each interval window in a regression model.
For instance, in period 0, run a regression on the data in rows [0:80], in period 1 on rows [1:81] etc.
current error is:
ValueError: Found array with 0 sample(s) (shape=(0, 11)) while a minimum of 1 is required.
I would appreciate any help/suggestion I can get :)
Thanks!
suppose X,Y = load_mnist() where X and Y are the tensors that contain the whole mnist. Now i want a smaller proportion of the data to make my code run faster, but i need to keep all 10 classes there and also in a balanced manner. Is there an easy way to do this?
scikit-learn's train_test_split is meant to split the data into train and test classes, but you can use it to create a "balanced" subset of your dataset using the stratified argument. You can just specify the train/test size proportion you desire and thereby obtain a smaller, stratified sample of your data. In your case:
from sklearn.model_selection import train_test_split
X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)
If you want to do this with more control, you could use numpy.random.randint to generate indices of size of the subset and sample the original arrays as in the following piece of code:
# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)
# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500
# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)
# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]
In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=Ture, test_size=0.33, random_state=42)
Stratify will ensure the proportion of classes.
If you want to perform K-Fold then
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
check here for sklearn documentaion.
I have an Excel file that stores a sequence in each column (reading from top cell to bottom cell), and the trend of the sequence is similar to the previous column. So I'd like to predict the sequence for the nth column in this dataset.
A sample of my data set:
See that each column has a set of values / sequence, and they sort of progress as we move to the right, so I want to predict e.g. the values in the Z column.
Here's my code so far:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Read the Excel file in rows
df = pd.read_excel(open('vec_sol2.xlsx', 'rb'),
header=None, sheet_name='Sheet1')
print(type(df))
length = len(df.columns)
# Get the sequence for each row
x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)
print("y_train shape: ", y_train.shape)
pred_model = LogisticRegression()
pred_model.fit(x_train, y_train)
print(pred_model)
I'll explain the logic as much as possible:
x_train and x_test will just be the index / column number that is associated with a sequence.
y_train is an array of sequences.
There is a total of 51 columns, so splitting it with 25% being test data results in 37 train sequences and 13 test sequences.
I've managed to get the shapes of each var when debugging, they are:
x_train : (37, 1)
x_test : (13, 1)
y_train : (37, 51)
y_test : (13, 51)
But right now, running the program gives me this error:
ValueError: bad input shape (37, 51)
What is my mistake here?
I don't understand why are you using this:
x_train, x_test, y_train, y_test = train_test_split(
np.reshape(range(0, length - 1), (-1, 1)), df, test_size=0.25, random_state=0)
You have data here in df. Extract X and y from it and then split it to train and test.
Try this:
X = df.iloc[:,:-1]
y = df.iloc[:, -1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Otherwise, the stats you shared shows you are trying to have 51 columned output from one feature, which is weird if you think about it.
I wrote a function to split numpy ndarrays x_data and y_data into training and test data based on a percentage of the total size.
Here is the function:
def split_data_into_training_testing(x_data, y_data, percentage_split):
number_of_samples = x_data.shape[0]
p = int(number_of_samples * percentage_split)
x_train = x_data[0:p]
y_train = y_data[0:p]
x_test = x_data[p:]
y_test = y_data[p:]
return x_train, y_train, x_test, y_test
In this function, the top part of the data goes to the training dataset and the bottom part of the data samples go to the testing dataset based on percentage_split. How can this data split be made more randomized before being fed to the machine learning model?
Assuming there's a reason you're implementing this yourself instead of using sklearn.train_test_split, you can shuffle an array of indices (this leaves the training data untouched) and index on that.
def split_data_into_training_testing(x_data, y_data, split, shuffle=True):
idx = np.arange(len(x_data))
if shuffle:
np.random.shuffle(idx)
p = int(len(x_data) * split)
x_train = x_data[idx[:p]]
x_test = x_data[idx[p:]]
... # Similarly for y_train and y_test.
return x_train, x_test, y_train, y_test
You can create a mask with p randomly selected true elements and index the arrays that way. I would create the mask by shuffling an array of the available indices:
ind = np.arange(number_of_samples)
np.random.shuffle(ind)
ind_train = np.sort(ind[:p])
ind_test = np.sort(ind[p:])
x_train = x_data[ind_train]
y_train = y_data[ind_train]
x_test = x_data[ind_test]
y_test = y_data[ind_test]
Sorting the indices is only necessary if your original data is monotonically increasing or decreasing in x and you'd like to keep it that way. Otherwise, ind_train = ind[:p] is just fine.
I have a pandas datafrane with the following info:
RangeIndex: 920 entries, 0 to 919 Data columns (total 41 columns)
X = df[df.columns[:-1]]
Y = df['my_Target']
train_X,train_y,test_X, test_y =train_test_split(X,Y,test_size=0.33,shuffle = True, random_state=45)
The last column is the target, and the rest is the data.
The shape is the following:
print(train_X.shape,train_y.shape,test_X.shape, test_y.shape)
(616, 40) (304, 40) (616,) (304,)
However when I train a model:
model=svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))
it gives the following error:
model.fit(train_X,train_Y)
ValueError: Found input variables with inconsistent numbers of
samples: [616, 2]
Anyone got a hint about what is going on?
Your variables are in the wrong order:
X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
Per docs
X_train then X_test then y_train and then y_test
You have:
train_X,train_y,test_X, test_y