dividing big dataset python

dividing big dataset python - python

My dataset Features shape is (80102, 2592) and label.shape (80102, 2). I want to consider only few rows for traning as it is taking lot of time for training the CNN model. How can I divide the dataset in python and consider only few rows for traning and tesing both.

If your data is in the form of arrays let X be the array containing the data and y be the array containing the labels. You can use sklearn train_test_split function to create new samples of the data per the code below
from sklearn.model_selection import train_test_split
percent=.1 specify the percentof data you want to use, in this case 10%
X_data, X_dummy, y_labels, y_dummy=train_test_split(X,y,train_size=percent,randon_state=123, shuffle=True)
X_data will contain 10% of the original data and will be shuffled
y_labels will contain 10% of the corresponding labels.
If you want to specifically set the number of samples set train_size to an integer value. If you need further information the documentation is located here. If you data is a pandas dataframe you can use the pandas function pandas.DataFrame.sample..Documentation for that is here.. Assume your data frame is called data. The code below will produce a new data frame with a specified percent of the original rows
percent=.1
new_data=pandas.data.sample(n=None, frac=percent, replace=False, weights=None, random_state=123, axis=0)

Related

How to find the number of rows and columns in a MinMaxScaler object?

I made a dataframe of a csv file and passed it into train_test_split and then used MinMaxScaler to scale the whole X and Y dataframes but now I want to know the basic number of rows and columns but can't.
df=pd.read_csv("cancer_classification.csv")
from sklearn.model_selection import train_test_split
X = df.drop("benign_0__mal_1",axis=1).values
y = df["benign_0__mal_1"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit(X_train)
X_test = scaler.fit(X_test)
X_train.shape
this is throwing the following error
AttributeError Traceback (most recent call last)
in ()
----> 1 X_train.shape
AttributeError: 'MinMaxScaler' object has no attribute 'shape'
I read the documentation and was able to find the number of rows using scale_ but not to find the columns.
this is how the answer should look like but I was not able to find an attribute that can help

MinMaxScaler is an object that can fit itself to certain data and also transform that data. There are
The fit method fits the scaler's parameters to that data. It then returns the MinMaxScaler object
The transforms method transforms data based on the scaler's fitted parameters. It then returns the transformed data.
The fit_transform method first fits the scaler to that data, then transforms it and returns the transformed version of the data.
In your example, you are treating the MinMaxScaler object itself as the data! (see 1st bullet point)
The same MinMaxScaler shouldn't be fitted twice on different dataset since its internal values will be changed. You should never fit a minmaxscaler on the test dataset since that's a way of leaking test data into your model. What you should be doing is fit_transform() on the training data and transform() on the test data.
The answer here may also help this explanation: fit-transform on training data and transform on test data
When you call StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in X_train. Then calling .transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform().
The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.
If you fit() to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.
Instead, you want to only transform the test data by using the parameters computed on the training data.

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

I'm developing a SVR for ~100 continuous features and a continuous label.
For scaling the data, I wrote:
#Read in
df = pd.read_csv(data_path,sep='\t')
features = df.iloc[:,1:-1] #100 features
target = df.iloc[:,-1] #The label
names = df.iloc[:,0] #Column names
#Scale features
scaler = StandardScaler()
scaled_df = scaler.fit_transform(features)
# rename columns (since now its an np array)
features.columns = df_columns
So now I have a scaled data frame, and my next step was to split into train and test, and then develop a model (SVR):
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
model = SVR()
...and then I fit the model to the data.
But I noticed other people don't fit the StandardScaler() to the whole data frame, but they split the dataframe into train and test first, and then apply StandardScaler() to each separately.
Is there a difference between whether you apply the StandardScaler to the whole data frame, or train and test separately?

The previous answer says that you should separate the training and testing set when scaling, otherwise the testing one might bias the transformation of the training one. This is half correct and half wrong.
If you do the transformation separately, then it might well be that the training set will get scaled to wrong proportions (e.g. if it comes from a narrow continuous time range, thus taking on a subset of the values range). You will end up having wrong values for the variables of the test set.
In general, what you must do is scale on the training set and transfer the scale over to the testing set. This is done by using the methods fit and transform separately, as seen in the documentation.

You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.

How to split the fixed number of rows in a data into Xtest, Xtrain , Ytrain and Ytest without train_test_split function in python

I have the data set with 80 columns. In python I want to split the data into first 60 as train data and the 13 as test data. The data gets split randomly if I use train_test_split function. I don't want random data for train.
E.g: Data set columns looks like the below:
Date | dependent_variable | independent_variable_1 | independent_variable_2
train = data[:80]
test = data[13:]
From this how to split the dependent variable and independent variable.(Xtrain,Xtest, Ytrain and Ytest)
Thanks in advance.

The data gets split randomly if I use train_test_split function. I don't want random data for train.
By default its random, yes, but you can make it NOT random.
If you call the function doing train_test_split(X, y, test_size=0.33, shuffle=False). Notice the parameter shuffle:
Whether or not to shuffle the data before splitting
You will achieve your objective of splitting without random splits.
Finally, train_test_split splits your dataset rows using the test_size, so if you want to do it manually, keep in mind that you should split the rows and not the columns, and keep the respective columns for the X and the y.

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?

Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)

Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps

Splitting dataset into two non-redundant numpy arrays?

I have a numpy array "my_data". I am trying to split this dataset randomly. However, when I do this using the following code, I get a "train" array and a "test" array. Train array and test array have some rows in column.
training_idx = np.random.randint(my_data.shape[0], size=split_size)
test_idx = np.random.randint(my_data.shape[0], size=len(my_data)-split_size)
train, test = my_data[training_idx,:], my_data[test_idx,:]
My intention is to find train array first randomly and then whatever rows are left in my_data which are not in train array, to be a part of test array.
Is there a way in numpy to do so ? (I am refraining from using sklearn to split my data)
I referred to this post here to get here with my dataset.
How to split/partition a dataset into training and test datasets for, e.g., cross validation?
If I code per this post’s logic I end up getting train and test data sets where train and test have some redundant rows in them. I intend on making train and test datasets where no rows are common.

Following this answer you can do:
train_idx = np.random.randint(my_data.shape[0], size=split_size)
mask = np.ones_like(my_data, dtype=bool)
mask[train_idx] = False
train, test = my_data[~mask], my_data[mask]
Although, a more natural way would be to slice a permutation of your data, as Poojan suggested.
permuted = np.random.permutation(my_data)
train, test = permuted[:split_size], permuted[split_size:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.