I have a dataset of around 15,500 rows. The data set consist of two columns: text column (independent variable) and output (dependent variable). Output has binary values (i.e. 0 and 1). Around 9500 rows have a value for Output columns (i.e. I can use it for training purpose) and the remaining 6000 rows (that do not have output column value) I want to use it for testing purpose. All rows (15500) are in one single file. I created a model definition file in which I used parallel_CNN encoder for the text column. I used the following command to run to train and test the dataset:
ludwig experiment --dataset dataset_name.csv --config_file model_definitions.yml
Now the problem is that I don't tell the program to use the first 9500 rows to train the program and the remaining rows to test the model. Is there any way in Ludwig that I could pass any argument to tell which number of rows to be used for training and which rows should be used for testing? or is there any better way of doing the same task?
Related
I'm trying to get into machine learning, and I've been following this tutorial:
https://www.analyticsvidhya.com/blog/2021/05/classification-algorithms-in-python-heart-attack-prediction-and-analysis/
Near the end, we split the dataset into training and testing using train_test_split
x = data3.drop("output", axis=1)
y = data3["output"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
That is, we use the same dataset for training and testing, 70% for training and 30% for testing.
But how can I use another dataset to test my model ?
One scenario came to mind: "You trained your model in 250 patients, now test it against these 3 patients that we have, so we can see the chances of them having a heart attack".
How can I, instead of splitting the data, use another csv/dataframe as a test ? Assuming this test data has the same format as the train, just fewer rows.
train_test_split(x,y,test_size=0.3) only divides data into training and testing set. After training the model on training data, you can use your other data for testing too. This function is mainly for splitting current data and you can use any data for testing purposes. You just have to make sure the attributes and the type are same as the training data. If you have to test on 3 patients, all you have to do is to pass the patients data into model.predict() function as a dataframe or an array depends on data.
Just as you load one dataframe from a file:
data1 = pd.read_csv("heart.csv")
If you had two separate data files, you'd load them into separate data files and skip the train_test_split step.
train_df = pd.read_csv("heart_train.csv")
test_df = pd.read_csv("heart_test.csv")
Since the two dataframes are already separate, you just have to make sure you do any cleaning and pre-processing steps on both of them, including removal of the target variable (y).
I have this dataset for agriculture raw materials from 1990 to 2017, and I am trying to make some price predictions for sake of learning:
Here are all the columns:
Now I want to split the dataset into training and test set, so I can apply some machine learning models into predicting, however it is not clear in my head what should be my target variable y, considering that each of the columns has their prices and they are all independent from each other. How should I be splitting this dataset if I wanted to make price prediction?
As I can see from your data, there are a couple of raw material prices available for prediction. Considering that these raw materials prices are independent of each other, you can create a dataset with just one dependent variable (for example Copra_Price) and the rest of the independent variables, removing other price-related variables from the data. Once you have this dataset, you can easily split into train and test using Copra_Price. This can be repeated for each of the price variables.
One more consideration is that, if none of the price variables has anomalies in them, then you could use any one of them to split the data as a random selection on one of them would in most probability be a random selection across the group.
I'm using OneHotEncoding to generate dummies for a classification problem. When used on the training data, I get ~300 dummy columns, which is fine. However, when I input new data (which is fewer rows), the OneHotEncoding only generates ~250 dummies, which isn't surprising considering the smaller dataset, but then I can't use the new data with the model because the features don't align.
Is there a way to retain the OneHotEncoding schema to use on new incoming data?
I think you are using fit_transform on both training and test dataset, which is not the right approach because the encoding schema has to be consistent on both the dataset for the model to understand the information from the features.
The correct way is do
fit_transform on training data
transform on test data
By doing this way, you will get consistent number of columns.
I've worked around with Scikit Learn Library for Machine Learning purpose.
I got some problem related to Dummy variable while using Regression. I have 2 set of sample for Training set and Test set. Actually, program uses Training set to create "Prediction Model", then "Testing" to check the score. While running the program, If the shape is equal, it's fine. But dummy variable, will make change to the shape and lead to different in
shape.
Example
Training set: 130 Rows * 3 Column
Training set: 60 Rows * 3 Column
After making 1 and 2 column to be dummy, now shape is changing
Training set: 130 Rows * 15 Column
Training set: 60 Rows * 12 Column
Any solution to solve this problem?
If it's possible or not, to success in progress even data shape is different
Sample Program: https://www.dropbox.com/s/tcc1ianmljf5i8c/Dummy_Error.py?dl=0
If I understand your code correctly, you are using pd.get_dummies to create the dummy variables and are passing your entire data frame to the function.
In this case, pandas will create a dummy variable for every value in every category it finds. In this case, it looks like more category values exist in training than in test. This is why you end up with more columns in training than in test.
A better approach is to combine everything in one dataframe, create categorical variables in the combined data set and then split your data into train and test.
I have a dataset having 19 features. Now I need to do missing value imputation, then encoding the categorical variables using OneHOtEncoder of scikit and then run a machine learning algo.
My question is should I split this dataset before doing all the above things using train_test_split method of scikit or should I first split into train and test and then on each set of data, do missing value and encoding.
My concern is if I split first then do missing value and other encoding on resulting two sets, when doing encoding of variables in test set, shouldn't test set would have some values missing for that variable there maybe resulting in less no. of dummies. Like if original data had 3 levels for categorical and I know we are doing random sampling but is there a chance that the test set might not have all three levels present for that variable thereby resulting in only two dummies instead of three in first?
What's the right approach. Splitting first and then doing all of the above on train and test or do missing value and encoding first on whole dataset and then split?
I would first split the data into a training and testing set. Your missing value imputation strategy should be fitted on the training data and applied both on the training and testing data.
For instance, if you intend to replace missing values by the most frequent value or the median. This knowledge (median, most frequent value) must be obtained without having seen the testing set. Otherwise, your missing value imputation will be biased. If some values of feature are unseen in the training data, then you can for instance increasing your overall number of samples or have a missing value imputation strategy robust to outliers.
Here is an example how to perform missing value imputation using a scikit-learn pipeline and imputer: