Python Scikit Learn, LinearRegression, Dummy Variable lead to different in shape

Python Scikit Learn, LinearRegression, Dummy Variable lead to different in shape - python

I've worked around with Scikit Learn Library for Machine Learning purpose.
I got some problem related to Dummy variable while using Regression. I have 2 set of sample for Training set and Test set. Actually, program uses Training set to create "Prediction Model", then "Testing" to check the score. While running the program, If the shape is equal, it's fine. But dummy variable, will make change to the shape and lead to different in
shape.
Example
Training set: 130 Rows * 3 Column
Training set: 60 Rows * 3 Column
After making 1 and 2 column to be dummy, now shape is changing
Training set: 130 Rows * 15 Column
Training set: 60 Rows * 12 Column
Any solution to solve this problem?
If it's possible or not, to success in progress even data shape is different
Sample Program: https://www.dropbox.com/s/tcc1ianmljf5i8c/Dummy_Error.py?dl=0

If I understand your code correctly, you are using pd.get_dummies to create the dummy variables and are passing your entire data frame to the function.
In this case, pandas will create a dummy variable for every value in every category it finds. In this case, it looks like more category values exist in training than in test. This is why you end up with more columns in training than in test.
A better approach is to combine everything in one dataframe, create categorical variables in the combined data set and then split your data into train and test.

Related

Splitting training and testing data

I have a dataset of around 15,500 rows. The data set consist of two columns: text column (independent variable) and output (dependent variable). Output has binary values (i.e. 0 and 1). Around 9500 rows have a value for Output columns (i.e. I can use it for training purpose) and the remaining 6000 rows (that do not have output column value) I want to use it for testing purpose. All rows (15500) are in one single file. I created a model definition file in which I used parallel_CNN encoder for the text column. I used the following command to run to train and test the dataset:
ludwig experiment --dataset dataset_name.csv --config_file model_definitions.yml
Now the problem is that I don't tell the program to use the first 9500 rows to train the program and the remaining rows to test the model. Is there any way in Ludwig that I could pass any argument to tell which number of rows to be used for training and which rows should be used for testing? or is there any better way of doing the same task?

Prediction using a model in machine learning python

Say I have created a randomforest regression model using test/train data available with me.
This contains feature scaling and categorical data encoding.
Now if I get a new dataset on a new day and I need to use this model to predict the outcome of this new dataset and compare it with the new dataset outcome that I have, do I need to apply feature scaling and categorical data encoding on this dataset as well?
For example .. day 1 I have 10K rows with 6 features and 1 label -- a regression problem
I built a model using this.
On day 2, I get 2K rows with same features and label but of course the data within it would be different.
Now I want to firstly predict using this model and day 2 data, what should be the label as per my model.
Secondly, using this result I want to compare the outcome of the model against the day 2 original label that I have.
So in order to do this, when I pass the day 2 features as the test set to the model, do I need to first do feature scaling and categorical data encoding on them?
This is somewhat to do with making predictions and validating with the received data in order to assess the data quality of the received data.

You always need to pass the data to the model in the format it is expecting them. If the model has been trained on scaled, encoded, ... data. You need to do perform all these transformations every time you are pushing new data into the trained model (for whatever reason).
The easiest solution is to use sklearn's Pipeline to create a pipeline with all those transformations included and then use it, instead of the model itself to make predictions for new entries so that all those transformations are automatically applied.
example - automatically applying StandardScaler's scaling feature before passing data into the model:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
// then
pipe.fit...
pipe.score...
pipe.predict...
The same holds for dependent variable. If you scaled it before you trained your model, you will need to scale the new ones as well, or you will need to apply inverse operation on the output of the model before you compare it with the original dependent variable values.

Scikit-Learn Random Forest regression: mix two sets of true values (y)

I am training Random Forests with two sets of "true" y values (empirical). I can easy tell which one is better.
However, I was wondering if there is a simple method, other than brute force, to pick up the values from each set that would produce the best model. In other words, I would like to automatically mix both y sets to produce a new ideal one.
Say, for instance, biological activity. Different experiments and different databases provide different values. This is a simple example showing two different sets of y values on columns 3 and 4.
4a50,DQ7,47.6,45.4
3atu,ADP,47.7,30.7
5i9i,5HV,47.7,41.9
5jzn,GUI,47.7,34.2
4bjx,73B,48.0,44.0
4a6c,QG9,48.1,45.5
I know that column 3 is better because I have already trained different models against each of them and also because I checked a few articles to verify which value is correct and 3 is right more often than 4. However, I have thousands of rows and cannot read thousands of papers.
So I would like to know if there is an algorithm that, for instance, would use 3 as a base for the true y values but would pick values from 4 when the model improves by so doing.
It would be useful it it would report the final y column and be able to use more than 2, but I think I can figure out that.
The idea now is to find out if there is already a solution out there so that I don't need to reinvent the wheel.
Best,
Miro
NOTE: The features (x) are in a different file.

The problem is that an algorithm alone doesn't know which label is better.
What you could do: Train a classifier on data which you know is correct. Use the clasifier to predcit a value for each datapoint. Compare this value to the two list of labels which you already have and choose the label which is closer.
This solution obviously isn't perfect since the results depends on quality of the classfier which predicts the value and you still need enough labeled data to train the classifier. Additionaly there is also a chance that the classifier itself predicts a better value compared to your two lists of labels.

Choose column 3 and column 4 both together as target/predicted/y values in Random Forest classifier model fitting - and predict it with your result. Thus, your algorithm can keep track of both Y values and their correlation to predicted values. Your problem seems to be Multi-output classification problem, where there are multiple target/predicted variables (multiple y - values ) as you suggest.
Random forest supports this multi-output classification using random forest. Random Forest fit(X,y) method supports y to be array-like y : array-like, shape = [n_samples, n_outputs]
multioutput-classification
sklearn.ensemble.RandomForestClassifier.fit
Check multi-class and multi-output classification

Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:
I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81).
However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:
ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)
How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?
Thanks and I hope I've made myself clear :)

This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.
One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.
Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.
The second approach is what you mention in your question, so I'll go through it with pandas.
By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:
test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns,
fill_value=0)
This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.
You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.

For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.

This works for me:
Initially, I was getting this error message:
shapes (15754,3) and (4, ) not aligned
I found out that, I was creating a model using 3 variables in my train data. But what I add constant X_train = sm.add_constant(X_train) the constant variable is automatically gets created. So, in total there are now 4 variables.
And when you test this model by default the test variable has 3 variables. So, the error gets pops up for dimension miss match.
So, I used the trick that creates a dummy variable for y_test also.
`X_test = sm.add_constant(X_test)`
Though this a useless variable, but this solves all the issue.

When to use train_test_split of scikit learn

I have a dataset having 19 features. Now I need to do missing value imputation, then encoding the categorical variables using OneHOtEncoder of scikit and then run a machine learning algo.
My question is should I split this dataset before doing all the above things using train_test_split method of scikit or should I first split into train and test and then on each set of data, do missing value and encoding.
My concern is if I split first then do missing value and other encoding on resulting two sets, when doing encoding of variables in test set, shouldn't test set would have some values missing for that variable there maybe resulting in less no. of dummies. Like if original data had 3 levels for categorical and I know we are doing random sampling but is there a chance that the test set might not have all three levels present for that variable thereby resulting in only two dummies instead of three in first?
What's the right approach. Splitting first and then doing all of the above on train and test or do missing value and encoding first on whole dataset and then split?

I would first split the data into a training and testing set. Your missing value imputation strategy should be fitted on the training data and applied both on the training and testing data.
For instance, if you intend to replace missing values by the most frequent value or the median. This knowledge (median, most frequent value) must be obtained without having seen the testing set. Otherwise, your missing value imputation will be biased. If some values of feature are unseen in the training data, then you can for instance increasing your overall number of samples or have a missing value imputation strategy robust to outliers.
Here is an example how to perform missing value imputation using a scikit-learn pipeline and imputer:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.