Getting 'ValueError: shapes not aligned' on SciKit Linear Regression - python

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:
I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81).
However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:
ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)
How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?
Thanks and I hope I've made myself clear :)

This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.
One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.
Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.
The second approach is what you mention in your question, so I'll go through it with pandas.
By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:
test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns,
fill_value=0)
This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.
You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.

For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.

This works for me:
Initially, I was getting this error message:
shapes (15754,3) and (4, ) not aligned
I found out that, I was creating a model using 3 variables in my train data. But what I add constant X_train = sm.add_constant(X_train) the constant variable is automatically gets created. So, in total there are now 4 variables.
And when you test this model by default the test variable has 3 variables. So, the error gets pops up for dimension miss match.
So, I used the trick that creates a dummy variable for y_test also.
`X_test = sm.add_constant(X_test)`
Though this a useless variable, but this solves all the issue.

Related

sklearn2pmml not giving the same prediction output as sklearn pickle after loading it with pypmml

When testing a PMML object previously converted from a Pickle file (dumped from a sklearn fitted object), I am unable to reproduce the same results as with the pickle model. In the sklearn we see I obtain [0 1 0] as classes for the input given in X. However in PMML I would approaximate the probabilities to [1 1 1]. Is there anything I am doing wrong? This model behaviour is not what I would expect when converting the pickle file.
PMML models identify input columns by name, not by position. Therefore, one should not use "anonymized" data stores (such as numpy.ndarray) because, by definition, it will be impossible to identity input columns correctly.
Here, the order of input columns is problematic, because the RF model is trained using a small set of randomly generated data (a 3 x 4 numpy.array). It is likely that the RF model uses only one or two input columns, and ignores the rest (say, "x1" and "x3" are significant, and "x2" and "x4" are not). During conversion the SkLearn2PMML package removes all redundant features. In the current example, the RF model would expect a two-column input data store, where the first column corresponds to "x1" and the second column to "x3"; you're passing a four-column input data store instead, where the second column is occupied by "x2".
TLDR: When working with PMML models, do the following:
Don't use anonymized data stores! Train your model using a pandas.DataFrame (instead of numpy.ndarray), and also make predictions using the same data store class. This way the column mapping will always come out correct, even if the SkLearn2PMML package decided to eliminate some redundant columns.
Use the JPMML-Evaluator-Python package instead of PyPMML. Specifically, stay away from PyPMML's predict(X: numpy.ndarray) method!

Using class encoding for prediction?

I was wondering if you can use class encoding, specifically OneHotEncoder in Python, for prediction, if you do not know all the future feature values?
To give more context. I am predicting whether or not a fine will be paid in the future based upon the location, issuing office & amount (and potentially other features if I can get it to work). When I do onehotencoding on my training set it works great (For 100k rows of data my test accuracy is around 92%, using a 75/25 split).
However, when I then introduce the new data, there are locations and 'offices' the encoder never saw. Therefore, new features were not created. This means that in my training set, I had 2302 columns when I built my model (random forest), while when predicting using the real data, I have 3330 columns, therefore, the model I built is no longer valid. (note, I am also looking at other models as the data is so sparse)
How do you handle such a problem when class encoding? Can you only class encode if you have tight control on your future feature values?
Any help would be much appreciated. Apologies if my terminology is wrong, I am new to this and this is my first post on stackoverflow.
I can add code if it helps however I think it is more the theory which is relevant here.
There are two things to keep in mind when using OneHotEncoding.
The number of classes in a column is important. If a class is missing from the test set but present in the train set, it won't be a problem. But if a class is missing from the train set and present in the test set, the encoding will not be able to recognize the new class. This seems to be the problem in your case.
Secondly, you should use the same encoder to encode the train and test split. This way the number of columns in the train and test splits will be the same (2302 != 3330 columns), but in the case of any additional classes in the test set, the user can specify how to deal with missing values. Have a look at the documentation.
A possible way to deal with your issue would be to do the OneHotEncoding on the entire dataset and then split the data 75/25. This will work considering you wont have any new training data later.

Structure of data for multilabel Classification problem

I am working on a prediction problem where a user is having access to multiple Target and each access is having separate row. Below is the data
df=pd.DataFrame({"ID":[12567,12567,12567,12568,12568],"UnCode":[LLLLLLL,LLLLLLL,LLLLLLL,KKKKKK,KKKKKK],
"CoCode":[1000,1000,1000,1111,1111],"CatCode":[1,1,1,2,2],"RoCode":["KK","KK","KK","MM","MM"],"Target":[12,4,6,1,6]
})
**Here ID is unique but can be repeated if user has accessed multiple targets and target can be repeated as well if accessed by different ID's**
I have converted this data to OHE and used for prediction using binary relevance, where my X is constant and target is varying.
Problem I am facing with this approach is the data becomes sparse and number of features in my original data are around 1300.
Can someone suggest me whether this approach is correct or not and what other methods/approach I can use in this type of problem. Also is this problem can be treated as multilabel classification?
Below is the input data for model

SkLearn - Why LabelEncoder().fit only to training data

I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit only to training data and not also to test data then I start to think why is this really necessary.
Specifically, at SkLearn if I want to LabelEncoder().fit only to training data then there are two different scenarios:
The test set has some new labels in relation to the training set. For example, the test set has only the labels ['USA', 'UK'] while the test set has the labels ['USA', 'UK', 'France']. Then, as it has been reported elsewhere (e.g. Getting ValueError: y contains new labels when using scikit learn's LabelEncoder), you are getting an error if you try to transform the test set according to this LabelEncoder() because exactly it encounters a new label.
The test set has the same labels as the training set. For example, both the training and the test set have the labels ['USA', 'UK', 'France']. However, then LabelEncoder().fit only to training data is essentially redundant since the test set have the same known values as the training set.
Hence, what is the point of LabelEncoder().fit only to training data and then LabelEncoder().tranform both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant?
Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. They did not mention anything about any production or out-of-vocabulary purposes.
The main reason to do so is because in inference/production time (not testing) you might encounter labels that you have never seen before (and you won't be able to call fit() even if you wanted to).
In scenario 2 where you are guaranteed to always have the same labels across folds and in production it is indeed redundant. But are you still guaranteed to see the same in production?
In scenario 1 you need to find a solution to handle unknown labels. One popular approach is map every unknown label into an unknown token. In natural language processing this is call the "Out of vocabulary" problem and the above approach is often used.
To do so and still use LabelEncoder() you can pre-process your data and perform the mapping yourself.
It's hard to guess why the senior data scientists gave you that advice without context, but I can think of at least one reason they may have had in mind.
If you are in the first scenario, where the training set does not contain the full set of labels, then it is often helpful to know this and so the error message is useful information.
Random sampling can often miss rare labels and so taking a fully random sample of all of your data is not always the best way to generate a training set. If France does not appear in your training set, then your algorithm will not be learning from it, so you may want to use a randomisation method that ensures your training set is representative of minority cases. On the other hand, using a different randomisation method may introduce new biases.
Once you have this information, it will depend on your data and problem to be solved as to what the best approach to solve it will be, but there are cases where it is important to have all labels present. A good example would be identifying the presence of a very rare illness. If your training data doesn't include the label indicating that the illness is present, then you better re-sample.

Python Scikit Learn, LinearRegression, Dummy Variable lead to different in shape

I've worked around with Scikit Learn Library for Machine Learning purpose.
I got some problem related to Dummy variable while using Regression. I have 2 set of sample for Training set and Test set. Actually, program uses Training set to create "Prediction Model", then "Testing" to check the score. While running the program, If the shape is equal, it's fine. But dummy variable, will make change to the shape and lead to different in
shape.
Example
Training set: 130 Rows * 3 Column
Training set: 60 Rows * 3 Column
After making 1 and 2 column to be dummy, now shape is changing
Training set: 130 Rows * 15 Column
Training set: 60 Rows * 12 Column
Any solution to solve this problem?
If it's possible or not, to success in progress even data shape is different
Sample Program: https://www.dropbox.com/s/tcc1ianmljf5i8c/Dummy_Error.py?dl=0
If I understand your code correctly, you are using pd.get_dummies to create the dummy variables and are passing your entire data frame to the function.
In this case, pandas will create a dummy variable for every value in every category it finds. In this case, it looks like more category values exist in training than in test. This is why you end up with more columns in training than in test.
A better approach is to combine everything in one dataframe, create categorical variables in the combined data set and then split your data into train and test.

Categories

Resources