Structure of data for multilabel Classification problem - python

I am working on a prediction problem where a user is having access to multiple Target and each access is having separate row. Below is the data
df=pd.DataFrame({"ID":[12567,12567,12567,12568,12568],"UnCode":[LLLLLLL,LLLLLLL,LLLLLLL,KKKKKK,KKKKKK],
"CoCode":[1000,1000,1000,1111,1111],"CatCode":[1,1,1,2,2],"RoCode":["KK","KK","KK","MM","MM"],"Target":[12,4,6,1,6]
})
**Here ID is unique but can be repeated if user has accessed multiple targets and target can be repeated as well if accessed by different ID's**
I have converted this data to OHE and used for prediction using binary relevance, where my X is constant and target is varying.
Problem I am facing with this approach is the data becomes sparse and number of features in my original data are around 1300.
Can someone suggest me whether this approach is correct or not and what other methods/approach I can use in this type of problem. Also is this problem can be treated as multilabel classification?
Below is the input data for model

Related

sklearn2pmml not giving the same prediction output as sklearn pickle after loading it with pypmml

When testing a PMML object previously converted from a Pickle file (dumped from a sklearn fitted object), I am unable to reproduce the same results as with the pickle model. In the sklearn we see I obtain [0 1 0] as classes for the input given in X. However in PMML I would approaximate the probabilities to [1 1 1]. Is there anything I am doing wrong? This model behaviour is not what I would expect when converting the pickle file.
PMML models identify input columns by name, not by position. Therefore, one should not use "anonymized" data stores (such as numpy.ndarray) because, by definition, it will be impossible to identity input columns correctly.
Here, the order of input columns is problematic, because the RF model is trained using a small set of randomly generated data (a 3 x 4 numpy.array). It is likely that the RF model uses only one or two input columns, and ignores the rest (say, "x1" and "x3" are significant, and "x2" and "x4" are not). During conversion the SkLearn2PMML package removes all redundant features. In the current example, the RF model would expect a two-column input data store, where the first column corresponds to "x1" and the second column to "x3"; you're passing a four-column input data store instead, where the second column is occupied by "x2".
TLDR: When working with PMML models, do the following:
Don't use anonymized data stores! Train your model using a pandas.DataFrame (instead of numpy.ndarray), and also make predictions using the same data store class. This way the column mapping will always come out correct, even if the SkLearn2PMML package decided to eliminate some redundant columns.
Use the JPMML-Evaluator-Python package instead of PyPMML. Specifically, stay away from PyPMML's predict(X: numpy.ndarray) method!

Convolutional neural network (CNN) on decimal values

I have a lot of csv file containing approximately 1000 rows and 2 columns where the data looks like this:
21260.35679 0.008732499
21282.111 0.008729349
21303.86521 0.008721652
21325.61943 0.008708224
These two are the features where the output will be a device name. Each csv file is data from a specific device of different times and there are also many devices. What I am trying to do is train the data and then classify the device name using CNN. If there is any incoming data outside of the trained observation, it should be classified as anomaly.
I am trying to convert those values to image matrix so that I can use CNN to train this data. But I what I am concerned about is, the second columns contains value less than 1 or and close to zero and the value is also float. If I convert it to integer it becomes zero and if all the values becomes zero then it doesn't make any sense.
How to solve this? And is it even possible to use CNN on these datasets?
From your description, your problem seems to be a sequence classification.
You have many temporal sequences. Each sequence has the same quantity of 2D elements and is associated to a device. Given a sequence as input, you want to predict the corresponding device.
This kind of temporal dependencies are better captured by RNNs. I would suggest giving a look at LSTM .

Encoding large number of categorical features [duplicate]

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:
I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81).
However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:
ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)
How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?
Thanks and I hope I've made myself clear :)
This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.
One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.
Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.
The second approach is what you mention in your question, so I'll go through it with pandas.
By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:
test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns,
fill_value=0)
This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.
You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.
For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.
This works for me:
Initially, I was getting this error message:
shapes (15754,3) and (4, ) not aligned
I found out that, I was creating a model using 3 variables in my train data. But what I add constant X_train = sm.add_constant(X_train) the constant variable is automatically gets created. So, in total there are now 4 variables.
And when you test this model by default the test variable has 3 variables. So, the error gets pops up for dimension miss match.
So, I used the trick that creates a dummy variable for y_test also.
`X_test = sm.add_constant(X_test)`
Though this a useless variable, but this solves all the issue.

Mixed parameter types for machine learning

I wish to fit a logistic regression model with a set of parameters. The parameters that I have include three distinct types of data:
Binary data [0,1]
Categorical data which has been encoded to integers [0,1,2,3,...]
Continuous data
I have two questions regarding pre-processing the parameter data before fitting a regression model:
For the categorical data, I've seen two ways to handle this. The first method is to use a one hot encoder, thus giving a new parameter for each category. The second method, is to just encode the categories with integers within a single parameter variable [0,1,2,3,4,...]. I understand that using a one hot encoder creates more parameters and therefore increases the risk of over-fitting the model; however, other than that, are there any reasons to prefer one method over the other?
I would like to normalize the parameter data to account for the large differences between the continuous and binary data. Is it generally acceptable to normalize the binary and categorical data? Should I normalize the categorical and the continuous parameters but not the binary parameters or can I just normalize all the parameter data types.
I realize I could fit this data with a random forest model and not have to worry much about pre-processing, but I'm curious how this applies with a regression type model.
Thank you in advance for your time and consideration.

Categories

Resources