One Hot Encoding (mapping values issue)

One Hot Encoding (mapping values issue) - python

I am working on a multilabel text classification.
I am doing one hot encoding for my training and testing labels such that at first I created a list which contains all labels i.e 8921 unique labels then I am doing one-hot encoding as with the help of a list as follows:
note: for following code:
b is my list of 8921 labels
and df['LABELS'] look like as follows:
b=['865.09','482.1','860.4','31.29', ......, '76.74', '76.92', '79.32']
LABELS
[532.40,493.20,V45.81,412,401.9,44.43]
[211.3,427.31,578.9,560.1,496,584.9,428.0,276.5]
[440.22, 492.8, 401.9, 714.0, 39.29, 88.48]
my code:
for label in b:
df[label]=np.where((df['LABELS'])==label,1,0)
df[['LABELS']+b].head()
Output that i get:
LABELS 038.9 785.59 584.9 427.5 410.71 ..... 428.0 682.6 425.4
0 [038.9, 0 0 0 0 0 0 0 0
493.20,
V45.81,
682.6,
401.9,
44.43]
1 [472.5, 0 0 0 0 0 ..... 0 0 0
428.0,
578.9,
560.1,
496,
584.9]
Desired output
LABELS 038.9 785.59 584.9 427.5 410.71 ..... 428.0 682.6 425.4
0 [038.9, 1 0 0 0 0 0 1 0
493.20,
V45.81,
682.6,
401.9,
44.43]
1 [472.5, 0 0 1 1 0 ..... 1 0 0
428.0,
578.9,
560.1,
496,
584.9]
kindly help where i am doing mistake while iterating my df['LABEL'] values.

It seems like df['LABELS'] is a series and contains a list in each row. So np.where probably checks this list with a single label. For example it checks if [532.40,493.20,V45.81,412,401.9,44.43] == '865.09' which obviously returns False.
Apart from that, the labels in b are strings whereas the labels in df['LABELS']
don't seem to be. This will also result in False as '865.09' == 865.09 are not the same.

Related

Should the background dataset for shap be standardized?

So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)

Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)

DL: when a prediction SHOULD be considered false in a multiclass multilabel problem

I am currently working on getting the metrics (classification report, confusion matrix) for a DL problem and I have stumbled across a problem.
My y_true is something like [1 0 0 0 1 0 0 1 0 0] (multilabel). The ones indicating the correct values (eg. RED, BLUE, GREEN).
My model outputs with Sigmoid activation, getting everything between 0 and 1. So far so good.
Now, the way I proceed is by getting the max value/score of the prediction torch.max(outputs), then its index/position on the array and then 1-hot encoding it so to resemble the y_true.
My question is, if y_true has 2 or more ones (labels), then even if I manage to predict 1 of them right, my 1-hot encoding will be always false (cause [1 0 0 0 1 0 0 1 0 0] <> [1 0 0 0 0 0 0 0 0 0]). I could get more than 1 score, but again I don't know how many labels my y_true has every time.
What is the right way to proceed with this?

AttributeError: 'Series' object has no attribute 'to_coo'

I am trying to use a Naive Bayes classifier from the sklearn module to classify whether movie reviews are positive. I am using a bag of words as the features for each review and a large dataset with sentiment scores attached to reviews.
df_bows = pd.DataFrame.from_records(bag_of_words)
df_bows = df_bows.fillna(0).astype(int)
This code creates a pandas dataframe which looks like this:
The Rock is destined to ... Staggeringly ’ ve muttering dissing
0 1 1 1 1 2 ... 0 0 0 0 0
1 2 0 1 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 0 0 0 0 0
3 0 0 1 0 4 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
I then try and fit this data frame with the sentiment of each review using this code
nb = MultinomialNB()
nb = nb.fit(df_bows, movies.sentiment > 0)
However I get an error which says
AttributeError: 'Series' object has no attribute 'to_coo'
This is what the df movies looks like.
sentiment text
id
1 2.266667 The Rock is destined to be the 21st Century's ...
2 3.533333 The gorgeously elaborate continuation of ''The...
3 -0.600000 Effective but too tepid biopic
4 1.466667 If you sometimes like to go to the movies to h...
5 1.733333 Emerges as something rare, an issue movie that...
Can you help with this?

When you're trying to fit your MultinomialNB model, sklearn's routine checks if the input df_bows is sparse or not. If it is, just like in our case, it is required to change the dataframe's type to 'Sparse'. Here is the way I fixed it :
df_bows = pd.DataFrame.from_records(bag_of_words)
# Keep NaN values and convert to Sparse type
sparse_bows = df_bows.astype('Sparse')
nb = nb.fit(sparse_bows, movies['sentiment'] > 0)
Link to Pandas doc : pandas.Series.sparse.to_coo

pythonic way of making dummy column from sum of two values

I have a dataframe with one column called label which has the values [0,1,2,3,4,5,6,8,9].
I would like to make dummy columns out of this, but I would like some labels to be joined together, so for example I want dummy_012 to be 1 if the observation has either label 0, 1 or 2.
If i use the command df2 = pd.get_dummies(df, columns=['label']), it would create 9 columns, 1 for each label.
I know I can use df2['dummy_012']=df2['dummy_0']+df2['dummy_1']+df2['dummy_2'] after that to turn it into one joint column, but I want to know if there's a more pythonic way of doing it (or some function where i can just change the parameters to the joins).

Maybe this approach can give a idea:
groups = ['012', '345', '6789']
for gp in groups:
df.loc[df['Label'].isin([int(x) for x in gp]), 'Label_Group'] = f'dummies_{gp}'
Output:
Label Label_Group
0 0 dummies_012
1 1 dummies_012
2 2 dummies_012
3 3 dummies_345
4 4 dummies_345
5 5 dummies_345
6 6 dummies_6789
7 8 dummies_6789
8 9 dummies_6789
And then apply dummy:
df_dummies = pd.get_dummies(df['Label_Group'])
dummies_012 dummies_345 dummies_6789
0 1 0 0
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 1 0
6 0 0 1
7 0 0 1
8 0 0 1

I don't know that this is pythonic because a more elegant solution might exist, but I does allow you to change parameters and it's vectorized. I've read that get_dummies() can be a bit slow with large amounts of data and vectorizing pandas is good practice in general. So I vectorized this function and had it do its calculations with numpy arrays. It should give you a boost in performance as the dataset increases in size compared to similar functions.
This function will take your dataframe and a list of numbers as strings and will return your dataframe with the column you wanted.
def get_dummy(df,column_nos):
new_col_name = 'dummy_'+''.join([i for i in column_nos])
vector_sum = sum([df[i].values for i in column_nos])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df
In case you'd rather the input to be integers rather than strings, you can tweak the above function to look like below.
def get_dummy(df,column_nos):
column_names = ['dummy_'+str(i) for i in column_nos]
new_col_name = 'dummy_'+''.join([str(i) for i in sorted(column_nos)])
vector_sum = sum([df[i].values for i in column_names])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df

Alternative to scikit learn labelBinarizer in R

I'm doing machine learning for time series prediction and I need to transform dates to vectors of zeros and ones.
If I decide that the relvant information of the date is the day of the week on which the observation was made, I'd like to have a time series of vectors of length 7, that contains only one "1" placed in the first slot if it's a Monday, second if it's a Tuesday etc...
I'd like, for example for an input (like "2015-12-22 22:48:00") to be transformed into
0 1 0 0 0 0 0
if the relevant information is that it's a tuesday. Or a
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
If it's that it's 10 p.m
The labelBinarizer() from sklearn.preprocessing does that nicely in python, and I've looked for the equivalent in R, but haven't found it. Do any of you guys happen to know what I'm looking for ?
Here is the labelBinarizer() : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
Right now I'm doing this in python : where Hour is a time series of the the exact hours at which my observations were made;
import sklearn.preprocessing as pp
lbday = pp.LabelBinarizer()
lbday.fit(list(range(24)))
pp.LabelBinarizer(neg_label=0, pos_label=1)
Hour = lbday.transform(Hour)
Then i export a csv of the binarized dates that I read with R.
Thank you !

Try this:
binarizer <- function(levels){
f = function(v){
m = matrix(0, nrow=length(v), ncol=length(levels))
vf = as.numeric(factor(v, levels=levels))
m[cbind(1:length(v),vf)]=1
colnames(m)=levels
m
}
f
}
Example:
> ab = binarizer(letters[1:5]) # valid values a to e
> ab(c("a","e","a"))
a b c d e
[1,] 1 0 0 0 0
[2,] 0 0 0 0 1
[3,] 1 0 0 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

One Hot Encoding (mapping values issue) - python

Related

Should the background dataset for shap be standardized?

DL: when a prediction SHOULD be considered false in a multiclass multilabel problem

AttributeError: 'Series' object has no attribute 'to_coo'

pythonic way of making dummy column from sum of two values

Alternative to scikit learn labelBinarizer in R

Categories

Resources