I am trying to use a Naive Bayes classifier from the sklearn module to classify whether movie reviews are positive. I am using a bag of words as the features for each review and a large dataset with sentiment scores attached to reviews.
df_bows = pd.DataFrame.from_records(bag_of_words)
df_bows = df_bows.fillna(0).astype(int)
This code creates a pandas dataframe which looks like this:
The Rock is destined to ... Staggeringly ’ ve muttering dissing
0 1 1 1 1 2 ... 0 0 0 0 0
1 2 0 1 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 0 0 0 0 0
3 0 0 1 0 4 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
I then try and fit this data frame with the sentiment of each review using this code
nb = MultinomialNB()
nb = nb.fit(df_bows, movies.sentiment > 0)
However I get an error which says
AttributeError: 'Series' object has no attribute 'to_coo'
This is what the df movies looks like.
sentiment text
id
1 2.266667 The Rock is destined to be the 21st Century's ...
2 3.533333 The gorgeously elaborate continuation of ''The...
3 -0.600000 Effective but too tepid biopic
4 1.466667 If you sometimes like to go to the movies to h...
5 1.733333 Emerges as something rare, an issue movie that...
Can you help with this?
When you're trying to fit your MultinomialNB model, sklearn's routine checks if the input df_bows is sparse or not. If it is, just like in our case, it is required to change the dataframe's type to 'Sparse'. Here is the way I fixed it :
df_bows = pd.DataFrame.from_records(bag_of_words)
# Keep NaN values and convert to Sparse type
sparse_bows = df_bows.astype('Sparse')
nb = nb.fit(sparse_bows, movies['sentiment'] > 0)
Link to Pandas doc : pandas.Series.sparse.to_coo
Related
So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)
I am working on a multilabel text classification.
I am doing one hot encoding for my training and testing labels such that at first I created a list which contains all labels i.e 8921 unique labels then I am doing one-hot encoding as with the help of a list as follows:
note: for following code:
b is my list of 8921 labels
and df['LABELS'] look like as follows:
b=['865.09','482.1','860.4','31.29', ......, '76.74', '76.92', '79.32']
LABELS
[532.40,493.20,V45.81,412,401.9,44.43]
[211.3,427.31,578.9,560.1,496,584.9,428.0,276.5]
[440.22, 492.8, 401.9, 714.0, 39.29, 88.48]
my code:
for label in b:
df[label]=np.where((df['LABELS'])==label,1,0)
df[['LABELS']+b].head()
Output that i get:
LABELS 038.9 785.59 584.9 427.5 410.71 ..... 428.0 682.6 425.4
0 [038.9, 0 0 0 0 0 0 0 0
493.20,
V45.81,
682.6,
401.9,
44.43]
1 [472.5, 0 0 0 0 0 ..... 0 0 0
428.0,
578.9,
560.1,
496,
584.9]
Desired output
LABELS 038.9 785.59 584.9 427.5 410.71 ..... 428.0 682.6 425.4
0 [038.9, 1 0 0 0 0 0 1 0
493.20,
V45.81,
682.6,
401.9,
44.43]
1 [472.5, 0 0 1 1 0 ..... 1 0 0
428.0,
578.9,
560.1,
496,
584.9]
kindly help where i am doing mistake while iterating my df['LABEL'] values.
It seems like df['LABELS'] is a series and contains a list in each row. So np.where probably checks this list with a single label. For example it checks if [532.40,493.20,V45.81,412,401.9,44.43] == '865.09' which obviously returns False.
Apart from that, the labels in b are strings whereas the labels in df['LABELS']
don't seem to be. This will also result in False as '865.09' == 865.09 are not the same.
I am trying to implement a python code to extend a matrix in such a way as given below:
Given Matrix:
1 2
3 4
Now I want to convert it to the following:
1 0 0 2 0 0
0 0 0 0 0 0
0 0 0 0 0 0
3 0 0 4 0 0
0 0 0 0 0 0
0 0 0 0 0 0
I am trying the same for a matrix of the dimensions 60x80. I tried out numpy.insert(). But for larger matrix I am not able to apply the same thing(as it becomes too much hardcoding). So need some suggestions to do such interpolation.
You can use the step part of the slice to achieve this, if you preallocate yourself a result
repeat = 3
result = np.zeros((arr.shape[0]*repeat, arr.shape[1]*repeat))
result[::repeat,::repeat] = arr
I have the list of comments in the following format:
Comments=[['hello world'], ['would', 'hard', 'press'],['find', 'place', 'less'']]
wordset={'hello','world','hard','would','press','find','place','less'}
I wish to have the table or dataframe which has wordset as index and the individual counts for each comment in Comments
I worked with the following code which achieves the required dataframe. And It is high time taking and I look for an efficient implementation. Since the corpus is large, this has a huge impact on the efficiency of our ranking algorithm.
result=pd.DataFrame()
for comment in Comments:
worddict_terms=dict.fromkeys(wordset,0)
for items in comment:
worddict_terms[items]+=1
df_comment=pd.DataFrame.from_dict([worddict_terms])
frames=[result,df_comment]
result = pd.concat(frames)
Comments_raw_terms=result.transpose()
The result we expect is:
0 1 2
hello 1 0 0
world 1 0 0
would 0 1 0
press 0 1 0
find 0 0 1
place 0 0 1
less 0 0 1
hard 0 1 0
I think your nested for loop is increasing complexity. I am writing code which replaces 2 for loops with single map function. I am writing code only up to part where for each comment in comments, you get the count_dictionary for "Hello" and "World". You, Please copy the remaining code of making table using pandas.
from collections import Counter
import funcy
from funcy import project
def fun(comment):
wordset={'hello','world'}
temp_dict_comment = Counter(comment)
temp_dict_comment = dict(temp_dict_comment)
final_dict = project(temp_dict_comment,wordset)
print final_dict
Comments=[['hello', 'world'], ['would', 'hard', 'press'],['find', 'place', 'less', 'excitingit', 'wors', 'watch', 'paint', 'dri']]
map(fun,Comments)
This should help as it is only containing single map instead of 2 for loops.
Try this approach:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
text = pd.Series(Comments).str.join(' ')
X = vect.fit_transform(text)
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
Result:
In [49]: r
Out[49]:
find hard hello less place press world would
0 0 0 1 0 0 0 1 0
1 0 1 0 0 0 1 0 1
2 1 0 0 1 1 0 0 0
In [50]: r.T
Out[50]:
0 1 2
find 0 0 1
hard 0 1 0
hello 1 0 0
less 0 0 1
place 0 0 1
press 0 1 0
world 1 0 0
would 0 1 0
Pure Pandas solution:
In [61]: pd.get_dummies(text.str.split(expand=True), prefix_sep='', prefix='')
Out[61]:
find hello would hard place world less press
0 0 1 0 0 0 1 0 0
1 0 0 1 1 0 0 0 1
2 1 0 0 0 1 0 1 0
I'm doing machine learning for time series prediction and I need to transform dates to vectors of zeros and ones.
If I decide that the relvant information of the date is the day of the week on which the observation was made, I'd like to have a time series of vectors of length 7, that contains only one "1" placed in the first slot if it's a Monday, second if it's a Tuesday etc...
I'd like, for example for an input (like "2015-12-22 22:48:00") to be transformed into
0 1 0 0 0 0 0
if the relevant information is that it's a tuesday. Or a
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
If it's that it's 10 p.m
The labelBinarizer() from sklearn.preprocessing does that nicely in python, and I've looked for the equivalent in R, but haven't found it. Do any of you guys happen to know what I'm looking for ?
Here is the labelBinarizer() : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
Right now I'm doing this in python : where Hour is a time series of the the exact hours at which my observations were made;
import sklearn.preprocessing as pp
lbday = pp.LabelBinarizer()
lbday.fit(list(range(24)))
pp.LabelBinarizer(neg_label=0, pos_label=1)
Hour = lbday.transform(Hour)
Then i export a csv of the binarized dates that I read with R.
Thank you !
Try this:
binarizer <- function(levels){
f = function(v){
m = matrix(0, nrow=length(v), ncol=length(levels))
vf = as.numeric(factor(v, levels=levels))
m[cbind(1:length(v),vf)]=1
colnames(m)=levels
m
}
f
}
Example:
> ab = binarizer(letters[1:5]) # valid values a to e
> ab(c("a","e","a"))
a b c d e
[1,] 1 0 0 0 0
[2,] 0 0 0 0 1
[3,] 1 0 0 0 0