Text Classification with Custom Vocabulary in Python

Text Classification with Custom Vocabulary in Python - python

I have some list of words/lexicon and I want to use them for a BOW classification. In sklearnit is possible to use countvectorizer and tfidfvectorizer in sklearn, the two approaches builds the vocabulary the use from the training data. But in my case I have built a kind of list of words(dictionary) that can be used to discriminate between the classes for text classification.
Is there any library or package I can use in python?

Check out the vocabulary parameter of the CountVectorizer here in the docs.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
twenty_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'], shuffle=True, random_state=42)
my_vocabulary = ['aristotle',
'arithmetic',
'arizona',
'arkansas'
]
count_vect = CountVectorizer(vocabulary=my_vocabulary)
X_train_counts = count_vect.fit_transform(twenty_train.data)
df = pd.DataFrame.sparse.from_spmatrix(X_train_counts)
And you will see in the output only that only the words from your vocabulary are used:
Out[16]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
... .. .. .. ..
2252 0 0 0 0
2253 0 0 0 0
2254 0 0 0 0
2255 0 0 0 0
2256 0 0 0 0
[2257 rows x 4 columns]

Of late, there are so many questions around multi-class multi label text classification. Please check if this article helps.

Related

Spam Classification Lab: Python Error in Array Syntax

I receive an invalid syntax error for this code... Any help would be greatly appreciated.
#show the vectors for each sentence
print(X.toarray())
[[0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0]
[0 0 0 0 0 2 0 1 1 0 1 0 0 1 0 0 0]
[0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0]
[1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0 1]]

Mate simply try this with Numpy
import numpy as np
arr = np.array(X)
print(arr)
In case you're solving a text based problem like converting words to vectors, then I'll recommend you use the Tokenizer from Tensorflow library. Use the following code then.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.text.preprocessing.text import Tokenizer
sentences = [
'This is sample one',
'This is sample two'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
Try this.
Hope this helps.

ValueError: Found array with 1 feature(s) while a minimum of 2 is required

I applied Random Forest RFECV among other ML models to a churn dataset.
While Logistic, SVC, Gradient Boosting, Decision Trees worked well on the data (all using RFECV),
Random Forest RFECV decided that only one feature was important and eliminated all the other features.
Code:
#Create Feature variable X and Target variable y
y = churn_dataset['Churn']
X = churn_dataset.drop(['Churn'], axis = 1)
#RFECV
rfecv = RFECV(RandomForestClassifier(), cv=10, scoring='f1')
rfecv = rfecv.fit(X, y)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X.columns[rfecv.support_])
print(np.where(rfecv.support_ == False)[0])
#drop columns
X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1, inplace=True)
rfecv.estimator_.feature_importances_
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=8)
#fit model
random_forest = rfecv.fit(X_train, y_train)
The following error is returned:
ValueError: Found array with 1 feature(s) (shape=(1622, 1)) while a minimum of 2 is required.
Output of churn_dataset.head()
name gender churn last_purchase_in_days order_count purchase_quantity ...
2 ACKLE 0 1 0.317604 -0.453647 2 -0.368683 1.173058 0.291104 0 ... 0 0 0 0 0 0 1 0 0 1.00
4 ADNAN 1 1 0.250814 -0.453647 2 -0.368683 -0.431351 -0.418023 0 ... 0 0 0 0 0 0 1 0 0 1.00
5 ADY 0 1 -1.143415 -0.453647 2 -0.368683 0.190767 -0.117630 0 ... 0 0 0 0 0 0 1 0 0 1.00
6 ANDY 0 1 0.768432 -0.453647 2 -0.368683 -0.752232 -0.559952 0 ... 0 0 0 0 0 0 1 0 0 1.00
7 AGIE 0 0 -1.669381 3.048875 8 -0.368683 0.520653 4.251851 0 ... 0 0 0 0 0 0 1 0 0 0.16
churn_dataset.columns
Index(['name', 'gender', 'Churn', 'last_purchase_in_days',
'order_count', 'quantity', 'disc_code',
'AOV', 'sales',
'channel_Paid Advertising','channel_Recurring Payment',
'channel_Search Engine',
'channel_Social Media', 'country_Denmark', 'country_France',
'country_Germany', 'country_Italy',
'country_Luxembourg', 'country_Others', 'country_Switzerland',
'country_United Kingdom', 'city_Düsseldorf', 'city_Frankfurt',
'city_Hamburg', 'city_Hannover', 'city_Köln', 'city_Leipzig',
'city_Munich', 'city_Others', 'city_Stuttgart', 'city_Wien',
'Probability_of_Churn'],
dtype='object')

AttributeError: 'Series' object has no attribute 'to_coo'

I am trying to use a Naive Bayes classifier from the sklearn module to classify whether movie reviews are positive. I am using a bag of words as the features for each review and a large dataset with sentiment scores attached to reviews.
df_bows = pd.DataFrame.from_records(bag_of_words)
df_bows = df_bows.fillna(0).astype(int)
This code creates a pandas dataframe which looks like this:
The Rock is destined to ... Staggeringly ’ ve muttering dissing
0 1 1 1 1 2 ... 0 0 0 0 0
1 2 0 1 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 0 0 0 0 0
3 0 0 1 0 4 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
I then try and fit this data frame with the sentiment of each review using this code
nb = MultinomialNB()
nb = nb.fit(df_bows, movies.sentiment > 0)
However I get an error which says
AttributeError: 'Series' object has no attribute 'to_coo'
This is what the df movies looks like.
sentiment text
id
1 2.266667 The Rock is destined to be the 21st Century's ...
2 3.533333 The gorgeously elaborate continuation of ''The...
3 -0.600000 Effective but too tepid biopic
4 1.466667 If you sometimes like to go to the movies to h...
5 1.733333 Emerges as something rare, an issue movie that...
Can you help with this?

When you're trying to fit your MultinomialNB model, sklearn's routine checks if the input df_bows is sparse or not. If it is, just like in our case, it is required to change the dataframe's type to 'Sparse'. Here is the way I fixed it :
df_bows = pd.DataFrame.from_records(bag_of_words)
# Keep NaN values and convert to Sparse type
sparse_bows = df_bows.astype('Sparse')
nb = nb.fit(sparse_bows, movies['sentiment'] > 0)
Link to Pandas doc : pandas.Series.sparse.to_coo

GridSearchCV for number of neurons

I am trying to learn by myself how to grid-search number of neurons in a basic multi-layered neural networks. I am using GridSearchCV and KerasClasifier of Python as well as Keras. The code below works for other data sets very well but I could not make it work for Iris dataset for some reasons and I cannot find it why, I am missing out something here. The result I get is:
Best: 0.000000 using {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 5}
from pandas import read_csv
import numpy
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.model_selection import GridSearchCV
dataframe=read_csv("iris.csv", header=None)
dataset=dataframe.values
X=dataset[:,0:4].astype(float)
Y=dataset[:,4]
seed=7
numpy.random.seed(seed)
#encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#one-hot encoding
dummy_y = np_utils.to_categorical(encoded_Y)
#scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
def create_model(n_neurons=1):
#create model
model = Sequential()
model.add(Dense(n_neurons, input_dim=X.shape[1], activation='relu')) # hidden layer
model.add(Dense(3, activation='softmax')) # output layer
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, initial_epoch=0, verbose=0)
# define the grid search parameters
neurons=[3, 5]
#this does 3-fold classification. One can change k.
param_grid = dict(n_neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, dummy_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
For the purpose of illustration and computational efficiency I search only for two values. I sincerely apologize for asking such a simple question. I am new to Python, switched from R, by the way because I realized that Deep Learning community is using python.

Haha, this is probably the funniest thing I ever experienced on Stack Overflow :) Check:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
and you should see different behavior. The reason why your model get a perfect score (in terms of cross_entropy having 0 is equivalent to best model possible) is that you haven't shuffled your data and because Iris consist of three balanced classes each of your feed had a single class like a target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 (first fold ends here) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (second fold ends here)2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Such problems are really easy to be solved by every model - so that's why you've got a perfect match.
Try to shuffle your data before - this should result in an expected behavior.

Alternative to scikit learn labelBinarizer in R

I'm doing machine learning for time series prediction and I need to transform dates to vectors of zeros and ones.
If I decide that the relvant information of the date is the day of the week on which the observation was made, I'd like to have a time series of vectors of length 7, that contains only one "1" placed in the first slot if it's a Monday, second if it's a Tuesday etc...
I'd like, for example for an input (like "2015-12-22 22:48:00") to be transformed into
0 1 0 0 0 0 0
if the relevant information is that it's a tuesday. Or a
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
If it's that it's 10 p.m
The labelBinarizer() from sklearn.preprocessing does that nicely in python, and I've looked for the equivalent in R, but haven't found it. Do any of you guys happen to know what I'm looking for ?
Here is the labelBinarizer() : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
Right now I'm doing this in python : where Hour is a time series of the the exact hours at which my observations were made;
import sklearn.preprocessing as pp
lbday = pp.LabelBinarizer()
lbday.fit(list(range(24)))
pp.LabelBinarizer(neg_label=0, pos_label=1)
Hour = lbday.transform(Hour)
Then i export a csv of the binarized dates that I read with R.
Thank you !

Try this:
binarizer <- function(levels){
f = function(v){
m = matrix(0, nrow=length(v), ncol=length(levels))
vf = as.numeric(factor(v, levels=levels))
m[cbind(1:length(v),vf)]=1
colnames(m)=levels
m
}
f
}
Example:
> ab = binarizer(letters[1:5]) # valid values a to e
> ab(c("a","e","a"))
a b c d e
[1,] 1 0 0 0 0
[2,] 0 0 0 0 1
[3,] 1 0 0 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text Classification with Custom Vocabulary in Python - python

Of late, there are so many questions around multi-class multi label text classification. Please check if this article helps.

Related

Spam Classification Lab: Python Error in Array Syntax

ValueError: Found array with 1 feature(s) while a minimum of 2 is required

AttributeError: 'Series' object has no attribute 'to_coo'

GridSearchCV for number of neurons

Alternative to scikit learn labelBinarizer in R

Categories

Resources