Multi-label text classification with scikit-learn, which classifiers to use?

Multi-label text classification with scikit-learn, which classifiers to use? - python

I have done text classification using scikit-learn Python library importing these classifiers:
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
The input text was in the form:
('some text 1', 'class1')
('some text 2', 'class2')
('some text 3', 'class3')
...
And everything was ok. But what I want to know is if I have multi-labeled text like:
('some text 1', 'class1', 'class3')
('some text 2', 'class2', 'class1')
('some text 3', 'class3')
...
if that is possible to use these classifiers, or should I use some other classifiers?

All classifiers able to do Multi-class or Multi-Label are referred on this page.
Based on it, only 2 of your models can be used directly as multi-label:
RandomForestClassifier
KNeighborsClassifier
After what I've done (in an exercice), is to use a OneVsAll with another compatible classifier then extract the top N or all labels above X% (the more labels you have, the lower will be the threshold as the sum is equal to 1). It's not the cleanest thing you can do but it works (I compared it with multi-label classifier results and it was pretty close or identical)
I hope it helps,
Nicolas

Related

Understanding Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK

I am trying to to use the example given in this article https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a except that instead of using the 20newsgroups data set that the tutorial uses I am trying to use my own data which consists of text files which are in /home/pi/train/ where each sub directory under train is a label like /home/pi/train/FOOTBALL/ /home/pi/train/BASKETBALL/. I am trying to test one document at a time by putting it in either /home/pi/test/FOOTBALL/ or /home/pi/test/BASKETBALL/ and running the program.
# -*- coding: utf-8 -*-
import sklearn
from pprint import pprint
from sklearn.datasets import load_files
docs_to_train = sklearn.datasets.load_files("/home/pi/train/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
pprint(list(docs_to_train.target_names))
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(docs_to_train.data)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),])
text_clf = text_clf.fit(docs_to_train.data, docs_to_train.target)
import numpy as np
docs_to_test = sklearn.datasets.load_files("/home/pi/test/", description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
predicted = text_clf.predict(docs_to_test.data)
np.mean(predicted == docs_to_test.target)
pprint(np.mean(predicted == docs_to_test.target))
If I put a football text document in the /home/pi/test/FOOTBALL/ folder and run the program I get:
['FOOTBALL', 'BASKETBALL']
1.0
If move the same document about football into the /home/pi/test/BASKETBALL/ folder and run the program I get:
['FOOTBALL', 'BASKETBALL']
0.0
Is this how np.mean is supposed to work? Does anyone know what it is trying to tell me?

Having a read through the docs on sklearn's load_files, maybe the problem is in the call X_train_counts = count_vect.fit_transform(docs_to_train.data). You may have to explore the structure of the docs_to_train.data object to assess how you access the underlying module data. Unfortunately, the docs aren't all that helpful in terms of data's structure:
Dictionary-like object, the interesting attributes are: either data, the raw text data to learn, or ‘filenames’, the files holding it, ‘target’, the classification labels (integer index), ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.
It may also be the case that CountVectorizer() is expecting a single filepath or txt object, and not a data holder filled with many sub-data types.

Sign Language Glove project from GitHub: Help in understanding code

I'm very new to Python and I'm trying to replicate this Sign Language Glove project heree with my own hardware for a first practice into Machine Learning. I could already write data in CSV files from my accelerometers, but I can't understand the process. The file named 'modeling' confuses me. Can anyone help me understand what are the processes happening?
import numpy as np
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd
df= pd.read_csv("final.csv") ##This I understand. I've successfully created csv files with data
#########################################################################
## These below, I do not.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2)
train_features = train[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
train_label = train.cl
test_features = test[['F1','F2','F3','F4','F5','X','Y','Z','C1','C2']]
test_label = test.cl
## SVM
model = svm.SVC(kernel='linear', gamma=1, C=1)
model.fit(train_features, train_label)
model.score(train_features, train_label)
predicted_svm = model.predict(test_features)
print "svm"
print accuracy_score(test_label, predicted_svm)
cn =confusion_matrix(test_label, predicted_svm)

Welcome to the community. That looks like a nice way to start off.
Like #hilverts_drinking_problem suggested, I would recommend looking at sklearn documentation. But here's a quick explanation of what's going on.
The train, test split function randomly splits the dataset into two datasets for the sake of training and testing. test_size = 0.2 means 20% of the data will be in the test set, remaining 80% in train.
The next two lines are just separating out the inputs (features) and outputs (targets) for training. Same for test in the next two lines.
Finally, you create an SVM object, train the model using model.fit, and get its score using .score. You then use the model to predict stuff for the test set. Finally, you print the accuracy score for your test set, along with its confusion matrix.
If you need me to clarify/detail something, let me know!

Tensorflow OneHotEncoder on large file for Linear Regression

I need to run a simple linear regression on a large dataset 30GB that can't be load in memory. The features are mostly categorical data. I've already build a prototype in scikit-learn that works just fine, but only works on a subsample of the data.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
data = pd.read_csv('datafile.csv', nrows=5e6)
""" data['categorical_feature'] it's a text field, which has categories comma separated. Example of structure is shown below.
categorical_feature
1 1671,1293
2 1293
3 1233,1671
"""
cat_vec = CountVectorizer(min_df=2)
m_cat = cat_vec.fit_transform(data['categorical_feature'])
lm = linear_model.Ridge()
lm.fit(m_cat, data['target'])
How would I write this in tensorflow? I've looked around didn't find much that can replicate the behaviour of CountVectorizer in scikit-learn.

Python - Graphviz - Remove legend on nodes of DecisionTreeClassifier

I have a decision tree classifier from sklearn and I use pydotplus to show it.
However I don't really like when there is a lot of informations on each nodes for my presentation (entropy, samples and value).
To explain it easier to people I would like to only keep the decision and the class on it.
Where can I modify the code to do it ?
Thank you.

Accoring to the documentation, it is not possible to abstain from setting the additional information inside boxes. The only thing that you may implicitly omit is the impurity parameter.
However, I have done it the other explicit way which is somewhat crooked. First, I save the .dot file setting the impurity to False. Then, I open it up and convert it to a string format. I use regex to subtract the redundant labels and resave it.
The code goes like this:
import pydotplus # pydot library: install it via pip install pydot
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.datasets import load_iris
data = load_iris()
clf = DecisionTreeClassifier()
clf.fit(data.data, data.target)
export_graphviz(clf, out_file='tree.dot', impurity=False, class_names=True)
PATH = '/path/to/dotfile/tree.dot'
f = pydot.graph_from_dot_file(PATH).to_string()
f = re.sub('(\\\\nsamples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])', '', f)
f = re.sub('(samples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])\\\\n', '', f)
with open('tree_modified.dot', 'w') as file:
file.write(f)
Here are the images before and after modification:
In your case, there seems to be more parameters in boxes, so you may want to tweak the code a little bit.
I hope that helps!

OneClassSVM scikit learn

I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?

It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multi-label text classification with scikit-learn, which classifiers to use? - python

Related

Understanding Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK

Sign Language Glove project from GitHub: Help in understanding code

Tensorflow OneHotEncoder on large file for Linear Regression

Python - Graphviz - Remove legend on nodes of DecisionTreeClassifier

OneClassSVM scikit learn

Categories

Resources