Converting python SVM text classifier to Tensorflow model - python

I have written a python code for text classifier using SVM (Multi-class), now I want to run this code in the android application. TensorFlow-lite is useful in this scenario from what I have read, how should I proceed to work to convert my python code to TensorFlow-lite code? what should steps that I need to follow?
Below is the code for SVM Classifier,
import pandas as pd
import numpy as np
import tensorflow as tf
from collections import Counter
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC
column_names = ['text', 'labels']
data = pd.read_csv("newdataset.csv", names = column_names, index_col = False)
train_x, test_x, train_y, test_y = model_selection.train_test_split(data.text,data.labels,test_size = 0.5 ,random_state = 0)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',max_features=100)
count_vect.fit(data.text)
xtrain_count = count_vect.transform(train_x)
xtest_count = count_vect.transform(test_x)
tfidf_vect = TfidfTransformer()
xtrain_tfidf = tfidf_vect.fit_transform(xtrain_count)
xtest_tfidf = tfidf_vect.fit_transform(xtest_count)
clf = svm.SVC(kernel='linear')
clf.fit(xtrain_tfidf, train_y)
predicted = clf.predict(xtest_tfidf)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(test_y,predicted))
print(classification_report(test_y,predicted))
print(accuracy_score(test_y,predicted))

Related

ValueError: Iterable over raw text documents expected, string object received

I have this model and I tried to make a simple interface for it using streamlit. It follows the same transformation steps that were undertaken during the training phase so I don't understand what's wrong here. I supose it has to do with streamlit input and that I need to transform my input somehow, but I couldn't figure it out. Any help will be appreciated, thanks!
here is the code:
import streamlit as st
import numpy as np
import pickle
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
data=pd.read_csv('IMDB Dataset.csv')
train, test= train_test_split(data, test_size=0.2, random_state=42)
Xtrain, ytrain = train['review'], train['sentiment']
Xtest, ytest = test['review'], test['sentiment']
model = joblib.load('model.pkl')
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks yang ingin dianalisis')
tf = TfidfVectorizer()
tfdf = tf.fit_transform(Xtrain)
vect = pd.DataFrame(tf.transform(txt).toarray())
txt = pd.DataFrame(vect)
pred = model.predict(txt)
print(pred)
st.write(pred)
You have to pass an if statement to txt before proceeding with the rest of the execution otherwise you will always encounter ValueError after this error is fixed. Now visit your vect variable, transform() is expecting raw document, meaning an iterable input
which contains a single element. So you will have to convert the input which is a str by default into a list of string and after that, pass it to transform() as the parameter.
import streamlit as st
import numpy as np
import pickle
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
data = pd.read_csv('IMDB Dataset.csv')
train, test = train_test_split(data, test_size=0.2, random_state=42)
Xtrain, ytrain = train['review'], train['sentiment']
Xtest, ytest = test['review'], test['sentiment']
model = joblib.load('model.pkl')
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks yang ingin dianalisis')
tf = TfidfVectorizer()
tfdf = tf.fit_transform(Xtrain)
if txt is not None:
raw_doc = [txt]
vect = pd.DataFrame(tf.transform(raw_doc).toarray())
txt = pd.DataFrame(vect)
pred = model.predict(txt)
print(pred)
st.write(pred)

How to classify new data using a pre-trained model - Python Text Classification (NLTK and Scikit)

I am very new to Text Classification and I am trying to classify each line of a dataset composed by twitter comments according to some pre-defined topics.
I have used the code bellow in Jupyter Notebook to build and train a model with a training dataset. I chose to use a supervised approach in python with NLTK and Scikit, as unsupervised ones (like LDA) were not giving me good results.
I followed these steps so far:
Mannually categorised the topics of a training dataset;
Applied the training dataset to the code bellow and trained it resulting in an accuracy of aprox. 82%.
Now, I want to use this model to automatically categorise the topics of another dataset (i.e., my test dataset). Most posts only cover the training part, so it's quite frustraiting for a newcommer to understand how to get the trained model and actually use it.
Hence, the question is: with the code below, how can I now use the trained model to classify a new dataset?
I appreciate your help.
This tutorial is very good, and I used it as a reference for the code below: https://medium.com/#ishan16.d/text-classification-in-python-with-scikit-learn-and-nltk-891aa2d0ac4b
My model building and training code:
#Do library and methods import
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk as nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import regex as re
import requests
# Import dataset
df = pd.read_csv(r'C:\Users\user_name\Downloads\Train_data.csv', delimiter=';')
# Tokenize
def tokenize(x):
tokenizer = RegexpTokenizer(r'\w+')
return tokenizer.tokenize(x)
df['tokens'] = df['Tweet'].map(tokenize)
# Stem and Lemmatize
nltk.download('wordnet')
nltk.download('omw-1.4')
def stemmer(x):
stemmer = PorterStemmer()
return ' '.join([stemmer.stem(word) for word in x])
def lemmatize(x):
lemmatizer = WordNetLemmatizer()
return ' '.join([lemmatizer.lemmatize(word) for word in x])
df['lemma'] = df['tokens'].map(lemmatize)
df['stems'] = df['tokens'].map(stemmer)
# set up feature matrix and target column
X = df['lemma']
y = df['Topic']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 13)
# Create out pipeline with a vectorizer and our naive Bayes classifier
pipe_mnnb = Pipeline(steps = [('tf', TfidfVectorizer()), ('mnnb', MultinomialNB())])
# Create parameter grid
pgrid_mnnb = {
'tf__max_features' : [1000, 2000, 3000],
'tf__stop_words' : ['english', None],
'tf__ngram_range' : [(1,1),(1,2)],
'tf__use_idf' : [True, False],
'mnnb__alpha' : [0.1, 0.5, 1]
}
# Set up the grid search and fit the model
gs_mnnb = GridSearchCV(pipe_mnnb,pgrid_mnnb,cv=5,n_jobs=-1)
gs_mnnb.fit(X_train, y_train)
# Check the score
gs_mnnb.score(X_train, y_train)
gs_mnnb.score(X_test, y_test)
# Check the parameters
gs_mnnb.best_params_
# Get predictions
preds_mnnb = gs_mnnb.predict(X)
df['preds'] = preds_mnnb
# Print resulting dataset
print(df.shape)
df.head()
It seems than after training you just have to do as for your validation step using directly the grid-searcher, which in sklearn library is also used after training as a model taking the best found hyperparameters.
So take a X which is what you want to evaluate and run
preds_mnnb = gs_mnnb.predict(X)
preds_mnnb should contain what you expect

All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough

Can I ask, when I run this code, it produces an output without error:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score, cross_val_predict,cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import chi2, f_regression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectKBest
#from xgboost import XGBRegressor
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_selection import SelectKBest
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.pipeline import Pipeline
from scipy.stats import spearmanr
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix,classification_report
import pickle
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score,recall_score
from sklearn.datasets import make_classification
#Generate fake data
X, y = make_classification(n_samples=5000, n_classes=2, n_features=20, n_redundant=0,random_state=0) #fake data
X_train = X[:4500] #.iloc for df
y_train = y[:4500]
X_test = X[4500:]#.reset_index(drop=True,inplace=True)
y_test = y[4500:]
scorers = {
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score)
}
def run_SVC(X_train, y_train, X_test, y_test,output_file,data_name,refit_score='precision_score'):
'''
run SVC algorithm, with CV and hyperparameter tuning.
'''
short_dataname = data_name.strip().split('/')
file_model_name = output_file + '_svc_' + short_dataname[-1]
clf = SVC()
skf = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
#fs = SelectKBest(score_func = mutual_info_classif)
pipeline = Pipeline(steps=[('svc',clf)]) #,('sel',fs)
print(pipeline.get_params().keys())
search = GridSearchCV(
pipeline,
param_grid={
'svc__C': [0.01, 0.1, 10, 1000], ##Regularization
'svc__gamma': [0.0001, 0.01, 1, 10],
'svc__kernel':['linear','rbf'],
},
return_train_score=True,
verbose=3,
refit=refit_score,
scoring=scorers,
cv=skf,
n_jobs=-1,
)
search.fit(X_train, y_train)
# make the predictions
y_pred = search.predict(X_test)
print('Best params for {}'.format(refit_score))
print(search.best_params_)
print(classification_report(y_test,y_pred)) #labels=['neg','pos']
return
print(run_SVC(X_train,y_train,X_test,y_test,'test.txt','dataset'))
When i comment in the only two lines that are commented out (#fs = SelectKBest(score_func = mutual_info_classif)) and fs in the line after that, I get the error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SVC()' (type <class 'sklearn.svm._classes.SVC'>) doesn't
I can see that other people have addressed this on SO before, e.g. here, so I tried to follow that person's answer, but my SelectKBest is already before my pipeline - when I move the line with 'fs' to be higher in my code (which I thought was what the answer was saying), I get the same error.
Could someone show me where I'm going wrong here and what I'm meant to change to remove this error?
The order of the steps in a Pipeline matters, and only the last step can be a non-transformer like your svc.

how to add confusion matrix and k-fold 10 fold in sentiment analysis

I want to add an evaluation model using the cross-validation and confusion matrix k-fold (k = 10) method, but I'm confused
dataset : https://github.com/fadholifh/dats/blob/master/cpas.txt
Using Pyhon 3.7
import sklearn.metrics
import sen
import csv
import os
import re
import nltk
import scipy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
factorys = StemmerFactory()
stemmer = factorys.create_stemmer()
if __name__ == "__main__":
main()
the result is confusion matrix and for k-fold each fold has a percentage of F1-score, precission, and recall
df = pd.read_csv("cpas.txt", header=None, delimiter="\t")
X = df[1].values
y = df[0].values
stop_words = stopwords.words('english')
stemmer = PorterStemmer()
def clean_text(text, stop_words, stemmer):
return " ".join([stemmer.stem(word) for word in word_tokenize(text)
if word not in stop_words and not word.isnumeric()])
X = np.array([clean_text(text, stop_words, stemmer) for text in X])
kfold = KFold(3, shuffle=True, random_state=33)
i = 1
for train_idx, test_idx in kfold.split(X):
X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = LinearSVC()
model.fit(X_train, y_train)
print ("Fold : {0}".format(i))
i += 1
print (classification_report(y_test, model.predict(X_test)))
The reason you use cross validation is for parameter tuning when the data is less. One can use grid search with CV to do this.
df = pd.read_csv("cpas.txt", header=None, delimiter="\t")
X = df[1].values
labels = df[0].values
text = np.array([clean_text(text, stop_words, stemmer) for text in X])
idx = np.arange(len(text))
np.random.shuffle(idx)
text = text[idx]
labels = labels[idx]
pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('svm', LinearSVC())])
params = {
'vectorizer__ngram_range' : [(1,1),(1,2),(2,2)],
'vectorizer__lowercase' : [True, False],
'vectorizer__norm' : ['l1','l2']}
model = GridSearchCV(pipeline, params, cv=3, verbose=1)
model.fit(text, y)

sklearn classifier get ValueError: bad input shape (3529, 12)

I have a json file that file have preprocess data at the same time that data is also change vector.then how to train the data using SVM classification method
Vector is one name of the column
another one is values, values have genres of vector column
import pickle
from nltk.corpus import stopwords
import string
from nltk.stem import SnowballStemmer
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics
stopwords=set(stopwords.words("english"))
exclude = set(string.punctuation)
snow=SnowballStemmer("english")
tvec = pickle.load(open("dataPackage/tfidf.pickle", 'rb'))
data=pd.read_json("dataPackage/finalData.json",orient = 'split')
inputLen = len(data["Vector"].iloc[0])
X = list(data["Vector"])
y = list(data.drop(["Vector"],axis = 1).values)
np.shape(X)
np.shape(y)
X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y), test_size=0.3,random_state=109)
model = svm.SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Categories

Resources