ValueError: Iterable over raw text documents expected, string object received - python

I have this model and I tried to make a simple interface for it using streamlit. It follows the same transformation steps that were undertaken during the training phase so I don't understand what's wrong here. I supose it has to do with streamlit input and that I need to transform my input somehow, but I couldn't figure it out. Any help will be appreciated, thanks!
here is the code:
import streamlit as st
import numpy as np
import pickle
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
data=pd.read_csv('IMDB Dataset.csv')
train, test= train_test_split(data, test_size=0.2, random_state=42)
Xtrain, ytrain = train['review'], train['sentiment']
Xtest, ytest = test['review'], test['sentiment']
model = joblib.load('model.pkl')
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks yang ingin dianalisis')
tf = TfidfVectorizer()
tfdf = tf.fit_transform(Xtrain)
vect = pd.DataFrame(tf.transform(txt).toarray())
txt = pd.DataFrame(vect)
pred = model.predict(txt)
print(pred)
st.write(pred)

You have to pass an if statement to txt before proceeding with the rest of the execution otherwise you will always encounter ValueError after this error is fixed. Now visit your vect variable, transform() is expecting raw document, meaning an iterable input
which contains a single element. So you will have to convert the input which is a str by default into a list of string and after that, pass it to transform() as the parameter.
import streamlit as st
import numpy as np
import pickle
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
data = pd.read_csv('IMDB Dataset.csv')
train, test = train_test_split(data, test_size=0.2, random_state=42)
Xtrain, ytrain = train['review'], train['sentiment']
Xtest, ytest = test['review'], test['sentiment']
model = joblib.load('model.pkl')
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks yang ingin dianalisis')
tf = TfidfVectorizer()
tfdf = tf.fit_transform(Xtrain)
if txt is not None:
raw_doc = [txt]
vect = pd.DataFrame(tf.transform(raw_doc).toarray())
txt = pd.DataFrame(vect)
pred = model.predict(txt)
print(pred)
st.write(pred)

Related

Invalid Syntax Error in a certain line of code in python Decision Tree algorithm

Following is my code
I am running it on IDLE python 3.8
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn import trees
from sklearn.metrics import accuracy_score,classification_report
import warnings
from sklearn.preprocessing import StandardScalar
from sklearn.neural_networks import MLPClassifier
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
data=pd.read_csv('data.csv')
cols_to_retain=[]
x-feature=data[cols_to_retain]
x_dict=x_feature.T.to_dict.values()
vect=DictVectorizer(sparse=False)
x_vector=vect.fit_transform(x_dict)
print(x_vector)
x_train=[:-1]
x_test=[-1:]
print('Train set')
print(x_train)
print('Test set')
print(x_test)
le=LabelEncoder
y_train=le.fit_transform(data['Goal'][:-1])
clf=tree.DecisionTreeClassifier(criteron='entropy')
clf=clf.fit_transform(x_train,y_train)
print('Test Data')
print(le.inverse_transform(clf.predict(x_test)))
It shows me error for these particular lines
It only says invalid syntax error
x_train=[:-1]
x_test=[-1:]
packages are imported correctly
Your code contains multiple issues:
The import should be StandardScaler not StandardScalar,
You got unused imports like MLPClassifier,
cols_to_retrain is empty. Thus, data[cols_to_retrain] will return an empty data frame,
to_dict should be to_dict(),
variable names x-feature and x_feature do not match,
LabelEncoder is missing brackets (),
x_train=[:-1] and x_test=[-1:] is not valid. You probably wanted to select a subset like x_train = x_vector[:-1] or x_test = x_vector[-1:]. Please add additional sample data, if you need help with this selection.
Here is an updated version of your code:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("data.csv")
print(data)
cols_to_retain = []
x_feature = data[cols_to_retain]
x_dict = x_feature.T.to_dict().values()
vect = DictVectorizer(sparse=False)
x_vector = vect.fit_transform(x_dict)
print(x_vector)
x_train = x_vector[:-1]
x_test = x_vector[-1:]
print("Train set")
print(x_train)
print("Test set")
print(x_test)
le = LabelEncoder()
y_train = le.fit_transform(data["Goal"][:-1])
clf = DecisionTreeClassifier(criteron="entropy")
clf = clf.fit_transform(x_train, y_train)
print("Test Data")
print(le.inverse_transform(clf.predict(x_test)))

Converting python SVM text classifier to Tensorflow model

I have written a python code for text classifier using SVM (Multi-class), now I want to run this code in the android application. TensorFlow-lite is useful in this scenario from what I have read, how should I proceed to work to convert my python code to TensorFlow-lite code? what should steps that I need to follow?
Below is the code for SVM Classifier,
import pandas as pd
import numpy as np
import tensorflow as tf
from collections import Counter
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC
column_names = ['text', 'labels']
data = pd.read_csv("newdataset.csv", names = column_names, index_col = False)
train_x, test_x, train_y, test_y = model_selection.train_test_split(data.text,data.labels,test_size = 0.5 ,random_state = 0)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',max_features=100)
count_vect.fit(data.text)
xtrain_count = count_vect.transform(train_x)
xtest_count = count_vect.transform(test_x)
tfidf_vect = TfidfTransformer()
xtrain_tfidf = tfidf_vect.fit_transform(xtrain_count)
xtest_tfidf = tfidf_vect.fit_transform(xtest_count)
clf = svm.SVC(kernel='linear')
clf.fit(xtrain_tfidf, train_y)
predicted = clf.predict(xtest_tfidf)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(test_y,predicted))
print(classification_report(test_y,predicted))
print(accuracy_score(test_y,predicted))

sklearn classifier get ValueError: bad input shape (3529, 12)

I have a json file that file have preprocess data at the same time that data is also change vector.then how to train the data using SVM classification method
Vector is one name of the column
another one is values, values have genres of vector column
import pickle
from nltk.corpus import stopwords
import string
from nltk.stem import SnowballStemmer
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics
stopwords=set(stopwords.words("english"))
exclude = set(string.punctuation)
snow=SnowballStemmer("english")
tvec = pickle.load(open("dataPackage/tfidf.pickle", 'rb'))
data=pd.read_json("dataPackage/finalData.json",orient = 'split')
inputLen = len(data["Vector"].iloc[0])
X = list(data["Vector"])
y = list(data.drop(["Vector"],axis = 1).values)
np.shape(X)
np.shape(y)
X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y), test_size=0.3,random_state=109)
model = svm.SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

SHAP KernelExplainer error on textual data using pipeline

I was looking through the SHAP package for Python and I found no examples using KernelExplainer to explain textual data predictions so I decided to test it out using the dataset i found on https://www.superdatascience.com/machine-learning/.
I encountered a problem in the KernelExplainer part at the last bit, where I believe the problem is the way I input the data and model into the explainer.
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Can anyone advise me on what I should revise so as to make the explainer work? I spent hours on this last bit but to no avail. Any help or advice is greatly appreciated. With much thanks!
Dataset: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import nltk
#Load the data
os.chdir('C:\\Users\\Win\\Desktop\\MyLearning\\Explainability\\SHAP')
review = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')
#Clean the data
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
def clean_text(df_text_column, data):
corpus = []
for i in range(0, len(data)):
text = re.sub('[^a-zA-Z]', ' ', df_text_column[i])
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
text = ' '.join(text)
corpus.append(text)
return corpus
X = pd.DataFrame({'Review':clean_text(review['Review'],review)})['Review']
y = review['Liked']
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Creating the pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
from sklearn.pipeline import make_pipeline
np.random.seed(0)
rf_pipe = make_pipeline(vect, rf)
rf_pipe.steps
rf_pipe.fit(X_train, y_train)
y_pred = rf_pipe.predict(X_test)
y_prob = rf_pipe.predict_proba(X_test)
#Performance Metrics
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred) #Accuracy
metrics.roc_auc_score(y_test, y_prob[:, 1]) #ROC-AUC score
# use Kernel SHAP to explain test set predictions
import shap
explainer = shap.KernelExplainer(rf_pipe.predict_proba, X_train, link="logit")
shap_values = explainer.shap_values(X_test, nsamples=100)
# plot the SHAP values
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], X_test.iloc[0,:], link="logit")

Get panda Series from csv

I am totally new to machine learning, I am currently playing with MNIST machine learning, using RandomForestClassifier.
I use sklearn and panda.
I have a training CSV data set.
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
train = pd.read_csv("train.csv")
features = train.columns[1:]
X = train[features]
y = train['label']
user_train = pd.read_csv("input.csv")
user_features = user_train.columns[1:]
y_train = user_train[user_features]
user_y = user_train['label']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X/255.,y,test_size=1,random_state=0)
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred_rf)
print("pred : ", y_pred_rf)
print("random forest accuracy: ",acc_rf)
I have the current code, which works well. It takes the training set, split and take one element for testing, and does the prediction.
What I want now is to use the testing data from an input, I have a new csv called "input.csv", and I want to predict the value inside this csv.
How can I replace the model_selection.train_test_split with my input data ?
I am sure the response is very obvious, and I didn't find anything.
The following part of your code is unused
user_train = pd.read_csv("input.csv")
user_features = user_train.columns[1:]
y_train = user_train[user_features]
user_y = user_train['label']
If input.csv has the same structure of train.csv you may want to:
train a classifier and test it on a split of the input.csv dataset: (please refer to http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html to know how to set the test size)
input_train = pd.read_csv("input.csv")
input_features = user_train.columns[1:]
input_data = user_train[input_features]
input_labels = user_train['label']
data_train, data_test, labels_train, labels_test = model_selection.train_test_split(input_data/255.,input_labels,test_size=1,random_state=0)
clf_rf = RandomForestClassifier()
clf_rf.fit(data_train, labels_train)
labels_pred_rf = clf_rf.predict(data_test)
acc_rf = accuracy_score(labels_test, labels_pred_rf)
test the previously trained classifier on the whole input.csv file
input_train = pd.read_csv("input.csv")
input_features = user_train.columns[1:]
input_data = user_train[input_features]
input_labels = user_train['label']
labels_pred_rf = clf_rf.predict(input_data)
acc_rf = accuracy_score(input_labels, labels_pred_rf)

Categories

Resources