Text prediction with Python

Text prediction with Python - python

My sample data are as follows.
I have a bigger dataset.
In this dataset I want to predict the next date text.
The codes I want to run are like this.
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.metrics import mean_squared_error,r2_score
from sklearn import model_selection
from sklearn.neighbors import KNeighborsRegressor
from warnings import filterwarnings
from sklearn.metrics import mean_squared_error, mean_absolute_error
from tensorflow.keras.callbacks import EarlyStopping
filterwarnings('ignore')
data = pd.read_excel("datasets.xlsx")
data.head()
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=10)
model = Sequential()
model.add(Dense(30,activation="relu"))
model.add(Dense(15,activation="relu"))
model.add(Dense(15,activation="relu"))
model.add(Dense(15,activation="relu"))
model.add(Dense(1))
model.compile(optimizer="adam",loss="mse")
model.fit(x=x_train, y = y_train,validation_data=(x_test,y_test),batch_size=50,epochs=300)
I don't know exactly how to separate x and y here.I tried separating x and y as columns before, but
I was not successful in this dataset.
In this dataset, I need to separate x and y as rows.
As you can see I could only create the model
Also, do I need to use a tokenizer here? Anyone who can help with a sample code?

Related

ValueError: Iterable over raw text documents expected, string object received

I have this model and I tried to make a simple interface for it using streamlit. It follows the same transformation steps that were undertaken during the training phase so I don't understand what's wrong here. I supose it has to do with streamlit input and that I need to transform my input somehow, but I couldn't figure it out. Any help will be appreciated, thanks!
here is the code:
import streamlit as st
import numpy as np
import pickle
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
data=pd.read_csv('IMDB Dataset.csv')
train, test= train_test_split(data, test_size=0.2, random_state=42)
Xtrain, ytrain = train['review'], train['sentiment']
Xtest, ytest = test['review'], test['sentiment']
model = joblib.load('model.pkl')
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks yang ingin dianalisis')
tf = TfidfVectorizer()
tfdf = tf.fit_transform(Xtrain)
vect = pd.DataFrame(tf.transform(txt).toarray())
txt = pd.DataFrame(vect)
pred = model.predict(txt)
print(pred)
st.write(pred)

You have to pass an if statement to txt before proceeding with the rest of the execution otherwise you will always encounter ValueError after this error is fixed. Now visit your vect variable, transform() is expecting raw document, meaning an iterable input
which contains a single element. So you will have to convert the input which is a str by default into a list of string and after that, pass it to transform() as the parameter.
import streamlit as st
import numpy as np
import pickle
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib
import pandas as pd
data = pd.read_csv('IMDB Dataset.csv')
train, test = train_test_split(data, test_size=0.2, random_state=42)
Xtrain, ytrain = train['review'], train['sentiment']
Xtest, ytest = test['review'], test['sentiment']
model = joblib.load('model.pkl')
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks yang ingin dianalisis')
tf = TfidfVectorizer()
tfdf = tf.fit_transform(Xtrain)
if txt is not None:
raw_doc = [txt]
vect = pd.DataFrame(tf.transform(raw_doc).toarray())
txt = pd.DataFrame(vect)
pred = model.predict(txt)
print(pred)
st.write(pred)

UnimplementedError: Graph execution error, in google Collab with keras

Hi guys im trying to do some AI text classification with Keras and is giving me this error. Probably my layers are bad or something like that but dont really know the "Unimplemented" error.
This is my code:
history = model.fit(X_train, y_train,
epochs=100,
verbose=True,
validation_data=(X_test, y_test),
batch_size=10)
The error is:
`
UnimplementedError: Graph execution error:
Detected at node 'binary_crossentropy/Cast' defined at (most recent call last)
`
Don´t know why is this happening.
Rest of the code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
import os
# print(os.listdir("../input"))
plt.style.use('ggplot')
filepath_dict = {'films': 'reviews_filmaffinity.scv'}
df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_table('reviews_filmaffinity.csv', sep='\|\|', header=0, engine='python')
df['source'] = source
df_list.append(df)
df = pd.concat(df_list)
df_films = df[df['source'] == 'films']
df_films['texto'] = df_films['review_title'] + ' ' + df_films['review_text']
sentences = df_films['texto'].values
df_films['polaridad'] = df['review_rate'].apply(lambda x: 'positivo' if x > 6
else ('negativo' if x < 4
else 'neutro'))
y = df_films['polaridad'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2, random_state=0)
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
X_train
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print("Accuracy:", score)
input_dim = X_train.shape[1] # Number of features
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
I searched online but i dont figured out how to fix that... its driving me crazy

Invalid Syntax Error in a certain line of code in python Decision Tree algorithm

Following is my code
I am running it on IDLE python 3.8
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn import trees
from sklearn.metrics import accuracy_score,classification_report
import warnings
from sklearn.preprocessing import StandardScalar
from sklearn.neural_networks import MLPClassifier
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
data=pd.read_csv('data.csv')
cols_to_retain=[]
x-feature=data[cols_to_retain]
x_dict=x_feature.T.to_dict.values()
vect=DictVectorizer(sparse=False)
x_vector=vect.fit_transform(x_dict)
print(x_vector)
x_train=[:-1]
x_test=[-1:]
print('Train set')
print(x_train)
print('Test set')
print(x_test)
le=LabelEncoder
y_train=le.fit_transform(data['Goal'][:-1])
clf=tree.DecisionTreeClassifier(criteron='entropy')
clf=clf.fit_transform(x_train,y_train)
print('Test Data')
print(le.inverse_transform(clf.predict(x_test)))
It shows me error for these particular lines
It only says invalid syntax error
x_train=[:-1]
x_test=[-1:]
packages are imported correctly

Your code contains multiple issues:
The import should be StandardScaler not StandardScalar,
You got unused imports like MLPClassifier,
cols_to_retrain is empty. Thus, data[cols_to_retrain] will return an empty data frame,
to_dict should be to_dict(),
variable names x-feature and x_feature do not match,
LabelEncoder is missing brackets (),
x_train=[:-1] and x_test=[-1:] is not valid. You probably wanted to select a subset like x_train = x_vector[:-1] or x_test = x_vector[-1:]. Please add additional sample data, if you need help with this selection.
Here is an updated version of your code:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("data.csv")
print(data)
cols_to_retain = []
x_feature = data[cols_to_retain]
x_dict = x_feature.T.to_dict().values()
vect = DictVectorizer(sparse=False)
x_vector = vect.fit_transform(x_dict)
print(x_vector)
x_train = x_vector[:-1]
x_test = x_vector[-1:]
print("Train set")
print(x_train)
print("Test set")
print(x_test)
le = LabelEncoder()
y_train = le.fit_transform(data["Goal"][:-1])
clf = DecisionTreeClassifier(criteron="entropy")
clf = clf.fit_transform(x_train, y_train)
print("Test Data")
print(le.inverse_transform(clf.predict(x_test)))

n_splits=10 cannot be greater than the number of members in each class

I'm trying to perform a model selection between KNN and Logistic Regression using sampling technique as 10-fold cross validation, but keep getting the above error after the last part. Could I please be advised on what I'm doing wrong? Thanks.
Here is my code:
import pandas as pd
import numpy as np
import sklearn.model_selection
from sklearn.model_selection import cross_val_score
#Load data
mtcars_df = pd.read_csv('mtcars.csv')
mtcars_df.head()
X = mtcars_df.iloc[:,1:].values
Y = mtcars_df['model'].values
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
print(cross_val_score(knn,X,Y, cv=10, scoring='accuracy').mean())

How to improve accuracy_score in machine learning python for this regression problem?

I am a beginner to machine learning and as part of learning I choose student performance dataset from UCI. I want to predict the final result of a student based on the features given.
I first tried using two main and highly correlated features G1 and G2 that are grades of two exams. I used LinearRegression algorithm and got an accuracy of 0.4 or less.
Then I tried feature engineering on all the features that are objects in dataframe and still the accuracy is same.
How can I improve accuracy score ?
My code as a Python notebook
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error,accuracy_score
df = pd.read_csv('student-mat.csv',sep=';')
df2 = pd.read_csv('student-por.csv',sep=';')
df = [df,df2]
df = pd.concat(df)
df = pd.get_dummies(df)
X = df.drop('G3',axis=1)
y = df['G3']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
y_pred = [int(round(i)) for i in y_pred]
accuracy_score(y_test,y_pred)

The accuracy calculated on continous variables is not very useful. You can use the mean squared error instead, which is relevant for continuous output.
As for improving your model, you can try to use the different tools at your disposal to identify the most relevant features. I recommend statsmodels API (https://www.statsmodels.org/stable/regression.html) to get a more in-depth analysis.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text prediction with Python - python

Related

ValueError: Iterable over raw text documents expected, string object received

UnimplementedError: Graph execution error, in google Collab with keras

Invalid Syntax Error in a certain line of code in python Decision Tree algorithm

n_splits=10 cannot be greater than the number of members in each class

How to improve accuracy_score in machine learning python for this regression problem?

Categories

Resources