I am trying to make a script that takes a json file(pizza-train.json) (from this Kaggle competition. I want to extract the request_text field from each dictionary in the list, and construct a bag of words representation of the string (string to count-list).
The next step is to train a logistic regression classifier to predict the variable “requester_received_pizza”. I want to train the 90% of the data and predict the 10%. The problem is that I don't know how to predict the 10%. Any advice would be really helpfull!
import json
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
f_json = json.load(open('pizza-train.json'))
request_text = []
y = []
for item in f_json[:100]:
request_text.append(item['request_text'])
y.append(item['requester_received_pizza'])
vectorizer = CountVectorizer(min_df=1, lowercase=True, stop_words='english')
train_data_features = vectorizer.fit_transform(request_text)
train_data_features = train_data_features.toarray()
print 'Shape = '
print train_data_features.shape
vocab = vectorizer.get_feature_names()
print '\n'
print 'Vocab = '
print vocab
x_train, x_test, y_train, y_test = train_test_split(train_data_features, y, test_size=0.10)
You might do it like this:
alg = sklearn.linear_model.LogisticRegression()
alg.fit(x_train, y_train)
test_score = alg.score(x_test, y_test)
You should read the sklearn docs logistic regression and cross validation, which are very good and provide more sophisticated methods for validating your models. This tutorial for the Kaggle Titanic competition might also be useful.
Related
I'm a data science newbie and I'm trying to use TfidfVectorizer with RandomForestClassifier to predict a binary "yes/no" outcome on a string like so:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('~/Downloads/New_Query_2019_12_04.csv', usecols=['statement', 'result'])
df = df.head(100)
# remove non-values
df = df.dropna()
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(df['statement']).toarray()
y = df['result'].values
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=0)
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
All of this appears to work great, but I'm stuck on how to predict a phrase against the model. When I do something like:
good_string = preprocess_string('This is a good sentence')
tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform([good_string]).toarray()
y_pred = classifier.predict(X)
I get the error "Number of features of the model must match the input."
I also tried fitting the string with my previous TfidfVectorizer:
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform([good_string]).toarray()
but I got the error "max_df corresponds to < documents than min_df". I think I'm just a bit confused as to how to fit the array features of the single string to match the number features in my model. Any help would be greatly appreciated.
The issue was that I was running it through a different vectorizer with the same constructor params:
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
instead of using the same vectorizer I used when fitting the documents here:
X = tfidfconverter.fit_transform(df['statement']).toarray()
I also should not have been attempting to fit the data I was trying to predict, but ONLY transform it.
X = tfidfconverter.transform([good_string]).toarray()
I am classifying text with 2 categories. One is imperatives, and the other one is non-imperatives. I prepared my text in the way Naive Bayes Classifier needs. But, now, I also need to use SVM. What should I do here? (I need to classify the text and calculate the accuracy, too.)Thank you for reading and trying to answering my questions.
all_words_list = [word for (sent, cat) in train for word in sent]
all_words = nltk.FreqDist(all_words_list)
word_items = all_words.most_common(1000)
word_features = [word for (word, count) in word_items]
def document_features(document, word_features):
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d, word_features), c) for (d, c) in
train]
train_set, test_set = featuresets[360:], featuresets[:360]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))
I would suggest first divide your dataset in train and test properly
X contains feature variable and Y contains response variable and we are splitting it in 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=101,test_size=0.3)
than
from sklearn import svm
from sklearn import metrics
#on sklearn docs you can find more about SVM parameters
model = svm.SVC(kernel='rbf',C=10000.0,gamma = 'auto')
model = model.fit(X_train, y_train)
print('Accuracy is ', round(metrics.accuracy_score(model.predict(X_test),y_test),2))
Getting straight to the point:
1) My goal was to apply NLP and Machine learning algorithm to classify a dataset containing sentences into 5 different types of categories(numeric). For e.g. "I want to know details of my order -> 1".
Code:
import numpy as np
import pandas as pd
dataset = pd.read_csv('Ecom.tsv', delimiter = '\t', quoting = 3)
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, len(dataset)):
review = re.sub('[^a-zA-Z]', ' ', dataset['User'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
# # Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Everything works fine here, the model is trained well and predicts correct results for test data.
2) Now i wanted to use this trained model to predict a category for a new sentence. So i pre-processed the text in the same way i did for my dataset.
Code:
#Pre processing the new input
new_text = "Please tell me the details of this order"
new_text = new_text.split()
ps = PorterStemmer()
processed_text = [ps.stem(word) for word in new_text if not word in set(stopwords.words('english'))]
vect = CountVectorizer()
Z = vect.fit_transform(processed_text).toarray()
classifier.predict(Z)
ValueError: operands could not be broadcast together with shapes (4,4) (33,)
The only thing i am able to understand is that when i transformed my corpus the first time i trained my model, the shape of the numpy array is (18, 33). Second time when i am trying to predict for a new input, when i transformed my processed_text using fit_transform(), the numpy array shape is (4, 4).
I am not able to figure out is there any process here that i applied incorrectly? What can be the resolution. Thanks in advance! :)
you got the the problem right!
Say you have a corpus made of 33 different words, then your bag of words at training time will have 33 columns. Now you are using another corpus which has only 4 different words. You end up with a matrix with 4 columns, and the model won't like that! hence you need to fit the second corpus in the same bag of words matrix you had at the beginning, with 33 columns. There are different ways to do this, well explained here.
For example one way is to save the transform object you used at training time with fit() and then apply it at test time (only transform())!
The code is used to generate word2vec and use it to train the naive Bayes classifier.
I am able to generate word2vec and use the similarity functions successfully.As a next step I would want to use the word2vec to train the naive bayes classifier. Currently the code given an error when I am trying to slit the data in test and training. How do i convert word2vec model into the array so that it can be used as training data.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import gensim
# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
# for word2vec we want an array of vectors
corpus.append(review)
#print(corpus)
X = gensim.models.Word2Vec(corpus, min_count=1,size=1000)
#print (X.most_similar("love"))
#embedding_matrix = np.zeros(len(X.wv.vocab), dtype='float32')
#for i in range(len(X.wv.vocab)):
# embedding_vector = X.wv[X.wv.index2word[i]]
# if embedding_vector is not None:
# embedding_matrix[i] = embedding_vector
# Creating the Bag of Words model
#from sklearn.feature_extraction.text import CountVectorizer
#cv = CountVectorizer(max_features = 1500)
#X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
It gives an error on line -
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
TypeError: Expected sequence or array-like, got <class 'gensim.models.word2vec.Word2Vec'>
Word2Vec provides word embeddings only. If you want to characterize documents by embeddings, you'll need to perform an averaging/summing/max operation on embeddings of all words from each document to have a D-dimensional vector that can be used for classification. See here and there for further information on this.
Otherwise, you can use Doc2Vec model to directly produce document embeddings, for which gensim also gives a very good provider.
You have vectors for each word, now you have two approaches to move forward, one could be simply take average of all the words in a sentence to find the sentence vector, another could be to use tfidf.
I implemented the average approach in one of my ongoing projects and i am sharing the github link, please go to the heading "text vectorization(word2vec)" and you will find the code their.
https://github.com/abhibhargav29/SentimentAnalysis/blob/master/SentimentAnalysis.ipynb. I would however suggest you to read data cleaning before as well to understand it in a better way.
One important advice: Do not split the data into train, cv, test after vectorization, do it before vectorization or you will overfit the model.
I'm trying to train a logistic classifier. My dataset has the following columns.
name , review, rating, reviews_cleaned , word_count, sentiment,
The sentiment is either +1 or -1 based on whether the rating is greater than 3 or less. The word count contains a dict of words with occurences and reviews_cleaned just strips off the reviews off punctuations.
This is my code to train a LogisticClassifier.
train_data, test_data = train_test_split(products, test_size = 0.2)
sentiment_model = LogisticRegression(penalty='l2', C=1)
sentiment_model.fit(products['sentiment'], products['word_count'])
I get the following error,
ValueError: Found input variables with inconsistent numbers of samples: [1, 166752]
PS: The equivalent statment using graphLab create is
sentiment_model = graphlab.logistic_classifier.create(train_data,
target = 'sentiment',
features=['word_count'],
validation_set=None)
What am I doing wrong?
Your training data looks like it's a 1-dimensional vector but sklearn requires it to be 2-dimensional - if you reshape it you should be okay. Also you make your train/test split but you're not actually using the data that you're producing (fit with train_data instead).
Using GraphLab in that course is very irritating to say the least. Give this a whirl:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('amazon_baby.csv', header = 0)
df.dropna(how="any", inplace= True)
products = df[df['rating'] != 3] #drop the products with 3-star rating
products['sentiment'] = products['rating'] >= 4
X_train, X_test, y_train, y_test = train_test_split(products['review'], products['sentiment'], test_size = .2 ,random_state = 0)
vect = CountVectorizer()
X_train = vect.fit_transform(X_train.values)
X_test = vect.transform(X_test.values)
model = LogisticRegression(penalty ='l2', C = 1)
model.fit(X_train, y_train)
I'm not sure what the direct translation between Sklearn/Pandas and GraphLab is, but this looks like it's what they are doing.
When I score the model, I get:
model.score(X_test, y_test)
> .93155480
Let me know what results you get or if this works for you.