I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated. Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
Now, I define both columns as 'phrases' and 'anger':
df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']
Subsequently, I split this data such that 20% is used for testing and 80% is used for training:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Now, I convert the words in x_train to numerical data using TfidfVectorizer:
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))
Now, I convert x_traincv to an array:
a = x_traincv.toarray()
I also convert x_testcv to a numerical array:
x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()
Now, I have
mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
mnb.fit(x_testcv,y_test)
testmessage=x_test.iloc[i]
predictions = mnb.predict(x_testcv[i].reshape(1,-1))
error_score = error_score + (predictions-int(b[i]))**2
print(testmessage)
print(predictions)
print(error_score/len(x_test))
However, an example of the results I get are:
Bring it back
[0]
It is greatly appreciatd when
[0]
Apologies in advance
[0]
Can you please
[0]
See you then
[0]
I hope this email finds you well.
[0]
Thanks in advance
[0]
I am sorry to inform
[0]
You’re absolutely right
[0]
I am deeply regretful
[0]
Shoot me through
[0]
I’m looking forward to
[0]
As I already stated
[0]
Hello
[0]
We expect all students
[0]
If it’s not too late
[0]
and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences. Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?
Two things, you are fitting The MultinomialNB with the test set. In your loop you have mnb.fit(x_testcv,y_test) but you should do mnb.fit(x_traincv,y_train)
Second, when performing pre-processing you should call the fit_transform only on the training data while on the test you should call only the transform method.
Related
I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder
data = pd.read_csv('data.csv')
x = data['Reviews']
y = data['Ratings']
le = LabelEncoder()
x_encoded = le.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
Then I printed out the accuracy like below:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
And here's the output:
Accuracy: 0.5975
I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.
Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?
It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.
What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.
As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.
A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.
I want to create a text classifer that looks at research abstracts and determines whether they are focused on access to care, based on a labeled dataset I have. The data source is an Excel spreadsheet, with three fields (project_number, abstract, and accessclass) and 326 rows of abstracts. The accessclass is 1 for access related and 0 for not access related (not sure if this is relevant). Anyway, I tried following along a tutorial by wanted to make it relevant by adding my own data and I'm having some issues with my X and Y arrays. Any help is appreciated.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
df = pd.read_excel("accessclasses.xlsx")
df.head()
#TFIDF vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True,
strip_accents='ascii', stop_words=stopset)
y = df.accessclass
x = vectorizer.fit_transform(df)
print(x.shape)
print(y.shape)
#above and below seem to be where the issue is.
x_train, x_test, y_train, y_test = train_test_split(x, y)
You are using your whole dataframe to encode your predictor. Remember to use only the abstract in the transformation (you could also fit the corpus word dictionary before and then transform it afterwards).
Here's a solution:
y = df.accessclass
x = vectorizer.fit_transform(df.abstract)
The rest looks ok.
I am using scikit-learn to build a classifier that predicts if two sentences are paraphrases or not (e.g. paraphrases: How tall was Einstein vs. What was Albert Einstein's length).
My data consists of 2 columns with strings (phrase pairs) and 1 target column with 0's and 1's (= no paraphrase, paraphrase). I want to try different algorithms.
I expect the last line of code below to fit the model. Instead, the pre-processing Pipeline keeps producing an error I cannot solve: "AttributeError: 'numpy.ndarray' object has no attribute 'lower'."
The code is below and I have isolated the error happening in the last line shown (for brevity I have excluded the rest). I suspect it is because the target column contains 0s and 1s, which cannot be turned lowercase.
I have tried the answers to similar questions on stackoverflow, but no luck so far.
How can you work around this?
question1 question2 is_paraphrase
How long was Einstein? How tall was Albert Einstein? 1
Does society place too How do sports contribute to the 0
much importance on society?
sports?
What is a narcissistic What is narcissistic personality 1
personality disorder? disorder?
======
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
para = "paraphrases.tsv"
df = pd.read_csv(para, usecols = [3, 5], nrows = 100, header=0, sep="\t")
y = df["is_paraphrase"].values
X = df.drop("is_paraphrase", axis=1).values
X = X.astype(str) # I have tried this
X = np.char.lower(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
The error is not because of the last column, it is because your Train xdataset will contain two columns question1 and question2. Now this will result in you X_train having each row as list of values. So when the CountVectorizer is trying to convert it into lower case, it is returning an error since a numpy.ndarray does not contain lower function.
To overcome this problem you need to split the dataset X_train into two parts, say X_train_pt1 and X_train_pt2. Then perform CountVectorizer on these indiviudally, followed by tfidfTransformer on each individual result. Also ensure that you same object for transformation on these datasets.
Finally you stack these two arrays together and give it as input to your classifier. You can find a similar implementation here.
Update :
I think the following should be of some help (I admit this code can be further improved for more efficiency):
def flat_list(my_list):
return [str(item) for sublist in my_list for item in sublist]
def transform_data(trans_obj_list,dataset_splits):
X_train = dataset_splits[0].astype(str)
X_train = flat_list(X_train)
for trfs in trans_obj_list:
transformed_vector = trfs().fit(X_train)
for x in xrange(0,len(dataset_splits)):
dataset_splits[x] =flat_list(dataset_splits[x].astype(str))
dataset_splits[x]=transformed_vector.transform(dataset_splits[x])
return dataset_splits
new_X_train,new_X_test = transform_data([CountVectorizer,TfidfTransformer],[X_train,X_test])
I have a large numpy array and when I run scikit learn's train_test_split to split the array into training and test data, I always run into memory errors. What would be a more memory efficient method of splitting into train and test, and why does the train_test_split cause this?
The follow code results in a memory error and causes a crash
import numpy as np
from sklearn.cross_validation import train_test_split
X = np.random.random((10000,70000))
Y = np.random.random((10000,))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)
One method that I've tried which works is to store X in a pandas dataframe and shuffle
X = X.reindex(np.random.permutation(X.index))
since I arrive at the same memory error when I try
np.random.shuffle(X)
Then, I convert the pandas dataframe back to a numpy array and using this function, I can obtain a train test split
#test_proportion of 3 means 1/3 so 33% test and 67% train
def shuffle(matrix, target, test_proportion):
ratio = int(matrix.shape[0]/test_proportion) #should be int
X_train = matrix[ratio:,:]
X_test = matrix[:ratio,:]
Y_train = target[ratio:,:]
Y_test = target[:ratio,:]
return X_train, X_test, Y_train, Y_test
X_train, X_test, Y_train, Y_test = shuffle(X, Y, 3)
This works for now, and when I want to do k-fold cross-validation, I can iteratively loop k times and shuffle the pandas dataframe. While this suffices for now, why does numpy and sci-kit learn's implementations of shuffle and train_test_split result in memory errors for big arrays?
Another way to use the sklearn split method with reduced memory usage is to generate an index vector of X and split on this vector. Afterwards you can select your entries and e.g. write training and test splits to the disk.
import h5py
import numpy as np
from sklearn.cross_validation import train_test_split
X = np.random.random((10000,70000))
Y = np.random.random((10000,))
x_ids = list(range(len(X)))
x_train_ids, x_test_ids, Y_train, Y_test = train_test_split(x_ids, Y, test_size = 0.33, random_state=42)
# Write
f = h5py.File('dataset/train.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_train_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_train, dtype=np.int)
f.close()
f = h5py.File('dataset/test.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_test_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_test, dtype=np.int)
f.close()
# Read
f = h5py.File('dataset/train.h5py', 'r')
X_train = np.array(f.get('inputs'), dtype=np.int)
Y_train = np.array(f.get('labels'), dtype=np.int)
f.close()
f = h5py.File('dataset/test.h5py', 'r')
X_test = np.array(f.get('inputs'), dtype=np.int)
Y_test = np.array(f.get('labels'), dtype=np.int)
f.close()
I came across a similar problem.
As mentioned by #user1879926, I think shuffle is a main cause of memory exhaustion.
And ,as 'Shuffle' is claimed to be an invalid parameter for model_selection.train_test_split cited,
train_test_split in sklearn 0.19 has option disabling shuffle.
So, I think you can escape from memory error by just adding shuffle=False option.
I faced the same problem with my code. I was using a dense array like you and ran out of memory. I converted my training data to sparse (I am doing document classification) and solved my issue.
I suppose a more "memory efficient" way would be to iteratively select instances for training and testing (although, as is typical in computer science, you sacrifice the efficiency inherent in using matrices).
What you could do is iterate over the array and, for each instance, 'flip a coin' (use the random package) to determine whether you use the instance as training or testing and, depending upon which, storing the instance in the appropriate numpy array.
This iterative method shouldn't be bad for only 10000 instances. What is curious though is that 10000 X 70000 isn't all that large; what type of machine are you running? Makes me wonder whether it is a Python/numpy/scikit issue or a machine issue...
Anyway, hope that helps!
I am getting peculiar differences in results between WEKA and scikit while using the same RandomForest technique and the same dataset. With scikit I am getting an AUC around 0.62 (all the time, for I did extensive testing). However, with WEKA, im getting results close to 0.79. Thats a huge difference!
The dataset I tested the algorithms on is KC1.arff, of which I put a copy in my public dropbox folder https://dl.dropbox.com/u/30688032/KC1.arff. For WEKA, I simply downloaded the .jar file from http://www.cs.waikato.ac.nz/ml/weka/downloading.html. In WEKA, I set the cross-validation parameter as 10-fold, the dataset as KC1.arff, the algorithm as "RandomForest -l 19 -K 0 -S 1". Then ran the code! Once you generate the results in WEKA, it should be saved as a file, .csv or .arff. Read that file and check the column 'Area_under_ROC', it should be somewhat close to 0.79.
Below is the code for the scikit's RandomForest
import numpy as np
from pandas import *
from sklearn.ensemble import RandomForestClassifier
def read_arff(f):
from scipy.io import arff
data, meta = arff.loadarff(f)
return DataFrame(data)
def kfold(clr,X,y,folds=10):
from sklearn.cross_validation import StratifiedKFold
from sklearn import metrics
auc_sum=0
kf = StratifiedKFold(y, folds)
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clr.fit(X_train, y_train)
pred_test = clr.predict(X_test)
print metrics.auc_score(y_test,pred_test)
auc_sum+=metrics.auc_score(y_test,pred_test)
print 'AUC: ', auc_sum/folds
print "----------------------------"
#read the dataset
X=read_arff('KC1.arff')
y=X['Defective']
#changes N, and Y to 0, and 1 respectively
s = np.unique(y)
mapping = Series([x[0] for x in enumerate(s)], index = s)
y=y.map(mapping)
del X['Defective']
#initialize random forests (by defualt it is set to 10 trees)
rf=RandomForestClassifier()
#run algorithm
kfold(rf,np.array(X),y)
#You will get an average AUC around 0.62 as opposed to 0.79 in WEKA
Please keep in mind that the real auc value, as shown in relevant papers' experimental results, is around 0.79, so the problem lies on my implementation that uses the scikit random forests.
Your kind help will be highly appreciated!!
Thank you very much!
After posting the question at scikit-learn issue tracker, I got feedback that the problem is in the "predict" function I used. It should have been "pred_test = clr.predict_proba(X_test)[:, 1]" instead of "pred_test = clr.predict(X_test)", since the classification problem is binary: either 0 or 1.
After implementing the change, the results turned out to be the same for WEKA's and scikit's random forest :)