I am using scikit-learn to build a classifier that predicts if two sentences are paraphrases or not (e.g. paraphrases: How tall was Einstein vs. What was Albert Einstein's length).
My data consists of 2 columns with strings (phrase pairs) and 1 target column with 0's and 1's (= no paraphrase, paraphrase). I want to try different algorithms.
I expect the last line of code below to fit the model. Instead, the pre-processing Pipeline keeps producing an error I cannot solve: "AttributeError: 'numpy.ndarray' object has no attribute 'lower'."
The code is below and I have isolated the error happening in the last line shown (for brevity I have excluded the rest). I suspect it is because the target column contains 0s and 1s, which cannot be turned lowercase.
I have tried the answers to similar questions on stackoverflow, but no luck so far.
How can you work around this?
question1 question2 is_paraphrase
How long was Einstein? How tall was Albert Einstein? 1
Does society place too How do sports contribute to the 0
much importance on society?
sports?
What is a narcissistic What is narcissistic personality 1
personality disorder? disorder?
======
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
para = "paraphrases.tsv"
df = pd.read_csv(para, usecols = [3, 5], nrows = 100, header=0, sep="\t")
y = df["is_paraphrase"].values
X = df.drop("is_paraphrase", axis=1).values
X = X.astype(str) # I have tried this
X = np.char.lower(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
The error is not because of the last column, it is because your Train xdataset will contain two columns question1 and question2. Now this will result in you X_train having each row as list of values. So when the CountVectorizer is trying to convert it into lower case, it is returning an error since a numpy.ndarray does not contain lower function.
To overcome this problem you need to split the dataset X_train into two parts, say X_train_pt1 and X_train_pt2. Then perform CountVectorizer on these indiviudally, followed by tfidfTransformer on each individual result. Also ensure that you same object for transformation on these datasets.
Finally you stack these two arrays together and give it as input to your classifier. You can find a similar implementation here.
Update :
I think the following should be of some help (I admit this code can be further improved for more efficiency):
def flat_list(my_list):
return [str(item) for sublist in my_list for item in sublist]
def transform_data(trans_obj_list,dataset_splits):
X_train = dataset_splits[0].astype(str)
X_train = flat_list(X_train)
for trfs in trans_obj_list:
transformed_vector = trfs().fit(X_train)
for x in xrange(0,len(dataset_splits)):
dataset_splits[x] =flat_list(dataset_splits[x].astype(str))
dataset_splits[x]=transformed_vector.transform(dataset_splits[x])
return dataset_splits
new_X_train,new_X_test = transform_data([CountVectorizer,TfidfTransformer],[X_train,X_test])
Related
I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated. Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
Now, I define both columns as 'phrases' and 'anger':
df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']
Subsequently, I split this data such that 20% is used for testing and 80% is used for training:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Now, I convert the words in x_train to numerical data using TfidfVectorizer:
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))
Now, I convert x_traincv to an array:
a = x_traincv.toarray()
I also convert x_testcv to a numerical array:
x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()
Now, I have
mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
mnb.fit(x_testcv,y_test)
testmessage=x_test.iloc[i]
predictions = mnb.predict(x_testcv[i].reshape(1,-1))
error_score = error_score + (predictions-int(b[i]))**2
print(testmessage)
print(predictions)
print(error_score/len(x_test))
However, an example of the results I get are:
Bring it back
[0]
It is greatly appreciatd when
[0]
Apologies in advance
[0]
Can you please
[0]
See you then
[0]
I hope this email finds you well.
[0]
Thanks in advance
[0]
I am sorry to inform
[0]
You’re absolutely right
[0]
I am deeply regretful
[0]
Shoot me through
[0]
I’m looking forward to
[0]
As I already stated
[0]
Hello
[0]
We expect all students
[0]
If it’s not too late
[0]
and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences. Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?
Two things, you are fitting The MultinomialNB with the test set. In your loop you have mnb.fit(x_testcv,y_test) but you should do mnb.fit(x_traincv,y_train)
Second, when performing pre-processing you should call the fit_transform only on the training data while on the test you should call only the transform method.
I want to create a text classifer that looks at research abstracts and determines whether they are focused on access to care, based on a labeled dataset I have. The data source is an Excel spreadsheet, with three fields (project_number, abstract, and accessclass) and 326 rows of abstracts. The accessclass is 1 for access related and 0 for not access related (not sure if this is relevant). Anyway, I tried following along a tutorial by wanted to make it relevant by adding my own data and I'm having some issues with my X and Y arrays. Any help is appreciated.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
df = pd.read_excel("accessclasses.xlsx")
df.head()
#TFIDF vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True,
strip_accents='ascii', stop_words=stopset)
y = df.accessclass
x = vectorizer.fit_transform(df)
print(x.shape)
print(y.shape)
#above and below seem to be where the issue is.
x_train, x_test, y_train, y_test = train_test_split(x, y)
You are using your whole dataframe to encode your predictor. Remember to use only the abstract in the transformation (you could also fit the corpus word dictionary before and then transform it afterwards).
Here's a solution:
y = df.accessclass
x = vectorizer.fit_transform(df.abstract)
The rest looks ok.
I am trying to create a logistic regression model using scikit learn with the code below. I am using 9 columns for the features (X) and one for the label (Y). When trying to fit I get an error "ValueError: Found input variables with inconsistent numbers of samples: [9, 560000]" even though previously the lengths of X and Y are the same, if I use x.transpose() i get a different error "AttributeError: 'int' object has no attribute 'lower'". I am assuming this has to do with the tfidfvectorizer possibly, I am doing this because 3 of the columns contain single words and wasn't working. Is this the right way to be doing this or should I be converting the words in the columns separately and then using train_test_split? If not why am I getting the errors and how can I fic them. Heres an example of the csv.
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation =
model_selection.train_test_split(x_data, Y, test_size=0.2, random_state=7)
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
What you are trying to do is unusual because TfidfVectorizer is designed to extract numerical features from text. But if you don't really care and just want to make your code works, one way to do it is by converting your numerical data to string and configure TfidfVectorizer to accept tokenized data:
import pandas as pd
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
cols = ['srcip','sport','dstip','dsport','proto','service','smeansz','dmeansz','attack_cat','Label']
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
# convert all columns to string like we don't care
for col in my_df.columns:
my_df[col] = my_df[col].astype(str)
# replace nan with empty string like we don't care
for col in my_df.columns[my_df.isna().any()].tolist():
my_df.loc[:, col].fillna('', inplace=True)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(
x_data.values, Y.values, test_size=0.2, random_state=7)
# configure TfidfVectorizer to accept tokenized data
# reference http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
tfidf_vectorizer = TfidfVectorizer(
analyzer='word',
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None)
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
That being said, I'd recommend you to use another method to do feature engineering on your dataset. For example, you can try to encode your nominal data (eg. IP, port) to numerical value.
I am trying to add predicted data back to my original dataset in Python. I think I'm supposed to use Pandas and ASSIGN and pd.DataFrame but I have no clue how to write this after reading all the documentation (sorry I'm new to all this and just started learning coding recently). I've written my code below and just need help with the code for adding my predictions back to the dataset. Thanks for the help!
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
# Feature Scaling X_train and X_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Feature scaling the all independent variables used to build the model
whole_dataset = sc.transform(X)
# Fitting classifier to the Training set
# Create your Naive Bayes here
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
# Predicting the results for the whole dataset
y_pred2 = classifier.predict_proba(whole_dataset)
# Add y_pred2 predictions back to the dataset
???
You can just do dataset['prediction'] = y_pred to add a new column.
Pandas supports a simple syntax for adding new columns, here it will add a new column and probably take a view on the numpy array returned from sklearn so it should be nice and fast.
EDIT
Looking at your code and the data, you're misunderstanding what train_test_split does, this is splitting the data into 3/4 1/4 splits of your original dataset which has 400 rows, your X train data contains 300 rows, the test data is 100 rows. You're then trying to assign back to your original dataset which is 400 rows. Firstly the number of rows don't match, secondly what is returned from predict_proba is a matrix of the predicted classes as a percentage. So what you want to do after training is to predict on the original dataset and assign this back as 2 columns by sub-selecting each column:
y_pred = classifier.predict_proba(X)
now assign this back :
dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]
There are several solutions. The answer of EdChurm had mentioned one.
As far as I know, pandas has other 2 methods to work with it.
df.insert()
df.assign()
Since you didn't provide the data in use, here's a pretty simple example.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(10), columns=['raw'])
df = df.assign(cube_raw=df['raw']**2)
df.insert(1,'square_raw',df['raw']**3)
df
raw square_raw cube_raw
0 1.624345 2.638498 4.285832
1 -0.611756 0.374246 -0.228947
2 -0.528172 0.278965 -0.147342
3 -1.072969 1.151262 -1.235268
4 0.865408 0.748930 0.648130
5 -2.301539 5.297080 -12.191435
6 1.744812 3.044368 5.311849
7 -0.761207 0.579436 -0.441071
8 0.319039 0.101786 0.032474
9 -0.249370 0.062186 -0.015507
Just keep in mind that df.assign() doesn't work inplace, so you should reassign to your previous variable.
In my opinion, I prefer df.insert() the most, for it allows you to assign which location you want to insert. (with parameter loc)
I'm trying to train a logistic classifier. My dataset has the following columns.
name , review, rating, reviews_cleaned , word_count, sentiment,
The sentiment is either +1 or -1 based on whether the rating is greater than 3 or less. The word count contains a dict of words with occurences and reviews_cleaned just strips off the reviews off punctuations.
This is my code to train a LogisticClassifier.
train_data, test_data = train_test_split(products, test_size = 0.2)
sentiment_model = LogisticRegression(penalty='l2', C=1)
sentiment_model.fit(products['sentiment'], products['word_count'])
I get the following error,
ValueError: Found input variables with inconsistent numbers of samples: [1, 166752]
PS: The equivalent statment using graphLab create is
sentiment_model = graphlab.logistic_classifier.create(train_data,
target = 'sentiment',
features=['word_count'],
validation_set=None)
What am I doing wrong?
Your training data looks like it's a 1-dimensional vector but sklearn requires it to be 2-dimensional - if you reshape it you should be okay. Also you make your train/test split but you're not actually using the data that you're producing (fit with train_data instead).
Using GraphLab in that course is very irritating to say the least. Give this a whirl:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('amazon_baby.csv', header = 0)
df.dropna(how="any", inplace= True)
products = df[df['rating'] != 3] #drop the products with 3-star rating
products['sentiment'] = products['rating'] >= 4
X_train, X_test, y_train, y_test = train_test_split(products['review'], products['sentiment'], test_size = .2 ,random_state = 0)
vect = CountVectorizer()
X_train = vect.fit_transform(X_train.values)
X_test = vect.transform(X_test.values)
model = LogisticRegression(penalty ='l2', C = 1)
model.fit(X_train, y_train)
I'm not sure what the direct translation between Sklearn/Pandas and GraphLab is, but this looks like it's what they are doing.
When I score the model, I get:
model.score(X_test, y_test)
> .93155480
Let me know what results you get or if this works for you.