scikit learn logistic regression model tfidfvectorizer - python

I am trying to create a logistic regression model using scikit learn with the code below. I am using 9 columns for the features (X) and one for the label (Y). When trying to fit I get an error "ValueError: Found input variables with inconsistent numbers of samples: [9, 560000]" even though previously the lengths of X and Y are the same, if I use x.transpose() i get a different error "AttributeError: 'int' object has no attribute 'lower'". I am assuming this has to do with the tfidfvectorizer possibly, I am doing this because 3 of the columns contain single words and wasn't working. Is this the right way to be doing this or should I be converting the words in the columns separately and then using train_test_split? If not why am I getting the errors and how can I fic them. Heres an example of the csv.
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation =
model_selection.train_test_split(x_data, Y, test_size=0.2, random_state=7)
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)

What you are trying to do is unusual because TfidfVectorizer is designed to extract numerical features from text. But if you don't really care and just want to make your code works, one way to do it is by converting your numerical data to string and configure TfidfVectorizer to accept tokenized data:
import pandas as pd
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
cols = ['srcip','sport','dstip','dsport','proto','service','smeansz','dmeansz','attack_cat','Label']
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
# convert all columns to string like we don't care
for col in my_df.columns:
my_df[col] = my_df[col].astype(str)
# replace nan with empty string like we don't care
for col in my_df.columns[my_df.isna().any()].tolist():
my_df.loc[:, col].fillna('', inplace=True)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(
x_data.values, Y.values, test_size=0.2, random_state=7)
# configure TfidfVectorizer to accept tokenized data
# reference http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
tfidf_vectorizer = TfidfVectorizer(
analyzer='word',
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None)
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
That being said, I'd recommend you to use another method to do feature engineering on your dataset. For example, you can try to encode your nominal data (eg. IP, port) to numerical value.

Related

How to use a new data set on a trained model?

I am trying to use a new data set on a previously trained model to see how accurate the model is. I use the following code and receive the below error. Would another method solve this problem? thanks
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
df = pd.read_excel('xxxx.xlsx')
enc = LabelEncoder()
X = df[df.columns[1:]]
Y = df[df.columns[0]].values.ravel()
Y2 = enc.fit_transform(Y)
df.insert(0, "Unit Status", Y2, True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y2, random_state = 0, test_size = 0.25)
clf = LinearSVC(random_state=0,dual=False, tol=1e-5)
clf.fit(X, Y2)
Y_pred = clf.predict(X_test)
confusion_matrix(Y_test, Y_pred)
classifier_predictions = clf.predict(X_test)
print(accuracy_score(Y_test, classifier_predictions)*100)
df2 = pd.read_excel('xxxx_v2.xlsx')
y_pred=clf.predict(df2)
ValueError: could not convert string to float: '20-002'
The data in the new dataframe must all be floats or at least can be converted to float, the first and second columns have string data which cannot be converted to numbers, thus the model cannot train or predict on this data. from looking at the data, you could use labelEncoder on the second column and decide whether or not to use OneHotEncoder, but it looks to me that the first column doesn't contain categorical data. If the model needs the first column's data, then you need to convert it to numbers somehow, otherwise just drop the column.

When predicting on a single sentence, receive the error "Number of features of the model must match the input."

I'm a data science newbie and I'm trying to use TfidfVectorizer with RandomForestClassifier to predict a binary "yes/no" outcome on a string like so:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('~/Downloads/New_Query_2019_12_04.csv', usecols=['statement', 'result'])
df = df.head(100)
# remove non-values
df = df.dropna()
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(df['statement']).toarray()
y = df['result'].values
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=0)
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
All of this appears to work great, but I'm stuck on how to predict a phrase against the model. When I do something like:
good_string = preprocess_string('This is a good sentence')
tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform([good_string]).toarray()
y_pred = classifier.predict(X)
I get the error "Number of features of the model must match the input."
I also tried fitting the string with my previous TfidfVectorizer:
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform([good_string]).toarray()
but I got the error "max_df corresponds to < documents than min_df". I think I'm just a bit confused as to how to fit the array features of the single string to match the number features in my model. Any help would be greatly appreciated.
The issue was that I was running it through a different vectorizer with the same constructor params:
tfidfconverter = TfidfVectorizer(
max_features=1500,
min_df=5,
max_df=0.7,
stop_words=stopwords.words('english'))
instead of using the same vectorizer I used when fitting the documents here:
X = tfidfconverter.fit_transform(df['statement']).toarray()
I also should not have been attempting to fit the data I was trying to predict, but ONLY transform it.
X = tfidfconverter.transform([good_string]).toarray()

Fitting MultinomialNB on multiple columns of data

Given a table of data containing 100 rows, such as:
Place | Text | Value | Text_Two
europe | some random text | 3.2 | some more random text
america | the usa | 4.1 | the white house
...
I am trying to classify with the following:
df = pd.read_csv('data.csv')
mnb = MultinomialNB()
tf = TfidfVectorizer()
df.loc[df['Place'] == 'europe','Place'] = 0
df.loc[df['Place'] == 'america','Place'] = 1
X = df[['Text', 'Value', 'Text_Two']]
y = df['Place']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_train_tf = tf.fit_transform(X_train)
mnb.fit(X_train_tf, y_train)
The above produces the following error:
ValueError: Found input variables with inconsistent numbers of
samples: [3, 100]
So from what I understand it's only seeing the categories that were set with X = df[['Text', 'Value', 'Text_Two']], not the data within those categories.
The code above works if I only specify X for one category, such as:
X = df['Text']
Is it possible to fit the MultinomialNB on multiple categories of data?
This has nothing to do with MultinomialNB. It can handle multiple columns fine. The problem is TfidfVectorizer.
TfidfVectorizer only works on a an iterable of single dimension (single column of your dataframe) and will not do any kind of check on the shape or type of the input data.
It will only do this:
for doc in raw_documents:
...
...
When you pass a dataframe to it (be it a single column or multiple columns), for doc in raw_documents:, on a dataframe will only output the column names and not actual data. The data you pass in X has three columns, so only those columns are used as documents, and hence the error
ValueError: Found input variables with inconsistent numbers of samples: [3, 100]
because your y will have 100 length, and your X (even though it has length 100, but due to tfidfvectorizer it will only now have 3 length).
So to solve this, you have two options:
1) You need to do individual tf-idf vectorization for each text column (Text, Text_Two) and then combine the resultant matrices to form the feature matrix to be used with MultinomialNB.
2) You can combine the two text columns into a single column as #âńōŋŷxmoůŜ has suggested and then do tf-idf on that single column.
Both options will result in different feature vectors, so you need to first understand what each one does and choose what you want.
I rather combine columns Text and Text_Two as one column then construct the classifier from there. MultinomialNB works only for one classifier. Below is the code that combines columns Text and Text_Two into one.
You might be interested on multi-class or multi-label classification but it refers to the target variables (Y) rather than the dependent variables (X).
http://scikit-learn.org/stable/modules/multiclass.html. Hope it helps.
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv', header=0, sep='|')
df.columns = [x.strip() for x in df.columns]
mnb = MultinomialNB()
tf = TfidfVectorizer()
#df.loc[df['Place'] == 'europe','Place'] = 0
#df.loc[df['Place'] == 'america','Place'] = 1
#X = df[['Text', 'Value', 'Text_Two']]
X = df.Text + df.Text_Two
y = df['Place']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB())
pipe.fit(X_train, y_train)
pipe.predict(X_test)

Training a Logistic Classifier

I'm trying to train a logistic classifier. My dataset has the following columns.
name , review, rating, reviews_cleaned , word_count, sentiment,
The sentiment is either +1 or -1 based on whether the rating is greater than 3 or less. The word count contains a dict of words with occurences and reviews_cleaned just strips off the reviews off punctuations.
This is my code to train a LogisticClassifier.
train_data, test_data = train_test_split(products, test_size = 0.2)
sentiment_model = LogisticRegression(penalty='l2', C=1)
sentiment_model.fit(products['sentiment'], products['word_count'])
I get the following error,
ValueError: Found input variables with inconsistent numbers of samples: [1, 166752]
PS: The equivalent statment using graphLab create is
sentiment_model = graphlab.logistic_classifier.create(train_data,
target = 'sentiment',
features=['word_count'],
validation_set=None)
What am I doing wrong?
Your training data looks like it's a 1-dimensional vector but sklearn requires it to be 2-dimensional - if you reshape it you should be okay. Also you make your train/test split but you're not actually using the data that you're producing (fit with train_data instead).
Using GraphLab in that course is very irritating to say the least. Give this a whirl:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('amazon_baby.csv', header = 0)
df.dropna(how="any", inplace= True)
products = df[df['rating'] != 3] #drop the products with 3-star rating
products['sentiment'] = products['rating'] >= 4
X_train, X_test, y_train, y_test = train_test_split(products['review'], products['sentiment'], test_size = .2 ,random_state = 0)
vect = CountVectorizer()
X_train = vect.fit_transform(X_train.values)
X_test = vect.transform(X_test.values)
model = LogisticRegression(penalty ='l2', C = 1)
model.fit(X_train, y_train)
I'm not sure what the direct translation between Sklearn/Pandas and GraphLab is, but this looks like it's what they are doing.
When I score the model, I get:
model.score(X_test, y_test)
> .93155480
Let me know what results you get or if this works for you.

Preparing CSV file data for Scikit-Learn Using Pandas?

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?
Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?
I've imported my file using:
dataset = pd.read_csv('example.csv', header=None, sep=',')
Thanks
I'd recommend using sklearn's train_test_split
from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)
You can try this.
Sperating target class from the rest:
pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]
Now to create test and training samples:
I would just use numpy's randn:
mask = np.random.rand(len(pixel_values )) < 0.8
train = pixel_values [mask]
test = pixel_values [~msk]
Now you have traning and test samples in train and test with 80:20 ratio.
You can simply do:
choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]
Then, to pass it as x and y to Scikit-Learn:
scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])
Let me know if this doesn't work.

Categories

Resources