Naivebayes MultinomialNB scikit-learn/sklearn - python

I am bulding a naive bayes classifier and I follow the tutorial on the scikit-learn website.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
import string
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Importing dataset
data = pd.read_csv("test.csv", quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True,error_bad_lines=False)
df2 = data.set_index("name", drop = False)
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
train, test = train_test_split(df2, test_size=0.2)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(traintrain['review'])
test_matrix = count_vect.transform(testrain['review'])
clf = MultinomialNB().fit(X_train_tfidf, train['sentiment'])
The first argument is the vocabulary dictionary and it returns a Document-Term matrix.
What should be the second argument,twenty_train.target?
Edit Data example
Name, review,rating
film1,......,1
film2, the film is....,5
film3, film about..., 4
with this instruction I created a new column , if the rating is >3 so the review is positive, else it is negative
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)

The fit method of MultinomialNB expects as input the x and y.
Now, x should be the training vectors (training data) and y should be the target values.
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
In more detail:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is
the number of features.
y : array-like, shape = [n_samples]
Target values.
Note: Make sure that shape = [n_samples, n_features] and shape = [n_samples] of x and y are defined correctly. Otherwise, the fit will throw an error.
Toy example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
newsgroups_train = fetch_20newsgroups(subset='train')
categories = ['alt.atheism', 'talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
vectorizer = TfidfVectorizer()
# the following will be the training data
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape
newsgroups_test = fetch_20newsgroups(subset='test',
categories=categories)
# this is the test data
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = MultinomialNB(alpha=.01)
# the fitting is done using the TRAINING data
# Check the shapes before fitting
vectors.shape
#(2034, 34118)
newsgroups_train.target.shape
#(2034,)
# fit the model using the TRAINING data
clf.fit(vectors, newsgroups_train.target)
# the PREDICTION is done using the TEST data
pred = clf.predict(vectors_test)
EDIT:
The newsgroups_train.target is just a numpy array that contains the labels (or targets or classes).
import numpy as np
newsgroups_train.target
array([1, 3, 2, ..., 1, 0, 1])
np.unique(newsgroups_train.target)
array([0, 1, 2, 3])
So in this example we have 4 different classes/targets.
This variable is needed in order to fit a classifier.

Related

NLP classification with sparse and numerical features crashes

I have a dataset of 10 million english shows, which has been cleaned and lemmatized, and their classification into different category types such as comedy, documentary, action, ... etc
I also have a feature called duration, which is the length of the tv show.
Data can be found here
I perform tfidf vectorization on the titles, which returns a sparse matrix and normalization on the duration column.
Then I want to feed the data to a logistic regression classifier.
side question: I want to know if theres a better way to handle combining a sparse matrix and a numerical column
when I try to do it using todense() or toarray(), It works
When i pass it to the logistic regression function, the notebook crashes. But if i dont have the duration col, which means i dont have to apply the toarray() or todense() function, it works perfectly. Is this a memory issue?
This is my code:
import os
import pandas as pd
from sklearn import metrics
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
def normalize(df, col = ''):
mms = MinMaxScaler()
mms_col = mms.fit_transform(df[[col]])
return mms_col
def tfidf(X, col = ''):
tfidf_vectorizer = TfidfVectorizer(max_df = 0.8, max_features = 10000)
return tfidf_vectorizer.fit_transform(X[col])
def get_training_data(df):
df = shuffle(pd.read_csv(df).dropna())
data = df[['name_title', 'Duration']]
X_duration = normalize(data, col = 'Duration')
X_sparse = tfidf(data, col = 'name_title')
X = pd.DataFrame(X_sparse.toarray())
X['Duration'] = X_duration
y = df['target']
return X, y
def logistic_regression(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y)
lr = LogisticRegression(C = 100.0, random_state = 1, solver = 'lbfgs', multi_class = 'ovr')
lr.fit(X_train, y_train)
y_predict = lr.predict(X_test)
print(y_predict)
print("Logistic Regression Accuracy %.3f" %metrics.accuracy_score(y_test, y_predict))
data_path = '../data/'
X, y = get_training_data(os.path.join(data_path, 'podcasts_en_processed.csv'))
print(X.shape) # this prints (971426, 10001)
logistic_regression(X, y)

scikitlearn SVR giving same values for predictions (have tried scaling)

I have a battery dataframe with rows representing various cycles and a set of features for that cycle:
As an example row 1:
df = pd.DataFrame(columns=['Ecell_V', 'I_mA', 'EnergyCharge_W_h', 'QCharge_mA_h',
'EnergyDischarge_W_h', 'QDischarge_mA_h', 'Temperature__C',
'cycleNumber', 'SOH', 'Cell'])
df.loc[0] = [3.730646, 2988.8713, 0.185061, 49.724845, 0.0, 0.0, 27.5, 2, 0.99, 'VAH11']
There are 600,000 rows
I am trying to predict the value for SOH as follows:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression # for building a linear regression model
from sklearn.svm import SVR # for building SVR model
from sklearn.preprocessing import MinMaxScaler
train_data = pd.read_csv("train_data.csv")
train_cell = train_data.pop('Cell')
# reduce size of df train for comp purposes
train_data = train_data.iloc[::20, :]
train_data = train_data.reset_index(drop=True)
#remove unwanted features
train_data.pop('Ns')
train_data.pop('time_s')
#scale the data
scaler = MinMaxScaler()
train_data_scaled = scaler.fit_transform(train_data)
#return to df
train_data_scaled = pd.DataFrame(train_data_scaled, columns=['Ecell_V', 'I_mA', 'EnergyCharge_W_h', 'QCharge_mA_h',
'EnergyDischarge_W_h', 'QDischarge_mA_h', 'Temperature__C',
'cycleNumber', 'SOH'])
train_data_scaled
#unscale target
train_data_scaled['SOH'] = train_data['SOH']
train_data_scaled
#split target and input
X = train_data_scaled.drop('SOH', axis=1)
y = train_data_scaled['SOH'].values
#model
model = SVR(kernel='rbf', C=100, epsilon=1)
svr = model.fit(X, y)
#predict model
pred = model.predict(X)
Now returning ``` pred `` gives the same prediction for each row:
array([0.89976814, 0.89976814, 0.89976814, ..., 0.89976814, 0.89976814,
0.89976814])
why is this happening?
Using StandardScaler() on the X and y data corrected this issue, with an inverse called to return it to original values.

Value Error while classifying for multidimensional output classes using SVMs

I am trying to fit & classify my data using SVMs.
My input data consists of 11 features (dimensions) with 1335 samples, and output data consists of 17 classes (1335x17)
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svccl = svclassifier.fit(x_train, y_train)
(and even for kernel = poly)
I get the following error:
ValueError: y should be a 1d array, got an array of shape (934, 17) instead.
Same error comes when I try to classify using Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(x_train, y_train)
gnb_predictions = gnb.predict(x_test)
Where am i wrong in my approach?
SVC and GaussianNB won't support multiple target variable classification.
Hence it won't accept anything else than 1d array to tackle that you would need to fit one classifier per target.
There is already API Multioutput classification
You can combine this with any classifier you want.
Combining Mulitoutput with SVC
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC
import numpy as np
X = np.random.rand(934, 100)
Y = np.random.randint(17, size = [934, 17])
n_samples, n_features = X.shape
svc = SVC()
multi_target_forest = MultiOutputClassifier(svc, n_jobs=-1)
multi_target_forest.fit(X, Y).predict(X)
Combining Mulitoutput with GaussianNB
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import GaussianNB
import numpy as np
X = np.random.rand(934, 100)
Y = np.random.randint(17, size = [934, 17])
n_samples, n_features = X.shape
gnb = GaussianNB()
multi_target_forest = MultiOutputClassifier(gnb, n_jobs=-1)
multi_target_forest.fit(X, Y).predict(X)

Expected 2d array but got scalar array instead

I am getting this error
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
while executing this code
# SVR
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
# Load dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting the SVR to the data set
regressor = SVR(kernel = 'rbf', gamma = 'auto')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
You need to understand how SVM works. Your trainig data is a matrix of shape (n_samples, n_features). That means, your SVM operates in feature space of n_features dimensions. Hence, it cannot predict a value for a scalar input, unless n_features is 1. You can only predict values for vectors of dimension n_features. So, if your data set has 5 columns, you can predict values for an arbitrary row-vector of 5 columns. See the below example.
import numpy as np
from sklearn.svm import SVR
# Data: 200 instances of 5 features each
X = randint(1, 100, size=(200, 5))
y = randint(0, 2, size=200)
reg = SVR()
reg.fit(X, y)
y_test = np.array([[0, 1, 2, 3, 4]]) # Input to .predict must be 2-dimensional
reg.predict(y_test)
# Predicting a new result with Linear Regression
X_test = np.array([[6.5]])
print(lin_reg.predict(X_test))
# Predicting a new result with Polynomial Regression
print(lin_reg_2.predict(poly_reg.fit_transform(X_test)))

Print predict ValueError: Expected 2D array, got 1D array instead

The error shows in my last two codes.
ValueError: Expected 2D array, got 1D array instead: array=[0 1].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
%matplotlib inline
df = pd.read_csv('.......csv')
df.drop(['Company'], 1, inplace=True)
x = pd.DataFrame(df.drop(['R&D Expense'],1))
y = pd.DataFrame(df['R&D Expense'])
X_test = x.index[[0,1]]
y_test = y.index[[0,1]]
X_train = x.drop(x.index[[0,1]])
y_train = y.drop(y.index[[0,1]])
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
score = r2_score(y_true, y_predict)
return score
from sklearn.metrics import make_scorer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
def fit_model_shuffle(x, y):
cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.20, random_state = 0)
regressor = KNeighborsRegressor()
params = {'n_neighbors':range(3,10)}
scoring_fnc = make_scorer(performance_metric)
grid = GridSearchCV(regressor, param_grid=params,scoring=scoring_fnc,cv=cv_sets)
grid = grid.fit(x, y)
return grid.best_estimator_
reg = fit_model_shuffle(X_train, y_train)
> for i, y_predict in enumerate(reg.predict(X_test),1):
print(i, y_predict)
The error message is self-explanatory. Your library expects the input to be a 2D matrix, with one pattern per row. So, if you are doing regression with just one input, before passing it to the regressor, do
my_data = my_data.reshape(-1, 1)
to make a 2X1 shaped matrix
On the other hand (unlikely), if you have a single vector [0, 1]
my_data = my_data.reshape(1, -1)
to make a 1X2 matrix

Categories

Resources