IndexError while getting feature importance in logistic regression using weights - python

Doing some sentiment analysis, I am trying to get the feature importance using logistic regression. I found a reference here (How to get feature importance in logistic regression using weights?) to how to do it, but when implementing it, it's giving me error and I don't know why and how to solve.
Can some one help me ?
here is my code.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
## Creating Training data
Independent_var = df_final.tweet # the features
Dependent_var = df_final.sent_binary # the sentiment (positive, negative, neutral)
# Logistic regression
cv = CountVectorizer(min_df=2, max_df=0.50, ngram_range = (1,2), max_features=50)
text_count_vector = cv.fit_transform(Independent_var)
#standardized_data = StandardScaler(with_mean=False).fit_transform(text_count_vector)
feature_names = np.array(cv.get_feature_names())
#feature_names
## Splitting in the given training data for our training and testing
X_tr, X_test, y_tr, y_test = train_test_split(text_count_vector, Dependent_var, test_size=0.3, random_state=225)
LogReg = LogisticRegression(solver='lbfgs', multi_class='multinomial')
LogReg_clf = LogReg.fit(X_tr, y_tr)
#coefs = np.abs(LogReg_clf.coef_)
coefs = LogReg_clf.coef_
#get the sorting indices
sorted_index = np.argsort(coefs)[::-1]
# check if the sorting indices are correct
print(coefs[sorted_index])
#get the index of the top-20 features
top_20 = sorted_index[:20]
#get the names of the top 20 most important features
print(feature_names[top_20])
The error I get :
IndexError Traceback (most recent call last)
<ipython-input-103-b566f1c5a21c> in <module>
22 print(sorted_index)
23 # check if the sorting indices are correct
---> 24 print(coefs[sorted_index])
25
26 #get the index of the top-20 features
IndexError: index 23 is out of bounds for axis 0 with size 3

Related

How to solve Feature Shape Mismatch error in XGBoost Feature Selection?

I'm trying to use the following code to get feature importance for feature selection in XGBOOST. But I keep getting an error saying that there's a "Feature shape mismatch, expected: 91, got 78".
"select_X_train" has 78 features (columns) as does "select_X_test" so I think the issue is that the thresholds array must still contain the 91 original variables? But, I can't figure out a way to fix it. Can you please advise?
Here is the code:
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)```

raise ValueError ValueError: Found array with 0 feature(s) (shape=(124, 0)) while a minimum of 1 is required

I am trying to apply a PCA (Principal component analysis) on a dataset with 124 rows and 13 features. I'm trying to see how many features to use (via Logistic Regression) to get the most accurate prediction, I have this code here:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
'machine-learning-databases/wine/wine.data', header=None)
from sklearn.model_selection import train_test_split
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)
# standardize the features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
# initializing the PCA transformer and
# logistic regression estimator:
pca = PCA() #prof recommends getting rid of m_components = 3
lr = LogisticRegression()
# dimensionality reduction:
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
"""
rows = len(X_train_pca)
columns = len(X_train_pca[0])
print(rows)
print(columns)
"""
# fitting the logistic regression model on the reduced dataset:
for i in range(12):
lr.fit(X_train_pca[:, :i], y_train)
y_train_pca = lr.predict(X_train_pca[:, :i])
print('Training accuracy:', lr.score(X_train_pca[:, :i], y_train))
I get the error message: raise ValueError("Found array with %d feature(s) (shape=%s) while"
ValueError: Found array with 0 feature(s) (shape=(124, 0)) while a minimum of 1 is required.
To my understanding, the for loop range is correct at 12 because it will go through all 13 features (0 through 12) and I am trying to have the for loop go through all the features (go through logistic regression with one feature, then two, then 3.... on and on until all 13 features and then see what their accuracies are to see how many features works best).
To your error:
X_train_pca[:, :i] when i=0 will give you an empty array, which is invalid as an input of .fit().
How to solve:
If you want to fit the model with only intercept, you can explicitly set fit_intercept=False in LogisticRegression() and add one extra column (to the leftmost) in your X filled with 1 (to act as the intercept).

ValueError: dimension mismatch While Predicting New Values Sentiment Analysis

I am relatively new to the machine learning subject. I am trying to do sentiment analysis prediction.
Type column includes the sentiment of the tweet(pos, neg or neutral as 0,1 and 2). Tweet column includes the tweets.
I am trying to predict new set of tweets's sentiments as 0,1 and 2.
When I wrote the code given here I got dimension mismatch error.
import pandas as pd
train_tweets = pd.read_csv("tweets_type.csv")
from sklearn.model_selection import train_test_split
y = train_tweets.Type
X= train_tweets.Tweet
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(train_X)
train_X_dtm = vect.transform(train_X)
test_X_dtm = vect.transform(test_X)
test_X_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
%time nb.fit(train_X_dtm, train_y)
# make class predictions for X_test_dtm
y_pred_class = nb.predict(test_X_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
metrics.accuracy_score(test_y, y_pred_class)
march_tweets = pd.read_csv("march_data.csv")
X=march_tweets.Tweet
vect.fit(X)
train_new_dtm = vect.transform(X)
new_pred_class = nb.predict(train_new_dtm)
The error I am getting is here:
Would be so glad if you could help me.
It seems I made a mistake fitting X after I already fitted train_X. I found out there is no use of doing that repeatedly once you the model is fitted. So what I did is I removed this line and it worked perfectly.
vect.fit(X)

Train-test split does not seem to work properly in Python?

I am trying to run a kNN (k-nearest neighbour) algorithm in Python.
The dataset I am using to try and do this is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wine
Here is the code I am using:
#1. LIBRARIES
import os
import pandas as pd
import numpy as np
print os.getcwd() # Prints the working directory
os.chdir('C:\\file_path') # Provide the path here
#2. VARIABLES
variables = pd.read_csv('wines.csv')
winery = variables['winery']
alcohol = variables['alcohol']
malic = variables['malic']
ash = variables['ash']
ash_alcalinity = variables['ash_alcalinity']
magnesium = variables['magnesium']
phenols = variables['phenols']
flavanoids = variables['flavanoids']
nonflavanoids = variables['nonflavanoids']
proanthocyanins = variables['proanthocyanins']
color_intensity = variables['color_intensity']
hue = variables['hue']
od280 = variables['od280']
proline = variables['proline']
#3. MAX-MIN NORMALIZATION
alcoholscaled=(alcohol-min(alcohol))/(max(alcohol)-min(alcohol))
malicscaled=(malic-min(malic))/(max(malic)-min(malic))
ashscaled=(ash-min(ash))/(max(ash)-min(ash))
ash_alcalinity_scaled=(ash_alcalinity-min(ash_alcalinity))/(max(ash_alcalinity)-min(ash_alcalinity))
magnesiumscaled=(magnesium-min(magnesium))/(max(magnesium)-min(magnesium))
phenolsscaled=(phenols-min(phenols))/(max(phenols)-min(phenols))
flavanoidsscaled=(flavanoids-min(flavanoids))/(max(flavanoids)-min(flavanoids))
nonflavanoidsscaled=(nonflavanoids-min(nonflavanoids))/(max(nonflavanoids)-min(nonflavanoids))
proanthocyaninsscaled=(proanthocyanins-min(proanthocyanins))/(max(proanthocyanins)-min(proanthocyanins))
color_intensity_scaled=(color_intensity-min(color_intensity))/(max(color_intensity)-min(color_intensity))
huescaled=(hue-min(hue))/(max(hue)-min(hue))
od280scaled=(od280-min(od280))/(max(od280)-min(od280))
prolinescaled=(proline-min(proline))/(max(proline)-min(proline))
alcoholscaled.mean()
alcoholscaled.median()
alcoholscaled.min()
alcoholscaled.max()
#4. DATA FRAME
d = {'alcoholscaled' : pd.Series([alcoholscaled]),
'malicscaled' : pd.Series([malicscaled]),
'ashscaled' : pd.Series([ashscaled]),
'ash_alcalinity_scaled' : pd.Series([ash_alcalinity_scaled]),
'magnesiumscaled' : pd.Series([magnesiumscaled]),
'phenolsscaled' : pd.Series([phenolsscaled]),
'flavanoidsscaled' : pd.Series([flavanoidsscaled]),
'nonflavanoidsscaled' : pd.Series([nonflavanoidsscaled]),
'proanthocyaninsscaled' : pd.Series([proanthocyaninsscaled]),
'color_intensity_scaled' : pd.Series([color_intensity_scaled]),
'hue_scaled' : pd.Series([huescaled]),
'od280scaled' : pd.Series([od280scaled]),
'prolinescaled' : pd.Series([prolinescaled])}
df = pd.DataFrame(d)
#5. TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.matrix(df),np.matrix(winery),test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
#6. K-NEAREST NEIGHBOUR ALGORITHM
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
In section 5, when I run sklearn.model_selection to import the train-test split mechanism, this does not appear to be running correctly because it provides the shapes: (0,13) (0,178) (1,13) (1,178).
Then, upon trying to run the knn, I get the error message: Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required. This is not due to scaling with max-min normalisation as I still get this error message even when the variables are not scaled.
I'm not exactly sure where your code is going wrong, it's a slightly different way of going about it compared to the sklearn docs. However, I can show you a different way of getting the train test split to work on the wine dataset for you.
from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = load_wine(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3)
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

LDA Producing Fewer Components Than Requested in Python

I am working on the following data set:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
The data can be found by clicking on the Data Folder link. There are two data sets present, a training and a testing set. The file I am using contains the combined data from both sets.
I am attempting to apply Linear Discriminant Analysis (LDA) to obtain two components, however when my code runs, it produces just a single component. I also obtain just a single component if I set "n_components = 3"
I just got done testing PCA, which works just fine for any number "n" I provide, such that "n" is less than or equal to the number of features present in the X arrays at the time of the transformation.
I am not sure why LDA seems to behaving so strangely. Here is my code:
#Load libraries
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
dataset = pandas.read_csv('bank-full.csv',engine="python", delimiter='\;')
#Output Basic Dataset Info
print(dataset.shape)
print(dataset.head(20))
print(dataset.describe())
# Split-out validation dataset
X = dataset.iloc[:,[0,5,9,11,12,13,14]] #we are selecting only the "clean data" w/o preprocessing
Y = dataset.iloc[:,16]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_temp = X_train
X_validation = sc_X.transform(X_validation)
'''# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)
X_validation = pca.transform(X_validation)
explained_variance = pca.explained_variance_ratio_'''
# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, Y_train)
X_validation = lda.transform(X_validation)
LDA (at least the implementation in sklearn) can produce at most k-1 components (where k is number of classes). So if you are dealing with binary classification - you'll end up with only 1 dimension.
Refer to manual for more detail: http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
Also related:
Python (scikit learn) lda collapsing to single dimension
LDA ignoring n_components?

Categories

Resources