Identify feature names from a pandas dataframe - python

I have a piece of code as follow:
# transforming data to best 20 features
from sklearn.feature_selection import SelectKBest, chi2
import matplotlib.pyplot as plt
fs = SelectKBest(score_func=chi2, k=20)
fs.fit(X_train, y_train)
X_train = fs.transform(X_train)
X_test = fs.transform(X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()
The plot gives me the output as indicated in this picture, I'm wondering how can I identify the actual feature names of these features instead of "1-20" ? I tried get_support() but it gives an error since my data is in array format as I used train_test_split.

The features are in the same order as the data inside the X_train array. So in order to get your feature names you should extract them before making the X_train to a numpy array. If you are using a pandas dataframe named df what you can do is:
for i in range(len(fs.scores_)):
print(f'Feature {df.columns[i]}: {fs.scores_[i]}')

Related

Feature importance using gridsearchcv for logistic regression

I've trained a logistic regression model like this:
reg = LogisticRegression(random_state = 40)
cvreg = GridSearchCV(reg, param_grid={'C':[0.05,0.1,0.5],
'penalty':['none','l1','l2'],
'solver':['saga']},
cv = 5)
cvreg.fit(X_train, y_train)
Now to show the feature's importance I've tried this code, but I don't get the names of the coefficients in the plot:
from matplotlib import pyplot
importance = cvreg.best_estimator_.coef_[0]
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Obviously, the plot isn't very informative. How do I add the names of the coefficients to the x-axis?
The importance of the coeff is:
cvreg.best_estimator_.coef_
array([[1.10303023e+00, 7.48816905e-01, 4.27705027e-04, 6.01404570e-01]])
The coefficients correspond to the columns of X_train, so pass in the X_train names instead of range(len(importance)).
Assuming X_train is a pandas dataframe:
import matplotlib.pyplot as plt
features = X_train.columns
importance = cvreg.best_estimator_.coef_[0]
plt.bar(features, importance)
plt.show()
Note that if X_train is just a numpy array without column names, you will have to define the features list based on your own data dictionary.

Identifying arrays with a band structure

I would like to identify arrays with a band like structure (first image) as compared to a more homogeneous structure shown in the homogenous image.
I have so far used some skewness and RMS techniques to test for this but it doesn't work well if the bands are evenly spaced. Are there any more refined ways of identifying such arrays in Python?
Try sns.pairplot.
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# The dataset: wages
# We fetch the data from OpenML. Note that setting the parameter as_frame to True will retrieve the data as a pandas dataframe.
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# Then, we identify features X and targets y: the column WAGE is our target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

How can I normalize columns in a data frame and then plot a regression line? [duplicate]

This question already has answers here:
Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)
(5 answers)
Closed 3 years ago.
I have a large data frame with MANY columns. I want to normalize a few columns which are all numeric, and then plot two using regression. I thought the code below would do it for me.
from sklearn import preprocessing
# Create x, where x the 'scores' column's values as floats
modDF = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']].values.astype(float)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(modDF)
# Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(x="WeightedAvg", y="Spread", data=modDF)
However, I am getting the following error: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
I did a regression without normalizing, using sns.regplot and it worked, but it looked weird, so I want to see it with normalization applied. I know how the regression works. I just don't know how the regression works.
There is no need to use the command: df_normalized = pd.DataFrame(x_scaled).
If you want to run a linear regression. This should work:
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
df = ['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)
X = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']]
#select your target variable
y = df[['target']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)
#start linear regression
reg = LinearRegression().fit(X_train_scaled, y_train)
#predict for test
y_predict = reg(X_test_scaled, y_test)
If you work with train/test-split it is important that you use the scaler fitting only on the training data, the test data is unknow to that point in time! For the testing part you are only allowed to use it for transforming.

mlextend plot_decision_regions with model fit on Pandas DataFrame?

I'm a big fan of mlxtend's plot_decision_regions function, (http://rasbt.github.io/mlxtend/#examples , https://stackoverflow.com/a/43298736/1870832)
It accepts an X(just two columns at a time), y, and (fitted) classifier clf object, and then provides a pretty awesome visualization of the relationship between model predictions, true y-values, and a pair of independent variables.
A couple restrictions:
X and y have to be numpy arrays, and clf needs to have a predict() method. Fair enough. My problem is that in my case, the classifier clf object I would like to visualize has already been fitted on a Pandas DataFrame...
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit a Classifier to the data
# This classifier is fit on the data as a Pandas DataFrame
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = xgb.XGBClassifier()
clf.fit(X, y)
So now when I try to use plot_decision_regions passing X/y as numpy arrays...
# Plot Decision Region using mlxtend's awesome plotting function
plot_decision_regions(X=X.values,
y=y.values,
clf=clf,
legend=2)
I (understandably) get an error that the model can't find the column names of the dataset it was trained on
ValueError: feature_names mismatch: ['Planned_End', 'Actual_End'] ['f0', 'f1']
expected Planned_End, Actual_End in input data
training data did not have the following fields: f1, f0
In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?
Try to change:
X = df[['Planned_End', 'Actual_End']].values
y = df['Late'].values
and proceed to:
clf = xgb.XGBClassifier()
clf.fit(X, y)
plot_decision_regions(X=X,
y=y,
clf=clf,
legend=2)
OR fit & plot using X.values and y.values

add column to data set in python

I am trying to add predicted data back to my original dataset in Python. I think I'm supposed to use Pandas and ASSIGN and pd.DataFrame but I have no clue how to write this after reading all the documentation (sorry I'm new to all this and just started learning coding recently). I've written my code below and just need help with the code for adding my predictions back to the dataset. Thanks for the help!
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
# Feature Scaling X_train and X_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Feature scaling the all independent variables used to build the model
whole_dataset = sc.transform(X)
# Fitting classifier to the Training set
# Create your Naive Bayes here
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
# Predicting the results for the whole dataset
y_pred2 = classifier.predict_proba(whole_dataset)
# Add y_pred2 predictions back to the dataset
???
You can just do dataset['prediction'] = y_pred to add a new column.
Pandas supports a simple syntax for adding new columns, here it will add a new column and probably take a view on the numpy array returned from sklearn so it should be nice and fast.
EDIT
Looking at your code and the data, you're misunderstanding what train_test_split does, this is splitting the data into 3/4 1/4 splits of your original dataset which has 400 rows, your X train data contains 300 rows, the test data is 100 rows. You're then trying to assign back to your original dataset which is 400 rows. Firstly the number of rows don't match, secondly what is returned from predict_proba is a matrix of the predicted classes as a percentage. So what you want to do after training is to predict on the original dataset and assign this back as 2 columns by sub-selecting each column:
y_pred = classifier.predict_proba(X)
now assign this back :
dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]
There are several solutions. The answer of EdChurm had mentioned one.
As far as I know, pandas has other 2 methods to work with it.
df.insert()
df.assign()
Since you didn't provide the data in use, here's a pretty simple example.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(10), columns=['raw'])
df = df.assign(cube_raw=df['raw']**2)
df.insert(1,'square_raw',df['raw']**3)
df
raw square_raw cube_raw
0 1.624345 2.638498 4.285832
1 -0.611756 0.374246 -0.228947
2 -0.528172 0.278965 -0.147342
3 -1.072969 1.151262 -1.235268
4 0.865408 0.748930 0.648130
5 -2.301539 5.297080 -12.191435
6 1.744812 3.044368 5.311849
7 -0.761207 0.579436 -0.441071
8 0.319039 0.101786 0.032474
9 -0.249370 0.062186 -0.015507
Just keep in mind that df.assign() doesn't work inplace, so you should reassign to your previous variable.
In my opinion, I prefer df.insert() the most, for it allows you to assign which location you want to insert. (with parameter loc)

Categories

Resources