I used L1-based feature selection shown here in order to select suitable columns from pandas DataFrame X.
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
However it is not clear to me how can I get the column names. Since X_new is numpy array, I tried this:
X_new.dtype.names
But it returns nothing. So, how can I actually understand which columns have been selected?
Once you have converted your data into a csv file, you will want to use pd.read_csv to get that file into a dataframe.
You can then use the columns attribute to access the columns.
Furthermore, you could use the to_list attribute to get the columns as a list.
Alternatively, you could use Ahmad's method:
import re
f = open('f.csv','r')
alllines = f.readlines()
columns = re.sub(' +',' ',alllines[0]) #delete extra space in one line
columns = columns.strip().split(',') #split using space
print(columns)
EDIT: The question was solved by the OP through using model.get_support instead of SelectFromModel.get_support
Related
I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()
I want to divide my data into labels in that the first 6 columns determine the 7th column now I have selected the first 6 columns which is working perfectly
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Assign column names to the dataset
names=['buying', 'maint', 'doors', 'persons', 'lug_boot','safety', 'class']
# load the dataset in csv format into the pandas dataframe
cardata= pd.read_csv(r'C:\Users\user\Downloads\car.data', names=names)
X = cardata.iloc[:, 0:6]
The above code is working perfectly and when I run
print(X.head())
it prints the first 6 columns with exemption of the last column which is supposed to be predicted.
But this code below seems not to work as it outputs a similar behaviour to the one above
y = cardata.select_dtypes(include=[object])
print(y.head())
please help I need to assign the variable y to only the last column that is the seventh column
The output is the same which is not the case , I need when I run print(y.head()) it only prints the last column
Try this
X,y = cardata.iloc[:,:-1],cardata.iloc[:,-1]
This selects all rows and separates X and y based on the last column (index = -1). This should get you the result you are looking for
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.
I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?
This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`
As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)
I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?
This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`
As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)