Filling missing values of test from groupby mean of training set - python

I have two dataframes, train and test. The test set has missing values on a column.
import numpy as np
import pandas as pd
train = [[0,1],[0,2],[0,3],[0,7],[0,7],[1,3],[1,5],[1,2],[1,2]]
test = [[0,0],[0,np.nan],[1,0],[1,np.nan]]
train = pd.DataFrame(train, columns = ['A','B'])
test = pd.DataFrame(test, columns = ['A','B'])
The test set has two missing values on column B. If the groupby column is A
If the imputing strategy is mode, then the missing values should be imputed with 7 and 2.
If the imputing strategy is mean, then the missing values should be (1+2+3+7+7)/5 = 4 and (3+5+2+2)/4 = 3.
What is a good way to do this?
This question is related, but uses only one dataframe instead of two.

IIUC, here's one way:
from statistics import mode
test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()
If you want a function:
from statistics import mode
def evaluate_nan(strategy= 'mean'):
return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()
test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)

Related

Selecting specific features based on correlation values

I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Python - How to determine the feature / column names returned by Chi Squared test [duplicate]

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?
This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`
As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

The easiest way for getting feature names after running SelectKBest in Scikit Learn

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?
This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`
As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

How can I normalize the data in a range of columns in my pandas dataframe

Suppose I have a pandas data frame surveyData:
I want to normalize the data in each column by performing:
surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min())
This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:
Name State Gender Age Income Height
Sam CA M 13 10000 70
Bob AZ M 21 25000 55
Tom FL M 30 100000 45
I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.
You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:
# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.
I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options.
The way of doing that in your case when using StandardScaler would be:
from sklearn.preprocessing import StandardScaler
cols_to_norm = ['Age','Height']
surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm])
Simple way and way more efficient:
Pre-calculate the mean:
dropna() avoid missing data.
mean_age = survey_data.Age.dropna().mean()
max_age = survey_data.Age.dropna().max()
min_age = survey_data.Age.dropna().min()
dataframe['Age'] = dataframe['Age'].apply(lambda x: (x - mean_age ) / (max_age -min_age ))
this way will work...
I think it's really nice to use built-in functions
# Assuming same lines from your example
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = scaler.fit_transform(survey_data[cols_to_norm])
MinMax normalize all numeric columns with minmax_scale
import numpy as np
from sklearn.preprocessing import minmax_scale
# cols = ['Age', 'Height']
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
Note: Keeps index, column names or non-numerical variables unchanged.
import pandas as pd
import numpy as np
# let Dataset here be your data#
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
for x in dataset.columns[dataset.dtypes == 'int64']:
Dataset[x] = minmax.fit_transform(np.array(Dataset[I]).reshape(-1,1))

Categories

Resources