I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)
Related
Let's say that i have two 1-D arrays with 2 different statistical distributions. Now, i want to match both distributions using one of them as "target".
In the example, i "shifted" one of the distributions using MinMaxScaler() from SciKit to match it with the other one...but i am sure i can achieve a "automatic" and "better" match with some API...or some code...
In the example i have both arrays in the same DataFrame (and both have the same length), but i'd be very pleased if somebody kwnow a way to achieve it using 2 different Dataframes and/or 2 arrays with different lengths.
Thank you!!
CODE
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import plotly.figure_factory as ff
################## DATA ######################
np.random.seed(54)
crv = np.random.uniform(1,99,(1,100)).flatten()
np.random.seed(115)
crv_target = np.random.uniform(51,149,(1,100)).flatten()
# Create DataFrame
df = pd.DataFrame(data=[crv, crv_target]).T
df = df.rename(columns={0: "crv", 1: "crv_target"})
# Scaler
scale = MinMaxScaler(feature_range=(50,150))
df['crv_shifted'] = scale.fit_transform(X=df['crv'].values.reshape(-1, 1),y=df['crv_target'].values.reshape(-1, 1))
# Create distplot
data = [df['crv_shifted'],df['crv_target'],df['crv']]
labels = ['crv_shifted','crv_target','crv']
colors = ['#F8C471', '#22D2E6','#CD6155']
fig = ff.create_distplot(data, labels,show_hist=False,show_rug=False,colors=colors)
fig.show()
LINK TO PLOT
I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.
Right now I am dropping such features manually by using pandas.
I want to make a code which can drop such features automaticlly.
I wrote a code to visualize heat map and correlation in this way:
#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data
def calculateCorrelationByPearson(self):
columns = self.data.columns
plt.figure(figsize=(12, 8))
sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f',
linewidths=0.5, cmap='Blues')
plt.show()
for column in columns:
corr = stats.spearmanr(self.data['total'], self.data[columns])
print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')
This gives me a perfect view of my features and relationship with each other.
Now I want to drop columns which are not important.
Let's say correlation less than 0.4.
How can I apply this logic in to my code?
Here is an approach to remove variables with a correlation coef value below some threshold:
import pandas as pd
from scipy.stats import spearmanr
data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4
corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold
vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar) #add the target variable back in
data2 = data[vars_to_keep]
I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)
I am using sklearn in python to perform principle component analysis (PCA) on gene expression data. My data is loaded as a pandas dataframe, for which I can call df.head() and the df looks good. I am using sklearn to generate a loading matrix, but the matrix only displays a generic index, and will not accept a column name for an index. I have 1722 genes, so it is important that I obtain the loading score for each gene computationally.
Here is my code for PCA:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import preprocessing
# Load the data as pandas dataframe
cols = ['gene', 'FC_TSWV', 'FC_WFT', 'FC_TSWV_WFT']
df = pd.read_csv('./PCA.txt', names = cols, header = None, index_col = 'gene')
# preprocess data:
scaled_df = preprocessing.scale(df.T)
# perform PCA
pca = PCA()
pca.fit(scaled_df)
pca_data = pca.transform(scaled_df)
# Generate loading matrix. HERE IS WHERE THE TROUBLE IS:
loading_scores = pd.Series(pca.components_[0], index = df.gene)
# Print loading matrix
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)
print(loading_scores)
I have tried:
loading_scores = pd.Series(pca.components_[0], index = df.gene)
loading_scores = pd.Series(pca.components_[0], index = df['gene'])
loading_scores = pd.Series(pca.components_[0], index = df.loc['gene']
AttributeError: 'DataFrame' object has no attribute 'gene'.
If I do not specify an index at all, the loading scores are designated with the generic 0-based index.
Anyone know how to fix this?
Use df.index instead of df.gene or df['gene']
Once you set a certain column to be the index, the way to access it is through the .index attribute, and not through the column's name anymore.
Trying to accomplish K-Means in Python using aggregated data files. For example, instead of a data frame with 3 records represented by 3 rows, one row will represent all 3 with a column like cnt (arbitrarily named) representing those 3 unique instances with the number 3 in it.
Below is a set of some basic starter code that does NOT use the aggregated representation of the rows. Please let me know if you would like for me to post the .csv too, but it should be pretty basic:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
data = pd.read_csv('../Data/wholesale_data.csv')
data.head()
categorical_features = ['Channel', 'Region']
continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicassen']
for col in categorical_features: #for each categorical col
dummies = pd.get_dummies(data[col], prefix=col) #one-hot-encoding
data = pd.concat([data, dummies], axis=1) #append to data
data.drop(col, axis=1, inplace=True) #drop orig column
data.head()
mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)
sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k) #init model
km = km.fit(data_transformed) #fit model
sum_of_squared_distances.append(km.inertia_) #overall SSE
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()