I tried the Robustscaler in sklearn, and found the results are not the same as the formula.
The formula of the Robustscaler in sklearn is:
I have a matrix shown as below:
I test the first data in feature one (row one and column one). The scaled value should be (1-3)/(5.5-1.5) = -0.5. However, the result from the sklearn is -0.67. Does anyone know where the calculation is not correct?
The code using sklearn is as below:
import numpy as np
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
From the RobustScaler documentation (emphasis added):
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.
i.e. the median and IQR quantities are calculated per column, and not for the whole array.
Having clarified that, let's calculate the scaled values for your first column manually:
import numpy as np
x1 = np.array([1, 4, 7, 2]) # your 1st column here
q75, q25 = np.percentile(x1, [75 ,25])
iqr = q75 - q25
x1_med = np.median(x1)
x1_scaled = (x1-x1_med)/iqr
x1_scaled
# array([-0.66666667, 0.33333333, 1.33333333, -0.33333333])
which is the same with the first column of your own x_new, as calculated by scikit-learn:
# your code verbatim:
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
# result
[[-0.66666667 -0.375 -0.35294118 -0.33333333]
[ 0.33333333 0.375 0.35294118 0.33333333]
[ 1.33333333 1.125 1.05882353 1. ]
[-0.33333333 -0.625 -0.82352941 -1. ]]
np.all(x1_scaled == x_new[:,0])
# True
Similarly for the rest of the columns (features) - you need to calculate separately the median and IQR values for each one of them before scaling them.
UPDATE (after comment):
As pointed out in the Wikipedia entry on quartiles:
For discrete distributions, there is no universal agreement on selecting the quartile values
See also the relevant reference, Sample quantiles in statistical packages:
There are a large number of different definitions used for sample quantiles in statistical computer packages
Digging into the documentation of np.percentile used here, you'll see that there are no less that five (5) different methods of interpolation, and not all of them produce identical results (see also the 4 different methods demonstrated in the Wikipedia entry linked just above); here is a quick demonstration of these methods and their results in the x1 data defined above:
np.percentile(x1, [75 ,25]) # interpolation='linear' by default
# array([4.75, 1.75])
np.percentile(x1, [75 ,25], interpolation='lower')
# array([4, 1])
np.percentile(x1, [75 ,25], interpolation='higher')
# array([7, 2])
np.percentile(x1, [75 ,25], interpolation='midpoint')
# array([5.5, 1.5])
np.percentile(x1, [75 ,25], interpolation='nearest')
# array([4, 2])
Apart from the fact that there are no two methods producing identical results, it should also be apparent that the definition you are using in your own calculations corresponds to interpolation='midpoint', while the default Numpy method is interpolation='linear'. And as Ben Reiniger correctly points out in the comments below, what is actually used in the source code of RobustScaler is np.nanpercentile (a variation pf np.percentile I have used here that is able to handle nan values) with the default interpolation='linear' setting.
Related
I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.
One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered
Since by using sklearns's StandardScaler the initial data are normalized, isn't that problematic that the initial data are not the same anymore?
Example:
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[1,1],[2,0]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(data)
[[1 1]
[2 0]]
print(scaled_data)
[[-1. 1.]
[ 1. -1.]]
As you can see the data are not the same due to normalization. How that change is not affecting the results in future processing since the data are different and in what scenario is suitable to perform normalization (basically we do that for data which have negative values but I mean in what processes is it appropriate)?
Let's go to official docs for the function:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
From that, we can see this formula:
The standard score of a sample x is calculated as:
z = (x - u) / s
Here u - mean & s - standard deviation
As per Normal distribution theorem, we can represent any data using above formula & distribution.
Geometrically, we are subtracting all the values of a field/column with a same value & dividing with the another same value.
We are just rescaling the data.So, data integrity will not be lost
Another point to keep in mind is that the defaults for the StandardScaler class in sklearn do not modify the data in-place by default:
"copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned."
So above, since you've assigned the results to the scaled_data name, the object that is referenced by data is unchanged as long as you don't change the copy=True default parameter in StandardScaler.
I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset.
How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction?
Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)
Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?
First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.
Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.
Overview:
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
PART 1:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
Visualize what's going on using the biplot
Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)
Let's see first what amount of variance does each PC explain.
pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]
PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.
Now, let's find the most important features.
print(abs( pca.components_ ))
[[0.52237162 0.26335492 0.58125401 0.56561105]
[0.37231836 0.92555649 0.02109478 0.06541577]
[0.72101681 0.24203288 0.14089226 0.6338014 ]
[0.26199559 0.12413481 0.80115427 0.52354627]]
Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).
To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.
To get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Nice article as well here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
the pca library contains this functionality.
pip install pca
A demonstration to extract the feature importance is as following:
# Import libraries
import numpy as np
import pandas as pd
from pca import pca
# Lets create a dataset with features that have decreasing variance.
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)
# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])
# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)
# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])
# PC feature
# 0 PC1 f1
# 1 PC2 f2
# 2 PC3 f3
# 3 PC4 f4
# 4 PC5 f5
# 5 PC6 f6
# 6 PC7 f7
# 7 PC8 f8
# 8 PC9 f9
Plot the explained variance
model.plot()
Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.
ax = model.biplot(n_feat=10, legend=False)
Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.
ax = model.biplot3d(n_feat=10, legend=False)
# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):
# Change pcs components ndarray to a dataframe
importance_df = pd.DataFrame(pca.components_)
# Assign columns
importance_df.columns = original_num_df.columns
# Change to absolute values
importance_df =importance_df.apply(np.abs)
# Transpose
importance_df=importance_df.transpose()
# Change column names again
## First get number of pcs
num_pcs = importance_df.shape[1]
## Generate the new column names
new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
## Now rename
importance_df.columns =new_columns
# Return importance df
return importance_df
# Call function to create importance df
importance_df =create_importance_dataframe(pca, original_num_df)
# Show first few rows
display(importance_df.head())
# Sort depending on PC of interest
## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )
## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )
I have movielens dataset which I want to apply PCA on it, but sklearn PCA function dose not seems to do it correctly.
I have 718*8913 matrix which rows indicate the users and columns indicate movies
here is my python code :
Load movie names and movie ratings
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)
Standardizing
X_std = StandardScaler().fit_transform(df1)
Apply PCA
pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
I did't set any component number so I expect that PCA return 718*8913 matrix in new dimension but pca result size is 718*718 and pca.explained_variance_ratio_ size is 718, and sum of all members of it is 1, but how this is possible!!!
I have 8913 features and it return only 718 and sum of variance of them is equal to 1 can any one explain what is wrong here ?
my plot picture result:
As you can see in the above picture it just contain 718 component and sum of it is 1 but I have 8913 features where they gone?
Test with smaller example
I even try with scikit learn PCA example which can be found in documentation page of pca Here is the Link I change the example and just increase the number of features
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);
and in this example we have 6 sample and 8 feauters I set the n_components to 7 but the result size is 6*6.
I think when the number of features is bigger than number of samples the maximum number of components scikit learn pca will return is equal to number of samples
See the documentation on PCA.
Because you did not pass an n_components parameter to PCA(), sklearn uses min(n_samples, n_features) as the value of n_components, which is why you get a reduced feature set equal to n_samples.
I believe your variance is equal to 1 because you didn't set the n_components, from the documentation:
If n_components is not set then all components are stored and the sum
of explained variances is equal to 1.0.
I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)
To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?
MWE:
# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r
np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1
print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_
Output:
It seems to have to do with how standard deviation is calculated.
>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. , 0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356, 0.70710678])
From numpy.std documentation,
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.
EDIT: (To explain how to use alternate ddof)
There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.
sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)
The current answers are good, but sklearn has changed a bit meanwhile. The new syntax that makes sklearn behave exactly like R.scale() now is:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
sc.fit(data)
sc.scale_ = np.std(data, axis=0, ddof=1).to_list()
sc.transform(data)
Feature request:
https://github.com/scikit-learn/scikit-learn/issues/23758
R.scale documentation says:
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
However, sklearn.preprocessing.StandardScale always scale with standard deviation.
In my case, I want to replicate R.scale in Python without centered,I followed #Sid advice in a slightly different way:
import numpy as np
def get_scale_1d(v):
# I copy this function from R source code haha
v = v[~np.isnan(v)]
std = np.sqrt(
np.sum(v ** 2) / np.max([1, len(v) - 1])
)
return std
sc = StandardScaler()
sc.fit(data)
sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x)
sc.transform(data)