Why the cumsum of PCA explained_variance_ratio_ is greater than 1? - python

I got a dataframe,df, as below:
Name
Sales
Book
Sign
Andy
10000
2
10
Bobo
20000
3
20
Tim
0
2
14
...
...
...
...
And I would like to perform PCA with n_components = 0.9
So I first use StandardScaler() from sklearn to standardize the values and getting the below values:
[ 1.33865216 1.80350169 1.90692518 1.40305228]
[ 0.98050987 0.68720789 0.33371191 0.67278892]
[ 0.95059432 1.10958933 0.47673129 0.85535476]
[-0.20264719 -0.54976631 -0.23836565 -0.81816542]
[-1.01921185 -1.63589 -0.52440442 -1.85270517]
[ 0.89958047 -0.03687457 0.90578946 0.79449948]
[-1.16715811 -0.85146734 -1.23950137 -0.81816542]
[-0.3867463 0.05363574 -0.09534626 0.33808489]]
And I used the from advanced_pca import CustomPCA to perform PCA with varimax rotation. Below is the code:
varimax_pca = CustomPCA(n_components=n_components, rotation='varimax', random_state = 9527)
However, I found something strange that the cumsum of explained_variance_ratio is greater than 1
pca_var_ratio = varimax_pca.fit(Z).explained_variance_ratio_
print(pca_var_ratio)
>>>[0.57124482 1.09019268]
Is there any bug? Is it normal that the cumsum of explained variance ratio can be greater than 1?
Thanks!
This is the link of advanced_pca module: https://pypi.org/project/advanced-pca/0.1/

Related

Use same Min and Max Data for Multiple Features in MinMaxScaler

I have a dataset of 5 features. Two of these features are very similar but do not have the same min and max values.
... | feature 2 | feature 3 | ...
--------------------------------
..., 208.429993, 206.619995, ...
..., 207.779999, 205.050003, ...
..., 206.029999, 203.410004, ...
..., 204.429993, 202.600006, ...
..., 206.429993, 204.25, ...
feature 3 is always smaller than feature 2 and it is important that it stays that way after scaling. But since feature 2 and features 3 do not have the exact same min and max values, after scaling they will both end up having 0 and 1 as min and max by default. This will remove the relationship between the values. In fact after scaling, the first sample becomes:
... | feature 2 | feature 3 | ...
--------------------------------
..., 0.00268, 0.00279, ...
This is something that I do not want. I cannot seem to find a way to manually change the min and max values of MinMaxScaler. There are other ugly hacks such as manipulating the data and combining feature2 and feature 3 into one for the scaling and splitting again afterward. But I would like to know first if there is a solution that is handled by sklearn, such as using the same min and max for multiple features.
Otherwise, the simplest workaround would do.
Fitting scaler with one column and transforming both. Trying with the data you posted:
feature_1 feature_2
0 208.429993 206.619995
1 207.779999 205.050003
2 206.029999 203.410004
3 204.429993 202.600006
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df['feature_2'].values.reshape(-1,1))
scaler.transform(df)
array([[1.45024949, 1. ],
[1.288559 , 0.60945366],
[0.85323442, 0.20149259],
[0.45522189, 0. ]])
If you scale data that are outside of the range you used to fit the scaler, the scaled data will be outside of [0,1].
The only way to avoid it is to scale each column individually.
Whether or not this is a problem depends on what you want to do with the data after scaling.

Documents-terms matrix dimensionality reduction

I am working with text documents clustering, with a Hierarchical Clustering approach, in Python.
I have a corpus of 10k documents and have constructed a documents-terms matrix over a dictionary based on a collection of terms classified as 'keyword' for the entire corpus.
The matrix has a shape: [10000 x 2000] and is very sparse. (let's call it dtm)
id 0 1 2 4 ... 1998 1999
0 0 0 0 1 ... 0 0
1 0 1 0 0 ... 0 1
2 1 0 0 0 ... 1 0
.. .. ... ... .. ..
9999 0 0 0 0 ... 0 0
I think that applying some dimensionality reduction techniques could lead to an enhancement in the precision of clustering.
I have tried using some MDS approach like this
def select_n_components(var_ratio, goal_var: float) -> int:
# Set initial variance explained so far
total_variance = 0.0
# Set initial number of features
n_components = 0
# For the explained variance of each feature:
for explained_variance in var_ratio:
# Add the explained variance to the total
total_variance += explained_variance
# Add one to the number of components
n_components += 1
# If we reach our goal level of explained variance
if total_variance >= goal_var:
# End the loop
break
# Return the number of components
return n_components
def do_MDS(dtm):
# scale dtm in range [0:1] to better variance maximization
scl = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scl.fit_transform(dtm)
tsvd = TruncatedSVD(n_components=data_rescaled.shape[1] - 1)
X_tsvd = tsvd.fit(data_rescaled)
# List of explained variances
tsvd_var_ratios = tsvd.explained_variance_ratio_
optimal_components = select_n_components(tsvd_var_ratios, 0.95)
from sklearn.manifold import MDS
mds = MDS(n_components=optimal_components, dissimilarity="euclidean", random_state=1)
pos = mds.fit_transform(dtm.values)
U_df = pd.DataFrame(pos)
U_df_transposed = U_df.T # for consistency with pipeline workflow, export tdm matrix
return U_df_transposed
The objective is to automatically detect an optimal number of components and apply the dimensionality reduction. But the output has not shown a tangible enhancement.

Regression by group in python pandas

I want to ask a quick question related to regression analysis in python pandas.
So, assume that I have the following datasets:
Group Y X
1 10 6
1 5 4
1 3 1
2 4 6
2 2 4
2 3 9
My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like:
Group Coefficient
1 0.25 (lets assume that coefficient is 0.25)
2 0.30
I hope I can explain my question.
Many thanks in advance for your help.
I am not sure about the type of regression you need, but this is how you do an OLS (Ordinary least squares):
import pandas as pd
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
#This is what you need
df.groupby('Group').apply(regress, 'Y', ['X'])
You can define your regression function and pass parameters to it as mentioned.

Sampling rows in data frame with an empirical probability distribution of a variable

I have got a following problem.
Let's assume that we have a data frame with few variables. Morover one variable (var_A) is a probability score - its values ranges from 0 to 1. I want to sample rows from this data frame in a way that it will be more probable to pick a row with higher value of var_A - so I guess that I have to draw from an empirical distribution of var_A. I know how to implement edf function of var_A as it's suggested here but I have no idea how to use this distribution for sampling rows.
Can you please help me with this?
Thanks
You can use numpy.random.choice to sample in this manner:
import numpy as np
num_dists = 4
num_samples = 10
var_A = np.random.uniform(0, 1, num_dists)
# ensure var_A sums to 1
var_A /= np.sum(var_A)
samples = np.random.choice(len(var_A), num_samples, p=var_A)
print('var_A: ', var_A)
print('samples: ', samples)
Sample output:
var_A: [ 0.23262621 0.02990421 0.22357316 0.51389642]
samples: [3 0 0 2 0 0 2 3 3 2]

Very Large Values Predicted for Linear Regression

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.
I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:
non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)
train = train._get_numeric_data()
train.fillna(0, inplace = True)
non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)
test = test._get_numeric_data()
test.fillna(0, inplace = True)
feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
lm = LinearRegression(normalize = False)
lm.fit(X, y)
import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))
The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.
Also, if there is a better way to do this I'd be eager to hear about it.
You seem to encounter collinearity due to introduction of categorical variables in feature column, since sum of the feature columns of "one-hot" encoded variables is always 1.
If you have one categorical variable , you need to set "fit_intercept=False" in your linear Regression (or drop one of the feature column of one-hot coded variable)
If you have more than one categorical variables, you need to drop one feature column for each of the category to break collinearity.
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
In [72]:
df = pd.read_csv('/home/siva/anaconda3/data.csv')
df
Out[72]:
C1 C2 C3 y
0 1 0 0 12.4
1 1 0 0 11.9
2 0 1 0 8.3
3 0 1 0 3.1
4 0 0 1 5.4
5 0 0 1 6.2
In [73]:
y
X = df.iloc[:,0:3]
y = df.iloc[:,-1]
In [74]:
reg = LinearRegression()
reg.fit(X,y)
Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [75]:
_
reg.coef_,reg.intercept_
Out[75]:
(array([ 4.26666667, -2.18333333, -2.08333333]), 7.8833333333333346)
we find that co_efficients for C1, C2 , C3 do not make sense according to given X.
In [76]:
reg1 = LinearRegression(fit_intercept=False)
reg1.fit(X,y)
Out[76]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)
In [77]:
reg1.coef_
Out[77]:
array([ 12.15, 5.7 , 5.8 ])
we find that co_efficients makes much more sense when the fit_intercept was set to False
A detailed explanation for a similar question at below.
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn
I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Categories

Resources