I have very data with MANY dimensions (thousands) and have successfully performed a principal components analysis on it.
The output looks as shown above. My problem is that (as shown on axis X), one principal component accounts for all of the variance. I worry that it is masking all of the other principal components. Therefore, I want to figure out which dimension the principal component corresponds to, and then eliminate it from my data. Is there a way to work out which dimension it corresponds to?
If any code is necessary to aid understanding, it is this:
print(fdata.shape)
# scale data
scaled_fdata = preprocessing.scale(fdata.T)
# setting up PCA
pca = PCA()
pca.fit(scaled_fdata)
pca_data = pca.transform(scaled_fdata)
# scree plot to see how many pricipal components are needed to account for most variance
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
# create scree plot diagram to see how many principal components should be used
plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot - Females')
plt.show()
#do the PCA plot
pca_df = pd.DataFrame(pca_data, columns=labels)
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('My PCA Graph - Females')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))
for sample in pca_df.index:
plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))
plt.show()
plt.qt()
I don't think you need to worry about masking all of the other principal components, since it is the very reason to do the principal component. Normally when you do the PCA, the corresponding Eigen Value will be reported, which will be used to rank the importance of the PC's, aka, the PC's are ranked based on how much they can explain the variance, the first one explained the most variance, which "masking" other PC's. Then the second, the third, and so on. All you need to do is plotting the Cumulative Summation of the Explained Variance, and select the cutoff of how many PC's you want to use. For example, of the top 5 PC's can explain 99% of the variance, you only need those 5 PC's.
Read through this: https://towardsdatascience.com/an-approach-to-choosing-the-number-of-components-in-a-principal-component-analysis-pca-3b9f3d6e73fe
The Principle Component Analysis using Varimax Rotation here's the solution
glad if it works for you!!!
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import decomposition
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from factor_analyzer import FactorAnalyzer
from sklearn.preprocessing import StandardScaler
df = pd.read_excel('excel path')
df.drop(['DISTRICT'],axis=1,inplace=True)
numCOl = df
numCorr = numCOl.corr()
print(numCorr.round(3))
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value
#performing kaiser mayer test (KO test)
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model = calculate_kmo(df)
kmo_model
fa = FactorAnalyzer()
fa.fit(df)
ev,v = fa.get_eigenvalues()
print(ev)
# ploting the graph
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Screen Plot')
plt.xlabel('Number of Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()
fa = FactorAnalyzer(n_factors=4,method='principal',rotation='varimax') here`
fa.fit(df)
data_1 = pd.DataFrame(fa.loadings_,index=df.columns)
data_1.round(3)
Related
I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
I use the fuzzy-c-means clustering implementation and I would like the data X to form the number of clusters i define in the algorithm(I beleive that is how it works). But the behavior is confusing.
cm = FCM(n_clusters=6)
cm.fit(X)
This code generates a plot with 4 labels - [0,2,4,6]
cm = FCM(n_clusters=4)
cm.fit(X)
This code generates a plot with 4 labels - [0,1,2,3]
I expect labels [0,1,2,3,4,5] when i initialize the cluster number to be 6.
code:
from fcmeans import FCM
from matplotlib import pyplot as plt
from seaborn import scatterplot as scatter
# fit the fuzzy-c-means
fcm = FCM(n_clusters=6)
fcm.fit(X)
# outputs
fcm_centers = fcm.centers
fcm_labels = fcm.u.argmax(axis=1)
# plot result
%matplotlib inline
f, axes = plt.subplots(1, 2, figsize=(11,5))
scatter(X[:,0], X[:,1], ax=axes[0])
scatter(X[:,0], X[:,1], ax=axes[1], hue=fcm_labels)
scatter(fcm_centers[:,0], fcm_centers[:,1], ax=axes[1],marker="s",s=200)
plt.show()
Fuzzy c-means is a fuzzy clustering algorithm.
The labels are only an approximation to the fuzzy assignment.
Most likely two clusters are pretty weak, and hence never win the argmax operation used to produce the labels. That doesn't mean these clusters have not been used, you are just not using the full fuzzy result.
I'm using fuzzy-c-means version 1.7.0:
>>> import fcmeans
>>> fcmeans.__version__
'1.7.0'
Using synthetic data:
>>> from sklearn.datasets import load_iris
>>> iris = load_iris().data
>>> model = fcmeans.FCM(n_clusters = 2)
>>> model.fit(iris)
>>> pred = model.predict(iris)
>>> from collections import Counter
>>> Counter(pred)
Counter({0: 97, 1: 53})
So, the n_clusters applied correctly.
I read about it and looks like once the algorithm reaches the knee point(max number of clusters it can perform with the data), it wont create anything more than this. So in my question, 4 was the maximum number of clusters that the algo perform with the given dataset.
I am trying to evaluate the amplitude spectrum of the Google trends time series using a fast Fourier transformation. If you look at the data for 'diet' in the data provided here it shows a very strong seasonal pattern:
I thought I could analyze this pattern using a FFT, which presumably should have a strong peak for a period of 1 year.
However when I apply a FFT like this (a_gtrend_ham being the time series multiplied with a Hamming window):
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import fft, fftshift
import pandas as pd
gtrend = pd.read_csv('multiTimeline.csv',index_col=0)
gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
# Sampling rate
fs = 12 #Points per year
a_gtrend_orig = gtrend['diet: (Worldwide)']
N_gtrend_orig = len(a_gtrend_orig)
length_gtrend_orig = N_gtrend_orig / fs
t_gtrend_orig = np.linspace(0, length_gtrend_orig, num = N_gtrend_orig, endpoint = False)
a_gtrend_sel = a_gtrend_orig.loc['2005-01-01 00:00:00':'2017-12-01 00:00:00']
N_gtrend = len(a_gtrend_sel)
length_gtrend = N_gtrend / fs
t_gtrend = np.linspace(0, length_gtrend, num = N_gtrend, endpoint = False)
a_gtrend_zero_mean = a_gtrend_sel - np.mean(a_gtrend_sel)
ham = np.hamming(len(a_gtrend_zero_mean))
a_gtrend_ham = a_gtrend_zero_mean * ham
N_gtrend = len(a_gtrend_ham)
ampl_gtrend = 1/N_gtrend * abs(fft(a_gtrend_ham))
mag_gtrend = fftshift(ampl_gtrend)
freq_gtrend = np.linspace(-0.5, 0.5, len(ampl_gtrend))
response_gtrend = 20 * np.log10(mag_gtrend)
response_gtrend = np.clip(response_gtrend, -100, 100)
My resulting amplitude spectrum does not show any dominant peak:
Where is my misunderstanding of how to use the FFT to get the spectrum of the data series?
Here is a clean implementation of what I think you are trying to accomplish. I include graphical output and a brief discussion of what it likely means.
First, we use the rfft() because the data is real valued. This saves time and effort (and reduces the bug rate) that otherwise follows from generating the redundant negative frequencies. And we use rfftfreq() to generate the frequency list (again, it is unnecessary to hand code it, and using the api reduces the bug rate).
For your data, the Tukey window is more appropriate than the Hamming and similar cos or sin based window functions. Notice also that we subtract the median before multiplying by the window function. The median() is a fairly robust estimate of the baseline, certainly more so than the mean().
In the graph you can see that the data falls quickly from its intitial value and then ends low. The Hamming and similar windows, sample the middle too narrowly for this and needlessly attenuate a lot of useful data.
For the FT graphs, we skip the zero frequency bin (the first point) since this only contains the baseline and omitting it provides a more convenient scaling for the y-axes.
You will notice some high frequency components in the graph of the FT output.
I include a sample code below that illustrates a possible origin of those high frequency components.
Okay here is the code:
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
from scipy.signal import tukey
from numpy.fft import fft, fftshift
import pandas as pd
gtrend = pd.read_csv('multiTimeline.csv',index_col=0,skiprows=2)
#print(gtrend)
gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
#print(gtrend.index)
a_gtrend_orig = gtrend['diet: (Worldwide)']
t_gtrend_orig = np.linspace( 0, len(a_gtrend_orig)/12, len(a_gtrend_orig), endpoint=False )
a_gtrend_windowed = (a_gtrend_orig-np.median( a_gtrend_orig ))*tukey( len(a_gtrend_orig) )
plt.subplot( 2, 1, 1 )
plt.plot( t_gtrend_orig, a_gtrend_orig, label='raw data' )
plt.plot( t_gtrend_orig, a_gtrend_windowed, label='windowed data' )
plt.xlabel( 'years' )
plt.legend()
a_gtrend_psd = abs(rfft( a_gtrend_orig ))
a_gtrend_psdtukey = abs(rfft( a_gtrend_windowed ) )
# Notice that we assert the delta-time here,
# It would be better to get it from the data.
a_gtrend_freqs = rfftfreq( len(a_gtrend_orig), d = 1./12. )
# For the PSD graph, we skip the first two points, this brings us more into a useful scale
# those points represent the baseline (or mean), and are usually not relevant to the analysis
plt.subplot( 2, 1, 2 )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psd[1:], label='psd raw data' )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psdtukey[1:], label='windowed psd' )
plt.xlabel( 'frequency ($yr^{-1}$)' )
plt.legend()
plt.tight_layout()
plt.show()
And here is the output displayed graphically. There are strong signals at 1/year and at 0.14 (which happens to be 1/2 of 1/14 yrs), and there is a set of higher frequency signals that at first perusal might seem quite mysterious.
We see that the windowing function is actually quite effective in bringing the data to baseline and you see that the relative signal strengths in the FT are not altered very much by applying the window function.
If you look at the data closely, there seems to be some repeated variations within the year. If those occur with some regularity, they can be expected to appear as signals in the FT, and indeed the presence or absence of signals in the FT is often used to distinguish between signal and noise. But as will be shown, there is a better explanation for the high frequency signals.
Okay, now here is a sample code that illustrates one way those high frequency components can be produced. In this code, we create a single tone, and then we create a set of spikes at the same frequency as the tone. Then we Fourier transform the two signals and finally, graph the raw and FT data.
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
t = np.linspace( 0, 1, 1000. )
y = np.cos( 50*3.14*t )
y2 = [ 1. if 1.-v < 0.01 else 0. for v in y ]
plt.subplot( 2, 1, 1 )
plt.plot( t, y, label='tone' )
plt.plot( t, y2, label='spikes' )
plt.xlabel('time')
plt.subplot( 2, 1, 2 )
plt.plot( rfftfreq(len(y),d=1/100.), abs( rfft(y) ), label='tone' )
plt.plot( rfftfreq(len(y2),d=1/100.), abs( rfft(y2) ), label='spikes' )
plt.xlabel('frequency')
plt.legend()
plt.tight_layout()
plt.show()
Okay, here are the graphs of the tone, and the spikes, and then their Fourier transforms. Notice that the spikes produce high frequency components that are very similar to those in our data.
In other words, the origin of the high frequency components is very likely in the short time scales associated with the spikey character of signals in the raw data.
I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset.
How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction?
Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)
Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?
First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.
Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.
Overview:
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
PART 1:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
Visualize what's going on using the biplot
Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)
Let's see first what amount of variance does each PC explain.
pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]
PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.
Now, let's find the most important features.
print(abs( pca.components_ ))
[[0.52237162 0.26335492 0.58125401 0.56561105]
[0.37231836 0.92555649 0.02109478 0.06541577]
[0.72101681 0.24203288 0.14089226 0.6338014 ]
[0.26199559 0.12413481 0.80115427 0.52354627]]
Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).
To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.
To get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Nice article as well here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
the pca library contains this functionality.
pip install pca
A demonstration to extract the feature importance is as following:
# Import libraries
import numpy as np
import pandas as pd
from pca import pca
# Lets create a dataset with features that have decreasing variance.
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)
# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])
# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)
# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])
# PC feature
# 0 PC1 f1
# 1 PC2 f2
# 2 PC3 f3
# 3 PC4 f4
# 4 PC5 f5
# 5 PC6 f6
# 6 PC7 f7
# 7 PC8 f8
# 8 PC9 f9
Plot the explained variance
model.plot()
Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.
ax = model.biplot(n_feat=10, legend=False)
Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.
ax = model.biplot3d(n_feat=10, legend=False)
# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):
# Change pcs components ndarray to a dataframe
importance_df = pd.DataFrame(pca.components_)
# Assign columns
importance_df.columns = original_num_df.columns
# Change to absolute values
importance_df =importance_df.apply(np.abs)
# Transpose
importance_df=importance_df.transpose()
# Change column names again
## First get number of pcs
num_pcs = importance_df.shape[1]
## Generate the new column names
new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
## Now rename
importance_df.columns =new_columns
# Return importance df
return importance_df
# Call function to create importance df
importance_df =create_importance_dataframe(pca, original_num_df)
# Show first few rows
display(importance_df.head())
# Sort depending on PC of interest
## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )
## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )
I am writing a code about a mono-energetic gamma beam which the dominated interaction is photoelectric absorption, mu=2 cm-1, and i need to generate 50000 random numbers and sample the interaction depth(which I do not know if i did it or not).
I know that the mean free path=mu-1, but I need to find the mean free path from the simulation and from mu and compare them, is what I did right in the code or not?
import random
import matplotlib.pyplot as plt
import numpy as np
mu=(2)
random.seed=()
data = np.random.randn(50000)*10
bins = np.arange(data.min(), data.max()+1e-8, 0.1)
meanfreepath = 1/mu
print(meanfreepath)
plt.hist(data, bins=bins)
plt.show()
Well, interaction depth distribution is Exponential one, not a gaussian.
So code would be
lmbda = 2 # cm^-1
beta = 1.0/lmbda
data = np.random.exponential(scale=beta, size=50000)
mfp = np.mean(data)
print(mfp)
# build histogram
More details at https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.exponential.html
Code above produced
0.4977168417102998
which looks like 2-1 to me