Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Can the dimension of the data be reduced to only one principal component?
I tried it on the iris data set-
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
pca = PCA(n_components=1)
pca_X = pca.fit_transform(X) #X = standardized iris data
pca_df = pd.DataFrame(pca_X, columns=["PCA1"])
plt.plot(pca_df["PCA1"], "o")
We can see three different clusters. So can to dimension be reduced to 1?
You can choose to reduce the dimensions to 1 using PCA, the only thing it promises is that the resultant principal component is in the direction of highest variance in the data.
If you are reducing the dimensions in order to improve classification you can use Linear Discriminant Analysis which gives you the direction of maximum separation between the classes.
Yes, the dimension can be reduced to 1, which is exactly what you have done in your example.
The y Axis in your plot shows the coordinate for each observation wrt the first principal component.
The three clusters likely relate to the three species in the Iris dataset and have nothing to do with the number of components.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
When do I apply PCA, is it after preprocessing (i.e removing null values, encoding etc.,) the entire dataset or before? After I've completely preprocessed my dataset,
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,0:14] = sc.fit_transform(x_train[:,0:14])
x_test[:,0:14] = sc.transform(x_test[:,0:14])
I'm left with the shape, 113126x91
Applying PCA is better on scaled data because you won't face the Large vs. Tiny problem between features.
Large vs. Tiny problem means that the variance of features would be different. for example, in a dataset, one feature has a range (-5, +5) and another lies in the range of (-10000, +10000). Features with larger values can dominate the process.
PCA is a dimensionality reduction technique used to reduce the dimensionality of large data sets by transforming a large collection of variables into a smaller one that still contains most of the information in the large group. To reduce dimensions, PCA takes eigenvectors with higher eigenvalues and map your data points to those vectors; hence dimensionality is reduced.
Let me give you an example of how applying PCA after scaling will be helpful.
Let me import some valuable things that we will be using for this example.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, normalize
import matplotlib.pyplot as plt
# For reproducibility
np.random.seed(123)
Let me make a dummy data set on which we will see the effect of applying PCA before and after scaling.
rows = 100
features = 7
X = np.random.normal(size=[rows, features])
X = np.append(X, 3*np.random.choice(2, size = [rows,1]), axis = 1)
A dummy dataset is created in variable X having 100 examples and 7 features. Now lets apply PCA on it without scaling and plot the data.
pca = PCA(2)
low_x = pca.fit_transform(X)
plt.scatter(low_x[:,0], low_x[:,1])
Here is a plot of data after reducing the number of features from 7 to 2 without scaling the dataset. You can see that data points are very near and messy. One feature has a higher variance than the other. For further processing or modeling, this will affect the results.
Let's apply feature scaling first and then apply PCA to the dataset.
X_normalized = normalize(X)
pca = PCA(2)
low_x = pca.fit_transform(X_noramlized)
plt.scatter(low_x[:,0], low_x[:,1])
In the following plot, the data is clear and scattered. There is no big difference between the variance of both features.
Hence, it is always better to apply normalization before applying PCA to a dataset.
But always remember one thing, Data science is mostly hit and try for developers. Try this if it doesn't help your results, you can always try a different way.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
My question is simple. I think I kind of understand FFT and DFT. what I dont understand is why, in Python or matlab, do we use FFT size as the number of samples? why does every sample taken in the time domain corresponds to a frequency bin in the frequency domain.
For example the Scipy's fft pack, in order to plot the spectrum of a .wav file signal we use:
FFT = abs(scipy.fftpack.fft(time_domain_signal));
Frequency_Vector = fftpack.fftfreq(len(FFT_out), (1/Sampling_rate))
Now if I type the len(FFT_out)
it is the the same as the number of samples (ie sampling freq * time of the audio signal) and since ffreq is the frequency vector that contains the frequency bins, therefore Len(fft) = number of frequency bins.
A simple explanation would be appreciated.
Mathematically a key property of the fourier transform is that it is linear and invertible. The latter means that if two signals have the same fourier transform they are equal, and that for any spectrum there is a signal with that spectrum.
For implementations with a finite collection of samples of a signal the first property means that the fourier tramsform can be represented by a N x M matrix where N is the number of time samples and M the number of frequency samples. The second property means that the matrix must be invertible, and so square, ie we must have M == N.
You say that time bins and frequency correspond, and that is true in the sense that there are the same number of them. However the value in each frequency bin will depend on all time values.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to find an autoregressive model on some data stored in a dataframe and I have 96 data points per day. The data is the value of solar irradiance in some region and I know it has a 1-day seasonality. I want to obtain a simple linear model using scikit LinearRegression and I want to specify which lagged data points to use. I would like to use the last 10 data points, plus the data point that has a lag of 97, which corresponds to the data point of 24 hour earlier. How can I specify the lagged coefficients that I want to use? I don't want to have 97 coefficients, I just want to use 11 of them: the previous 10 data points and the data point 97 positions back.
Just make a dataset X with 11 columns [x0-97, x0-10, x0-9,...,x0-1]. Then series of x0 will be your target Y.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have x1=Job level (numerical), x2= Job code (categorical) and y = Stock value (numerical). For a data set of 3x500 i have 250 NaN values in Stock Value.
What do I need to change in my code below to read x2 as a categorical value and rerun the program to find the coefficients?Data set example
> import pandas as pd from sklearn.linear_model import LinearRegression
> df = pd.read_excel("stats.xlsx")
> df_nonull=df.dropna() X_train = df_nonull[['Job Code','Job Level']]
> y_train = df_nonull[['Stock Value']]
>
>
> X_test = df[['Job Code','Job Level']] y_test = df[['Stock Value']]
>
> regressor = LinearRegression() model=regressor.fit(X_train, y_train)
> # display coefficients print(regressor.coef_)
> print(regressor.coef_)
This is a straightforward model training problem. Your available training data (observations) are the rows with Stock Value present; your later "real" data are the rows without.
Categorical data is quite legal in such cases. In fact, you might try declaring Job Level as categorical, as well, since it's discrete; that will free you from any assumptions of linearity (although it also denies any applicability of the level-code ordering).
Your task is to choose a model type that services your data properly. This requires research and experimentation; welcome to Data Science. Since you haven't discussed your data shape, density, connectivity, clustering, etc., there's really not much we can explore that with you. Six observations on three features (note that Job Code and Job Title are not 100% coupled) is not enough for educated speculation.
Try adding some polynomial terms to your "linear" regression: perhaps a sqared term and a square root for each input. That's often the first attempt for such a task.
This question already has answers here:
How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?
(3 answers)
Closed 5 years ago.
I have a set of vectors, in python, composing my knowledge base, for example:
KB=[[1,2,3,4],[1,2,2,1],[4,3,1,2],[5,4,3,5]]
Now I computed the cluster for KB, using:
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3)
model.fit(KB)
Now I have a new entry (could I have more than one),
A=[3,2,1,3]
and I would know which is the cluster that best fits A with respect to the cluster computed above, then exploiting the KB.
Could you help me?
Thanks in advance
Here you are:
KB=[[1,2,3,4],[1,2,2,1],[4,3,1,2],[5,4,3,5]]
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3).fit(KB)
A=[3,2,1,3]
l = model.predict([A])
print model.labels_, l
centers = model.cluster_centers_.copy()
print centers
In order you model to be 'fit', i join two lines.
I then use the method predict to .. predict.
I also print the labels for each example that were use in the model.
Edit Add plot
import matplotlib.pyplot as plt
import numpy
# Compute the distances vector to vector
d = numpy.array([[numpy.sum(KBi - cj) for KBi in KB] for cj in centers])
print d
# for cluster 0 and 1
plt.scatter(d[0], d[1])
plt.pause(10)