I have read various articles talking about standardization and normalization but none of the offers a concrete example on how to rescale data using their formulas.
I would like to transform data as follows;
given data = [x1...xn]
rescale(data,n) should it rescale it to n whilst retaining distribution for example
eg_1 = np.array([1]) -->rescale(eg, 2) -->[0.5, 0.5]
eg_2 = np.array([1,1]) -->rescale(eg_2, 4) -->[0.5,0.5,0.5,0.5]
eg3 = np.array([0,1]) --> rescale(eg_3,4) --> [0,0,0.5,0.5]
If possible, I would also like for the inverse to be true, for example
inv_eg1 = np.array([0.5,0.5,0.5,0.5]) --->inv_rescale(inv_eg1,2) --> [1,1]
My initial attempt was simply the formula, (sum of variables in the array/total length of array) * range no. of desired range = value at range no position.
but unfortunately, it not retain the distribution.
The purpose is, I want to apply various kernels and matrices of different shapes but i do not want to use padding.
Please help
Related
I have a data array on which I have performed an FFT. This is the code that I have applied.
import numpy as np
# "data" is a column vector on which FFT needs to be performed
# N = No. of points in "data"
# dt = time interval between two corresponding data points
FFT_data = np.fft.fft(data) # Complex values
FFT_data_real = 2/N*abs(FFT_data) # Absolute values
However, I went through following link: https://www.dsprelated.com/showarticle/1159.php
Here it says, to enhance the SNR we can apply "RMS-averaged FFT" and "Vector Averaged FFT".
Can somebody please let me know how to we go about doing these two methodologies in Python or is there any documentation/links to which we can refer ?
As your reference indicates:
If you take the square root of the average of the squares of your sample spectra, you are doing RMS Averaging. Another alternative is Vector Averaging in which you average the real and complex components separately.
Obviously to be able to perform either averaging you'd need to have more than a single data set to average. In your example code, you have a single column vector data. Let's assume you have multiple such column vectors arranged as a 2D NxM matrix, where N is the number of points per dataset and M is the number of datasets. Since the datasets are stored in columns, when computing the FFT you will need to specify the parameter axis=0 to compute the FFT along columns.
RMS-averaged FFT
As the name suggests, for this method you need to take the square-root of the mean of the squared amplitudes. Since the different sets are stored in columns, you'd need to do the average along the axis 1 (the other axis than the one used for the FFT).
FFT_data = np.fft.fft(data, axis=0) # Complex values
FFT_data_real = 2/N*abs(FFT_data) # Absolute values
rms_averaged = np.sqrt(np.mean(FFT_data_real**2, axis=1))
Vector Averaged FFT
In this case you need to obtain the real and imaginary components of the FFT data, then compute the average on each separately:
FFT_data = np.fft.fft(data, axis=0) # Complex values
real_part_avg = 2/N*np.mean(np.real(FFT_data),axis=1)
imag_part_avg = 2/N*np.mean(np.imag(FFT_data),axis=1)
vector_averaged = np.abs(real_part_avg+1j*imag_part_avg)
Note that I've kept the 2/N scaling you had for the absolute values.
But what can I do if I really only have one dataset?
If that dataset happens to be stationary and sufficiently large then you could break down your dataset into smaller blocks. This can be done by reshaping your vector into an NxM matrix with the following:
data = data.reshape(N,M)
...
Then you could perform the averaging with either method.
I'm trying to do this
np.random.choice(a, 1, p=p)
where len(a) != len(p)
Could you point me in the direction where to look for how to resize the probability distribution "p"? The idea is to keep the same distribution but over a different number of variables.
EDIT:
Basically this (https://en.wikipedia.org/wiki/Scale_parameter) but with discrete variable.
I think that the interpolation is the way to go as suggested by Ryan Sander. I am using a neural network to output the policy distribution over an environment action space. I'm trying to train the network on multiple environments with different action space sizes. For example the network is outputting the distribution over a action space of size 6 (actions [0,1,2,...5], 6 numbers summing up to 1) and I'm trying to sample this distribution over an action space of size 9. Or the other way around.
The problem with interpolation is that the values that I get do not sum up to 1. If I do softmax on those values, the distribution that I get does not have the same(ish) shape as the original.
A couple suggestions:
If you want to "interpolate" your probability distribution across the new values, you can do so using np.interp, i.e. using the example below:
# Set parameters for interpolation
xp = <VALS THAT P IS CURRENTLY OVER>
x = <VALS YOU WANT P TO BE OVER>
# Now interpolate
p_interp = numpy.interp(x, xp, p)
If you want to simply sample from the same variables as before (i.e. use the exact same probability distribution), you can use np.pad. You will probably want to specify different values for the left and right sides, depending on where the values you sample from in p fit in with the values in a.
# Value to pad by (on both sides)
pad_width_left = 5 # Padding on lefthand side
pad_width_right = 3 # Padding on righthand side
# Now pad vector
p_padded_left = np.pad(p, pad_width_left)[:-pad_width_left]
p_padded_right = np.pad(p_padded_left, pad_width_right)[pad_width_right:]
I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.
The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.
I want to combine phase spectrum of one image and magnitude spectrum of different image into one image.
I have got phase spectrum and magnitude spectrum of image A and image B.
Here is the code.
f = np.fft.fft2(grayA)
fshift1 = np.fft.fftshift(f)
phase_spectrumA = np.angle(fshift1)
magnitude_spectrumB = 20*np.log(np.abs(fshift1))
f2 = np.fft.fft2(grayB)
fshift2 = np.fft.fftshift(f2)
phase_spectrumB = np.angle(fshift2)
magnitude_spectrumB = 20*np.log(np.abs(fshift2))
I trying to figure out , but still i do not know how to do that.
Below is my test code.
imgCombined = abs(f) * math.exp(1j*np.angle(f2))
I wish i can come out just like that
Here are the few things that you would need to fix for your code to work as intended:
The math.exp function supports scalar exponentiation. For an element-wise matrix exponentiation you should use numpy.exp instead.
Similary, the * operator would attempt to perform matrix multiplication. In your case you want to instead perform element-wise multiplication which can be done with np.multiply
With these fixes you should get the frequency-domain combined matrix as follows:
combined = np.multiply(np.abs(f), np.exp(1j*np.angle(f2)))
To obtain the corresponding spatial-domain image, you would then need compute the inverse transform (and take the real part since there my be residual small imaginary parts due to numerical errors) with:
imgCombined = np.real(np.fft.ifft2(combined))
Finally the result can be shown with:
import matplotlib.pyplot as plt
plt.imshow(imgCombined, cmap='gray')
Note that imgCombined may contain values outside the [0,1] range. You would then need to decide how you want to rescale the values to fit the expected [0,1] range.
The default scaling (resulting in the image shown above) is to linearly scale the values such that the minimum value is set to 0, and the maximum value is set to 0.
Another way could be to limit the values to that range (i.e. forcing all negative values to 0 and all values greater than 1 to 1).
Finally another approach, which seems to provide a result closer to the screenshot provided, would be to take the absolute value with imgCombined = np.abs(imgCombined)
I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.