I'm wondering why the following:
sklearn.preprocessing.StandardScaler().fit_transform([[58,144000]])
gives this result:
array([[0., 0.]])
I'm doing a Logistic Regression where I run fit_transform() on array of values (the actual data file) like the ones above. Yet, that transform seems to work fine. But when I try to do a single pair of values as shown above ([[58,144000]]), I get zeros.
For predictions using a "new" input, I need to scale that new value the same way as the test/train data were scaled so my ML predictions will work.
Thanks for suggestions.
Thanks!
If you read the docs, you may wondering, why does it expect a 2D array? You can compute mean and standard deviation of a vector, which is a 1D array, as you reflect it on your question. The answer is, it expects (samples, features) data.
So, in case where you pass data like [[58,144000]], it is a (1,2) array which means 1 sample with 2 features. Then it will fit transform each feature, which is a single number. Normalizing each feature give you a zero: [[0., 0.]].
On the other hand, if you pass the data like [[58],[144000]], then it will be (2,1), which means 2 samples and 1 feature. Then it scale and standard each feature, and give you the result as you may expected like: [[-1],[1]].
x = [58,144000]
mu = np.mean(x)
sigma = np.std(x)
print([((58 - mu) / sigma),((144000 - mu) / sigma)]) # [-1.0, 1.0]
from sklearn.preprocessing import StandardScaler
print(StandardScaler().fit_transform([[58],[144000]])) # [[-1.] [ 1.]]
Related
I want to fit this data:
I have the following model function.
def losvd_param(v, v_rot, v_disp, h3, h4):
y = np.asarray((np.asarray(v)-v_rot)/(v_disp)) # define new variably y for compact notation
return (np.exp(-0.5 * y**2) * (1 + h3*((2*np.sqrt(2)*y**3-3*np.sqrt(2)*y)/np.sqrt(6)) + h4*((4*y**4-12*y**2+3)/np.sqrt(24))))
The 4 parameters refer to: x-value of maximum, width of the distribution, skewness and kurtosis.
I use curve_fit() to fit my data:
gh_moments = curve_fit(losvd_param, vel_corr_peak, broadening_func)[0]
and get the unexpected output [1. 1. 1. 1.] which is clearly not correct it should more be something like [1318, 300, 0, 0], putting these values in manually into my model function I roughly get the right fit to my data. I also get the warning:
OptimizeWarning: Covariance of the parameters could not be estimated
Can anybody tell me why this could be the case ?
Edit: I get the same results, when I use a different model function simple gaussian. Using a linear model I get the fit is "working", so it might be something connected to the gaussian function ? (Note that my x-array goes from values rougly 500-2250)
I have read various articles talking about standardization and normalization but none of the offers a concrete example on how to rescale data using their formulas.
I would like to transform data as follows;
given data = [x1...xn]
rescale(data,n) should it rescale it to n whilst retaining distribution for example
eg_1 = np.array([1]) -->rescale(eg, 2) -->[0.5, 0.5]
eg_2 = np.array([1,1]) -->rescale(eg_2, 4) -->[0.5,0.5,0.5,0.5]
eg3 = np.array([0,1]) --> rescale(eg_3,4) --> [0,0,0.5,0.5]
If possible, I would also like for the inverse to be true, for example
inv_eg1 = np.array([0.5,0.5,0.5,0.5]) --->inv_rescale(inv_eg1,2) --> [1,1]
My initial attempt was simply the formula, (sum of variables in the array/total length of array) * range no. of desired range = value at range no position.
but unfortunately, it not retain the distribution.
The purpose is, I want to apply various kernels and matrices of different shapes but i do not want to use padding.
Please help
Say I have a data set of 100 data. The interesting part about this data set is that each data is a 4x3 matrix. My question is how should I calculate the standard deviation of this data set? I tried the following code, but I don't know if the result is correct. If it is correct, I want to know how it works. I know the standard deviation equation for 1d data, but I don't know the definition of std for a collection of m x n data. There is only explanation for 1d data in the docstring of np.std.
import numpy as np
datalist = []
for _ in range(100):
data = np.random.random((4,3))
datalist.append(data)
std = np.std(np.asarray(datalist))
print(std)
Seems like you're having unnecessary steps. To begin with, you can get 100 matrices of 4x3 like this:
x = np.random.rand(100, 4, 3)
Then just call np.std on it:
np.std(x)
0.2827262559096299
That's if you want the standard deviation of all values. If you want it per matrix cell, specify the axis argument:
np.std(x, axis=0)
array([[0.27863211, 0.2670126 , 0.28752064],
[0.28540484, 0.25365294, 0.28905531],
[0.28848584, 0.27695767, 0.26886147],
[0.27138472, 0.3135065 , 0.29361115]])
axis=0 means that it's going to collapse the axis 0 (the one with size 100), which will return a matrix of 4x3.
I have a 4D array of shape (1948, 60, 2, 3) which tells the difference in end effector positions (x,y,z) over 60 time steps.
The number 1948 indicates the number of samples, 60 is the number of time steps, 2 is for left_arm and right_arm, 3 denotes the x,y,z positions.
a sample of how it looks is below:
array([[[ 3.93048840e-05, 7.70215296e-04, 1.13865805e-03],
[ 1.11679799e-04, -7.04810066e-04, 1.83552688e-04]],
[[ -6.26468389e-04, 6.86668923e-04, 1.57112482e-04],
[ 3.68164582e-04, 7.98948528e-04, 4.50642200e-04]],
[[ 2.51472961e-04, -2.48105983e-04, 7.52486843e-04],
[ 8.99905240e-05, 1.70473461e-04, -3.09927572e-04]],
[[ -7.52414330e-04, 5.46782063e-04, -3.76679264e-04],
[ -3.12531026e-04, -3.36585211e-04, 5.79075595e-05]],
[[ 7.69968002e-04, -1.95524291e-03, -8.65666619e-04],
[ 2.37583215e-04, 4.59415986e-04, 6.07292643e-04]],
[[ 1.41795261e-03, -1.62364401e-03, -8.99673829e-04],
I want to normalize this data as I need tot rain on a neural netowrk. How do I go about normalizing a 4D array I have an intuition for images. Can I normalize each example data or should the normalization be there for the entire 4D array?
The trick would be to use keepdims set as True, which lets the broadcasting happen without bothering us with the housekeeping work of extending dims. Hence, the solution for generic ndarrays that would handle generic dimension arrays would be -
# Get min, max value aming all elements for each column
x_min = np.min(x, axis=tuple(range(x.ndim-1)), keepdims=1)
x_max = np.max(x, axis=tuple(range(x.ndim-1)), keepdims=1)
# Normalize with those min, max values leveraging broadcasting
out = (x - x_min)/ (x_max - x_min)
First, yes you can do normalization and there is no problem with that.
Second, there is nothing special about 4D arrays. Normalization simply should be performed separately for each feature. Thus depending on the type of the normalization, you should calculate the max and min (or mean and std) values for each feature across all samples in the training set.
In your case you should decide which parts of the data refer to the same distribution. So decide on each dimension:
1) First dimension is just number of samples, so it doesn't make new distribution. Treat it as number of data entries.
2) Time step. Here you should decide: does x,y,z values have unique distribution at each of 60 timesteps? If no, treat it the same way as previous step. If yes, calculate max,min (or mean, std) for following feature, separately for each time step. (For simplicity, think like does arm at step 0 can actually have similar values to 30, 60? If yes again they all correspond to data entries, no: x60 features)
3) Do left arm and right arm have different x,y,z values? If yes, again calculate them separately. ( I guess they do, because left and right arm statistically tend to occupy different points in space)
4) x,y,z values definitely independent distributions, so calculate them separately.
Now when you decide you will have features between 3 and 360 (depending on your decisions) so calculate necessary values for them (max, min or mean, std) and perform standard routine.
Hope it helps!
I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.