Normalize Dataset in Python

Normalize Dataset in Python - python

So I have a dataset that I want to be normalized. The datset contains of bunch of numbers so im just going to post one line of it:
1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1,0
Does anyone know how to do it? I'm not allowed to use Scikit-Learn.

Normalization takes all your values and transforms them so that they lie in between 0 and 1.
To perform this:
First find the minimum value (call it a) and the maximum value (call it b)
Take every value in your data set (call it d) and find (d-a)/(b-a).
(d-a) makes sure that the range goes from [a,b] to [0,b-a] and then dividing by (b-a) makes the range [0,1].
In Python you would first convert your dataset to a numpy array (a much more efficient data structure)
import numpy as np
d = np.array(your_dataset)
Then find the max and min
a = d.min()
b = d.max()
Finally you perform the operation
d = (d-a)/(b-a)

In order to normalize a dataset you simply calculate the average df['column_name'].mean() and standard deviation df['column_name'].std() for the dataset and subsequently subtract the average from every value in your dataset and divide the result by the standard deviation.
So the result would look something like this:
avg = df['column_name'].mean()
std = df['column_name'].std()
normalized = (df['column_name'] - avg) / std

Related

Vector and RMS averaging in FFT

I have a data array on which I have performed an FFT. This is the code that I have applied.
import numpy as np
# "data" is a column vector on which FFT needs to be performed
# N = No. of points in "data"
# dt = time interval between two corresponding data points
FFT_data = np.fft.fft(data) # Complex values
FFT_data_real = 2/N*abs(FFT_data) # Absolute values
However, I went through following link: https://www.dsprelated.com/showarticle/1159.php
Here it says, to enhance the SNR we can apply "RMS-averaged FFT" and "Vector Averaged FFT".
Can somebody please let me know how to we go about doing these two methodologies in Python or is there any documentation/links to which we can refer ?

As your reference indicates:
If you take the square root of the average of the squares of your sample spectra, you are doing RMS Averaging. Another alternative is Vector Averaging in which you average the real and complex components separately.
Obviously to be able to perform either averaging you'd need to have more than a single data set to average. In your example code, you have a single column vector data. Let's assume you have multiple such column vectors arranged as a 2D NxM matrix, where N is the number of points per dataset and M is the number of datasets. Since the datasets are stored in columns, when computing the FFT you will need to specify the parameter axis=0 to compute the FFT along columns.
RMS-averaged FFT
As the name suggests, for this method you need to take the square-root of the mean of the squared amplitudes. Since the different sets are stored in columns, you'd need to do the average along the axis 1 (the other axis than the one used for the FFT).
FFT_data = np.fft.fft(data, axis=0) # Complex values
FFT_data_real = 2/N*abs(FFT_data) # Absolute values
rms_averaged = np.sqrt(np.mean(FFT_data_real**2, axis=1))
Vector Averaged FFT
In this case you need to obtain the real and imaginary components of the FFT data, then compute the average on each separately:
FFT_data = np.fft.fft(data, axis=0) # Complex values
real_part_avg = 2/N*np.mean(np.real(FFT_data),axis=1)
imag_part_avg = 2/N*np.mean(np.imag(FFT_data),axis=1)
vector_averaged = np.abs(real_part_avg+1j*imag_part_avg)
Note that I've kept the 2/N scaling you had for the absolute values.
But what can I do if I really only have one dataset?
If that dataset happens to be stationary and sufficiently large then you could break down your dataset into smaller blocks. This can be done by reshaping your vector into an NxM matrix with the following:
data = data.reshape(N,M)
...
Then you could perform the averaging with either method.

Getting variance values for random samples generated from a standard normal distribution using numpy

I have a function that gives me probability distributions for each class, in terms of a matrix corresponding to mean values and another matrix corresponding to variance values. For example, if I had four classes then I would have the following outputs:
y_means = [1,2,3,4]
y_variance = [0.01,0.02,0.03,0.04]
I need to do the following calculation to the mean values to continue with the rest of my program:
y_means = np.array(y_means)
y_means = np.reshape(y_means,(y_means.size,1))
A = np.random.randn(10,y_means.size)
y_means = np.matmul(A,y_means)
Here, I have used the numpy.random.randn function to generate random samples from a standard normal distribution, and then multiply this with the matrix with the mean value to obtain a new output matrix. The dimension of the output matrix would then be of the size (10 x 1).
I need to do a similar calculation such that my output_variances will also be a (10 x 1) matrix. But it is not meaningful to multiply the variances in the same way with random samples from a standard normal distribution, because this would result in negative values as well. This is undesirable because my ultimate aim would be to create a normal distribution with these mean values and their corresponding variance values using:
torch.distributions.normal.Normal(loc=y_means, scale=y_variance)
So my question is if there is any method by which I get a variance value for each random sample generated by numpy.random.randn? Because then the multplication of such a matrix would make more sense with output_variance.
Or if there is any other strategy for this that I might be unaware of, please let me know.

The problem mentioned in the question required another matrix of the same dimension as A that corresponded to a variance measure for the random samples present in A.
Taking a row-wise or column-wise variance of the matrix denoted by A using numpy.var() didn't give a similar 10 x 4 matrix to multiply with y_variance.
I had solved the above problem by using the following approach:
First create a matrix with the same dimensions as A with zero entries, using the following line of code:
A_var = np.zeros_like(A)
then, using torch.distributions, create normal distributions with the values in A as the mean and zeroes as variance:
dist_A = torch.distributions.normal.Normal(loc=torch.Tensor(A), scale=torch.Tensor(A_var))
https://pytorch.org/docs/stable/distributions.html lists all the operations possible on Normal distributions in PyTorch. The sample() method can generate samples from a given distribution for any size. This property was exploited to first generate a sample matrix of size 10 X 10 x 4 and then calculating the variance along axis 0.
np.var(np.array(dist2.sample((10,))),axis=0)
This would result in a variance matrix of size 10 x 4, which can be used for calculations with y_variance.

How to interpret this fft graph

I want to apply Fourier transformation using fft function to my time series data to find "patterns" by extracting the dominant frequency components in the observed data, ie. the lowest 5 dominant frequencies to predict the y value (bacteria count) at the end of each time series.
I would like to preserve the smallest 5 coefficients as features, and eliminate the rest.
My code is as below:
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
X = df.iloc[0:2,0:10000]
dft_X = np.fft.fft(X)
print(dft_X)
print(len(dft_X))
plt.plot(dft_X)
plt.grid(True)
plt.show()
# What is the graph about(freq/amplitude)? How much data did it use?
for i in dft_X:
m = i[np.argpartition(i,5)[:5]]
n = i[np.argpartition(i,range(5))[:5]]
print(m,'\n',n)
Here is the output:
But I am not sure how to interpret this graph. To be precise,
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
Parameters:
n : int
Window length.
d : scalar, optional
Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns:
f : ndarray
Array of length n containing the sample frequencies.
How to determine n(window length) and sample spacing?
3) Why are transformed values all complex numbers?
Thanks

I'm gonna answer in reverse order of your questions
3) Why are transformed values all complex numbers?
The output of a Fourier Transform is always complex numbers. To get around this fact, you can either apply the absolute value on the output of the transform, or only plot the real part using:
plt.plot(dft_X.real)
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
No, the "frequency values" will be visible on the output of the FFT.
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
Your graph has so many lines because it's making a line for each column of your data set. Apply the FFT on each row separately (or possibly just transpose your dataframe) and then you'll get more actual frequency domain plots.
Follow up
Would using absolute value or real part of the output as features for a later model have different effect than using the original output?
Absolute values are easier to work with usually.
Using real part
Using absolute value
Here's the Octave code that generated this:
Fs = 4000; % Sampling rate of signal
T = 1/Fs; % Period
L = 4000; % Length of signal
t = (0:L-1)*T; % Time axis
freq = 1000; % Frequency of our sinousoid
sig = sin(freq*2*pi*t); % Fill Time-Domain with 1000 Hz sinusoid
f_sig = fft(sig); % Apply FFT
f = Fs*(0:(L/2))/L; % Frequency axis
figure
plot(f,abs(f_sig/L)(1:end/2+1)); % peak at 1kHz)
figure
plot(f,real(f_sig/L)(1:end/2+1)); % main peak at 1kHz)
In my example, you can see the absolute value returned no noise at frequencies other than the sinusoid of frequency 1kHz I generated while the real part had a bigger peak at 1kHz but also had much more noise.
As for effects, I don't know what you mean by that.
is it expected that "frequency values" always be complex numbers
Always? No. The Fourier series represents the frequency coefficients at which the sum of sines and cosines completely equate any continuous periodic function. Sines and cosines can be written in complex forms through Euler's formula. This is the most convenient way to store Fourier coefficients. In truth, the imaginary part of your frequency-domain signal represents the phase of the signal. (i.e if I have 2 sine functions of the same frequency, they can have different complex forms depending on the time shifting). However, most libraries that provide an FFT function will, by default, store FFT coefficients as complex numbers, to facilitate phase and magnitude calculations.
Is it convention that FFT use each column of dataset when plotting a line
I think it is an issue with mathplotlib.plot, not np.fft.
Could you please show me how to apply FFT on each row separately
There are many ways to go around this and I don't want to force you down one path, so I will propose the general solution to iterate over each row of your dataframe and apply the FFT on each specific row. Otherwise, in your case, I believe transposing your output could also work.

Get median value in each bin in a 2D grid

I have a 2-D array of coordinates and each coordinates correspond to a value z (like z=f(x,y)). Now I want to divide this whole 2-D coordinate set into, for example, 100 even bins. And calculate the median value of z in each bin. Then use scipy.interpolate.griddata function to create a interpolated z surface. How can I achieve it in python? I was thinking of using np.histogram2d but I think there is no median function in it. And I found myself have hard time understanding how scipy.stats.binned_statistic work. Can someone help me please. Thanks.

With numpy.histogram2d you can both count the number of data and sum it, thus it gives you the possibility to compute the average.
I would try something like this:
import numpy as np
coo=np.array([np.arange(1000),np.arange(1000)]).T #your array coordinates
def func(x, y): return x*(1-x)*np.sin(np.pi*x) / (1.5+np.sin(2*np.pi*y**2)**2)
z = func(coo[:,0], coo[:,1])
(n,ex,ey)=np.histogram2d(coo[:,0], coo[:,1],bins=100) # here we get counting
(tot,ex,ey)=np.histogram2d(coo[:,0], coo[:,1],bins=100,weights=z) # here we get total over z
average=tot/n
average=np.nan_to_num(average) #cure 0/0
print(average)

you'll need a few functions or one depending on how you want to structure things:
function to create the bins should take in your data, determine how big each bin is and return an array or array of arrays (also called lists in python).
Happy to help with this but would need more information about the data.
get the median of the bins:
Numpy (part of scipy) has a median function
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.median.html
essentially the median on an array called
"bin"
would be:
$ numpy.median(bin)
Note: numpy.median does accept multiple arrays, so you could get the median for some or all of your bins at once. numpy.median(bins) which would return an array of the median for each bin
Updated
Not 100% on your example code, so here goes:
import numpy as np
# added some parenthesis as I wasn't sure of the math. also removed ;'s
def bincalc(x, y):
return x*(1-x)*(np.sin(np.pi*x))/(1.5+np.sin(2*(np.pi*y)**2)**2)
coo = np.random.rand(1000,2)
tcoo = coo[0]
a = []
for i in tcoo:
a.append(bincalc(coo[0],coo[1]))
z_med = np.median(a)
print(z_med)`

Python - how to normalize time-series data

I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):
V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50
Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF

Assuming that your timeseries is an array, try something like this:
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
This will confine your values between 0 and 1

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization:
( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.
This function is slow, so it’s advisable run it just once and store the results.
from scipy.stats import norm
import pandas as pd
def get_NormArray(df, n, mode = 'total', linear = False):
'''
It computes the normalized value on the stats of n values ( Modes: total or scale )
using the formulas from the book "Statistically sound machine learning..."
(Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
It is modified to fit the data from -1 to 1 instead of -100 to 100
df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
n define the number of data points to get the mean and the quartiles for the normalization
modes: scale: scale, without centering. total: center and scale.
'''
temp =[]
for i in range(len(df))[::-1]:
if i >= n: # there will be a traveling norm until we reach the initian n values.
# those values will be normalized using the last computed values of F50,F75 and F25
F50 = df[i-n:i].quantile(0.5)
F75 = df[i-n:i].quantile(0.75)
F25 = df[i-n:i].quantile(0.25)
if linear == True and mode == 'total':
v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
elif linear == True and mode == 'scale':
v = 0.25 * df.iloc[i]/(F75-F25) -0.5
elif linear == False and mode == 'scale':
v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5
else: # even if strange values are given, it will perform full normalization with compression as default
v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5
temp.append(v[0])
return pd.DataFrame(temp[::-1])

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)
You can take a look here normalize-standardize-time-series-data-python
and
sklearn.preprocessing.minmax_scale

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalize Dataset in Python - python

Related

Vector and RMS averaging in FFT

Getting variance values for random samples generated from a standard normal distribution using numpy

How to interpret this fft graph

Get median value in each bin in a 2D grid

Python - how to normalize time-series data

Categories

Resources