implementing R scale function in pandas in Python?

implementing R scale function in pandas in Python? - python

What is the efficient equivalent of R's scale function in pandas? E.g.
newdf <- scale(df)
written in pandas? Is there an elegant way using transform?

Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's preprocessing module. You can pass pandas DataFrame to its scale method.
The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:
from sklearn.preprocessing import scale
from pandas import DataFrame
newdf = DataFrame(scale(df), index=df.index, columns=df.columns)
See also here.

I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)
def scale(y, c=True, sc=True):
x = y.copy()
if c:
x -= x.mean()
if sc and c:
x /= x.std()
elif sc:
x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
return x
For the more general version you'd probably need to do some type/length checking.
EDIT: Added explanation of the denominator in elif sc: clause
From the R docs:
... If ‘scale’ is
‘TRUE’ then scaling is done by dividing the (centered) columns of
‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
root mean square otherwise. If ‘scale’ is ‘FALSE’, no scaling is
done.
The root-mean-square for a (possibly centered) column is defined
as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
values and n is the number of non-missing values. In the case
‘center = TRUE’, this is the same as the standard deviation, but
in general it is not.
The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) computes the root mean square using the definition by first squaring x (the pow method) then summing along the rows and then dividing by the non NaN counts in each column (the count method).
As a side the note the reason I didn't just simply compute the RMS after centering is because the std method calls bottleneck for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.
You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.

Related

Double antiderivative computation in python

I have the following problem. I have a function f defined in python using numpy functions. The function is smooth and integrable on positive reals. I want to construct the double antiderivative of the function (assuming that both the value and the slope of the antiderivative at 0 are 0) so that I can evaluate it on any positive real smaller than 100.
Definition of antiderivative of f at x:
integrate f(s) with s from 0 to x
Definition of double antiderivative of f at x:
integrate (integrate f(t) with t from 0 to s) with s from 0 to x
The actual form of f is not important, so I will use a simple one for convenience. But please note that even though my example has a known closed form, my actual function does not.
import numpy as np
f = lambda x: np.exp(-x)*x
My solution is to construct the antiderivative as an array using naive numerical integration:
N = 10000
delta = 100/N
xs = np.linspace(0,100,N+1)
vs = f(xs)
avs = np.cumsum(vs)*delta
aavs = np.cumsum(avs)*delta
This of course works but it gives me arrays instead of functions. But this is not a big problem as I can interpolate aavs using a spline to get a function and get rid of the arrays.
from scipy.interpolate import UnivariateSpline
aaf = UnivariateSpline(xs, aavs)
The function aaf is approximately the double antiderivative of f.
The problem is that even though it works, there is quite a bit of overhead before I can get my function and precision is expensive.
My other idea was to interpolate f by a spline and take the antiderivative of that, however this introduces numerical errors that are too big for what I want to use the function.
Is there any better way to do that? By better I mean faster without sacrificing accuracy.
Edit: What I hope is possible is to use some kind of Fourier transform to avoid integrating twice. I hope that there is some convenient transform of vs that allows to multiply the values component-wise with xs and transform back to get the double antiderivative. I played with this a bit, but I got lost.
Edit: I figured out that by using the trapezoidal rule instead of a naive sum, increases the accuracy quite a bit. Using Simpson's rule should increase the accuracy further, but it's somewhat fiddly to do with numpy arrays.
Edit: As #user202729 rightfully complains, this seems off. The reason it seems off is because I have skipped some details. I explain here why what I say makes sense, but it does not affect my question.
My actual goal is not to find the double antiderivative of f, but to find a transformation of this. I have skipped that because I think it only confuses the matter.
The function f decays exponentially as x approaches 0 or infinity. I am minimizing the numerical error in the integration by starting the sum from 0 and going up to approximately the peak of f. This ensure that the relative error is approximately constant. Then I start from the opposite direction from some very big x and go back to the peak. Then I do the same for the antiderivative values.
Then I transform the aavs by another function which is sensitive to numerical errors. Then I find the region where the errors are big (the values oscillate violently) and drop these values. Finally I approximate what I believe are good values by a spline.
Now if I use spline to approximate f, it introduces an absolute error which is the dominant term in a rather large interval. This gets "integrated" twice and it ends up being a rather large relative error in aavs. Then once I transform aavs, I find that the 'good region' has shrunk considerably.
EDIT: The actual form of f is something I'm still looking into. However, it is going to be a generalisation of the lognormal distribution. Right now I am playing with the following family.
I start by defining a generalization of the normal distribution:
def pdf_n(params, center=0.0, slope=8):
scale, min, diff = params
if diff > 0:
r = min
l = min + diff
else:
r = min - diff
l = min
def retfun(m):
x = (m - center)/scale
E = special.expit(slope*x)*(r - l) + l
return np.exp( -np.power(1 + x*x, E)/2 )
return np.vectorize(retfun)
It may not be obvious what is happening here, but the result is quite simple. The function decays as exp(-x^(2l)) on the left and as exp(-x^(2r)) on the right. For min=1 and diff=0, this is the normal distribution. Note that this is not normalized. Then I define
g = pdf(params)
f = np.vectorize(lambda x:g(np.log(x))/x/area)
where area is the normalization constant.
Note that this is not the actual code I use. I stripped it down to the bare minimum.

You can compute the two np.cumsum (and the divisions) at once more efficiently using Numba. This is significantly faster since there is no need for several temporary arrays to be allocated, filled, read again and freed. Here is a naive implementation:
import numba as nb
#nb.njit('float64[::1](float64[::1], float64)') # Assume vs is contiguous
def doubleAntiderivative_naive(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
sum1, sum2 = 0.0, 0.0
for i in range(vs.size):
sum1 += vs[i] * delta
sum2 += sum1 * delta
res[i] = sum2
return res
However, the sum is not very good in term of numerical stability. A Kahan summation is needed to improve the accuracy (or possibly the alternative Kahan–Babuška-Klein algorithm if you are paranoid about the accuracy and performance do not matter so much). Note that Numpy use a pair-wise algorithm which is quite good but far from being prefect in term of accuracy (this is a good compromise for both performance and accuracy).
Moreover, delta can be factorized during in the summation (ie. the result just need to be premultiplied by delta**2).
Here is an implementation using the more accurate Kahan summation:
#nb.njit('float64[::1](float64[::1], float64)')
def doubleAntiderivative_accurate(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
delta2 = delta * delta
sum1, sum2 = 0.0, 0.0
c1, c2 = 0.0, 0.0
for i in range(vs.size):
# Kahan summation of the antiderivative of vs
y1 = vs[i] - c1
t1 = sum1 + y1
c1 = (t1 - sum1) - y1
sum1 = t1
# Kahan summation of the double antiderivative of vs
y2 = sum1 - c2
t2 = sum2 + y2
c2 = (t2 - sum2) - y2
sum2 = t2
res[i] = sum2 * delta2
return res
Here is the performance of the approaches on my machine (with an i5-9600KF processor):
Numpy cumsum: 51.3 us
Naive Numba: 11.6 us
Accutate Numba: 37.2 us
Here is the relative error of the approaches (based on the provided input function):
Numpy cumsum: 1e-13
Naive Numba: 5e-14
Accutate Numba: 2e-16
Perfect precision: 1e-16 (assuming 64-bit numbers are used)
If f can be easily computed using Numba (this is the case here), then vs[i] can be replaced by calls to f (inlined by Numba). This helps to reduce the memory consumption of the computation (N can be huge without saturating your RAM).
As for the interpolation, the splines often gives good numerical result but they are quite expensive to compute and AFAIK they require the whole array to be computed (each item of the array impact all the spline although some items may have a negligible impact alone). Regarding your needs, you could consider using Lagrange polynomials. You should be careful when using Lagrange polynomials on the edges. In your case, you can easily solve the numerical divergence issue on the edges by extending the array size with the border values (since you know the derivative on each edges of vs is 0). You can apply the interpolation on the fly with this method which can be good for both performance (typically if the computation is parallelized) and memory usage.

First, I created a version of the code I found more intuitive. Here I multiply cumulative sum values by bin widths. I believe there is a small error in the original version of the code related to the bin width issue.
import numpy as np
f = lambda x: np.exp(-x)*x
N = 1000
xs = np.linspace(0,100,N+1)
domainwidth = ( np.max(xs) - np.min(xs) )
binwidth = domainwidth / N
vs = f(xs)
avs = np.cumsum(vs)*binwidth
aavs = np.cumsum(avs)*binwidth
Next, for visualization here is some very simple plotting code:
import matplotlib
import matplotlib.pyplot as plt
plt.figure()
plt.scatter( xs, vs )
plt.figure()
plt.scatter( xs, avs )
plt.figure()
plt.scatter( xs, aavs )
plt.show()
The first integral matches the known result of the example expression and can be seen on wolfram
Below is a simple function that extracts an element from the second derivative. Note that int is a bad rounding function. I assume this is what you have implemented already.
def extract_double_antideriv_value(x):
return aavs[int(x/binwidth)]
singleresult = extract_double_antideriv_value(50.24)
print('singleresult', singleresult)
Whatever full computation steps are required, we need to know them before we can start optimizing. Do you have a million different functions to integrate? If you only need to query a single double anti-derivative many times, your original solution should be fairly ideal.
Symbolic Approximation:
Have you considered approximations to the original function f, which can have closed form integration solutions? You have a limited domain on which the function lives. Perhaps approximate f with a Taylor series (which can be constructed with known maximum error) then integrate exactly? (consider Pade, Taylor, Fourier, Cheby, Lagrange(as suggested by another answer), etc...)
Log Tricks:
Another alternative to dealing with spiky errors, would be to take the log of your original function. Is f always positive? Is the integration error caused because the neighborhood around the max is very small? If so, you can study ln(f) or even ln(ln(f)) instead. It would really help to understand what f looks like more.
Approximation Integration Tricks
There exist countless integration tricks in general, which can make approximate closed form solutions to undo-able integrals. A very common one when exponetnial functions are involved (I think yours is expoential?) is to use Laplace's Method. But which trick to pull out of the bag is highly dependent upon the conditions which f satisfies.

Tensorflow normalize Vs. traditional way of subtracting mean and dividing by std

I'm curious about any preference of using tf.keras.utils.normalize vs the way we usually normalize a series, subtracting mean and dividing by standard deviation:
import tensorflow as tf
import numpy as np
series = np.random.random(10) + 10 * np.sin(np.random.random(1))
mean = np.mean(series)
std = np.std(series)
(series - mean) / std
tf.keras.utils.normalize(series)
Is there any pros/cons for either method?
tf normalize in [0,1] range, but we get values in [-1,1] range using the other method.

tf.keras.utils.normalize uses the algorithm described here., so it just makes the data along the specified axis a unit vector with respect to your favorite lp norm. Whether this is preferable to sklearn.StandardScaler() depends on the problem. For many time series problems, you want to detrend them, so make the mean 0., so StandardScaler is appropriate. If you want the inputs reasonably similarly scaled, both methods are equivalent, more or less.

Normalizations in sklearn and their differences

I have read many articles suggested this formula
N = (x - min(x))/(max(x)-min(x))
for normalization
but when i dig into the normalizor of sklearn somewhere i found they are using this formula
x / np.linalg.norm(x)
As the later use l2-norm by default. Which one should I use? Why is there a difference in between both?

There are different normalization techniques and sklearn provides for many of them. Please note that we are looking at 1d arrays here. For a matrix these operations are applied to each column (have a look at this post for an in depth example Scaling features for machine learning) Let's go through some of them:
Scikit-learn's MinMaxScaler performs (x - min(x))/(max(x)-min(x)) This scales your array in such a way that you only have values between 0 and 1. Can be useful if you want to apṕly some transformation afterwards where no negative values are allowed (e.g. a log-transform or in scaling RGB pixels like done in some MNIST examples)
scikit-learns StandardScaler performs (x-x.mean())/x.std() which centers the array around zero and scales by the variance of the features. This is a standard transformation and is appicable in many situations but keep in mind that you will get negative values. This is especially useful when you have gaussian sampled data which is not centered around 0 and/or does not have a unit variance.
Scikit-learn's Normalizer performs x / np.linalg.norm(x). This sets the length of your array/vector to 1. Might come in handy if you want to do some linear algebra stuff like if you want to implement the Gram-Schmidt Algorithm.
Scikit-learn's RobustScaler can be used to scale data with outliers. Mean and standard deviation are not robust to outliers therefore this scaler uses the median and scales the data to quantile ranges.
There are other non-linear transformations like QuantileTransformer that scales be quantile ranges and PowerTransformer that maps any distribution to a distribution similar to a Gaussian distribution.
And there are many other normalizations used in machine learning and there vast amount can be confusing. The idea behind normalizing data in ML is usually that you want dont want your model to treat one feature differently than others simply because it has a higher mean or a larger variance. For most standard cases I use MinMaxScaler or StandardScaler depending on whether scaling according to the variance seems important to me.

np.ling.norm is given by:
np.linalg.norm(x) = sqrt((sum_i_j(abs(x_i_j)))^2)
so lets assume you have:
X= (1 2
0 -1)
then with this you would have:
np.linalg.norm(x)= sqr((1+2+0+1)^2)= sqr(16)=4
X = (0.25 0.5
0 -0.25)
with the other approach you would have:
min(x)= -1
max(x)= 2
max(x)-min(x)=3
X = (0.66 1
0.33 0)
So the min(x)/max(x) is also called MinMaxScaler, there all the values are always between 0-1, the other approaches normalizes your values , but you can still have negativ values. Depending on your next steps you need to decide which one to use.

Based on the API description
Scikit-learn normalizer scales input vectors individually to a unit norm (vector length).
That is why it uses the L2 regularizer (you can also use L1 as well, as explained in the API)
I think you are looking for a scaler instead of a normalizer by your description. Please find the Min-Max scaler in this link.
Also, you can consider a standard scaler that normalizes value by removing its mean and scales to its standard deviation.

Python - how to normalize time-series data

I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):
V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50
Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF

Assuming that your timeseries is an array, try something like this:
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
This will confine your values between 0 and 1

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization:
( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.
This function is slow, so it’s advisable run it just once and store the results.
from scipy.stats import norm
import pandas as pd
def get_NormArray(df, n, mode = 'total', linear = False):
'''
It computes the normalized value on the stats of n values ( Modes: total or scale )
using the formulas from the book "Statistically sound machine learning..."
(Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
It is modified to fit the data from -1 to 1 instead of -100 to 100
df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
n define the number of data points to get the mean and the quartiles for the normalization
modes: scale: scale, without centering. total: center and scale.
'''
temp =[]
for i in range(len(df))[::-1]:
if i >= n: # there will be a traveling norm until we reach the initian n values.
# those values will be normalized using the last computed values of F50,F75 and F25
F50 = df[i-n:i].quantile(0.5)
F75 = df[i-n:i].quantile(0.75)
F25 = df[i-n:i].quantile(0.25)
if linear == True and mode == 'total':
v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
elif linear == True and mode == 'scale':
v = 0.25 * df.iloc[i]/(F75-F25) -0.5
elif linear == False and mode == 'scale':
v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5
else: # even if strange values are given, it will perform full normalization with compression as default
v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5
temp.append(v[0])
return pd.DataFrame(temp[::-1])

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)
You can take a look here normalize-standardize-time-series-data-python
and
sklearn.preprocessing.minmax_scale

Equivalent python command for quantile in matlab

I'm trying to replicate some Matlab code in python. I could not find an exact equivalent to the Matlab function quantile. What I found most close is python's mquantiles.
Matlab example:
quantile( [ 8.60789925e-05, 1.98989354e-05 , 1.68308882e-04, 1.69379370e-04], 0.8)
...gives: 0.00016958
Same example in python:
scipy.stats.mstats.mquantiles( [8.60789925e-05, 1.98989354e-05, 1.68308882e-04, 1.69379370e-04], 0.8)
...gives 0.00016912
Does anyone know how to exactly replicate Matlab's quantile function?

The documentation for quantile (under the More About => Algorithms section) gives the exact algorithm used. Here's some python code that does it for a single quantile for a flat array, using bottleneck to do partial sorting:
import numpy as np
import botteleneck as bn
def quantile(a, prob):
"""
Estimates the prob'th quantile of the values in a data array.
Uses the algorithm of matlab's quantile(), namely:
- Remove any nan values
- Take the sorted data as the (.5/n), (1.5/n), ..., (1-.5/n) quantiles.
- Use linear interpolation for values between (.5/n) and (1 - .5/n).
- Use the minimum or maximum for quantiles outside that range.
See also: scipy.stats.mstats.mquantiles
"""
a = np.asanyarray(a)
a = a[np.logical_not(np.isnan(a))].ravel()
n = a.size
if prob >= 1 - .5/n:
return a.max()
elif prob <= .5 / n:
return a.min()
# find the two bounds we're interpreting between:
# that is, find i such that (i+.5) / n <= prob <= (i+1.5)/n
t = n * prob - .5
i = np.floor(t)
# partial sort so that the ith element is at position i, with bigger ones
# to the right and smaller to the left
a = bn.partsort(a, i)
if i == t: # did we luck out and get an integer index?
return a[i]
else:
# we'll linearly interpolate between this and the next index
smaller = a[i]
larger = a[i+1:].min()
if np.isinf(smaller):
return smaller # avoid inf - inf
return smaller + (larger - smaller) * (t - i)
I only did the single-quantile, 1d case because that's all I needed. If you want several quantiles, it's probably worth just doing the full sort; to do it per-axis and knew you didn't have any nans, all you should need to do is add an axis argument to the sort and vectorize the linear interpolation bit. Doing it per-axis with nans would be a little trickier.
This code gives:
>>> quantile([ 8.60789925e-05, 1.98989354e-05 , 1.68308882e-04, 1.69379370e-04], 0.8)
0.00016905822360000001
and the matlab code gave 0.00016905822359999999; the difference is 3e-20. (which is less than machine precision)

Your input vector only has 4 values, which is far too few to get a good approximation of the quantiles of the underlying distribution. The discrepancy is probably the result of Matlab and SciPy using different heuristics to compute quantiles on under sampled distributions.

A bit late, but:
mquantiles is very flexible. You just need to provide alphap and betap parameters.
Here, since MATLAB does a linear interpolation, you need to set the parameters to (0.5,0.5).
In [9]: scipy.stats.mstats.mquantiles( [8.60789925e-05, 1.98989354e-05, 1.68308882e-04, 1.69379370e-04], 0.8, alphap=0.5, betap=0.5)
EDIT: MATLAB says that it does linear interpolation, however it seems that it calculates the quantile through piece-wise linear interpolation, which is equivalent to Type 5 quantile in R, and (0.5, 0.5) in scipy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.