I'd like to normalize my training set before passing it to my NN so instead of doing it manually (subtract mean and divide by std), I tried keras.utils.normalize() and I am amazed about the results I got.
Running this:
r = np.random.rand(3000) * 1000
nr = normalize(r)
print(np.mean(r))
print(np.mean(nr))
print(np.std(r))
print(np.std(nr))
print(np.min(r))
print(np.min(nr))
print(np.max(r))
print(np.max(nr))
Results in that:
495.60440066771866
0.015737914577213984
291.4440194021
0.009254802974329002
0.20755517410064872
6.590913227674956e-06
999.7631481267636
0.03174747238214018
Unfortunately, the docs don't explain what's happening under the hood. Can you please explain what it does and if I should use keras.utils.normalize instead of what I would have done manually?
It is not the kind of normalization you expect. Actually, it uses np.linalg.norm() under the hood to normalize the given data using Lp-norms:
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
# Arguments
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
# Returns
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
For example, in the default case, it would normalize the data using L2-normalization (i.e. the sum of squared of elements would be equal to one).
You can either use this function, or if you don't want to do mean and std normalization manually, you can use StandardScaler() from sklearn or even MinMaxScaler().
Related
I have ocean temperature data having 'depth' and 'time', as data(depth,time).
I want to use the 'detrend' function at each depth and save that result. So that I get as a result detrend(number of depth, time) as a one array. Depth = 42 and time = 72
for i in range(44):
depth = temp[:,i]
detrend = s.detrend(depth)
But, this is giving only last depth value calculation.
Please let me know.
Assuming you're using scipy.signal.detrend(), you can use the axis argument to do this without a loop. It's unclear to me which axis you're hoping to detrend, but you can pick axis=0 or axis=1 depending on which you want. So:
detrend = s.detrend(temp,axis=0)
In general, if you do want to do something like this in a loop you could create an empty array of the right size that you then write into in each iteration of the loop.
detrended = np.zeros_like(temp)
for i in range(44):
depth = temp[:,i]
detrended[:,i] = s.detrend(depth)
First, if you don't save the detrended data in a list or an array during the for loop it is obvious that only the last detrend will survive. Actually there is no need to use a for loop for that.
Instead of using scipy methods you can use also directly use the xarray polyfit and polyval method, which let you specify the dimension over which to detrend.
Example:
import numpy as np
import xarray as xr
#test data with linear trend
data_vars=np.arange(1000)[None,:]+np.random.randn(50,1000)*100
data=xr.DataArray(data_vars,dims=['depth','time'])
p = data.polyfit(dim='time', deg=1)
fit = xr.polyval(data['time'], p.polyfit_coefficients)
detrend = data-fit
#eg for zeroth depth
data.isel(depth=0).plot(label='raw')
detrend.isel(depth=0).plot(label='lin. detrended')
plt.legend()
My goal is to compute a derivative of a moving window of a multidimensional dataset along a given dimension, where the dataset is stored as Xarray DataArray or DataSet.
In the simplest case, given a 2D array I would like to compute a moving difference across multiple entries in one dimension, e.g.:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6) ).reshape(10,6)
T=3
reducedArray = np.zeros_like(data)
for i in range(data.shape[1]):
if i < T:
reducedArray[:,i] = data[:,i] - data[:,0]
else:
reducedArray[:,i] = data[:,i] - data[:,i-T]
where the if i <T condition ensures that input and output contain proper values (i.e., no nans) and are of identical shape.
Xarray's diff aims to perform a finite-difference approximation of a given derivative order using nearest-neighbours, so it is not suitable here, hence the question:
Is it possible to perform this operation using Xarray functions only?
The rolling weighted average example appears to be something similar, but still too distinct due to the usage of NumPy routines. I've been thinking that something along the lines of the following should work:
xr2DDataArray = xr.DataArray(
data,
dims=('x','y'),
coords={'x':np.linspace(0,1,10), 'y':np.linspace(1,4,6)}
)
r = xr2DDataArray.rolling(x=T,min_periods=2)
r.reduce( redFn )
I am struggling with the definition of redFn here ,though.
Caveat The actual dataset to which the operation is to be applied will have a size of ~10GiB, so a solution that does not blow up the memory requirements will be highly appreciated!
Update/Solution
Using Xarray rolling
After sleeping on it and a bit more fiddling the post linked above actually contains a solution. To obtain a finite difference we just have to define the weights to be $\pm 1$ at the ends and $0$ else:
def fdMovingWindow(data, **kwargs):
T = kwargs['T'];
del kwargs['T'];
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
axis = kwargs['axis']
if data.shape[axis] == T:
return np.sum(data * weights, **kwargs)
else:
return 0
r.reduce(fdMovingWindow, T=4)
alternatively, using construct and a dot product:
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
xrWeights = xr.DataArray(weights, dims=['window'])
xr2DDataArray.rolling(y=T,min_periods=1).construct('window').dot(xrWeights)
This carries a massive caveat: The procedure essentially creates a list arrays representing the moving window. This is fine for a modest 2D / 3D array, but for a 4D array that takes up ~10 GiB in memory this will lead to an OOM death!
Simplicistic - memory efficient
A less memory-intensive way is to copy the array and work in a way similar to NumPy's arrays:
xrDiffArray = xr2DDataArray.copy()
dy = xr2DDataArray.y.values[1] - xr2DDataArray.y.values[0] #equidistant sampling
for src in xr2DDataArray:
if src.y.values < xr2DDataArray.y.values[0] + T*dy:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.values[0]
else:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.sel(y = src.y.values - dy*T).values
This will produce the intended result without dimensional errors, but it requires a copy of the dataset.
I was hoping to utilise Xarray to prevent a copy and instead just chain operations that are then evaluated if and when values are actually requested.
A suggestion as to how to accomplish this will still be welcomed!
I have never used xarray, so maybe I am mistaken, but I think you can get the result you want avoiding using loops and conditionals. This is at least twice faster than your example for numpy arrays:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6)).reshape(10,6)
reducedArray = np.empty_like(data)
reducedArray[:, T:] = data[:, T:] - data[:, :-T]
reducedArray[:, :T] = data[:, :T] - data[:, 0, np.newaxis]
I imagine the improvement will be higher when using DataArrays.
It does not use xarray functions but neither depends on numpy functions. I am confident that translating this to xarray will be straightforward, I know that it works if there are no coords, but once you include them, you get an error because of the coords mismatch (coords of data[:, T:] and of data[:, :-T] are different). Sadly, I can't do better now.
I would like to scale an array of size [192,4000] to a specific range. I would like each row (1:192) to be rescaled to a specific range e.g. (-840,840). I run a very simple code:
import numpy as np
from sklearn import preprocessing as sp
sample_mat = np.random.randint(-840,840, size=(192, 4000))
scaler = sp.MinMaxScaler(feature_range=(-840,840))
scaler = scaler.fit(sample_mat)
scaled_mat= scaler.transform(sample_mat)
This messes up my matrix range, even when max and min of my original matrix is exactly the same. I can't figure out what is wrong, any idea?
You can do this manually.
It is a linear transformation of the minmax normalized data.
interval_min = -840
interval_max = 840
scaled_mat = (sample_mat - np.min(sample_mat) / (np.max(sample_mat) - np.min(sample_mat)) * (interval_max - interval_min) + interval_min
MinMaxScaler support feature_range argument on initialization that can produce the output in a certain range.
scaler = MinMaxScaler(feature_range=(1, 2)) will yield output in the (1,2) range
I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.
One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered
I am trying to implement a very simple example of the law of large numbers using PyMC. The goal is to generate many sample averages of samples of different sizes. For example, in the code below, I'm taking repeatedly taking groups of 5 samples (samples_to_average = 5), calculating their mean, and then finding the 95% CI of the resulting trace.
The code below runs, but what I'd like to do is modify samples_to_average to be a list, so that I can calculate confidence intervals for a range of different sample sizes in a single pass.
import scipy.misc
import numpy as np
import pymc as mc
samples_to_average = 5
list_of_samples = mc.DiscreteUniform("response", lower=1, upper=10, size=1000)
#mc.deterministic
def sample_average(x=list_of_samples, n=samples_to_average):
samples = int(n)
selected = x[0:samples]
total = np.sum(selected)
sample_average = float(total) / samples
return sample_average
def getConfidenceInterval():
responseModel = mc.Model([samples_to_average, list_of_samples, sample_average])
mapRes = mc.MAP(responseModel)
mapRes.fit()
mcmc = mc.MCMC(responseModel)
mcmc.sample( 10000, 5000)
upper = np.percentile(mcmc.trace('sample_average')[:],95)
lower = np.percentile(mcmc.trace('sample_average')[:],5)
return (lower, upper)
print getConfidenceInterval()
Most examples I've seen using the deterministic decorator use global stochastic variables. However, to achieve my aim, I think what I need to do is create a stochastic variable (of the correct length) in getConfidenceInterval(), and pass this to sample_average (rather than supplying sample_average using globals / default parameter).
How can a variable created in getConfidenceInterval() be passed into sample_average(), or alternatively, what is another way that I can evaluate multiple models using different values of samples_to_average? I'd like to avoid globals if possible.
Before addressing your question, I would like to simplify the way sample_average is written so that it is more compact and easier to understand.
sample_average = mc.Lambda('sample_average', lambda x=list_of_samples, n=samples_to_average: np.mean(x[:n]))
Now you can generalize this to the case where samples_to_average is an array of parameters:
samples_to_average = np.arange(5, 25, 5)
sample_average = mc.Lambda('sample_average', lambda x=list_of_samples, n=samples_to_average: [np.mean(x[:t]) for t in n])
The getConfidenceInterval function would also have to be changed as shown below:
def getConfidenceInterval():
responseModel = mc.Model([samples_to_average, list_of_samples, sample_average])
mapRes = mc.MAP(responseModel)
mapRes.fit()
mcmc = mc.MCMC(responseModel)
mcmc.sample( 10000, 5000)
average = np.vstack((t for t in mcmc.trace('sample_average')))
upper = np.percentile(average, 95, axis = 0)
lower = np.percentile(average, 5, axis = 0)
return (lower, upper)
I used vstack to aggregate the sample averages into a 2D array and then used the axis option in Numpy's percentile function to compute percentiles along each column.