Python: how to improve a normalization algorithm? - python

I have a list X containg the data performed by different users N so the the number of the user is i=0,1,....,N-1. Each entry Xi has a different length.
I want to normalize the value of each user Xi over the global dataset X.
This is what I am doing. First of all I create a 1D list containing all the data, so:
tmp = list()
for i in range(0,len(X)):
tmp.extend(X[i])
then I convert it to an array and I remove outliers and NaN.
A = np.array(tmp)
A = A[~np.isnan(A)] #remove NaN
tr = np.percentile(A,95)
A = A[A < tr] #remove outliers
and then I create the histogram of this dataset
p, x = np.histogram(A, bins=10) # bin it into n = N/10 bins
finally I normalize the value of each users over the histogram I created, so:
Xn = list()
for i in range(0,len(X)):
tmp = np.array(X[i])
tmp = tmp[tmp < tr]
tmp = np.histogram(tmp, x)
Xn.append(append(tmp[0]/sum(tmp[0]))
My data set is very large and this process could take a while. I am wondering if there is e a better way to do that or a package.

For the first part, if each element X[i] of X is a list, you may be able to use sum, and then convert directly to an array, or use concatenate:
# Example X
X = [list(range(i)) for i in range(3, 19)] + [[2., np.NaN]]
# Build array with sum
A = np.array(sum(X, []))
# Build array with concatenate
A = np.concatenate(X)
The latter is more readable.
For the second part, I would store indices of the user to which each data point belongs.
idx = np.concatenate([np.full(len(x), i, int) for i,x in enumerate(X)])
tr = np.nanpercentile(A,95)
ok = A < tr # this excludes outliers, +Inf and NaN
idx = idx[ok]
A = A[ok]
Finally, you can compute x from the range of A, then use digitize on A and get the bins of each element. Then each pair (idx,bin-1) identifies the datum of a given user belonging to a given bin. You can then sum all these contributions using the at method of the ufunc add (see documentation). Finally, you divide by the sum over bins to normalize.
x = np.linspace(A.min(), A.max(), 10+1)
bin = np.digitize(A, x)
Xn = np.zeros((len(X), len(x)))
np.add.at(Xn, (idx,bin-1), 1)
Xn /= Xn.sum(axis=1)[:,np.newaxis]

Related

Using python/numpy to create a complex matrix

Using python/numpy, I would like to create a 2D matrix M whose components are:
I know I can do this with a bunch of for loops but is there a better way to do this by using numpy (not using for loops)?
This is how I tried, which end up giving me a value error.
I tried to first define a function that takes the sum over k:
define sum_function(i,j):
initial_array = np.arange(g(i,j),h(i,j)+1)
applied_array = f(i,j,initial_array)
return applied_array.sum()
then I tried to create the M matrix with np.mgrid as follows:
ii, jj = np.mgrid(start:fin, start:fin)
M_matrix = sum_function(ii,jj)
--
(Edited)
Let me write down the concrete form of a matrix as an example:
M_{i,j} = \sum_{k=min(i,j)}^{i+j}\sin{\left( (i+j)^k \right)}
if i,j = 0,1, then this matrix is 2 by 2 and it's form will be
\bigl(\begin{smallmatrix}
\sin(0) & \sin(1) \
\sin(1)& \sin(2)+\sin(4)
\end{smallmatrix}\bigr)
Now if the matrix gets really big, how would I create this matrix without using for loops?
To simplify thinking, lets ravel the i,j dimensions to one, ij dimension. Can we evaluate 3 arrays:
G = g(ij) # for all ij values
H = h(ij)
F = f(ij, kk) # for all ij, and all kk
In other words, can g,h,f be evaluated at multiple values, to produce whole-arrays?
If the G and H values were the same for all ij, or subsets (preferably slices), then
F[:, G:H].sum(axis=1)
would be the value for all ij.
If the H-G difference, the size of each slice, was the same, then we can construct a 2d indexing array, GH such that
F[:, GH].sum(axis=1)
In other words we are summing constant size windows of the F rows.
But if the H-G differences vary across ij, I think we are stuck with doing the sum for each ij element separately - with Python level loops, or ones complied with numba or cython.
I think I myself found an answer to this. I first create 3D array F_{i,j,k} = f(i,j,k). And then create a mask_array whose component is Ture if g(i,j) < k < f(i,j), False otherwise. Then I compute the element-wise multiplication of these two arrays, F*mask_array, and then taking the sum over k axis.
For example, this matrix can be efficiently created by the following code.
M_{i,j} = \sum_{k=min(i,j)}^{i+j}\sin{\left( (i+j)^k \right)}
#in this example, g(i,j) = min(i,j) and h(i,j) = i+j f(i,j,k) = sin((i+j)^k)
# 0<= i, j <= 2
#kk should range from min g(i,j) to max h(i,j)
ii, jj, kk = np.mgrid[0:3,0:3,0:5]
# k > g(i,j)
frm1 = kk >= jj
frm2 = kk >= ii
frm = np.logical_or(frm1,frm2)
# k < h(i,j)
to = kk <= ii+jj
#mask
k_mask = np.logical_and(frm,to)
def f(i,j,k):
return np.sin((i+j)**k)
M_before_mask = f(ii,jj,kk)
#Matrix created
M_matrix = (M_before_mask*k_mask).sum(axis=2)

Bin average as a function of position

I want to efficiently calculate the average of a variable (say temperature) over multiple areas of the plane.
I essentially want to do the following.
import numpy as np
num = 10000
XYT = np.random.uniform(0, 1, (num, 3))
X = np.transpose(XYT)[0]
Y = np.transpose(XYT)[1]
T = np.transpose(XYT)[2]
size = 10
bins = np.empty((size, size))
for i in range(size):
for j in range(size):
if rescaled X,Y in bin[i][j]:
bins[i][j] = mean T
I would use pandas (although im sure you can achieve basically the same with vanilla numpy)
df = pandas.DataFrame({'x':npX,'y':npY,'z':npZ})
# solve quadrants
df['quadrant'] = (df['x']>=0)*2 + (df['y']>=0)*1
# group by and aggregate
mean_per_quadrant = df.groupby(['quadrant'])['temp'].aggregate(['mean'])
you may need to create multiple quadrant cutoffs to get unique groupings
for example (df['x']>=50)*4 + (df['x']>=0)*2 + (df['y']>=0)*1 would add an extra 2 quadrants to our group (one y>=0, and one y<0) (just make sure you use powers of 2)

How to do vector-matrix multiplication with conditions?

I want to obtain a list (or array, doesn't matter) of A from the following formula:
A_i = X_(k!=i) * S_(k!=i) * X'_(k!=i)
where:
X is a vector (and X' is the transpose of X), S is a matrix, and the subscript k is defined as {k=1,2,3,...n| k!=i}.
X = [x1, x2, ..., xn]
S = [[s11,s12,...,s1n],
[s21,s22,...,s2n]
[... ... ... ..]
[sn1,sn2,...,snn]]
I take the following as an example:
X = [0.1,0.2,0.3,0.5]
S = [[0.4,0.1,0.3,0.5],
[2,1.5,2.4,0.6]
[0.4,0.1,0.3,0.5]
[2,1.5,2.4,0.6]]
So, eventually, I would get a list of four values for A.
I did this:
import numpy as np
x = np.array([0.1,0.2,0.3,0.5])
s = np.matrix([[0.4,0.1,0.3,0.5],[1,2,1.5,2.4,0.6],[0.4,0.1,0.3,0.5],[1,2,1.5,2.4,0.6]])
for k in range(x) if k!=i
A = (x.dot(s)).dot(np.transpose(x))
print (A)
I am confused with how to use a conditional 'for' loop. Could you please help me to solve it? Thanks.
EDIT:
Just to explain more. If you take i=1, then the formula will be:
A_1 = X_(k!=1) * S_(k!=1) * X'_(k!=1)
So any array (or value) associated with subscript 1 will be deleted in X and S. like:
X = [0.2,0.3,0.5]
S = [[1.5,2.4,0.6]
[0.1,0.3,0.5]
[1.5,2.4,0.6]]
Step 1: correctly calculate A_i
Step 2: collect them into A
I assume what you want to calculate is
An easy way to do so is to mask away the entries using masked arrays. This way we don't need to delete or copy any matrixes.
# sample
x = np.array([1,2,3,4])
s = np.diag([4,5,6,7])
# we will use masked arrays to remove k=i
vec_mask = np.zeros_like(x)
matrix_mask = np.zeros_like(s)
i = 0 # start
# set masks
vec_mask[i] = 1
matrix_mask[i] = matrix_mask[:,i] = 1
s_mask = np.ma.array(s, mask=matrix_mask)
x_mask = np.ma.array(x, mask=vec_mask)
# reduced product, remember using np.ma.inner instead np.inner
Ai = np.ma.inner(np.ma.inner(x_mask, s_mask), x_mask.T)
vec_mask[i] = 0
matrix_mask[i] = matrix_mask[:,i] = 0
As terms of 0 don't add to the sum, we actually can ignore masking the matrix and just mask the vector:
# we will use masked arrays to remove k=i
mask = np.zeros_like(x)
i = 0 # start
# set masks
mask[i] = 1
x_mask = np.ma.array(x, mask=mask)
# reduced product
Ai = np.ma.inner(np.ma.inner(x_mask, s), x_mask.T)
# unset mask
mask[i] = 0
The final step is to assemble A out of the A_is, so in total we get
x = np.array([1,2,3,4])
s = np.diag([4,5,6,7])
mask = np.zeros_like(x)
x_mask = np.ma.array(x, mask=mask)
A = []
for i in range(len(x)):
x_mask.mask[i] = 1
Ai = np.ma.inner(np.ma.inner(x_mask, s), x_mask.T)
A.append(Ai)
x_mask.mask[i] = 0
A_vec = np.array(A)
Implementing a matrix/vector product using loops will be rather slow in Python. Therefore, I suggest to actually delete the rows/columns/elements at the given index and perform the fast built-in dot product without any explicit loops:
i = 0 # don't forget Python's indices are zero-based
x_ = np.delete(X, i) # remove element
s_ = np.delete(S, i, axis=0) # remove row
s_ = np.delete(s_, i, axis=1) # remove column
result = x_.dot(s_).dot(x_) # no need to transpose a 1-D array

Gridwise application of the bisection method

I need to find roots for a generalized state space. That is, I have a discrete grid of dimensions grid=AxBx(...)xX, of which I do not know ex ante how many dimensions it has (the solution should be applicable to any grid.size) .
I want to find the roots (f(z) = 0) for every state z inside grid using the bisection method. Say remainder contains f(z), and I know f'(z) < 0. Then I need to
increase z if remainder > 0
decrease z if remainder < 0
Wlog, say the matrix historyof shape (grid.shape, T) contains the history of earlier values of z for every point in the grid and I need to increase z (since remainder > 0). I will then need to select zAlternative inside history[z, :] that is the "smallest of those, that are larger than z". In pseudo-code, that is:
zAlternative = hist[z,:][hist[z,:] > z].min()
I had asked this earlier. The solution I was given was
b = sort(history[..., :-1], axis=-1)
mask = b > history[..., -1:]
index = argmax(mask, axis=-1)
indices = tuple([arange(j) for j in b.shape[:-1]])
indices = meshgrid(*indices, indexing='ij', sparse=True)
indices.append(index)
indices = tuple(indices)
lowerZ = history[indices]
b = sort(history[..., :-1], axis=-1)
mask = b <= history[..., -1:]
index = argmax(mask, axis=-1)
indices = tuple([arange(j) for j in b.shape[:-1]])
indices = meshgrid(*indices, indexing='ij', sparse=True)
indices.append(index)
indices = tuple(indices)
higherZ = history[indices]
newZ = history[..., -1]
criterion = 0.05
increase = remainder > 0 + criterion
decrease = remainder < 0 - criterion
newZ[increase] = 0.5*(newZ[increase] + higherZ[increase])
newZ[decrease] = 0.5*(newZ[decrease] + lowerZ[decrease])
However, this code ceases to work for me. I feel extremely bad about admitting it, but I never understood the magic that is happening with the indices, therefore I unfortunately need help.
What the code actually does, it to give me the lowest respectively the highest. That is, if I fix on two specific z values:
history[z1] = array([0.3, 0.2, 0.1])
history[z2] = array([0.1, 0.2, 0.3])
I will get higherZ[z1] = 0.3 and lowerZ[z2] = 0.1, that is, the extrema. The correct value for both cases would have been 0.2. What's going wrong here?
If needed, in order to generate testing data, you can use something along the lines of
history = tile(array([0.1, 0.3, 0.2, 0.15, 0.13])[newaxis,newaxis,:], (10, 20, 1))
remainder = -1*ones((10, 20))
to test the second case.
Expected outcome
I adjusted the history variable above, to give test cases for both upwards and downwards. Expected outcome would be
lowerZ = 0.1 * ones((10,20))
higherZ = 0.15 * ones((10,20))
Which is, for every point z in history[z, :], the next highest previous value (higherZ) and the next smallest previous value (lowerZ). Since all points z have exactly the same history ([0.1, 0.3, 0.2, 0.15, 0.13]), they will all have the same values for lowerZ and higherZ. Of course, in general, the histories for each z will be different and hence the two matrices will contain potentially different values on every grid point.
I compared what you posted here to the solution for your previous post and noticed some differences.
For the smaller z, you said
mask = b > history[..., -1:]
index = argmax(mask, axis=-1)
They said:
mask = b >= a[..., -1:]
index = np.argmax(mask, axis=-1) - 1
For the larger z, you said
mask = b <= history[..., -1:]
index = argmax(mask, axis=-1)
They said:
mask = b > a[..., -1:]
index = np.argmax(mask, axis=-1)
Using the solution for your previous post, I get:
import numpy as np
history = np.tile(np.array([0.1, 0.3, 0.2, 0.15, 0.13])[np.newaxis,np.newaxis,:], (10, 20, 1))
remainder = -1*np.ones((10, 20))
a = history
# b is a sorted ndarray excluding the most recent observation
# it is sorted along the observation axis
b = np.sort(a[..., :-1], axis=-1)
# mask is a boolean array, comparing the (sorted)
# previous observations to the current observation - [..., -1:]
mask = b > a[..., -1:]
# The next 5 statements build an indexing array.
# True evaluates to one and False evaluates to zero.
# argmax() will return the index of the first True,
# in this case along the last (observations) axis.
# index is an array with the shape of z (2-d for this test data).
# It represents the index of the next greater
# observation for every 'element' of z.
index = np.argmax(mask, axis=-1)
# The next two statements construct arrays of indices
# for every element of z - the first n-1 dimensions of history.
indices = tuple([np.arange(j) for j in b.shape[:-1]])
indices = np.meshgrid(*indices, indexing='ij', sparse=True)
# Adding index to the end of indices (the last dimension of history)
# produces a 'group' of indices that will 'select' a single observation
# for every 'element' of z
indices.append(index)
indices = tuple(indices)
higherZ = b[indices]
mask = b >= a[..., -1:]
# Since b excludes the current observation, we want the
# index just before the next highest observation for lowerZ,
# hence the minus one.
index = np.argmax(mask, axis=-1) - 1
indices = tuple([np.arange(j) for j in b.shape[:-1]])
indices = np.meshgrid(*indices, indexing='ij', sparse=True)
indices.append(index)
indices = tuple(indices)
lowerZ = b[indices]
assert np.all(lowerZ == .1)
assert np.all(higherZ == .15)
which seems to work
z-shaped arrays for the next highest and lowest observation in history
relative to the current observation, given the current observation is history[...,-1:]
This constructs the higher and lower arrays by manipulating the strides of history
to make it easier to iterate over the observations of each element of z.
This is accomplished using numpy.lib.stride_tricks.as_strided and an n-dim generalzed
function found at Efficient Overlapping Windows with Numpy - I will include it's source at the end
There is a single python loop that has 200 iterations for history.shape of (10,20,x).
import numpy as np
history = np.tile(np.array([0.1, 0.3, 0.2, 0.15, 0.13])[np.newaxis,np.newaxis,:], (10, 20, 1))
remainder = -1*np.ones((10, 20))
z_shape = final_shape = history.shape[:-1]
number_of_observations = history.shape[-1]
number_of_elements_in_z = np.product(z_shape)
# manipulate histories to efficiently iterate over
# the observations of each "element" of z
s = sliding_window(history, (1,1,number_of_observations))
# s.shape will be (number_of_elements_in_z, number_of_observations)
# create arrays of the next lower and next higher observation
lowerZ = np.zeros(number_of_elements_in_z)
higherZ = np.zeros(number_of_elements_in_z)
for ndx, observations in enumerate(s):
current_observation = observations[-1]
a = np.sort(observations)
lowerZ[ndx] = a[a < current_observation][-1]
higherZ[ndx] = a[a > current_observation][0]
assert np.all(lowerZ == .1)
assert np.all(higherZ == .15)
lowerZ = lowerZ.reshape(z_shape)
higherZ = higherZ.reshape(z_shape)
sliding_window from Efficient Overlapping Windows with Numpy
import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product
def norm_shape(shape):
'''
Normalize numpy array shapes so they're always expressed as a tuple,
even for one-dimensional shapes.
Parameters
shape - an int, or a tuple of ints
Returns
a shape tuple
from http://www.johnvinyard.com/blog/?p=268
'''
try:
i = int(shape)
return (i,)
except TypeError:
# shape was not a number
pass
try:
t = tuple(shape)
return t
except TypeError:
# shape was not iterable
pass
raise TypeError('shape must be an int, or a tuple of ints')
def sliding_window(a,ws,ss = None,flatten = True):
'''
Return a sliding window over a in any number of dimensions
Parameters:
a - an n-dimensional numpy array
ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size
of each dimension of the window
ss - an int (a is 1D) or tuple (a is 2D or greater) representing the
amount to slide the window in each dimension. If not specified, it
defaults to ws.
flatten - if True, all slices are flattened, otherwise, there is an
extra dimension for each dimension of the input.
Returns
an array containing each n-dimensional window from a
from http://www.johnvinyard.com/blog/?p=268
'''
if None is ss:
# ss was not provided. the windows will not overlap in any direction.
ss = ws
ws = norm_shape(ws)
ss = norm_shape(ss)
# convert ws, ss, and a.shape to numpy arrays so that we can do math in every
# dimension at once.
ws = np.array(ws)
ss = np.array(ss)
shape = np.array(a.shape)
# ensure that ws, ss, and a.shape all have the same number of dimensions
ls = [len(shape),len(ws),len(ss)]
if 1 != len(set(ls)):
error_string = 'a.shape, ws and ss must all have the same length. They were{}'
raise ValueError(error_string.format(str(ls)))
# ensure that ws is smaller than a in every dimension
if np.any(ws > shape):
error_string = 'ws cannot be larger than a in any dimension. a.shape was {} and ws was {}'
raise ValueError(error_string.format(str(a.shape),str(ws)))
# how many slices will there be in each dimension?
newshape = norm_shape(((shape - ws) // ss) + 1)
# the shape of the strided array will be the number of slices in each dimension
# plus the shape of the window (tuple addition)
newshape += norm_shape(ws)
# the strides tuple will be the array's strides multiplied by step size, plus
# the array's strides (tuple addition)
newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
strided = ast(a,shape = newshape,strides = newstrides)
if not flatten:
return strided
# Collapse strided so that it has one more dimension than the window. I.e.,
# the new array is a flat list of slices.
meat = len(ws) if ws.shape else 0
firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
dim = firstdim + (newshape[-meat:])
# remove any dimensions with size 1
dim = filter(lambda i : i != 1,dim)
return strided.reshape(dim)

Numpy interconversion between multidimensional and linear indexing

I'm looking for a fast way to interconvert between linear and multidimensional indexing in Numpy.
To make my usage concrete, I have a large collection of N particles, each assigned 5 float values (dimensions) giving an Nx5 array. I then bin each dimension using numpy.digitize with an appropriate choice of bin boundaries to assign each particle a bin in the 5 dimensional space.
N = 10
ndims = 5
p = numpy.random.normal(size=(N,ndims))
for idim in xrange(ndims):
bbnds[idim] = numpy.array([-float('inf')]+[-2.,-1.,0.,1.,2.]+[float('inf')])
binassign = ndims*[None]
for idim in xrange(ndims):
binassign[idim] = numpy.digitize(p[:,idim],bbnds[idim]) - 1
binassign then contains rows that correspond to the multidimensional index. If I then want to convert the multidimensional index to a linear index, I think I would want to do something like:
linind = numpy.arange(6**5).reshape(6,6,6,6,6)
This would give a look-up for each multidimensional index to map it to a linear index. You could then go back using:
mindx = numpy.unravel_index(x,linind.shape)
Where I'm having difficulties is figuring out how to take binassign (the Nx5 array) containing the multidimensional index in each row, and coverting that to an 1d linear index, by using it to slice the linear indexing array linind.
If anyone has a one (or several) line indexing trick to go back and forth between the multidimensional index and the linear index in a way that vectorizes the operation for all N particles, I would appreciate your insight.
You can simply calculate the index of each bin:
box_indices = numpy.dot(ndims**numpy.arange(ndims), binassign)
The scalar product simply does 1*x0 + 5*x1 + 5*5*x2 +… This is done very efficiently through NumPy's dot().
Although I very much like EOL's answer, I wanted to generalize it a bit for non-uniform numbers of bins along each direction, and also to highlight the differences between C and F styles of ordering. Here is an example solution:
ndims = 5
N = 10
# Define bin boundaries
binbnds = ndims*[None]
nbins = []
for idim in xrange(ndims):
binbnds[idim] = numpy.linspace(-10.0,10.0,numpy.random.randint(2,15))
binbnds[idim][0] = -float('inf')
binbnds[idim][-1] = float('inf')
nbins.append(binbnds[idim].shape[0]-1)
nstates = numpy.cumprod(nbins)[-1]
# Define variable values for N particles in ndims dimensions
p = numpy.random.normal(size=(N,ndims))
# Assign to bins along each dimension
binassign = ndims*[None]
for idim in xrange(ndims):
binassign[idim] = numpy.digitize(p[:,idim],binbnds[idim]) - 1
binassign = numpy.array(binassign)
# multidimensional array with elements mapping from multidim to linear index
# Two different arrays for C vs F ordering
linind_C = numpy.arange(nstates).reshape(nbins,order='C')
linind_F = numpy.arange(nstates).reshape(nbins,order='F')
and now make the conversion
# Fast conversion to linear index
b_F = numpy.cumprod([1] + nbins)[:-1]
b_C = numpy.cumprod([1] + nbins[::-1])[:-1][::-1]
box_index_F = numpy.dot(b_F,binassign)
box_index_C = numpy.dot(b_C,binassign)
and to check for correctness:
# Check
print 'Checking correct mapping for each particle F order'
for k in xrange(N):
ii = box_index_F[k]
jj = linind_F[tuple(binassign[:,k])]
print 'particle %d %s (%d %d)' % (k,ii == jj,ii,jj)
print 'Checking correct mapping for each particle C order'
for k in xrange(N):
ii = box_index_C[k]
jj = linind_C[tuple(binassign[:,k])]
print 'particle %d %s (%d %d)' % (k,ii == jj,ii,jj)
And for completeness, if you want to go back from the 1d index to the multidimensional index in a fast, vectorized-style way:
print 'Convert C-style from linear to multi'
x = box_index_C.reshape(-1,1)
bassign_rev_C = x / b_C % nbins
print 'Convert F-style from linear to multi'
x = box_index_F.reshape(-1,1)
bassign_rev_F = x / b_F % nbins
and again to check:
print 'Check C-order'
for k in xrange(N):
ii = tuple(binassign[:,k])
jj = tuple(bassign_rev_C[k,:])
print ii==jj,ii,jj
print 'Check F-order'
for k in xrange(N):
ii = tuple(binassign[:,k])
jj = tuple(bassign_rev_F[k,:])
print ii==jj,ii,jj

Categories

Resources