Why is this code using both flatten() and reshape(1,-1)? - python

I have two questions:
Why has the programmer used both reshape(1, -1) and flatten? I think they do the same thing (convert rows to columns).
Is the role of axis=0 to add the remaining n-20 rows vertically to X?
k = 5 # supervising time (k next interval closing price that we are gonna predict)
N = np.size(data , axis=0) # N = data.shape[0]
winSize = 20 # the size of sliding window
f = data.shape[1] # the number of features
X = np.zeros((0, winSize * f))
Ytr = np.zeros((0, 1))
for i in tqdm(range(N-winSize-k)):
X = np.concatenate((X, data[i: i+winSize].flatten().reshape(1,-1)), axis = 0)
Ytr = np.concatenate((Ytr, label[i+winSize+k].reshape(-1, 1)), axis=0)
I have tried to use the code with only reshape(1,-1) or flatten but got error.

Related

Random portfolio simulation using dirichlet distribution with weight boundaries

I'm currently stuck at a problem which is on random portfolio simulation, however I'm struggling to generate these portfolios that fit into a certain constraints:
the code I have is below:
import numpy as np
from scipy import stats
# n is number of simulation, and width is number of assets
n ,width = 1000000, 38
bound = [0.02, 0.04]
np.random.seed(5) # Set seed
random_weights = np.random.dirichlet(np.ones(width), size = n)
# alphas = np.ones((width,))
# random_weights = np.abs(stats.dirichlet.rvs(alphas, size=n))
# Select only rows that meet weight condition:
cond1 = np.all(bound[0] <= random_weights, axis=1)
cond2 = np.all(random_weights <= bound[1], axis=1)
valid_rows = cond1*cond2
new_weights = random_weights[valid_rows, :]
new_weights end up being empty
I have also tried:
weights = np.random.random((n, width))
weights_sum = weights.sum(axis=1)
weights_sum = np.reshape(weights_sum, (n, 1))
# Standardise these weights so that they sum to 1
random_weights = weights / weights_sum
cond1 = np.all(bound[0] <= random_weights, axis=1)
cond2 = np.all(random_weights <= bound[1], axis=1)
valid_rows = cond1*cond2
new_weights = random_weights[valid_rows, :]
new_weights still ends up being empty
Would you be advise what a possible solution is and why that might be the case?
Ok, looking at Dirichlet and following https://math.stackexchange.com/questions/1439299/sum-of-squares-of-random-variables-that-satisfy-x-1-dotsx-n-1/1440385#1440385.
n = 38
E[Xi] = 1/n
Var[Xi] = (n-1)/(n+1) 1/n2 ~ 1/n2
StdDev[Xi] = sqrt(Var[Xi]) ~ 1/n
You're trying to get ALL 38 random numbers at once with mean 1/38=0.026 and stddev~0.026 to be inside [0.02...0.04] interval.
It would be extremely rare event.

ValueError: cannot reshape array of size 300 into shape (100,100,3)

I'm struggeling in reshaping my image. Which is of dimension (100,100,3). The total array for all images makes up (3267, 100, 3)
def get_batch(batch_size,s="train"):
"""Create batch of n pairs, half same class, half different class"""
if s == 'train':
X = Xtrain
X= X.reshape(-1,100,100,3)
#X= X.reshape(-1,20,105,105)
categories = train_classes
else:
X = Xval
X= X.reshape(-1,100,100,3)
categories = val_classes
n_classes, n_examples, w, h, chan = X.shape
print(n_classes)
print(type(n_classes))
print(n_classes.shape)
# randomly sample several classes to use in the batch
categories = rng.choice(n_classes,size=(batch_size,),replace=False)
# initialize 2 empty arrays for the input image batch
pairs=[np.zeros((batch_size, h, w,1)) for i in range(2)]
# initialize vector for the targets
targets=np.zeros((batch_size,))
# make one half of it '1's, so 2nd half of batch has same class
targets[batch_size//2:] = 1
for i in range(batch_size):
category = categories[i]
idx_1 = rng.randint(0, n_examples)
pairs[0][i,:,:,:] = X[category, idx_1].reshape(w, h, chan)
idx_2 = rng.randint(0, n_examples)
# pick images of same class for 1st half, different for 2nd
if i >= batch_size // 2:
category_2 = category
else:
# add a random number to the category modulo n classes to ensure 2nd image has a different category
category_2 = (category + rng.randint(1,n_classes)) % n_classes
pairs[1][i,:,:,:] = X[category_2,idx_2].reshape(w, h,1)
return pairs, targets
However when trying to reshape the array pairs[0][i,:,:,:] = X[category, idx_1].reshape(w, h, chan) I always obtain the error that an array size of 300 is not reshapable into (100,100,3). I honestly don't see the problem why it should be...
Can anybody help me out?
you want array of 300 into 100,100,3. it cannot be because (100*100*3)=30000 and 30000 not equal to 300 you can only reshape if output shape has same number of values as input.
i suggest you should do (10,10,3) instead because (10*10*3)=300

IndexError: index 720 is out of bounds for axis 0 with size 720

With the following code, I am trying to make cross-validation, splitting into k = 5 folds, and then picking one of those for validation, and the remaining for training.
# size of the different sets:
# features (800,2)
# labels (800,)
lambdaValues = [0.001, 0.01, 0.1, 1, 10, 100]
nFolds = 5
foldSize = int(len(features)/nFolds)
for i in range(len(lambdaValues)):
for ii in range(0, len(features), foldSize):
pick_training = np.array(features)
pick_validation = np.zeros((foldSize, 2))
training_labels = labels
validation_labels = np.zeros((foldSize, 2))
k = ii
# perform k-fold split (here a five-fold split)
for n in range(foldSize):
pick_validation[n] = pick_training[k]
validation_labels[n] = training_labels[k]
pick_training = np.delete(pick_training, k, axis=0)
training_labels = np.delete(training_labels, k, axis=0)
k+=1
# do something (cross-validate)
# do something (cross-validate)
# do something (cross-validate)
print(pick_training.shape) # expected size (160, 2)
print(pick_validation.shape) # expected size (640, 2)
However, right now when I run it (still missing some actual computations) I get the following error:
I cant quiet understand why this happens, as the values of n, k are correct, and the size of pick_validation is (160,2) and pick_training is (640,2).
Can anyone hint to me why it would say that index 720 is out of bounds? (given the sizes of the arrays above where neither is of size 720).

How to improve performance of coincidence filtering of a time-series?

I'm working on instationary experimental data from fluid dynamics. We have measured data on three channels, so the samples are not directly coincident (measured at the same time). I want to filter them with a window scheme to get coincident samples and disgard all others.
Unfortunately, I cannot upload the original data set due to restrictions of the company. But I tried to set up a minimal example, which generates a similiar (smaller) dataset. The original dataset consists of 500000 values per channel, each noted with an arrival time. The coincidence is checked with these time stamps.
Just now, I loop over each sample from the first channel and look at the time differences to the other channels. If it is smaller than the specified window width, the index is saved. Probably it would be a little bit faster if I specifiy an intervall in which to check for the differences (like 100 or 1000 samples in the neighborhood). But the datarate between the channels can differ significantly, so it is not implemented yet. I prefer to get rid of looping over each sample - if possible.
def filterCoincidence(df, window = 50e-6):
'''
Filters the dataset with arbitrary different data rates on different channels to coincident samples.
The coincidence is checked with regard to a time window specified as argument.
'''
AT_cols = [col for col in df.columns if 'AT' in col]
if len(AT_cols) == 1:
print('only one group available')
return
used_ix = np.zeros( (df.shape[0], len(AT_cols)))
used_ix.fill(np.nan)
for ix, sample in enumerate(df[AT_cols[0]]):
used_ix[ix, 0] = ix
test_ix = np.zeros(2)
for ii, AT_col in enumerate(AT_cols[1:]):
diff = np.abs(df[AT_col] - sample)
index = diff[diff <= window].sort_values().index.values
if len(index) == 0:
test_ix[ii] = None
continue
test_ix[ii] = [ix_use if (ix_use not in used_ix[:, ii+1] or ix == 0) else None for ix_use in index][0]
if not np.any(np.isnan(test_ix)):
used_ix[ix, 1:] = test_ix
else:
used_ix[ix, 1:] = [None, None]
used_ix = used_ix[~np.isnan(used_ix).any(axis=1)]
print(used_ix.shape)
return
no_points = 10000
no_groups = 3
meas_duration = 60
df = pd.DataFrame(np.transpose([np.sort(np.random.rand(no_points)*meas_duration) for _ in range(no_groups)]), columns=['AT {}'.format(i) for i in range(no_groups)])
filterCoincidence(df, window=1e-3)
Is there a module already implemented, which can do this sort of filtering? However, it would be awesome if you can give me some hints to increase the performance of the code.
Just to update this thread if somebody else have a similar problem. I think after several code revisions, I have found a proper solution to this.
def filterCoincidence(self, AT1, AT2, AT3, window = 0.05e-3):
'''
Filters the dataset with arbitrary different data rates on different channels to coincident samples.
The coincidence is checked with regard to a time window specified as argument.
- arguments:
- three times series AT1, AT2 and AT3 (arrival times of particles in my case)
- window size (50 microseconds as default setting)
- output: indices of combined samples
'''
start_time = datetime.datetime.now()
AT_list = [AT1, AT2, AT3]
# take the shortest period of time
min_EndArrival = np.max(AT_list)
max_BeginArrival = np.min(AT_list)
for i, col in enumerate(AT_list):
min_EndArrival = min(min_EndArrival, np.max(col))
max_BeginArrival = max(max_BeginArrival, np.min(col))
for i, col in enumerate(AT_list):
AT_list[i] = np.delete(AT_list[i], np.where((col < max_BeginArrival - window) | (col > min_EndArrival + window)))
# get channel with lowest datarate
num_points = np.zeros(len(AT_list))
datarate = np.zeros(len(AT_list))
for i, AT in enumerate(AT_list):
num_points[i] = AT.shape[0]
datarate[i] = num_points[i] / (AT[-1]-AT[0])
used_ref = np.argmin(datarate)
# process coincidence
AT_ref_val = AT_list[used_ref]
AT_list = list(np.delete(AT_list, used_ref))
overview = np.zeros( (AT_ref_val.shape[0], 3), dtype=int)
overview[:,0] = np.arange(AT_ref_val.shape[0], dtype=int)
borders = np.empty(2, dtype=object)
max_diff = np.zeros(2, dtype=int)
for i, AT in enumerate(AT_list):
neighbors_lower = np.searchsorted(AT, AT_ref_val - window, side='left')
neighbors_upper = np.searchsorted(AT, AT_ref_val + window, side='left')
borders[i] = np.transpose([neighbors_lower, neighbors_upper])
coinc_ix = np.where(np.diff(borders[i], axis=1).flatten() != 0)[0]
max_diff[i] = np.max(np.diff(borders[i], axis=1))
overview[coinc_ix, i+1] = 1
use_ix = np.where(~np.any(overview==0, axis=1))
borders[0] = borders[0][use_ix]
borders[1] = borders[1][use_ix]
overview = overview[use_ix]
# create all possible combinations refer to the reference
combinations = np.prod(max_diff)
test = np.empty((overview.shape[0]*combinations, 3), dtype=object)
for i, [ref_ix, at1, at2] in enumerate(zip(overview[:, 0], borders[0], borders[1])):
test[i * combinations:i * combinations + combinations, 0] = ref_ix
at1 = np.arange(at1[0], at1[1])
at2 = np.arange(at2[0], at2[1])
test[i*combinations:i*combinations+at1.shape[0]*at2.shape[0],1:] = np.asarray(list(itertools.product(at1, at2)))
test = test[~np.any(pd.isnull(test), axis=1)]
# check distances
ix_ref = test[:,0]
test = test[:,1:]
test = np.insert(test, used_ref, ix_ref, axis=1)
test = test.astype(int)
AT_list.insert(used_ref, AT_ref_val)
AT_mat = np.zeros(test.shape)
for i, AT in enumerate(AT_list):
AT_mat[:,i] = AT[test[:,i]]
distances = np.zeros( (test.shape[0], len(list(itertools.combinations(range(3), 2)))))
for i, AT in enumerate(itertools.combinations(range(3), 2)):
distances[:,i] = np.abs(AT_mat[:,AT[0]]-AT_mat[:,AT[1]])
ix = np.where(np.all(distances <= window, axis=1))[0]
test = test[ix,:]
distances = distances[ix,:]
# check duplicates
# use sum of differences as similarity factor
dist_sum = np.max(distances, axis=1)
unique_sorted = np.argsort([np.unique(test[:,i]).shape[0] for i in range(test.shape[1])])[::-1]
test = np.hstack([test, dist_sum.reshape(-1, 1)])
test = test[test[:,-1].argsort()]
for j in unique_sorted:
_, ix = np.unique(test[:,j], return_index=True)
test = test[ix, :]
test = test[:,:3]
test = test[test[:,used_ref].argsort()]
# check that all values are after each other
ix = np.where(np.any(np.diff(test, axis=0) > 0, axis=1))[0]
ix = np.append(ix, test.shape[0]-1)
test = test[ix,:]
print('{} coincident samples obtained in {}.'.format(test.shape[0], datetime.datetime.now()-start_time))
return test
I'm certain that there is a better solution, but for me it works now. And I know, the variable names should definitely be chosen with more clarity (e.g. test), but I will clean up my code at the end of my master thesis... perhaps :-)

Gridwise application of the bisection method

I need to find roots for a generalized state space. That is, I have a discrete grid of dimensions grid=AxBx(...)xX, of which I do not know ex ante how many dimensions it has (the solution should be applicable to any grid.size) .
I want to find the roots (f(z) = 0) for every state z inside grid using the bisection method. Say remainder contains f(z), and I know f'(z) < 0. Then I need to
increase z if remainder > 0
decrease z if remainder < 0
Wlog, say the matrix historyof shape (grid.shape, T) contains the history of earlier values of z for every point in the grid and I need to increase z (since remainder > 0). I will then need to select zAlternative inside history[z, :] that is the "smallest of those, that are larger than z". In pseudo-code, that is:
zAlternative = hist[z,:][hist[z,:] > z].min()
I had asked this earlier. The solution I was given was
b = sort(history[..., :-1], axis=-1)
mask = b > history[..., -1:]
index = argmax(mask, axis=-1)
indices = tuple([arange(j) for j in b.shape[:-1]])
indices = meshgrid(*indices, indexing='ij', sparse=True)
indices.append(index)
indices = tuple(indices)
lowerZ = history[indices]
b = sort(history[..., :-1], axis=-1)
mask = b <= history[..., -1:]
index = argmax(mask, axis=-1)
indices = tuple([arange(j) for j in b.shape[:-1]])
indices = meshgrid(*indices, indexing='ij', sparse=True)
indices.append(index)
indices = tuple(indices)
higherZ = history[indices]
newZ = history[..., -1]
criterion = 0.05
increase = remainder > 0 + criterion
decrease = remainder < 0 - criterion
newZ[increase] = 0.5*(newZ[increase] + higherZ[increase])
newZ[decrease] = 0.5*(newZ[decrease] + lowerZ[decrease])
However, this code ceases to work for me. I feel extremely bad about admitting it, but I never understood the magic that is happening with the indices, therefore I unfortunately need help.
What the code actually does, it to give me the lowest respectively the highest. That is, if I fix on two specific z values:
history[z1] = array([0.3, 0.2, 0.1])
history[z2] = array([0.1, 0.2, 0.3])
I will get higherZ[z1] = 0.3 and lowerZ[z2] = 0.1, that is, the extrema. The correct value for both cases would have been 0.2. What's going wrong here?
If needed, in order to generate testing data, you can use something along the lines of
history = tile(array([0.1, 0.3, 0.2, 0.15, 0.13])[newaxis,newaxis,:], (10, 20, 1))
remainder = -1*ones((10, 20))
to test the second case.
Expected outcome
I adjusted the history variable above, to give test cases for both upwards and downwards. Expected outcome would be
lowerZ = 0.1 * ones((10,20))
higherZ = 0.15 * ones((10,20))
Which is, for every point z in history[z, :], the next highest previous value (higherZ) and the next smallest previous value (lowerZ). Since all points z have exactly the same history ([0.1, 0.3, 0.2, 0.15, 0.13]), they will all have the same values for lowerZ and higherZ. Of course, in general, the histories for each z will be different and hence the two matrices will contain potentially different values on every grid point.
I compared what you posted here to the solution for your previous post and noticed some differences.
For the smaller z, you said
mask = b > history[..., -1:]
index = argmax(mask, axis=-1)
They said:
mask = b >= a[..., -1:]
index = np.argmax(mask, axis=-1) - 1
For the larger z, you said
mask = b <= history[..., -1:]
index = argmax(mask, axis=-1)
They said:
mask = b > a[..., -1:]
index = np.argmax(mask, axis=-1)
Using the solution for your previous post, I get:
import numpy as np
history = np.tile(np.array([0.1, 0.3, 0.2, 0.15, 0.13])[np.newaxis,np.newaxis,:], (10, 20, 1))
remainder = -1*np.ones((10, 20))
a = history
# b is a sorted ndarray excluding the most recent observation
# it is sorted along the observation axis
b = np.sort(a[..., :-1], axis=-1)
# mask is a boolean array, comparing the (sorted)
# previous observations to the current observation - [..., -1:]
mask = b > a[..., -1:]
# The next 5 statements build an indexing array.
# True evaluates to one and False evaluates to zero.
# argmax() will return the index of the first True,
# in this case along the last (observations) axis.
# index is an array with the shape of z (2-d for this test data).
# It represents the index of the next greater
# observation for every 'element' of z.
index = np.argmax(mask, axis=-1)
# The next two statements construct arrays of indices
# for every element of z - the first n-1 dimensions of history.
indices = tuple([np.arange(j) for j in b.shape[:-1]])
indices = np.meshgrid(*indices, indexing='ij', sparse=True)
# Adding index to the end of indices (the last dimension of history)
# produces a 'group' of indices that will 'select' a single observation
# for every 'element' of z
indices.append(index)
indices = tuple(indices)
higherZ = b[indices]
mask = b >= a[..., -1:]
# Since b excludes the current observation, we want the
# index just before the next highest observation for lowerZ,
# hence the minus one.
index = np.argmax(mask, axis=-1) - 1
indices = tuple([np.arange(j) for j in b.shape[:-1]])
indices = np.meshgrid(*indices, indexing='ij', sparse=True)
indices.append(index)
indices = tuple(indices)
lowerZ = b[indices]
assert np.all(lowerZ == .1)
assert np.all(higherZ == .15)
which seems to work
z-shaped arrays for the next highest and lowest observation in history
relative to the current observation, given the current observation is history[...,-1:]
This constructs the higher and lower arrays by manipulating the strides of history
to make it easier to iterate over the observations of each element of z.
This is accomplished using numpy.lib.stride_tricks.as_strided and an n-dim generalzed
function found at Efficient Overlapping Windows with Numpy - I will include it's source at the end
There is a single python loop that has 200 iterations for history.shape of (10,20,x).
import numpy as np
history = np.tile(np.array([0.1, 0.3, 0.2, 0.15, 0.13])[np.newaxis,np.newaxis,:], (10, 20, 1))
remainder = -1*np.ones((10, 20))
z_shape = final_shape = history.shape[:-1]
number_of_observations = history.shape[-1]
number_of_elements_in_z = np.product(z_shape)
# manipulate histories to efficiently iterate over
# the observations of each "element" of z
s = sliding_window(history, (1,1,number_of_observations))
# s.shape will be (number_of_elements_in_z, number_of_observations)
# create arrays of the next lower and next higher observation
lowerZ = np.zeros(number_of_elements_in_z)
higherZ = np.zeros(number_of_elements_in_z)
for ndx, observations in enumerate(s):
current_observation = observations[-1]
a = np.sort(observations)
lowerZ[ndx] = a[a < current_observation][-1]
higherZ[ndx] = a[a > current_observation][0]
assert np.all(lowerZ == .1)
assert np.all(higherZ == .15)
lowerZ = lowerZ.reshape(z_shape)
higherZ = higherZ.reshape(z_shape)
sliding_window from Efficient Overlapping Windows with Numpy
import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product
def norm_shape(shape):
'''
Normalize numpy array shapes so they're always expressed as a tuple,
even for one-dimensional shapes.
Parameters
shape - an int, or a tuple of ints
Returns
a shape tuple
from http://www.johnvinyard.com/blog/?p=268
'''
try:
i = int(shape)
return (i,)
except TypeError:
# shape was not a number
pass
try:
t = tuple(shape)
return t
except TypeError:
# shape was not iterable
pass
raise TypeError('shape must be an int, or a tuple of ints')
def sliding_window(a,ws,ss = None,flatten = True):
'''
Return a sliding window over a in any number of dimensions
Parameters:
a - an n-dimensional numpy array
ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size
of each dimension of the window
ss - an int (a is 1D) or tuple (a is 2D or greater) representing the
amount to slide the window in each dimension. If not specified, it
defaults to ws.
flatten - if True, all slices are flattened, otherwise, there is an
extra dimension for each dimension of the input.
Returns
an array containing each n-dimensional window from a
from http://www.johnvinyard.com/blog/?p=268
'''
if None is ss:
# ss was not provided. the windows will not overlap in any direction.
ss = ws
ws = norm_shape(ws)
ss = norm_shape(ss)
# convert ws, ss, and a.shape to numpy arrays so that we can do math in every
# dimension at once.
ws = np.array(ws)
ss = np.array(ss)
shape = np.array(a.shape)
# ensure that ws, ss, and a.shape all have the same number of dimensions
ls = [len(shape),len(ws),len(ss)]
if 1 != len(set(ls)):
error_string = 'a.shape, ws and ss must all have the same length. They were{}'
raise ValueError(error_string.format(str(ls)))
# ensure that ws is smaller than a in every dimension
if np.any(ws > shape):
error_string = 'ws cannot be larger than a in any dimension. a.shape was {} and ws was {}'
raise ValueError(error_string.format(str(a.shape),str(ws)))
# how many slices will there be in each dimension?
newshape = norm_shape(((shape - ws) // ss) + 1)
# the shape of the strided array will be the number of slices in each dimension
# plus the shape of the window (tuple addition)
newshape += norm_shape(ws)
# the strides tuple will be the array's strides multiplied by step size, plus
# the array's strides (tuple addition)
newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
strided = ast(a,shape = newshape,strides = newstrides)
if not flatten:
return strided
# Collapse strided so that it has one more dimension than the window. I.e.,
# the new array is a flat list of slices.
meat = len(ws) if ws.shape else 0
firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
dim = firstdim + (newshape[-meat:])
# remove any dimensions with size 1
dim = filter(lambda i : i != 1,dim)
return strided.reshape(dim)

Categories

Resources