Numpy: Efficient access to sub-arrays generated by numpy.split - python

I have the following code that generates a list of sub-arrays based on the split function. Here, I just compare the first value of each tuple and based on the difference I generate the sub-arrays. So far so good.
import numpy as np
f = np.genfromtxt("d_n_isogro_ms.txt", names=True, dtype=None, usecols=(1,-1))
dm = np.absolute(np.diff(f['mz']))
pos = np.where(dm > 2)[0] + 1
fsplit = np.array_split(f, pos)
This is how the sample input looks like (only an excerpt):
[(270.0332, 472) (271.0376, 1936) (272.0443, 11188) (273.0495, 65874)
(274.0517, 8582) (275.0485, 4081) (276.0523, 659) (286.058, 1078)
(287.0624, 4927) (288.0696, 22481) (289.0757, 84001) (290.078, 13688)
(291.0746, 5402) (430.1533, 13995) (431.1577, 2992) (432.1685, 504)]
<type 'numpy.ndarray'>
The position for this particular data is then computed as:
pos = [7,12]
And here is my sample output:
[array([(270.0332, 472), (271.0376, 1936), (272.0443, 11188),
(273.0495, 65874), (274.0517, 8582), (275.0485, 4081),
(276.0523, 659)], dtype=[('mz', '<f8'), ('I', '<i8')]),
array([(286.058, 1078), (287.0624, 4927), (288.0696, 22481),
(289.0757, 84001), (290.078, 13688), (291.0746, 5402)],
dtype=[('mz', '<f8'), ('I', '<i8')]),
array([(430.1533, 13995),
(431.1577, 2992), (432.1685, 504)],
dtype=[('mz', '<f8'), ('I', '<i8')])]
I would like to perform the weighted average on each of the arrays. Is there an efficient way of doing this with numpy? I basically fail with the indexing. Preferably, I would like to use the dtype to identify weights and numbers.
Maybe one could do the whole operation on the fly
Thank you very much for your help in advance.
Best,
Christian

The output of np.array_split is a Python list containing arrays of unequal lenghts. The best you can do in that case is a Python loop:
result = [np.average(f_i['mz'], weights=f_i['I']) for f_i in fsplit]
But it is possible to come up with a completely vectorized solution, by using add.reduceat instead of array_split:
dm = np.abs(np.diff(f['mz']))
pos = np.flatnonzero(np.r_[True, dm > 2])
totals = np.add.reduceat(f['mz']*f['I'], pos)
counts = np.add.reduceat(f['I'], pos)
result = totals / counts

Related

Getting wrong results with np.argpartition, while selecting maximum n values from an array

so I was using this answer on 'How do I get indices of N maximum values in a NumPy array?' question. I used it in my ML model in which it outputs Logsoftmax layer values and I was thinking to get top 4 classes in each. In most of the cases, it sorted and gave values correctly but in a very few cases, I see partially unsorted results like this
arr = np.array([-3.0302, -2.7103, -7.4844, -3.4761, -5.3009, -5.2121, -3.7549, -4.7834,
-5.8870, -3.4839, -5.0104, -3.0992, -4.8823, -0.3319, -6.8084])
ind = np.argpartition(arr, -4)[-4:]
print(arr[ind])
and the output is
[-3.0992 -3.0302 -0.3319 -2.7103]
which is unsorted, it has to output the maximum values at last but it is not seen in this case. I checked with other examples and it is doing all fine. Like
arr = np.array([45, 35, 67.345, -34.5555, 66, -0.23655, 11.0001, 0.234444444])
ind = np.argpartition(arr, -4)[-4:]
print(arr[ind])
output
[35. 45. 66. 67.345]
What could be the reason? Did I miss anything?
If you're not planning on actually utilizing the sorted indices, why not just use np.sort?
>>> arr = np.array([-3.0302, -2.7103, -7.4844, -3.4761, -5.3009, -5.2121, -3.7549,
-4.7834, -5.8870, -3.4839, -5.0104, -3.0992, -4.8823, -0.3319, -6.8084])
>>> np.sort(arr)[-4:]
array([-3.0992, -3.0302, -2.7103, -0.3319])
Alternatively, as read here you could use a range for your kth option on np.argpartition:
np.argpartition(arr, range(0, -4, -1))[-4:]
array([-3.0992, -3.0302, -2.7103, -0.3319])

Randomly select timeframes as tuples from a list of time points

I have several time points taken from a video with some max time length (T). These points are stored in a list of lists as follows:
time_pt_nested_list =
[[0.0, 6.131, 32.892, 43.424, 46.969, 108.493, 142.69, 197.025, 205.793, 244.582, 248.913, 251.518, 258.798, 264.021, 330.02, 428.965],
[11.066, 35.73, 64.784, 151.31, 289.03, 306.285, 328.7, 408.274, 413.64],
[48.447, 229.74, 293.19, 333.343, 404.194, 418.575],
[66.37, 242.16, 356.96, 424.967],
[78.711, 358.789, 403.346],
[84.454, 373.593, 422.384],
[102.734, 394.58],
[158.534],
[210.112],
[247.61],
[340.02],
[365.146],
[372.153]]
Each list above is associated with some probability; I'd like to randomly select points from each list according to its probability to form n tuples of contiguous time spans, such as the following:
[(0,t1),(t1,t2),(t2,t3),...,(tn,T)]
where n is specified by the user. All the returned tuples should only contain the floating point numbers inside the nested list above. I want to assign the highest probability to them to be sampled and appear in the returned tuples, the second list a slightly lower probability, etc. The exact details of these probabilities are not important, but it would be nice if the user can input a parameter that controls how fast the probability decays when idx increases.
The returned tuples are timeframes that should exactly cover the entire video and should not overlap. 0 and T may not necessarily appear in time_pt_nested_list (but they may). Are there nice ways to implement this? I would be grateful for any insightful suggestions.
For example if the user inputs 6 as the number of subclips, then this will be an example output:
[(0.0, 32.892), (32.892, 64.784), (64.784, 229.74), (229.74, 306.285), (306.285, 418.575), (418.575, 437.47)]
All numbers appearing in the tuples appeared in time_pt_nested_list, except 0.0 and 437.47. (Well 0.0 does appear here but may not in other cases) Here 437.47 is the length of video which is also given and may not appear in the list.
This is simpler than it may look. You really just need to sample n points from your sublists, each with row-dependent sample probability. Whatever samples are obtained can be time-ordered to construct your tuples.
import numpy as np
# user params
n = 6
prob_falloff_param = 0.2
lin_list = sorted([(idx, el) for idx, row in enumerate(time_pt_nested_list) for
el in row], key=lambda x: x[1])
# endpoints required, excluded from random selection process
t0 = lin_list.pop(0)[1]
T = lin_list.pop(-1)[1]
arr = np.array(lin_list)
# define row weights, alpha is parameter
weights = np.exp(-prob_falloff_param*arr[:,0]**2)
norm_weights = weights/np.sum(weights)
# choose (weighted) random points, create tuple list:
random_points = sorted(np.random.choice(arr[:,1], size=(n-1), replace=False))
time_arr = [t0, *random_points, T]
output = list(zip(time_arr, time_arr[1:]))
example outputs:
# n = 6
[(0.0, 78.711),
(78.711, 84.454),
(84.454, 158.534),
(158.534, 210.112),
(210.112, 372.153),
(372.153, 428.965)]
# n = 12
[(0.0, 6.131),
(6.131, 43.424),
(43.424, 64.784),
(64.784, 84.454),
(84.454, 102.734),
(102.734, 210.112),
(210.112, 229.74),
(229.74, 244.582),
(244.582, 264.021),
(264.021, 372.153),
(372.153, 424.967),
(424.967, 428.965)]

How to efficiently mutate certain num of values in an array?

Given an initial 2-D array:
initial = [
[0.6711999773979187, 0.1949000060558319],
[-0.09300000220537186, 0.310699999332428],
[-0.03889999911189079, 0.2736999988555908],
[-0.6984000205993652, 0.6407999992370605],
[-0.43619999289512634, 0.5810999870300293],
[0.2825999855995178, 0.21310000121593475],
[0.5551999807357788, -0.18289999663829803],
[0.3447999954223633, 0.2071000039577484],
[-0.1995999962091446, -0.5139999985694885],
[-0.24400000274181366, 0.3154999911785126]]
The goal is to multiply some random values inside the array by a random percentage. Lets say only 3 random numbers get replaced by a random multipler, we should get something like this:
output = [
[0.6711999773979187, 0.52],
[-0.09300000220537186, 0.310699999332428],
[-0.03889999911189079, 0.2736999988555908],
[-0.6984000205993652, 0.6407999992370605],
[-0.43619999289512634, 0.5810999870300293],
[0.84, 0.21310000121593475],
[0.5551999807357788, -0.18289999663829803],
[0.3447999954223633, 0.2071000039577484],
[-0.1995999962091446, 0.21],
[-0.24400000274181366, 0.3154999911785126]]
I've tried doing this:
def mutate(array2d, num_changes):
for _ in range(num_changes):
row, col = initial.shape
rand_row = np.random.randint(row)
rand_col = np.random.randint(col)
cell_value = array2d[rand_row][rand_col]
array2d[rand_row][rand_col] = random.uniform(0, 1) * cell_value
return array2d
And that works for 2D arrays but there's chance that the same value is mutated more than once =(
And I don't think that's efficient and it only works on 2D array.
Is there a way to do such "mutation" for array of any shape and more efficiently?
There's no restriction of which value the "mutation" can choose from but the number of "mutation" should be kept strict to the user specified number.
One fairly simple way would be to work with a raveled view of the array. You can generate all your numbers at once that way, and make it easier to guarantee that you won't process the same index twice in one call:
def mutate(array_anyd, num_changes):
raveled = array_anyd.reshape(-1)
indices = np.random.choice(raveled.size, size=num_changes, replace=False)
values = np.random.uniform(0, 1, size=num_changes)
raveled[indices] *= values
I use array_anyd.reshape(-1) in favor of array_anyd.ravel() because according to the docs, the former is less likely to make an inadvertent copy.
The is of course still such a possibility. You can add an extra check to write back if you need to. A more efficient way would be to use np.unravel_index to avoid creating a view to begin with:
def mutate(array_anyd, num_changes):
indices = np.random.choice(array_anyd.size, size=num_changes, replace=False)
indices = np.unravel_indices(indices, array_anyd.shape)
values = np.random.uniform(0, 1, size=num_changes)
raveled[indices] *= values
There is no need to return anything because the modification is done in-place. Conventionally, such functions do not return anything. See for example list.sort vs sorted.
Using shuffle instead of random_choice, this would be a different solution. It works on an array of any shape.
def mutate(arrayIn, num_changes):
mult = np.zeros(arrayIn.ravel().shape[0])
mult[:num_changes] = np.random.uniform(0,1,num_changes)
np.random.shuffle(mult)
mult = mult.reshape(arrayIn.shape)
arrayIn = arrayIn + mult*arrayIn
return arrayIn

Numpy (n, 1, m) to (n,m)

I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.

Multiple Matrix Multiplications with Numpy

I have about 650 csv-based matrices. I plan on loading each one using Numpy as in the following example:
m1 = numpy.loadtext(open("matrix1.txt", "rb"), delimiter=",", skiprows=1)
There are matrix2.txt, matrix3.txt, ..., matrix650.txt files that I need to process.
My end goal is to multiply each matrix by each other, meaning I don't necessarily have to maintain 650 matrices but rather just 2 (1 ongoing and 1 that I am currently multiplying my ongoing by.)
Here is an example of what I mean with matrices defined from 1 to n: M1, M2, M3, .., Mn.
M1*M2*M3*...*Mn
The dimensions on all the matrices are the same. The matrices are not square. There are 197 rows and 11 columns. None of the matrices are sparse and every cell comes into play.
What is the best/most efficient way to do this in python?
EDIT: I took what was suggested and got it to work by taking the transpose since it isn't a square matrix. As an addendum to the question, is there a way in Numpy to do element by element multiplication?
A Python3 solution, if "each matrix by each other" actually means just multiplying them in a row and the matrices have compatible dimensions ( (n, m) · (m, o) · (o, p) · ... ), which you hint at with "(1 ongoing and 1 that...)", then use (if available):
from functools import partial
fnames = map("matrix{}.txt".format, range(1, 651))
np.linalg.multi_dot(map(partial(np.loadtxt, delimiter=',', skiprows=1), fnames))
or:
from functools import reduce, partial
fnames = map("matrix{}.txt".format, range(1, 651))
matrices = map(partial(np.loadtxt, delimiter=',', skiprows=1), fnames)
res = reduce(np.dot, matrices)
Maps etc. are lazy in python3, so files are read as needed. Loadtxt doesn't require a pre-opened file, a filename will do.
Doing all the combinations lazily, given that the matrices have the same shape (will do a lot of rereading of data):
from functools import partial
from itertools import starmap, combinations
map_loadtxt = partial(map, partial(np.loadtxt, delimiter=',', skiprows=1))
fname_combs = combinations(map("matrix{}.txt".format, range(1, 651)), 2)
res = list(starmap(np.dot, map(map_loadtxt, fname_combs)))
Using a bit of grouping to reduce reloading of files:
from itertools import groupby, combinations, chain
from functools import partial
from operator import itemgetter
loader = partial(np.loadtxt, delimiter=',', skiprows=1)
fname_pairs = combinations(map("matrix{}.txt".format, range(1, 651)), 2)
groups = groupby(fname_pairs, itemgetter(0))
res = list(chain.from_iterable(
map(loader(k).dot, map(loader, map(itemgetter(1), g)))
for k, g in groups
))
Since the matrices are not square, but have the same dimensions, you would have to add transposes before multiplication to match the dimensions. For example either loader(k).T.dot or map(np.transpose, map(loader, ...)).
If on the other hand the question actually was meant to address element wise multiplication, replace np.dot with np.multiply.
1. Variant: Nice code but reads all matrices at once
matrixFileCount = 3
matrices = [np.loadtxt(open("matrix%s.txt" % i ), delimiter=",", skiprows=1) for i in range(1,matrixFileCount+1)]
allC = itertools.combinations([x for x in range(matrixFileCount)], 2)
allCMultiply = [np.dot(matrices[c[0]], matrices[c[1]]) for c in allC]
print allCMultiply
2. Variant: Only load 2 Files at once, nice code but a lot of reloading
allCMulitply = []
fileList = ["matrix%s.txt" % x for x in range(1,matrixFileCount+1)]
allC = itertools.combinations(fileList, 2)
for c in allC:
m = [np.loadtxt(open(file), delimiter=",", skiprows=1) for file in c]
allCMulitply.append(np.dot(m[0], m[1]))
print allCMulitply
3. Variant: like the second but avoid loading every time. But only 2 matrix at one point in memory
Cause the permutations created with itertools are like (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4) you can avoid somtimes loading both of the 2 matrices.
matrixFileCount = 3
allCMulitply = []
mLoaded = {'file' : None, 'matrix' : None}
fileList = ["matrix%s.txt" % x for x in range(1,matrixFileCount+1)]
allC = itertools.combinations(fileList, 2)
for c in allC:
if c[0] is mLoaded['file']:
m = [mLoaded['matrix'], np.loadtxt(open(c[1]), delimiter=",", skiprows=1)]
else:
mLoaded = {'file' : None, 'matrix' : None}
m = [np.loadtxt(open(file), delimiter=",", skiprows=1) for file in c]
mLoaded = {'file' : c[0], 'matrix' : m[0]}
allCMulitply.append(np.dot(m[0], m[1]))
print allCMulitply
Performance
If you can load all Matrix at once in the memory, the first part is faster then the second, cause in the second you reload matrices a lot. Third part slower than first, but faster than second, cause it avoids sometimes to reloading matrices.
0.943613052368 (Part 1: 10 Matrices a 2,2 with 1000 executions)
7.75622487068 (Part 2: 10 Matrices a 2,2 with 1000 executions)
4.83783197403 (Part 3: 10 Matrices a 2,2 with 1000 executions)
Kordi's answer loads all of the matrices before doing the multiplication. And that's fine if you know the matrices are going to be small. If you want to conserve memory, however, I'd do the following:
import numpy as np
def get_dot_product(fnames):
assert len(fnames) > 0
accum_val = np.loadtxt(fnames[0], delimiter=',', skiprows=1)
return reduce(_product_from_file, fnames[1:], initializer=accum_val)
def _product_from_file(running_product, fname):
return running_product.dot(np.loadtxt(fname, delimiter=',', skiprows=1))
If the matrices are large and irregular in shape (not square), there are also optimization algorithms for determining the optimal associative groupings (i.e., where to put the parentheses), but in most cases I doubt it would be worth the overhead of loading and unloading each file twice, once to figure out the associative groupings and then once to carry it out. NumPy is surprisingly fast even on pretty big matrices.
How about a really simple solution avoiding map, reduce and the like? The default numpy array object does element-wise multiplication by default.
size = (197, 11)
result = numpy.ones(size)
for i in range(1, 651):
result *= numpy.loadtext(open("matrix{}.txt".format(i), "rb"),
delimiter=",", skiprows=1)

Categories

Resources