Rounding an array to values given in another array - python

Say I have an array:
values = np.array([1.1,2.2,3.3,4.4,2.1,8.4])
I want to round these values to members of an arbitrary array, say:
rounds = np.array([1.,3.5,5.1,6.7,9.2])
ideally returning an array of rounded numbers and an array of the residues:
rounded = np.array([1.,1.,3.5,5.1,1.,9.2])
residues = np.array([-0.1,-1.2,0.2,0.7,-1.1,0.6])
Is there a good pythonic way of doing this?

One option is this:
>>> x = np.subtract.outer(values, rounds)
>>> y = np.argmin(abs(x), axis=1)
And then rounded and residues are, respectively:
>>> rounds[y]
array([ 1. , 1. , 3.5, 5.1, 1. , 9.2])
>>> rounds[y] - values
array([-0.1, -1.2, 0.2, 0.7, -1.1, 0.8])
Essentially x is a 2D array of every value in values minus every value in rounds. y is a 1D array of the index of the minimum absolute value of each row of x. This y is then used to index rounds.
I should caveat this answer by noting that if len(values) * len(rounds) is big (e.g. starting to exceed 10e8), memory usage may start to become of concern. In this case, you could consider building up y iteratively instead to avoid having to allocate a large block of memory to x.

As the items in rounds array are sorted(or if not sort them) we can do this is O(n logn) time using numpy.searchsorted:
from functools import partial
def closest(rounds, x):
ind = np.searchsorted(rounds, x, side='right')
length = len(rounds)
if ind in (0, length) :
return rounds[ind]
else:
left, right = rounds[ind-1], rounds[ind]
val = min((left, right), key=lambda y:abs(x-y))
return val
f = partial(closest, rounds)
rounded = np.apply_along_axis(f, 1, values[:,None])[:,0]
residues = rounded - values
print repr(rounded)
print repr(residues)
Output:
array([ 1. , 1. , 3.5, 5.1, 1. , 9.2])
array([-0.1, -1.2, 0.2, 0.7, -1.1, 0.8])

The same time complexity as the answer by Ashwini Chaudhary, but fully vectorized:
def round_to(rounds, values):
# The main speed is in this line
I = np.searchsorted(rounds, values)
# Pad so that we can index easier
rounds_p = np.pad(rounds, 1, mode='edge')
# We have to decide between I and I+1
rounded = np.vstack([rounds_p[I], rounds_p[I+1]])
residues = rounded - values
J = np.argmin(np.abs(residues), axis=0)
K = np.arange(len(values))
return rounded[J,K], residues[J,K]

Find the closest number of x in rounds:
def findClosest(x,rounds):
return rounds[np.argmin(np.absolute(rounds-x))]
Loop over all values:
rounded = [findClosest(x,rounds) for x in values]
residues = values - rounded
This is a straightforward method, but you can be more efficient using that your rounds array is ordered.
def findClosest(x,rounds):
for n in range(len(rounds)):
if x > rounds[n]:
if n == 0:
return rounds[n]
elif rounds[n]-x > x-rounds[n-1]:
return rounds[n-1]
else:
return rounds[n]
return rounds[-1]
This might be, but not necessarily is faster than the argmin approach because you lose time with the python for loop, but you don't have to check along the whole rounds array.

The selected answer is already great. This one may seem convoluted to those that aren't necessarily used to more complex list-comprehensions, but otherwise it's actually quite clear (IMO) if you're familiar with it.
(Interestingly enough, this happens to run faster than the selected answer. Why would the numPy version be slower than this? Hmm... )
values = np.array([1.1,2.2,3.3,4.4,2.1,8.4])
rounds = np.array([1.,3.5,5.1,6.7,9.2])
rounded, residues = zip(*[
[
(rounds[cIndex]),
(dists[cIndex])
]
for v in values
for dists in [[r-v for r in rounds]]
for absDists in [[abs(d) for d in dists]]
for cIndex in [absDists.index(min(absDists))]
])
print np.array(rounded)
print np.array(residues)

Related

Generate a list of 100 elements, with each element having a 50% chance of being 0, and a 50% chance of being a random number between 0 and 1

I am quite new in this and I am trying to learn on my own. As I said in the title, I am trying to create a list of 100 numbers whose elements are either 50% chance of being 0's or 50% change being a number between 0 and 1. I made it like the one below. It works but it is a very tedious and not well coded program. Any hints of how to make to make it better?
import random
import numpy as np
#define a list of 100 random numbers between 0 and 1
randomlist = []
for i in range(0,100):
n = random.uniform(0,1)
randomlist.append(n)
print(randomlist)
#create a list of 100 numbers of 0's and 1's
def random_binary_string(length):
sample_values = '01' # pool of strings
result_str = ''.join((random.choice(sample_values) for i in range(length)))
return (result_str)
l=100
x=random_binary_string(l)
x1=np.array(list(map(int, x)))
print(x1)
#combine both lists. Keep value if of the binary list if it is equal to zero. Else, substitute it by the value of randomlist
#corresponding to the index position
finalist=[]
for i in range(len(x1)):
if x1[i]==0:
finalist.append(x1[i])
else:
finalist.append(randomlist[i])
print(finalist)
Thanks a lot!
You can simplify your code by nesting the two conditions. This avoids the need to keep two separate lists in memory and then merge them at the end.
randomlist = []
for i in range(0,100):
if random.choice((0, 1)) == 1:
randomlist.append(random.uniform(0,1))
else:
randomlist.append(0)
This is simple and succinct enough that you can refactor it to a single list comprehension. This is more compact but somewhat less legible.
randomlist = [random.uniform(0,1) if random.choice((0, 1)) else 0 for i in range(0,100)]
Here, we also shorten the code slightly by exploiting the fact that 0 is falsey and 1 is truthy in Python; i.e. they evaluate to False and True, respectively, in a boolean context. So if random.choice((0, 1)) == 1 can be abbreviated to simply if random.choice((0, 1)).
Somewhat obscurely, you can further simplify this (in the sense, use less code) by observing that the expression B if not A else A can be short-circuited into the expression A and B. This is not very obvious if you are not very familiar with boolean logic, but I think you can work it out on paper.
randomlist = [random.choice((0, 1)) and random.uniform(0,1) for i in range(0,100)]
Demo: https://ideone.com/uGHS2Y
You could try doing something like this:
import random
def create_random_list():
random_list = list()
for _ in range(100):
if random.choice((True, False)):
random_list.append(0)
else:
random_list.append(random.uniform(0, 1))
return random_list
randomly_generated_list = create_random_list()
print(len(randomly_generated_list), randomly_generated_list)
# 100 [x_0,...,x_99]
I propose this method:
first generate a random list of 'A', and 'B' with random.choice. 50% 'A' and 50% 'B'
then replace 'A' by random number between 0 and 1
and replace 'B' by 0
code here:
import random
ll = [ random.choice(['A', 'B']) for x in range(200)]
print(ll, len(ll))
for i in range(len(ll)):
if ll[i] == 'A':
ll[i]=random.random()
else:
ll[i]=0
print(ll, len(ll))
Shorter code here:
import random
ll = [ random.choice([0, random.random()]) for x in range(200)]
print(ll, len(ll), ll.count(0))
Since you are using Numpy, I probably would do what follow:
Create the array of num_el elements using random.uniform
Consider if the problem of the excluded upper bound: [low, high)
Create a boolean matrix with probability p=0.5 between true and false: random.choice
Use the matrix to set some elements of the array to zero, by
That's the code:
num_el = 10
p = 0.5
res = np.random.uniform(0., 1., size=(1, num_el))
bool_mat = np.random.choice(a=[False, True], size=(1, num_el), p=[p, 1-p])
res[bool_mat] = 0.
res
# array([[0. , 0.51213168, 0. , 0.68230528, 0.5287728 ,
# 0.9072587 , 0. , 0.43078057, 0.89735872, 0. ]])
The approach to use depends on whether your objective is to get exactly half of the outcomes to be zeroes, or have the expected number of zeros be half the total. It wasn't clear from your question which way you viewed the problem, so I've implemented both approaches as functions.
If you want a deterministic fixed proportion of zeroes/non-zeroes, the first function in the code below will do the trick. It creates a list with the desired number of zeros and non-zeros, and then uses shuffling (which I timed to be faster than sampling). If you want exactly half, then obviously the argument n has to be even.
If your goal is a probabilistic 50% zeroes, use the second function.
import random
# Exactly floor(n / 2) outcomes are zeros, i.e., exactly half when n is even.
# This version is trivial to modify to give any desired proportion of zeros.
def make_rand_list_v1(n = 100):
m = n // 2
n -= m
ary = [random.random() for _ in range(n)] + [0] * m
random.shuffle(ary)
return ary
# Each outcome has probability 0.5 of being zero
def make_rand_list_v2(n = 100):
return [random.getrandbits(1) and random.uniform(0, 1) for _ in range(n)]

Tuple-like (lexographical) max in numpy

I find myself running into the following situation in numpy muliple times over the past couple of months, and I cannot imagine there is not a proper solution for it.
I have a 2d array, let's say
x = np.array([
[1, 2, 3],
[2, -5, .333],
[1, 4, 2],
[2, -5, 4]])
Now I would like to (sort/get the maximum/do argsort/argmax/ etc) this array in such a way that it first compares the first column. If the first column is equal, it compares the second column, and then the third. So this means for our example:
# max like python: max(x.tolist())
np.tuple_like_amax(x) = np.array([2, -5, 4])
# argmax does't have python equivalent, but something like: [i for i, e in enumerate(x.tolist()) if e == max(x.tolist())][0]
np.tuple_like_argmax = 3
# sorting like python: sorted(x.tolist())
np.tuple_like_sort(x) = np.array([[1.0, 2.0, 3.0], [1.0, 4.0, 2.0], [2.0, -5.0, 0.333], [2.0, -5.0, 4.0]])
# argsort doesn't have python equivalent, but something like: sorted(range(len(x)), key=lambda i: x[i].tolist())
np.tuple_like_argsort(x) = np.array([0, 2, 1, 3])
This is exactly the way how python compares tuples (so actually just calling max(x.tolist()) does the trick here for max. It does feel however like a time-and-memory waste to first convert the array to a python list, and in addition I would like to use things like argmax, sort and all the other great numpy functions.
So just to be clear, I'm not interested in python code that mimics an argmax, but for something that achieves this without converting the lists to python lists.
Found so far:
np.sort seems to work on structured arrays when order= is given. It does feel to me that creating a structured array and then using this method is overkill. Also, argmax doesn't seem to support this, meaning that one would have to use argsort, which has a much higher complexity.
Here I will focus only on finding the lexicographic argmax (the others: max, argmin, and min can be found trivially from argmax). In addition, unlike np.argmax(), we will return all rows that are at rank 0 (if there are duplicate rows), i.e. all the indices where the row is the lexicographic maximum.
The idea is that, for the "tuple-like order" desired here, the function is really:
find all indices where the first column has the maximum;
break ties with the places where the second column is max, under condition that the first column is max;
etc., as long as there are ties to break (and more columns).
def ixmax(x, k=0, idx=None):
col = x[idx, k] if idx is not None else x[:, k]
z = np.where(col == col.max())[0]
return z if idx is None else idx[z]
def lexargmax(x):
idx = None
for k in range(x.shape[1]):
idx = ixmax(x, k, idx)
if len(idx) < 2:
break
return idx
At first, I was worried that the explicit looping in Python would kill it. But it turns out that it is quite fast. In the case where there is no ties (more likely with independent float values, for instance), that returns immediately after a single np.where(x[:, 0] == x[:, 0].max()). Only in the case of ties do we need to look at the (much smaller) subset of rows that were tied. In unfavorable conditions (many repeated values in all columns), it is still ~100x or more than the partition method, and O(log n) faster than lexsort(), of course.
Test 1: correctness
for i in range(1000):
x = np.random.randint(0, 10, size=(1000, 8))
found = lexargmax(x)
assert lexargmax_by_sort(x) in found and np.unique(x[found], axis=0).shape[0] == 1
(where lexargmark_by_sort is np.lexsort(x[:, ::-1].T)[-1])
Test 2: speed
x = np.random.randint(0, 10, size=(100_000, 100))
a = %timeit -o lexargmax(x)
# 776 µs ± 313 ns per loop
b = %timeit -o lexargmax_by_sort(x)
# 507 ms ± 2.65 ms per loop
# b.average / a.average: 652
c = %timeit -o lexargmax_by_partition(x)
# 141 ms ± 2.38 ms
# c.average / a.average: 182
(where lexargmark_by_partition is based on #MadPhysicist very elegant idea:
def lexargmax_by_partition(x):
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
return np.argpartition(view, -1)[-1]
)
After some more testing on various sizes, we get the following time measurements and performance ratios:
In the LHS plot, lexargmax is the group shown with 'o-' and lexargmax_by_partition is the upper group of lines.
In the RHS plot, we just show the speed ratio.
Interestingly, lexargmax_by_partition execution time seems fairly independent of m, the number of columns, whereas our lexargmax depends a little bit on it. I believe that is reflecting the fact that, in this setting (purposeful collisions of max in each column), the more columns we have, the "deeper" we need to go when breaking ties.
Previous (wrong) answer
To find the argmax of the row by lexicographic order, I was thinking you could do:
def lexmax(x):
r = (2.0 ** np.arange(x.shape[1]))[::-1]
return np.argmax(((x == x.max(axis=0)) * r).sum(axis=1))
Explanation:
x == x.max(axis=0) (as an int) is 1 for each element that is equal to the column's max. In your example, it is (astype(int)):
[[0 0 0]
[1 0 0]
[0 1 0]
[1 0 1]]
then we multiply by a column weight that is more than the sum of 1's on the right. Powers of two achieve that. We do it in float to address cases with more than 64 columns.
But this is fatally flawed: The positions of max in the second column should be considered only in the subset where the first column had the max value (to break the tie).
Other approaches including affine transformations of all columns so that we can sum them and find the max don't work either: if the max in column 0 is, say, 1.0, and there is a second place at 0.999, then we would have to know ahead of time that difference of 0.001 and make sure no combination of values from the columns to the right to sum up to overtake that difference. So, that's a dead end.
To sort a list by the contents of a row, you can use np.lexsort. The only catch is that it sorts by the last element of the selected axis first:
index = np.lexsort(x.T[::-1])
OR
index = np.lexsort(x[:, ::-1].T)
This is "argsort". You can make it into "sort" by doing
x[index]
"min" and "max" can be done trivially by using the index:
xmin = x[index[0]]
xmax = x[index[-1]]
Alternatively, you can use a technique I suggested in one of my questions: Sorting array of objects by row using custom dtype. The idea is to make each row into a structure that has a field for each element:
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
You can sort the array in-place by operating on the view:
>>> view.sort()
>>> x
array([[ 1. , 2. , 3. ],
[ 1. , 4. , 2. ],
[ 2. , -5. , 0.333],
[ 2. , -5. , 4. ]])
That's because the ndarray constructor points to x as the original buffer.
You can not get argmin, argmax, min and max to work on the result. However, you can still get the min and max in O(n) time using my favorite function in all of numpy: np.partition:
view.partition([0, -1])
xmin = x[0]
xmax = x[-1]
You can use argpartition on the array as well to get the indices of the desired elements:
index = view.argpartition([0, -1])[[0, -1]]
xmin = x[index[0]]
xmax = x[index[-1]]
Notice that both sort and partition have an order argument that you can use to rearrange the comparison of the columns.

Script in function to write find second smallest value if smallest value fulfills a particular condition

For a set of matrices that I have, called distance_matrix (these exist in a function that then generates all of them over a given range). I need to find the smallest value in this matrix, obviously notated by an index pair,for which i have this code:
min_indices = np.unravel_index(np.argmin(np.abs(distance_matrix)),np.shape(distance_matrix))
This works perfectly well. But now i need to write something that finds the second lowest value if the indices returned by the above code is (0,0). I guess i cant use the code above, since you cant modify it to find the next value (as far as i can tell). Ive tried with an if-loop, but thats not quite working:
sdiff = np.diff(np.sign(np.diff(distance_matrix)))
rising_1 = (sdiff == 2)
rising_2 = (sdiff[:-1] == 1) & (sdiff[1:] == 1)
rising_all = rising_1
rising_all[1:] = rising_all[1:] | rising_2
min_ind = np.where(rising_all)[0] + 1
minima = list(zip(min_ind, distance_matrix[min_ind]))
for ind_pair in range(0,len(minima)):
if ind_pair ==(0,0):
minima=sorted(minima, key=lambda pair: pair[1])[1]
else:
minima=sorted(minima, key=lambda pair: pair[1])[0]
Supposing that the distance matrix is 2 dimensional then, and using the following test data:
distance_matrix = np.array([[0. , 1. , 2. ],
[1. , 0.5, 1.5],
[2. , 1.5, 2. ]])
Now,
np.unravel_index(
np.argmin(np.abs(distance_matrix)),
np.shape(distance_matrix)
)
returns (0, 0) for you, which is what you don't want, of course. But is there a reason you can't just make that impossible by using the following:
mask = np.ones(np.shape(distance_matrix))
mask[0, 0] = np.nan # you can put this in a loop if there is
# more than one coordinate set you don't want
distance_matrix * mask
# array([[nan, 1. , 2. ],
# [1. , 0.5, 1.5],
# [2. , 1.5, 2. ]])
np.unravel_index(
np.nanargmin(np.abs(distance_matrix * mask)),
np.shape(distance_matrix)
)
# (1, 1)
Note that nanargmin is a version of argmin that ignores NaNs.
You can find the indices of the first two smallest elements at once using np.argpartition. Partitioning on the second element of the array guarantees that the first element will be smaller, so you get both.
indices2 = np.unravel_index(
np.argpartition(np.abs(distance_matrix), 1, None),
distance_matrix.shape)
You must pass an axis of None to ravel the index instead of partitioning the columns. Otherwise you will not be able to call np.unravel_index directly on the result.
If you want each index separately:
i = np.argpartition(np.abs(distance_matrix), 1, None)
indices = np.unravel_index(i[0], distance_matrix.shape)
indices2 = np.unravel_index(i[1], distance_matrix.shape)
If you want to get more minima, you will have to sort the left side of the array.
k = 5
i = np.argpartition(np.abs(distance_matrix), k, None)[:k]
i.sort()
indices = np.unravel_index(i, distance_matrix.shape)

Numpy find number of occurrences in a 2D array

Is there a numpy function to count the number of occurrences of a certain value in a 2D numpy array. E.g.
np.random.random((3,3))
array([[ 0.68878371, 0.2511641 , 0.05677177],
[ 0.97784099, 0.96051717, 0.83723156],
[ 0.49460617, 0.24623311, 0.86396798]])
How do I find the number of times 0.83723156 occurs in this array?
arr = np.random.random((3,3))
# find the number of elements that get really close to 1.0
condition = arr == 0.83723156
# count the elements
np.count_nonzero(condition)
The value of condition is a list of booleans representing whether each element of the array satisfied the condition. np.count_nonzero counts how many nonzero elements are in the array. In the case of booleans it counts the number of elements with a True value.
To be able to deal with floating point accuracy, you could do something like this instead:
condition = np.fabs(arr - 0.83723156) < 0.001
For floating point arrays np.isclose is much better option than either comparing with the exactly same element or defining a custom range.
>>> a = np.array([[ 0.68878371, 0.2511641 , 0.05677177],
[ 0.97784099, 0.96051717, 0.83723156],
[ 0.49460617, 0.24623311, 0.86396798]])
>>> np.isclose(a, 0.83723156).sum()
1
Note that real numbers are not represented exactly in a computer, that is why np.isclose will work while == doesn't:
>>> (0.1 + 0.2) == 0.3
False
Instead:
>>> np.isclose(0.1 + 0.2, 0.3)
True
To count the number of times x appears in any array, you can simply sum the boolean array that results from a == x:
>>> col = numpy.arange(3)
>>> cols = numpy.tile(col, 3)
>>> (cols == 1).sum()
3
It should go without saying, but I'll say it anyway: this is not very useful with floating point numbers unless you specify a range, like so:
>>> a = numpy.random.random((3, 3))
>>> ((a > 0.5) & (a < 0.75)).sum()
2
This general principle works for all sorts of tests. For example, if you want to count the number of floating point values that are integral:
>>> a = numpy.random.random((3, 3)) * 10
>>> a
array([[ 7.33955747, 0.89195947, 4.70725211],
[ 6.63686955, 5.98693505, 4.47567936],
[ 1.36965745, 5.01869306, 5.89245242]])
>>> a.astype(int)
array([[7, 0, 4],
[6, 5, 4],
[1, 5, 5]])
>>> (a == a.astype(int)).sum()
0
>>> a[1, 1] = 8
>>> (a == a.astype(int)).sum()
1
You can also use np.isclose() as described by Imanol Luengo, depending on what your goal is. But often, it's more useful to know whether values are in a range than to know whether they are arbitrarily close to some arbitrary value.
The problem with isclose is that its default tolerance values (rtol and atol) are arbitrary, and the results it generates are not always obvious or easy to predict. To deal with complex floating point arithmetic, it does even more floating point arithmetic! A simple range is much easier to reason about precisely. (This is an expression of a more general principle: first, do the simplest thing that could possibly work.)
Still, isclose and its cousin allclose have their uses. I usually use them to see if a whole array is very similar to another whole array, which doesn't seem to be your question.
If it may be of use to anyone: for very large 2D arrays, if you want to count how many time all elements appear within the entire array, one could flatten the array into a list and then count how many times each element appeared:
from itertools import chain
import collections
from collections import Counter
#large array is called arr
flatten_arr = list(chain.from_iterable(arr))
dico_nodeid_appearence = Counter(flatten_arr)
#how may times x appeared in the arr
dico_nodeid_appearence[x]

Speed up minimum search in Numpy/Python

I have two floating arrays and want to find data points which match within a certain range.
This is what I got so far:
import numpy as np
for vx in range(len(arr1)):
match = (np.abs(arr2-arr1[vx])).argmin()
if abs(arr1[vx]-arr2[match])<0.375:
point = arr2[match]
The problem is that arr1 contains 150000 elements and arr2 around 110000 elements. This takes an awful amount of time. Do you have suggestions to speed things up?
In addition to not being vectorized, your current search is (n * m) where n is the size of arr2 and m is the size of arr1. In these kinds of searches it helps to sort arr1 or arr2 so you can use a binary search. Sorting ends up being the slowest step but it's still faster if m is large because the n*log(n) sort is faster than (n*m).
Here is how you can do the search in a vectorized way using the sorted array:
def find_closest(A, target):
#A must be sorted
idx = A.searchsorted(target)
idx = np.clip(idx, 1, len(A)-1)
left = A[idx-1]
right = A[idx]
idx -= target - left < right - target
return A[idx]
arr2.sort()
closest = find_closest(arr2, arr1)
closest = np.where(abs(closest - arr1) < .375, closest, np.nan)
The whole idea of using numpy is to avoid computation with loops.
Specifying criteria to extract a new array that satisfies the criteria can be implemented easily with array computation. Here's an example extracting values from array a which satisfies the criteria that that element has an absolute different of less than 0.75 from the corresponding element in array b:-
a = array([1, 0, 0.5, 1.2])
b = array([1.2, 1.1, 1.3, 1.4])
c = a[abs(a-b)<0.75]
Which gives us
array([ 1. , 1.2])

Categories

Resources