Numpy find number of occurrences in a 2D array - python

Is there a numpy function to count the number of occurrences of a certain value in a 2D numpy array. E.g.
np.random.random((3,3))
array([[ 0.68878371, 0.2511641 , 0.05677177],
[ 0.97784099, 0.96051717, 0.83723156],
[ 0.49460617, 0.24623311, 0.86396798]])
How do I find the number of times 0.83723156 occurs in this array?

arr = np.random.random((3,3))
# find the number of elements that get really close to 1.0
condition = arr == 0.83723156
# count the elements
np.count_nonzero(condition)
The value of condition is a list of booleans representing whether each element of the array satisfied the condition. np.count_nonzero counts how many nonzero elements are in the array. In the case of booleans it counts the number of elements with a True value.
To be able to deal with floating point accuracy, you could do something like this instead:
condition = np.fabs(arr - 0.83723156) < 0.001

For floating point arrays np.isclose is much better option than either comparing with the exactly same element or defining a custom range.
>>> a = np.array([[ 0.68878371, 0.2511641 , 0.05677177],
[ 0.97784099, 0.96051717, 0.83723156],
[ 0.49460617, 0.24623311, 0.86396798]])
>>> np.isclose(a, 0.83723156).sum()
1
Note that real numbers are not represented exactly in a computer, that is why np.isclose will work while == doesn't:
>>> (0.1 + 0.2) == 0.3
False
Instead:
>>> np.isclose(0.1 + 0.2, 0.3)
True

To count the number of times x appears in any array, you can simply sum the boolean array that results from a == x:
>>> col = numpy.arange(3)
>>> cols = numpy.tile(col, 3)
>>> (cols == 1).sum()
3
It should go without saying, but I'll say it anyway: this is not very useful with floating point numbers unless you specify a range, like so:
>>> a = numpy.random.random((3, 3))
>>> ((a > 0.5) & (a < 0.75)).sum()
2
This general principle works for all sorts of tests. For example, if you want to count the number of floating point values that are integral:
>>> a = numpy.random.random((3, 3)) * 10
>>> a
array([[ 7.33955747, 0.89195947, 4.70725211],
[ 6.63686955, 5.98693505, 4.47567936],
[ 1.36965745, 5.01869306, 5.89245242]])
>>> a.astype(int)
array([[7, 0, 4],
[6, 5, 4],
[1, 5, 5]])
>>> (a == a.astype(int)).sum()
0
>>> a[1, 1] = 8
>>> (a == a.astype(int)).sum()
1
You can also use np.isclose() as described by Imanol Luengo, depending on what your goal is. But often, it's more useful to know whether values are in a range than to know whether they are arbitrarily close to some arbitrary value.
The problem with isclose is that its default tolerance values (rtol and atol) are arbitrary, and the results it generates are not always obvious or easy to predict. To deal with complex floating point arithmetic, it does even more floating point arithmetic! A simple range is much easier to reason about precisely. (This is an expression of a more general principle: first, do the simplest thing that could possibly work.)
Still, isclose and its cousin allclose have their uses. I usually use them to see if a whole array is very similar to another whole array, which doesn't seem to be your question.

If it may be of use to anyone: for very large 2D arrays, if you want to count how many time all elements appear within the entire array, one could flatten the array into a list and then count how many times each element appeared:
from itertools import chain
import collections
from collections import Counter
#large array is called arr
flatten_arr = list(chain.from_iterable(arr))
dico_nodeid_appearence = Counter(flatten_arr)
#how may times x appeared in the arr
dico_nodeid_appearence[x]

Related

Is there a way to remove specific elements in an array using numpy functions?

Is there a way to remove specific elements in an array using numpy.delete, boolean mask (or any other function) that meet certain criteria such as conditionals on that data type?, this by using numpy methods.
For example:
import numpy as np
arr = np.random.chisquare(6, 10)
array([4.61518458, 4.80728541, 4.59749491, 3.44053946, 5.52507358,
7.97092747, 2.01946678, 6.26877508, 3.68286537, 2.06759469])`
Now for test purposes I would like to know if I can use some numpy function to remove all elements that are divisible by the given value k
>>> np.delete(arr, 1, 0)
[4.61518458 4.59749491 3.44053946 5.52507358 7.97092747 2.01946678
6.26877508 3.68286537 2.06759469]
the delete(arr, 1, 0) call only removes the value at that position, is there a way to delete multiple values based on anonymous function lambda or a condition like the one I mentioned above?.
Yes, this is part of numpy's magic indexing. You can use comparison operator or the apply function to produce an array of booleans, with True for the ones to keep and False for the ones to toss. So, for example, to keep all the elements less than 5::
selections = array < 5
array = array[selections]
That will only keep the elements where selections is True.
Of course, since all your values are floats, they aren't going to be divisible by an integer k, but that's another story.
For doing such division, based on the answer of Tim:
k = 6 # a number
array = array[array % k == 0]
Since you're looking at floating point division and will therefore be subject to numerical limitations, there should be no expectation that the result of the division will be perfect. Instead, I would suggest that you accept removing all the numbers that are almost divisble by k.
For your problem I would set a threshold and use np.logical_and:
arr[np.logical_and(arr % k > threshold, (k - (arr % k) > thresold)]
Explanation
Consider the following problem:
k = 1.0000002300000000450001000101
x = np.array([k * i for i in range(1,10)] + [0.5,])
#array([1.00000023, 2.00000046, 3.00000069, 4.00000092, 5.00000115,
# 6.00000138, 7.00000161, 8.00000184, 9.00000207, 0.5])
In theory, all the numbers but the last one (0.5) should be divisible by k exactly. In reality, numerical precision limits that capability (if you really want to dig into why, I'd refer to the link above on floating point arithmetic)
np.where(x%k==0)
#array([0, 1, 2, 3, 5, 7], dtype=int64),)
x[x%k==0]
#array([1.00000023, 2.00000046, 3.00000069, 4.00000092, 6.00000138,
# 8.00000184])
We've missed a few that we would like to have been caught (x[4], x[6] and x[8], with values of 5*k 6*k and 9*k). If we look at the modular division itself, we see that the missed numbers are almost 0 or almost k (we expect the last one since 0.5%k==0.5):
x[x%k!=0]%k
#array([1.00000023e+00, 4.44089210e-16, 1.00000023e+00, 5.00000000e-01])
So the best we can do is find a work around where we look for cases that are close enough. Noting that the differences above are O(2**-51), we can use 2**-50 as our threshold in this case but for practical purposes we can probably be a bit more lenient.
You also mention you want to eliminate the values that are divisible, so we want to keep the values where x%k > threshold and k-x%k > threshold:
threshold = 2**-50
x[np.logical_and((x % k) > threshold, (k - (x % k)) > threshold)]
#array([0.5])
If you wanted to keep them, then you'd use the opposite inequalities and use np.logical_or:
x[np.logical_or((x % k) < threshold, (k - (x % k)) < threshold)]
#array([1.00000023, 2.00000046, 3.00000069, 4.00000092, 5.00000115,
# 6.00000138, 7.00000161, 8.00000184, 9.00000207])

Rounding Numbers that fall within variable number of ranges in Python

I have an input list of numbers:
lst = [3.253, -11.348, 6.576, 2.145, -11.559, 7.733, 5.825]
I am trying to think of a way to replace each number in a list with a given number if it falls into a range. I want to create multiple ranges based on min and max of input list and a input number that will control how many ranges there is.
Example, if i said i want 3 ranges equally divided between min and max.
numRanges = 3
lstMin = min(lst)
lstMax = max(lst)
step = (lstMax - lstMin) / numRanges
range1 = range(lstMin, lstMin + step)
range2 = range(range1 + step)
range3 = range(range2 + step)
Right away here, is there a way to make the number of ranges be driven by the numRanges variable?
Later i want to take the input list and for example if:
for i in lst:
if i in range1:
finalLst.append(1) #1 comes from range1 and will be growing if more ranges
elif i in range2:
finalLst.append(2) #2 comes from range2 and will be growing if more ranges
else i in range3:
finalLst.append(3) #3 comes from range2 and will be growing if more ranges
The way i see this now it is all "manual" and I am not sure how to make it a little more flexible where i can just specify how many ranges and a list of numbers and let the code do the rest. Thank you for help in advance.
finalLst = [3, 1, 3, 3, 1, 3, 3]
This is easy to do with basic mathematical operations in a list comprehension:
numRanges = 3
lstMin = min(lst)
lstMax = max(lst) + 1e-12 # small value added to avoid floating point rounding issues
step = (lstMax - lstMin) / numRanges
range_numbers = [int((x-lstMin) / step) for x in lst]
This will give an integer for each value in the original list, with 0 indicating that the value falls in the first range, 1 being the second, and so on. It's almost the same as your code, but the numbers start at 0 rather than 1 (you could stick a + 1 in the calculation if you really want 1-indexing).
The small value I've added to lstMax is there for two reasons. The first is to make sure that floating point rounding issues don't make the largest value in the list yield numRange as its range index rather than numRange-1 (indicating the numRangeth range). The other reason is to avoid a division by zero error if the list only contains a single value (possibly repeated multiple times) such that min(lst) and max(lst) return the same thing.
Python has a very nice tool for doing exactly this kind of work called bisect. Lets say your range list is defined as such:
ranges = [-15, -10, -5, 5, 10, 15]
For your input list, you simply call bisect, like so:
lst = [3.253, -11.348, 6.576, 2.145, -11.559, 7.733, 5.825]
results = [ranges[bisect(ranges, element)] for element in lst]
Which results in
>>>[5, -10, 10, 5, -10, 10, 10]
You can then extend this to any arbitrary list of ranges using ranges = range(start,stop,step) in python 2.7 or ranges = list(range(start,stop,step)) in python 3.X
Update
Reread your question, and this is probably closer to what you're looking for (still using bisect):
from numpy import linspace
from bisect import bisect_left
def find_range(numbers, segments):
mx = max(numbers)
mn = mn(numbers)
ranges = linspace(mn, mx, segments)
return [bisect_left(ranges, element)+1 for element in numbers]
>>> find_range(lst, 3)
[3, 2, 3, 3, 1, 3, 3]

Min-Max difference in continuous part of certain length within a np.array

I have a numpy array of values like this:
a = np.array((1, 3, 4, 5, 10))
In this case the array has length 5. Now I want to know the difference between the lowest and highest value in the array, but only within a certain continuous part of the array, for example with length 3.
So in this case it would be the difference between 4 and 10, so 6. It would also be nice to have the index of the starting point of the continuous part (in the above example that would be 2). So something like this:
def f(a, lenght_of_part):
...
return (max_difference, starting index)
I know I could iterate over sliced parts of the array, but for my actual purpose I have ~150k arrays of length 1500, so that would take too long.
What would be an easy and quick way of doing this?
Thanks in advance!
This is a bit tricky to get done in a vectorised way in Numpy. One option is to use numpy.lib.stride_tricks.as_strided, which requires care, because it allows to access arbitrary memory. Here's an example for a window size of k = 3:
>>> k = 3
>>> shape = (len(a) - k + 1, k)
>>> b = numpy.lib.stride_tricks.as_strided(
a, shape=shape, strides=(a.itemsize, a.itemsize))
>>> moving_ptp = b.ptp(axis=1)
>>> start_index = moving_ptp.argmax()
>>> moving_ptp[start_index]
6

Extract the minimum of a list of arrays, plot lines along with their intersections in Python

I have a list that is made up of sub arrays like:
[array(.....),array(.....),array(.....),....]
All the arrays are of the same length (basically each array represents a line).
I want to extract the minimum values element by element. So if each array has 100 elements, then I want the final list to be 100 elements in length. I also want the points where these lines intersect. Something like the this should clarify what I mean:
https://www.dropbox.com/s/xshkhvqp0ay3vxc/g14.png
I make no claims that this is the best way to do it, having not fully wrapped my mind around Python. I have coded up a solution to a similar problem while calculating phase diagrams. At a given temperature, for some number of free energies functions for different phases, what is the overall minimum free energy curve, and which function is the minimum at each point.
G = []
while iPhase < len(f): # loop through all free energy functions
G.append(f[iPhase](x)) # x is an array of x values
iPhase = iPhase+1
minG = G{0][:] # define an overall minimum free energy curve, starting with 0'th
minF = np.zeros(len(minG)) # prepare for index indicating which function f[i](x) is min
iPhase = 1
while iPhase < len(f):
nextF = iPhase*np.ones(len(x0, dtype=np.int)
np.less(G[iPhase],minG, nextF) # where is next free energy function less than current min
minG = np.minimum(minG, G[iPhase]) # new overall minimum
minF = np.ma.filled(np.ma.array(minF, mask=nextF), fill_value=iPhase) # new index as needed
iPhase = iPhase+1
So, the final output is an overall minimum minG, and an index of which curve it came from in minF. Now, if you want to refine the intersection points, one can use
changes = np.array(np.where(minF[:-1]!=minF[1:]))
to return indices of where the lines crossed, and which functions were involved. If they different functions are truly lines y=mx*b, you can do the algebra to get the exact crossing. For more complex functions, a more complex procedure will be needed (I define a temporary function as the difference of the two under consideration, and then use scipy.optimize.newton to get the zero).
Let say that you have the following list:
>>> l = [array('i', [5, 15, 1, 25]),
array('i', [5, 15, 2, 25]),
array('i', [5, 15, 3, 25])]
You can obtain minimum values of every arrays with the following:
>>> [min(x) for x in l]
[1, 2, 3]
Sorry but I don't understand the rest of your question :)
Like this:
>>> import numpy as np
>>> np.random.seed(0)
>>> data = np.random.rand(3, 4)
>>> data
array([[ 0.5488135 , 0.71518937, 0.60276338, 0.54488318],
[ 0.4236548 , 0.64589411, 0.43758721, 0.891773 ],
[ 0.96366276, 0.38344152, 0.79172504, 0.52889492]])
>>> result = data.min(axis=1)
>>> result
array([ 0.54488318, 0.4236548 , 0.38344152])
>>>
The intersection of two lines can be found by algebra
ax+b = cx+d
implies: x = (d-b) / (a-c)
Thus in python all you have to do is associate to each array a pair (a,b), these can be found by taking any two points from the array or, if you must, by a least squares fit:-)
This will give you pairwise intersections, but this is an N^2 algorithm. Presumably you can do better by sweeping, but at this point a computational geometry text is in order

Dealing with multi-dimensional arrays when ndims not known in advance

I am working with data from netcdf files, with multi-dimensional variables, read into numpy arrays. I need to scan all values in all dimensions (axes in numpy) and alter some values. But, I don't know in advance the dimension of any given variable. At runtime I can, of course, get the ndims and shapes of the numpy array.
How can I program a loop thru all values without knowing the number of dimensions, or shapes in advance? If I knew a variable was exactly 2 dimensions, I would do
shp=myarray.shape
for i in range(shp[0]):
for j in range(shp[1]):
do_something(myarray[i][j])
You should look into ravel, nditer and ndindex.
# For the simple case
for value in np.nditer(a):
do_something_with(value)
# This is similar to above
for value in a.ravel():
do_something_with(value)
# Or if you need the index
for idx in np.ndindex(a.shape):
a[idx] = do_something_with(a[idx])
On an unrelated note, numpy arrays are indexed a[i, j] instead of a[i][j]. In python a[i, j] is equivalent to indexing with a tuple, ie a[(i, j)].
You can use the flat property of numpy arrays, which returns a generator on all values (no matter the shape).
For instance:
>>> A = np.array([[1,2,3],[4,5,6]])
>>> for x in A.flat:
... print x
1
2
3
4
5
6
You can also set the values in the same order they're returned, e.g. like this:
>>> A.flat[:] = [x / 2 if x % 2 == 0 else x for x in A.flat]
>>> A
array([[1, 1, 3],
[2, 5, 3]])
I am not sure the order in which flat returns the elements is guaranteed in any way (as it iterates through the elements as they are in memory, so depending on your array convention you are likely to have it always being the same, unless you are really doing it on purpose, but be careful...)
And this will work for any dimension.
** -- Edit -- **
To clarify what I meant by 'order not guaranteed', the order of elements returned by flat does not change, but I think it would be unwise to count on it for things like row1 = A.flat[:N], although it will work most of the time.
This might be the easiest with recursion:
a = numpy.array(range(30)).reshape(5, 3, 2)
def recursive_do_something(array):
if len(array.shape) == 1:
for obj in array:
do_something(obj)
else:
for subarray in array:
recursive_do_something(subarray)
recursive_do_something(a)
In case you want the indices:
a = numpy.array(range(30)).reshape(5, 3, 2)
def do_something(x, indices):
print(indices, x)
def recursive_do_something(array, indices=None):
indices = indices or []
if len(array.shape) == 1:
for obj in array:
do_something(obj, indices)
else:
for i, subarray in enumerate(array):
recursive_do_something(subarray, indices + [i])
recursive_do_something(a)
Look into Python's itertools module.
Python 2: http://docs.python.org/2/library/itertools.html#itertools.product
Python 3: http://docs.python.org/3.3/library/itertools.html#itertools.product
This will allow you to do something along the lines of
for lengths in product(shp[0], shp[1], ...):
do_something(myarray[lengths[0]][lengths[1]]

Categories

Resources