Assume I have a sorted array of tuples which is sorted by the first value. I want to find the first index where a condition on the first element of the tuple holds. i.e. How do I replace the following code
test_array = [(1,2),(3,4),(5,6),(7,8),)(9,10)]
min_value = 5
index = 0
for c in test_array:
if c[0] > min_value:
break
else:
index = index + 1
With the equivalent of a matlab find ?
i.e. At the end of this loop I expect to get 3 but I'd like to make this more efficient. I an fine with using numpy for this. I tried using argmax but to no avail.
Thanks
Since the list is sorted and if you know the max possible value for the second element (or if there can only be 1 element with the same first value), you could apply bisect on the list of tuples (returns the sorted insertion position in the list)
import bisect
test_array = [(1,2),(3,4),(5,6),(7,8),(9,10)]
min_value = 5
print(bisect.bisect_left(test_array,(min_value,10000)))
Hardcoding to 10000 is bad, so if you only have integers you can do that instead:
print(bisect.bisect_left(test_array,(min_value+1,)))
result: 3
if you had floats (also works with integers) you could use sys.float_info.epsilon like this:
print(bisect.bisect_left(test_array,(min_value*(1+sys.float_info.epsilon),)))
It has O(log(n)) complexity so it's much better than a simple for loop when there are a lot of elements.
In general, numpy's where is used in a fashion similar to MATLAB's find. However, from an efficiency standpoint, I where cannot be controlled to return only the first element found. So, from a computational perspective, what you're doing here is not arguably less inefficient.
The where equivalent would be
index = numpy.where(numpy.array([t[0] for t in test_array]) >= min_value)
index = index[0] - 1
You can use numpy to indicate the elements that obey the conditions and then use argmax(), to get the index of the first one
import numpy
test_array = numpy.array([(1,2),(3,4),(5,6),(7,8),(9,10)])
min_value = 5
print (test_array[:,0]>min_value).argmax()
if you would like to find all of the elements that obey the condition, use can replace argmax() by nonzero()[0]
Related
In Python, how can we provide a function that takes as input:
A number
An array
and then provides as an output the rank of the
number in the array? If the number is not part of the array then it should be the rank of the number lower than the value given.
For example, if the function was given
the values 7.23 and
[1.2,4.3,5,7.23,63.1], then the rank should be 4.
the values 3.5 and
[1.2,4.3,5,7.23,63.1], then the rank should be 1.
the values 100 and
[1.2,4.3,5,7.23,63.1], then the rank should be 5.
Assuming the list/array is sorted, you can use bisect.bisect_right:
from bisect import bisect_right
bisect_right(array, number)
Example:
bisect_right([1.2,4.3,5,7.23,63.1], 7.23)
4
bisect_right([1.2,4.3,5,7.23,63.1], 3.5)
1
bisect_right([1.2,4.3,5,7.23,63.1], 100)
5
I think its best to try and solve this question yourself first to get some practice, but if you have tried and you are now looking for a solution below is a simple way of doing so.
Note the below solution assumes the array is sorted. The elif also isn't necessary due to the return above (can replace it with if), but for readability I have included it
def getRank(arr, value):
for index, arrVal in enumerate(arr):
if arrVal == value:
return index + 1
elif arrVal > value:
return index
return len(arr)
Another note, this is a simple O(n) solution as you have to at worst iterate through the whole array. Assuming the list is sorted a binary tree solution could be done with a complexity of O(logn)
Let's say I have a list: l=[7,2,20,9] and I wan't to find the minimum absolute difference among all elements within (in this case it would be 9-7 = 2 or equivalently |7-9|). To do it in nlogn complexity, I need to do sort, take the difference, and find the minimum element:
import numpy as np
sorted_l = sorted(l) # sort list
diff_sorted = abs(np.diff(sorted_l)) # get absolute value differences
min_diff = min(diff_sorted) # get min element
However, after doing this, I need to track which elements were used in the original l list that gave rise to this difference. So for l the minimum difference is 2 and the output I need is 7 and 9 since 9-7 is 2. Is there a way to do this? sorted method ruins the order and it's hard to backtrack. Am I missing something obvious? Thanks.
Use:
index = diff_sorted.tolist().index(min_diff)
sorted_l[index:index+2]
Output
[7, 9]
Whole Script
import numpy as np
l=[12,24,36,35,7]
sorted_l = sorted(l)
diff_sorted = np.diff(sorted_l)
min_diff = min(diff_sorted)
index = diff_sorted.tolist().index(min_diff)
sorted_l[index:index+2]
Output
[35, 36]
Explanation
tolist is transforming the numpy array into a list whose functions contain a index which gives you the index of the input argument. Therefore, using tolist and index functions, we get the index of the minimum in the sorted array. Using this index, we get two numbers which resulted the minimum difference ([index:index+2] is selecting two number in the sorted array)
I have a array with these elements:
array= [21558 43101 64638 86173 107701 129232 150775 172355 193864 215457
237071 258586 280130 301687 23255 344790 366285 387838 409365 430856
452367 473893 495456 516955 538543 560110 581641 603188]
In my program, there is a variable n that is randomly sorted. What I'm trying to achieve is very simple, but I just can't get anything to work.
With the line below, I'll find the index of the first value that is greater than n
value_index=np.where(array > n)[0][0]
What I need is to find the value that it represents, not the index.
Of course, I can simply just insert the value_index variable and call the value in a list, but I'm tryign to be as efficient as possible.
Can anyone help me find the fastest way possible to find this value?
Numpy generally isn't very good at getting the first of something without first computing the rest of the values. There is no equivalent to Pythons's
next(x for x in array if x > n)
Instead, you have to compute the mask of x > n, and get the first index of that. There are better ways to do this than np.where:
ind = np.flatnonzero(array > n)[0]
OR
ind = np.argmax(array > n)[0]
In either case, your best bet to get the value is
array[ind]
I have some very large lists that I am working with (>1M rows), and I am trying to find a fast (the fastest?) way of, given a float, ranking that float compared to the list of floats, and finding it's percentage rank compared to the range of the list. Here is my attempt, but it's extremely slow:
X =[0.595068426145485,
0.613726840488019,
1.1532608695652,
1.92952380952385,
4.44137931034496,
3.46432160804035,
2.20331487122673,
2.54736842105265,
3.57702702702689,
1.93202764976956,
1.34720184204056,
0.824997304105564,
0.765782842381996,
0.615110856990126,
0.622708022872803,
1.03211045820975,
0.997225012974318,
0.496352327702226,
0.67103858866700,
0.452224068868272,
0.441842124852685,
0.447584524952608,
0.4645525042246]
val = 1.5
arr = np.array(X) #X is actually a pandas column, hence the conversion
arr = np.insert(arr,1,val, axis=None) #insert the val into arr, to then be ranked
st = np.sort(arr)
RANK = float([i for i,k in enumerate(st) if k == val][0])+1 #Find position
PCNT_RANK = (1-(1-round(RANK/len(st),6)))*100 #Find percentage of value compared to range
print RANK, PCNT_RANK
>>> 17.0 70.8333
For the percentage ranks I could probably build a distribution and sample from it, not quite sure yet, any suggestions welcome...it's going to be used heavily so any speed-up will be advantageous.
Thanks.
Sorting the array seems to be rather slow. If you don't need the array to be sorted in the end, then numpy's boolean operations are quicker.
arr = np.array(X)
bool_array = arr < val # Returns boolean array
RANK = float(np.sum(bool_array))
PCT_RANK = RANK/len(X)
Or, better yet, use a list comprehension and avoid numpy all together.
RANK = float(sum([x<val for x in X]))
PCT_RANK = RANK/len(X)
Doing some timing, the numpy solution above gives 6.66 us on my system while the list comprehension method gives 3.74 us.
The two slow parts of your code are:
st = np.sort(arr). Sorting the list takes on average O(n log n) time, where n is the size of the list.
RANK = float([i for i, k in enumerate(st) if k == val][0]) + 1. Iterating through the list takes O(n) time.
If you don't need to sort the list, then as #ChrisMueller points out, you can just iterate through it once without sorting, which takes O(n) time and will be the fastest option.
If you do need to sort the list (or have access to it pre-sorted), then the fastest option for the second step is RANK = np.searchsorted(st, val) + 1. Since the list is already sorted, finding the index will only take O(log n) time by binary search instead of having to iterate through the whole list. This will still be a lot faster than your original code.
I have an array in which i have element like a = array.array('i',[3,5,7,2,8,9,10,37,99]). Now I have to find 4th largest element, If this is a list , then i can find by this way,
l = [3,5,7,2,8,9,10,37,99]
m = sorted(l)
m[-4]
you could use numpy.argsort that gives you the indices of the min values in order. So:
from numpy import argsort
index_to_fourth_largest_element = argsort(a)[-4]
But if you use this solution (meaning that you use numpy) and plan to do more with the array you could considering using numpy.array instead of array.array in the first place.