python: find value within range in float array - python

I have the following sorted python list, although multiple values can occur:
[0.0943200769115388, 0.17380131294164516, 0.4063245853719435,
0.45796523225774904, 0.5040225609708342, 0.5229351852840304,
0.6145136350368882, 0.6220712583558284, 0.7190096076050408,
0.8486436998476048, 0.8957381707345986, 0.9774325873910711,
0.9832076130275351, 0.985386554764682, 1.0]
Now, I want to know the index in the array where a particular value may fall:
For example, a value of 0.25 would fall in index 2 because it is between 0.173 and 0.40. I guess I can go through the list and do this in a for loop but I was wondering if there is some better way to do this which maybe more computationally efficient. I create this array once but have to perform many lookups.

>>> vals = [0.0943200769115388, 0.17380131294164516, 0.4063245853719435,
0.45796523225774904, 0.5040225609708342, 0.5229351852840304,
0.6145136350368882, 0.6220712583558284, 0.7190096076050408,
0.8486436998476048, 0.8957381707345986, 0.9774325873910711,
0.9832076130275351, 0.985386554764682, 1.0]
>>> import bisect
>>> bisect.bisect(vals, 0.25)
2

If you know the list is already sorted, then the textbook solution is to do a binary search. You keep two index bounds, min and max. Initialize them to 0 and len - 1. Then set mid to be (min + max) / 2. Compare the value at index mid with your target value. If it's less, then set min to mid + 1. If it's greater, then set max to mid - 1. Repeat until you either find the value or until max < min, in which case you will have found the desired index in O(log(n)) steps.

Related

Rank arrays numbers in Python

In Python, how can we provide a function that takes as input:
A number
An array
and then provides as an output the rank of the
number in the array? If the number is not part of the array then it should be the rank of the number lower than the value given.
For example, if the function was given
the values 7.23 and
[1.2,4.3,5,7.23,63.1], then the rank should be 4.
the values 3.5 and
[1.2,4.3,5,7.23,63.1], then the rank should be 1.
the values 100 and
[1.2,4.3,5,7.23,63.1], then the rank should be 5.
Assuming the list/array is sorted, you can use bisect.bisect_right:
from bisect import bisect_right
bisect_right(array, number)
Example:
bisect_right([1.2,4.3,5,7.23,63.1], 7.23)
4
bisect_right([1.2,4.3,5,7.23,63.1], 3.5)
1
bisect_right([1.2,4.3,5,7.23,63.1], 100)
5
I think its best to try and solve this question yourself first to get some practice, but if you have tried and you are now looking for a solution below is a simple way of doing so.
Note the below solution assumes the array is sorted. The elif also isn't necessary due to the return above (can replace it with if), but for readability I have included it
def getRank(arr, value):
for index, arrVal in enumerate(arr):
if arrVal == value:
return index + 1
elif arrVal > value:
return index
return len(arr)
Another note, this is a simple O(n) solution as you have to at worst iterate through the whole array. Assuming the list is sorted a binary tree solution could be done with a complexity of O(logn)

Trouble finding large jumps between data points in an array

I am trying to write a sigma clipping program that calculates the differences between each point in an array and its neighbor, and if the difference is greater than x times the standard deviation of the array, it sets the neighbor equal to the average of the two points closest to it. For example, if I had an array, testarray = np.array([1.01, 2.0, 1.22, 1.005, .996, 0.95]), and wanted to change any points that were more than 2 times deviant from their neighbor, then this function would search through the array and set the 2.0 in the testarray equal to 1.115, the average of 1.01 and 1.22.
def sigmaclip2(array, stand):
originalDeviation = np.std(array)
differences = np.abs(np.diff(array))
for i in range(len(differences)):
if differences[i] > stand*originalDeviation:
if array[i+1] != array[-1]:
array[i+1] = (array[i] + array[i+2]) / 2.0
else:
array[i+1] = (array[i] + array[i-1]) / 2.0
else:
pass
return array
This code works for this small testarray. But, I am working with a larger data set (~12000 elements). When I try to run it on the larger data set, I get the same array back that I plugged in.
Does anyone know what might be going wrong?
I should note that I have tried some of Python's built in sigma clipping routines, such as the one from Astropy, but it appears as if that cuts off any values that are greater than x times the standard deviation of the array. This is not what I want to do. I want to find any large, sudden jumps (often caused by 1 bad value) and set that bad value equal to the average of the 2 points around it if the bad value is more than x times the standard deviation discrepant from its neighbor.
in line 6 of your function array[-1] may be a typo as it always uses the last element of the array. Are you missing an i? In which case you might need to shift by one as difference[0] is the diff between array[0] and array[1]
PS I think I would use np.where with slice notation on array to find just the indexes to alter rather than useing a normal python loop. With numpy a loop is almost always a bad idea.
EDIT
Understand about edges but I don't think your code does what you expect. When I run it it averages array[2] to 1.06 as well as array[1] to 1.115
If I change line 6 to if array[i+1] != array[i-1]: (array[-1] is the last entry, always 0.95) it still doesn't work properly.
You also have to think about what you want your code to do where you get more than one outlier.. 1.01, 2.0, 2.25, 1.99, 1.22, 1.005, .996, 0.95 To cope with single outliers I would use something like
def sigmaclip3(array, stand):
cutoff = stand * np.std(array)
diffs = np.abs(np.diff(array))
ix = np.where((diffs[:-1] > cutoff) &
(diffs[1:] > cutoff))[0] + 1
array[ix] = (array[ix - 1] + array[ix + 1]) / 2.0
return array

Replace a loop in python with the equivalent of a matlab find

Assume I have a sorted array of tuples which is sorted by the first value. I want to find the first index where a condition on the first element of the tuple holds. i.e. How do I replace the following code
test_array = [(1,2),(3,4),(5,6),(7,8),)(9,10)]
min_value = 5
index = 0
for c in test_array:
if c[0] > min_value:
break
else:
index = index + 1
With the equivalent of a matlab find ?
i.e. At the end of this loop I expect to get 3 but I'd like to make this more efficient. I an fine with using numpy for this. I tried using argmax but to no avail.
Thanks
Since the list is sorted and if you know the max possible value for the second element (or if there can only be 1 element with the same first value), you could apply bisect on the list of tuples (returns the sorted insertion position in the list)
import bisect
test_array = [(1,2),(3,4),(5,6),(7,8),(9,10)]
min_value = 5
print(bisect.bisect_left(test_array,(min_value,10000)))
Hardcoding to 10000 is bad, so if you only have integers you can do that instead:
print(bisect.bisect_left(test_array,(min_value+1,)))
result: 3
if you had floats (also works with integers) you could use sys.float_info.epsilon like this:
print(bisect.bisect_left(test_array,(min_value*(1+sys.float_info.epsilon),)))
It has O(log(n)) complexity so it's much better than a simple for loop when there are a lot of elements.
In general, numpy's where is used in a fashion similar to MATLAB's find. However, from an efficiency standpoint, I where cannot be controlled to return only the first element found. So, from a computational perspective, what you're doing here is not arguably less inefficient.
The where equivalent would be
index = numpy.where(numpy.array([t[0] for t in test_array]) >= min_value)
index = index[0] - 1
You can use numpy to indicate the elements that obey the conditions and then use argmax(), to get the index of the first one
import numpy
test_array = numpy.array([(1,2),(3,4),(5,6),(7,8),(9,10)])
min_value = 5
print (test_array[:,0]>min_value).argmax()
if you would like to find all of the elements that obey the condition, use can replace argmax() by nonzero()[0]

Calculate a discrete mean in python

I have a set of data points for which I have made a program that will look into the data set, from that set take every n points, and sum it, and put it in a new list. And with that I can make a simple bar plots.
Now I'd like to calculate a discrete mean for my new list.
The formula I'm using is this: t_av=(1/nsmp) Sum[N_i*t_i,{i,n_l,n_u}]
Basically I have nsmp bins that have N_i number in them, t_i is a time of a bin, and n_l is the first bin, and n_u is the last bin.
So if my list is this: [373, 156, 73, 27, 16],
I have 5 bins, and I have: t_av=1/5 (373*1+156*2+73*3+27*4+16*5)=218.4
Now I have run into a problem. I tried with this:
for i in range(0,len(L)):
sr_vr = L[i]*i
tsr=sr_vr/nsmp
Where nsmp is the number of bins I can set, and I have L calculated. Since range will go from 0,1,2,3,4 I won't get the correct answer, because my first bin is calculated by 0. If I say range(1,len(L)+1) I'll get IndexError: list index out of range, since that will mess up the L[i]*i part since he will still multiply second (1) element of the list with 1, and then he'll be one entry short for the last part.
How do I correct this?
You can just use L[i]*(i+1) (assuming you stick with zero-based indexing).
However you can also use enumerate() to loop over indices and values together, and you can even provide 1 as the second argument so that the indexing starts at 1 instead of 0.
Here is how I would write this:
tsr = sum(x * i for i, x in enumerate(L, 1)) / len(L)
Note that if you are on Python 2.x and L contains entirely integers this will perform integer division. To get a float just convert one of the arguments to a float (for example float(len(L))). You can also use from __future__ import division.

Finding Nth item of unsorted list without sorting the list

Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...
A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.
import heapq
def nth_largest(n, iter):
return heapq.nlargest(n, iter)[-1]
Example Usage:
>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920
Confirm result by sorting:
>>> list(sorted(iter))[-10]
920
Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.
Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).
A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).
Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison.
In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.
You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.
You could try the Median of Medians method - it's speed is O(N).
Use heapsort. It only partially orders the list until you draw the elements out.
You essentially want to produce a "top-N" list and select the one at the end of that list.
So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.
After you finish scanning, pick the last item in your top-N list.
An example for ints and N = 5:
int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value
for(int i = 0; i < largeArray.length; i++) {
if(largeArray[i] > top5[4]) {
// insert into top5:
top5[4] = largeArray[i];
// resort:
quickSort(top5);
}
}
As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).
However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).
According to Wikipedia, this is the best selection algorithm:
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
Its complexity is O(n)
One thing you should do if this is in production code is test with samples of your data.
For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.
The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.
So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.
You can keep two different counts for each element -- the number of elements bigger than the element, and the number of elements lesser than the element.
Then do a if check N == number of elements bigger than each element
-- the element satisfies this above condition is your output
check below solution
def NthHighest(l,n):
if len(l) <n:
return 0
for i in range(len(l)):
low_count = 0
up_count = 0
for j in range(len(l)):
if l[j] > l[i]:
up_count = up_count + 1
else:
low_count = low_count + 1
# print(l[i],low_count, up_count)
if up_count == n-1:
#print(l[i])
return l[i]
# # find the 4th largest number
l = [1,3,4,9,5,15,5,13,19,27,22]
print(NthHighest(l,4))
-- using the above solution you can find both - Nth highest as well as Nth Lowest
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Hope it helps. :-)
Pandas Nlargest Documentation

Categories

Resources