In Python, how can we provide a function that takes as input:
A number
An array
and then provides as an output the rank of the
number in the array? If the number is not part of the array then it should be the rank of the number lower than the value given.
For example, if the function was given
the values 7.23 and
[1.2,4.3,5,7.23,63.1], then the rank should be 4.
the values 3.5 and
[1.2,4.3,5,7.23,63.1], then the rank should be 1.
the values 100 and
[1.2,4.3,5,7.23,63.1], then the rank should be 5.
Assuming the list/array is sorted, you can use bisect.bisect_right:
from bisect import bisect_right
bisect_right(array, number)
Example:
bisect_right([1.2,4.3,5,7.23,63.1], 7.23)
4
bisect_right([1.2,4.3,5,7.23,63.1], 3.5)
1
bisect_right([1.2,4.3,5,7.23,63.1], 100)
5
I think its best to try and solve this question yourself first to get some practice, but if you have tried and you are now looking for a solution below is a simple way of doing so.
Note the below solution assumes the array is sorted. The elif also isn't necessary due to the return above (can replace it with if), but for readability I have included it
def getRank(arr, value):
for index, arrVal in enumerate(arr):
if arrVal == value:
return index + 1
elif arrVal > value:
return index
return len(arr)
Another note, this is a simple O(n) solution as you have to at worst iterate through the whole array. Assuming the list is sorted a binary tree solution could be done with a complexity of O(logn)
Related
Assume I have a sorted array of tuples which is sorted by the first value. I want to find the first index where a condition on the first element of the tuple holds. i.e. How do I replace the following code
test_array = [(1,2),(3,4),(5,6),(7,8),)(9,10)]
min_value = 5
index = 0
for c in test_array:
if c[0] > min_value:
break
else:
index = index + 1
With the equivalent of a matlab find ?
i.e. At the end of this loop I expect to get 3 but I'd like to make this more efficient. I an fine with using numpy for this. I tried using argmax but to no avail.
Thanks
Since the list is sorted and if you know the max possible value for the second element (or if there can only be 1 element with the same first value), you could apply bisect on the list of tuples (returns the sorted insertion position in the list)
import bisect
test_array = [(1,2),(3,4),(5,6),(7,8),(9,10)]
min_value = 5
print(bisect.bisect_left(test_array,(min_value,10000)))
Hardcoding to 10000 is bad, so if you only have integers you can do that instead:
print(bisect.bisect_left(test_array,(min_value+1,)))
result: 3
if you had floats (also works with integers) you could use sys.float_info.epsilon like this:
print(bisect.bisect_left(test_array,(min_value*(1+sys.float_info.epsilon),)))
It has O(log(n)) complexity so it's much better than a simple for loop when there are a lot of elements.
In general, numpy's where is used in a fashion similar to MATLAB's find. However, from an efficiency standpoint, I where cannot be controlled to return only the first element found. So, from a computational perspective, what you're doing here is not arguably less inefficient.
The where equivalent would be
index = numpy.where(numpy.array([t[0] for t in test_array]) >= min_value)
index = index[0] - 1
You can use numpy to indicate the elements that obey the conditions and then use argmax(), to get the index of the first one
import numpy
test_array = numpy.array([(1,2),(3,4),(5,6),(7,8),(9,10)])
min_value = 5
print (test_array[:,0]>min_value).argmax()
if you would like to find all of the elements that obey the condition, use can replace argmax() by nonzero()[0]
I have some very large lists that I am working with (>1M rows), and I am trying to find a fast (the fastest?) way of, given a float, ranking that float compared to the list of floats, and finding it's percentage rank compared to the range of the list. Here is my attempt, but it's extremely slow:
X =[0.595068426145485,
0.613726840488019,
1.1532608695652,
1.92952380952385,
4.44137931034496,
3.46432160804035,
2.20331487122673,
2.54736842105265,
3.57702702702689,
1.93202764976956,
1.34720184204056,
0.824997304105564,
0.765782842381996,
0.615110856990126,
0.622708022872803,
1.03211045820975,
0.997225012974318,
0.496352327702226,
0.67103858866700,
0.452224068868272,
0.441842124852685,
0.447584524952608,
0.4645525042246]
val = 1.5
arr = np.array(X) #X is actually a pandas column, hence the conversion
arr = np.insert(arr,1,val, axis=None) #insert the val into arr, to then be ranked
st = np.sort(arr)
RANK = float([i for i,k in enumerate(st) if k == val][0])+1 #Find position
PCNT_RANK = (1-(1-round(RANK/len(st),6)))*100 #Find percentage of value compared to range
print RANK, PCNT_RANK
>>> 17.0 70.8333
For the percentage ranks I could probably build a distribution and sample from it, not quite sure yet, any suggestions welcome...it's going to be used heavily so any speed-up will be advantageous.
Thanks.
Sorting the array seems to be rather slow. If you don't need the array to be sorted in the end, then numpy's boolean operations are quicker.
arr = np.array(X)
bool_array = arr < val # Returns boolean array
RANK = float(np.sum(bool_array))
PCT_RANK = RANK/len(X)
Or, better yet, use a list comprehension and avoid numpy all together.
RANK = float(sum([x<val for x in X]))
PCT_RANK = RANK/len(X)
Doing some timing, the numpy solution above gives 6.66 us on my system while the list comprehension method gives 3.74 us.
The two slow parts of your code are:
st = np.sort(arr). Sorting the list takes on average O(n log n) time, where n is the size of the list.
RANK = float([i for i, k in enumerate(st) if k == val][0]) + 1. Iterating through the list takes O(n) time.
If you don't need to sort the list, then as #ChrisMueller points out, you can just iterate through it once without sorting, which takes O(n) time and will be the fastest option.
If you do need to sort the list (or have access to it pre-sorted), then the fastest option for the second step is RANK = np.searchsorted(st, val) + 1. Since the list is already sorted, finding the index will only take O(log n) time by binary search instead of having to iterate through the whole list. This will still be a lot faster than your original code.
I'm trying to implement a method to keep the visited states of the 8 puzzle from generating again.
My initial approach was to save each visited pattern in a list and do a linear check each time the algorithm wants to generate a child.
Now I want to do this in O(1) time through list access. Each pattern in 8 puzzle is an ordered permutation of numbers between 1 to 9 (9 being the blank block), for example 125346987 is:
1 2 5
3 4 6
_ 8 7
The number of all of the possible permutation of this kind is around 363,000 (9!). what is the best way to hash these numbers to indexes of a list of that size?
You can map a permutation of N items to its index in the list of all permutations of N items (ordered lexicographically).
Here's some code that does this, and a demonstration that it produces indexes 0 to 23 once each for all permutations of a 4-letter sequence.
import itertools
def fact(n):
r = 1
for i in xrange(n):
r *= i + 1
return r
def I(perm):
if len(perm) == 1:
return 0
return sum(p < perm[0] for p in perm) * fact(len(perm) - 1) + I(perm[1:])
for p in itertools.permutations('abcd'):
print p, I(p)
The best way to understand the code is to prove its correctness. For an array of length n, there's (n-1)! permutations with the smallest element of the array appearing first, (n-1)! permutations with the second smallest element appearing first, and so on.
So, to find the index of a given permutation, see count how many items are smaller than the first thing in the permutation and multiply that by (n-1)!. Then recursively add the index of the remainder of the permutation, considered as a permutation of (n-1) elements. The base case is when you have a permutation of length 1. Obviously there's only one such permutation, so its index is 0.
A worked example: [1324].
[1324]: 1 appears first, and that's the smallest element in the array, so that gives 0 * (3!)
Removing 1 gives us [324]. The first element is 3. There's one element that's smaller, so that gives us 1 * (2!).
Removing 3 gives us [24]. The first element is 2. That's the smallest element remaining, so that gives us 0 * (1!).
Removing 2 gives us [4]. There's only one element, so we use the base case and get 0.
Adding up, we get 0*3! + 1*2! + 0*1! + 0 = 1*2! = 2. So [1324] is at index 2 in the sorted list of 4 permutations. That's correct, because at index 0 is [1234], index 1 is [1243], and the lexicographically next permutation is our [1324].
I believe you're asking for a function to map permutations to array indices. This dictionary maps all permutations of numbers 1-9 to values from 0 to 9!-1.
import itertools
index = itertools.count(0)
permutations = itertools.permutations(range(1, 10))
hashes = {h:next(index) for h in permutations}
For example, hashes[(1,2,5,3,4,6,9,8,7)] gives a value of 1445.
If you need them in strings instead of tuples, use:
permutations = [''.join(x) for x in itertools.permutations('123456789')]
or as integers:
permutations = [int(''.join(x)) for x in itertools.permutations('123456789')]
It looks like you are only interested in whether or not you have already visited the permutation.
You should use a set. It grants the O(1) look-up you are interested in.
A space as well lookup efficient structure for this problem is a trie type structure, as it will use common space for lexicographical matches in any
permutation.
i.e. the space used for "123" in 1234, and in 1235 will be the same.
Lets assume 0 as replacement for '_' in your example for simplicity.
Storing
Your trie will be a tree of booleans, the root node will be an empty node, and then each node will contain 9 children with a boolean flag set to false, the 9 children specify digits 0 to 8 and _ .
You can create the trie on the go, as you encounter a permutation, and store the encountered digits as boolean in the trie by setting the bool as true.
Lookup
The trie is traversed from root to children based on digits of the permutation, and if the nodes have been marked as true, that means the permutation has occured before. The complexity of lookup is just 9 node hops.
Here is how the trie would look for a 4 digit example :
Python trie
This trie can be easily stored in a list of booleans, say myList.
Where myList[0] is the root, as explained in the concept here :
https://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html
The final trie in a list would be around 9+9^2+9^3....9^8 bits i.e. less than 10 MB for all lookups.
Use
I've developed a heuristic function for this specific case. It is not a perfect hashing, as the mapping is not between [0,9!-1] but between [1,767359], but it is O(1).
Let's assume we already have a file / reserved memory / whatever with 767359 bits set to 0 (e.g., mem = [False] * 767359). Let a 8puzzle pattern be mapped to a python string (e.g., '125346987'). Then, the hash function is determined by:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
I.e., in order to check if a 8puzzle pattern = 125346987 has already been used, we need to:
pattern = '125346987'
pos = getPosition(pattern)
used = mem[pos-1] #mem starts in 0, getPosition in 1.
With a perfect hashing we would have needed 9! bits to store the booleans. In this case we need 2x more (767359/9! = 2.11), but recall that it is not even 1Mb (barely 100KB).
Note that the function is easily invertible.
Check
I could prove you mathematically why this works and why there won't be any collision, but since this is a programming forum let's just run it for every possible permutation and check that all the hash values (positions) are indeed different:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
#CHECKING PURPOSES
def addperm(x,l):
return [ l[0:i] + [x] + l[i:] for i in range(len(l)+1) ]
def perm(l):
if len(l) == 0:
return [[]]
return [x for y in perm(l[1:]) for x in addperm(l[0],y) ]
#We generate all the permutations
all_perms = perm([ i for i in range(1,10)])
print "Number of all possible perms.: "+str(len(all_perms)) #indeed 9! = 362880
#We execute our hash function over all the perms and store the output.
all_positions = [];
for permutation in all_perms:
perm_string = ''.join(map(str,permutation))
all_positions.append(getPosition(perm_string))
#We wan't to check if there has been any collision, i.e., if there
#is one position that is repeated at least twice.
print "Number of different hashes: "+str(len(set(all_positions)))
#also 9!, so the hash works properly.
How does it work?
The idea behind this has to do with a tree: at the beginning it has 9 branches going to 9 nodes, each corresponding to a digit. From each of these nodes we have 8 branches going to 8 nodes, each corresponding to a digit except its parent, then 7, and so on.
We first store the first digit of our input string in a separate variable and pop it out from our 'node' list, because we have already taken the branch corresponding to the first digit.
Then we have 8 branches, we choose the one corresponding with our second digit. Note that, since there are 8 branches, we need 3 bits to store the index of our chosen branch and the maximum value it can take is 111 for the 8th branch (we map branch 1-8 to binary 000-111). Once we have chosen and store the branch index, we pop that value out, so that the next node list doesn't include again this digit.
We proceed in the same way for branches 7, 6 and 5. Note that when we have 7 branches we still need 3 bits, though the maximum value will be 110. When we have 5 branches, the index will be at most binary 100.
Then we get to 4 branches and we notice that this can be stored just with 2 bits, same for 3 branches. For 2 branches we will just need 1bit, and for the last branch we don't need any bit: there will be just one branch pointing to the last digit, which will be the remaining from our 1-9 original list.
So, what we have so far: the first digit stored in a separated variable and a list of 7 indexes representing branches. The first 4 indexes can be represented with 3bits, the following 2 indexes can be represented with 2bits and the last index with 1bit.
The idea is to concatenate all this indexes in their bit form to create a larger number. Since we have 17bits, this number will be at most 2^17=131072. Now we just add the first digit we had stored to the end of that number (at most this digit will be 9) and we have that the biggest number we can create is 1310729.
But we can do better: recall that when we had 5 branches we needed 3 bits, though the maximum value was binary 100. What if we arrange our bits so that those with more 0s come first? If so, in the worst case scenario our final bit number will be the concatenation of:
100 10 101 110 111 11 1
Which in decimal is 76735. Then we proceed as before (adding the 9 at the end) and we get that our biggest possible generated number is 767359, which is the ammount of bits we need and corresponds to input string 987654321, while the lowest possible number is 1 which corresponds to input string 123456789.
Just to finish: one might wonder why have we stored the first digit in a separate variable and added it at the end. The reason is that if we had kept it then the number of branches at the beginning would have been 9, so for storing the first index (1-9) we would have needed 4 bits (0000 to 1000). which would have make our mapping much less efficient, as in that case the biggest possible number (and therefore the amount of memory needed) would have been
1000 100 10 101 110 111 11 1
which is 1125311 in decimal (1.13Mb vs 768Kb). It is quite interesting to see that the ratio 1.13M/0.768K = 1.47 has something to do with the ratio of the four bits compared to just adding a decimal value (2^4/10 = 1.6) which makes a lot of sense (the difference is due to the fact that with the first approach we are not fully using the 4 bits).
First. There is nothing faster than a list of booleans. There's a total of 9! == 362880 possible permutations for your task, which is a reasonably small amount of data to store in memory:
visited_states = [False] * math.factorial(9)
Alternatively, you can use array of bytes which is slightly slower (not by much though) and has a much lower memory footprint (by a power of magnitude at least). However any memory savings from using an array will probably be of little value considering the next step.
Second. You need to convert your specific permutation to it's index. There are algorithms which do this, one of the best StackOverflow questions on this topic is probably this one:
Finding the index of a given permutation
You have fixed permutation size n == 9, so whatever complexity an algorithm has, it will be equivalent to O(1) in your situation.
However to produce even faster results, you can pre-populate a mapping dictionary which will give you an O(1) lookup:
all_permutations = map(lambda p: ''.join(p), itertools.permutations('123456789'))
permutation_index = dict((perm, index) for index, perm in enumerate(all_permutations))
This dictionary will consume about 50 Mb of memory, which is... not that much actually. Especially since you only need to create it once.
After all this is done, checking your specific combination is done with:
visited = visited_states[permutation_index['168249357']]
Marking it to visited is done in the same manner:
visited_states[permutation_index['168249357']] = True
Note that using any of permutation index algorithms will be much slower than mapping dictionary. Most of those algorithms are of O(n2) complexity and in your case it results 81 times worse performance even discounting the extra python code itself. So unless you have heavy memory constraints, using mapping dictionary is probably the best solution speed-wise.
Addendum. As has been pointed out by Palec, visited_states list is actually not needed at all - it's perfectly possible to store True/False values directly in the permutation_index dictionary, which saves some memory and an extra list lookup.
Notice if you type hash(125346987) it returns 125346987. That is for a reason, because there is no point in hashing an integer to anything other than an integer.
What you should do, is when you find a pattern add it to a dictionary rather than a list. This will provide the fast lookup you need rather than traversing the list like you are doing now.
So say you find the pattern 125346987 you can do:
foundPatterns = {}
#some code to find the pattern
foundPatterns[1] = 125346987
#more code
#test if there?
125346987 in foundPatterns.values()
True
If you must always have O(1), then seems like a bit array would do the job. You'd only need to store 363,000 elements, which seems doable. Though note that in practice it's not always faster. Simplest implementation looks like:
Create data structure
visited_bitset = [False for _ in xrange(373000)]
Test current state and add if not visited yet
if !visited[current_state]:
visited_bitset[current_state] = True
Paul's answer might work.
Elisha's answer is perfectly valid hash function that would guarantee that no collision happen in the hash function. The 9! would be a pure minimum for a guaranteed no collision hash function, but (unless someone corrects me, Paul probably has) I don't believe there exists a function to map each board to a value in the domain [0, 9!], let alone a hash function that is nothing more that O(1).
If you have a 1GB of memory to support a Boolean array of 864197532 (aka 987654321-12346789) indices. You guarantee (computationally) the O(1) requirement.
Practically (meaning when you run in a real system) speaking this isn't going to be cache friendly but on paper this solution will definitely work. Even if an perfect function did exist, doubt it too would be cache friendly either.
Using prebuilts like set or hashmap (sorry I haven't programmed Python in a while, so don't remember the datatype) must have an amortized 0(1). But using one of these with a suboptimal hash function like n % RANDOM_PRIME_NUM_GREATER_THAN_100000 might give the best solution.
I have the following sorted python list, although multiple values can occur:
[0.0943200769115388, 0.17380131294164516, 0.4063245853719435,
0.45796523225774904, 0.5040225609708342, 0.5229351852840304,
0.6145136350368882, 0.6220712583558284, 0.7190096076050408,
0.8486436998476048, 0.8957381707345986, 0.9774325873910711,
0.9832076130275351, 0.985386554764682, 1.0]
Now, I want to know the index in the array where a particular value may fall:
For example, a value of 0.25 would fall in index 2 because it is between 0.173 and 0.40. I guess I can go through the list and do this in a for loop but I was wondering if there is some better way to do this which maybe more computationally efficient. I create this array once but have to perform many lookups.
>>> vals = [0.0943200769115388, 0.17380131294164516, 0.4063245853719435,
0.45796523225774904, 0.5040225609708342, 0.5229351852840304,
0.6145136350368882, 0.6220712583558284, 0.7190096076050408,
0.8486436998476048, 0.8957381707345986, 0.9774325873910711,
0.9832076130275351, 0.985386554764682, 1.0]
>>> import bisect
>>> bisect.bisect(vals, 0.25)
2
If you know the list is already sorted, then the textbook solution is to do a binary search. You keep two index bounds, min and max. Initialize them to 0 and len - 1. Then set mid to be (min + max) / 2. Compare the value at index mid with your target value. If it's less, then set min to mid + 1. If it's greater, then set max to mid - 1. Repeat until you either find the value or until max < min, in which case you will have found the desired index in O(log(n)) steps.
Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...
A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.
import heapq
def nth_largest(n, iter):
return heapq.nlargest(n, iter)[-1]
Example Usage:
>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920
Confirm result by sorting:
>>> list(sorted(iter))[-10]
920
Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.
Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).
A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).
Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison.
In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.
You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.
You could try the Median of Medians method - it's speed is O(N).
Use heapsort. It only partially orders the list until you draw the elements out.
You essentially want to produce a "top-N" list and select the one at the end of that list.
So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.
After you finish scanning, pick the last item in your top-N list.
An example for ints and N = 5:
int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value
for(int i = 0; i < largeArray.length; i++) {
if(largeArray[i] > top5[4]) {
// insert into top5:
top5[4] = largeArray[i];
// resort:
quickSort(top5);
}
}
As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).
However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).
According to Wikipedia, this is the best selection algorithm:
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
Its complexity is O(n)
One thing you should do if this is in production code is test with samples of your data.
For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.
The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.
So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.
You can keep two different counts for each element -- the number of elements bigger than the element, and the number of elements lesser than the element.
Then do a if check N == number of elements bigger than each element
-- the element satisfies this above condition is your output
check below solution
def NthHighest(l,n):
if len(l) <n:
return 0
for i in range(len(l)):
low_count = 0
up_count = 0
for j in range(len(l)):
if l[j] > l[i]:
up_count = up_count + 1
else:
low_count = low_count + 1
# print(l[i],low_count, up_count)
if up_count == n-1:
#print(l[i])
return l[i]
# # find the 4th largest number
l = [1,3,4,9,5,15,5,13,19,27,22]
print(NthHighest(l,4))
-- using the above solution you can find both - Nth highest as well as Nth Lowest
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Hope it helps. :-)
Pandas Nlargest Documentation