Related
I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.
I have a list of tuples, which I am using to mark the lower and upper bounds of ranges. For example:
[(3,10), (4,11), (2,6), (8,11), (9,11)] # 5 separate ranges.
I want to find the ranges where three or more of the input ranges overlap. For instance the tuples listed above would return:
[(4,6), (8,11)]
I tried the method provided by #WolframH in answer to this post
But I don't know what I can do to:
Give me more than one output range
Set a threshold of at least three range overlaps to qualify an output
You first have to find all combinations of ranges. Then you can transform them to sets and calculate the intersection:
import itertools
limits = [(3,10), (4,11), (2,6), (8,11), (9,11)]
ranges = [range(*lim) for lim in limits]
results = []
for comb in itertools.combinations(ranges,3):
intersection = set(comb[0]).intersection(comb[1])
intersection = intersection.intersection(comb[2])
if intersection and intersection not in results and\
not any(map(intersection.issubset, results)):
results = filter(lambda res: not intersection.issuperset(res),results)
results.append(intersection)
result_limits = [(res[0], res[-1]+1) for res in map(list,results)]
It should give you all 3-wise intersections
You can, of course, solve this by brute-force checking all the combinations if you want. If you need this algorithm to scale, though, you can do it in (pseudo) nlogn. You can technically come up with a degenerate worst-case that's O(n**2), but whatchagonnado.
Basically, you sort the ranges, then for a given range you look to its immediate left to see that the bounds overlap, and if so you then look right to mark overlapping intervals as results. Pseudocode (which is actually almost valid python, look at that):
ranges.sort()
for left_range, current_range, right_range in sliding_window(ranges, 3):
if left_range.right < current_range.left:
continue
while right_range.left < min(left_range.right, current_range.right):
results.append(overlap(left_range, right_range))
right_range = right_range.next
#Before moving on to the next node, extend the current_range's right bound
#to be the longer of (left_range.right, current_range.right)
#This makes sense if you think about it.
current_range.right = max(left_range.right, current_range.right)
merge_overlapping(results)
(you also need to merge some possibly-overlapping ranges at the end, this is another nlogn operation - though n will usually be much smaller there. I won't discuss the code for that, but it's similar in approach to the above, involving a sort-then-merge. See here for an example.)
I have a function which receives an integer as an input and depending on what range this input lies in, assigns to it a difficulty value. I know that this can be done using if else loops. I was wondering whether there is a more efficient/cleaner way to do it.
I tried to do something like this
TIME_RATING_KEY ={
range(0,46):1,
range(46,91):2,
range(91,136):3,
range(136,201):4,
range(201,10800):5,
}
But found out that we can use range as a key in dict(right?). So is there a better way to do this?
You can implement an interval tree. This kind of data structures are able to return all the intervals that intersect a given input point.
In your case intervals don't overlap, so they would always return 1 interval.
Centered interval trees run in O(log n + m) time, where m is the number of intervals returned (1 in your case). So this would reduce the complexity from O(n) to O(log n).
The idea of these interval trees is the following:
You consider the interval that encloses all the intervals you have
Take the center of that interval and partition the given intervals into those that end before that point, those that contain that point and those that start after it.
Recursively construct the same kind of tree for the intervals ending before the center and those starting after it
Keep the intervals that contain the center point in two sorted sequences. One sorted by starting point, and the other sorted by ending point
When searching go left or right depending on the center point. When you find an overlap you use binary search on the sorted sequence you want to check (this allows for looking up not only intervals that contain a given point but intervals that intersect or contain a given interval).
It's trivial to modify the data structure to return a specific value instead of the found interval.
This said, from the context I don't think you actually need to reduce the efficiency of this lookup and you should probably use the simpler and more readable solution since it would be more maintainable and there are less chances to make mistakes.
However reading about the mroe efficient data structure can turn out useful in the future.
The simplest way is probably just to write a short function:
def convert(n, difficulties=[0, 46, 91, 136, 201]):
if n < difficulties[0]:
raise ValueError
for difficulty, end in enumerate(difficulties):
if n < end:
return difficulty
else:
return len(difficulties)
Examples:
>>> convert(32)
1
>>> convert(68)
2
>>> convert(150)
4
>>> convert(250)
5
As a side note: You can use a range as a dictionary key in Python 3.x, but not directly in 2.x (because range returns a list). You could do:
TIME_RATING_KEY = {tuple(range(0, 46)): 1, ...}
However that won't be much help!
I am working on a postage application which is required to check an integer postcode against a number of postcode ranges, and return a different code based on which range the postcode matches against.
Each code has more than one postcode range. For example, the M code should be returned if the postcode is within the ranges 1000-2429, 2545-2575, 2640-2686 or is equal to 2890.
I could write this as:
if 1000 <= postcode <= 2429 or 2545 <= postcode <= 2575 or 2640 <= postcode <= 2686 or postcode == 2890:
return 'M'
but this seems like a lot of lines of code, given that there are 27 returnable codes and 77 total ranges to check against. Is there a more efficient (and preferably more concise) method of matching an integer to all these ranges using Python?
Edit: There's a lot of excellent solutions flying around, so I have implemented all the ones that I could, and benchmarked their performances.
The environment for this program is a web service (Django-powered actually) which needs to check postcode region codes one-by-one, on the fly. My preferred implementation, then, would be one that can be quickly used for each request, and does not need any process to be kept in memory, or needs to process many postcodes in bulk.
I tested the following solutions using timeit.Timer with default 1000000 repetitions using randomly generated postcodes each time.
IF solution (my original)
if 1000 <= postcode <= 2249 or 2555 <= postcode <= 2574 or ...:
return 'M'
if 2250 <= postcode <= 2265 or ...:
return 'N'
...
Time for 1m reps: 5.11 seconds.
Ranges in tuples (Jeff Mercado)
Somewhat more elegant to my mind and certainly easier to enter and read the ranges. Particularly good if they change over time, which is possible. But it did end up four times slower in my implementation.
if any(lower <= postcode <= upper for (lower, upper) in [(1000, 2249), (2555, 2574), ...]):
return 'M'
if any(lower <= postcode <= upper for (lower, upper) in [(2250, 2265), ...]):
return 'N'
...
Time for 1m reps: 19.61 seconds.
Set membership (gnibbler)
As stated by the author, "it's only better if you are building the set once to check against many postcodes in a loop". But I thought I would test it anyway to see.
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((1000, 2249), (2555, 2574), ...)))):
return 'M'
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((2250, 2265), ...)))):
return 'N'
...
Time for 1m reps: 339.35 seconds.
Bisect (robert king)
This one may have been a bit above my intellect level. I learnt a lot reading about the bisect module but just couldn't quite work out which parameters to give find_ge() to make a runnable test. I expect that it would be extremely fast with a loop of many postcodes, but not if it had to do the setup each time. So, with 1m repetitions of filling numbers, edgepairs, edgeanswers etc for just one postal region code (the M code with four ranges), but not actually running the fast_solver:
Time for 1m reps: 105.61 seconds.
Dict (sentinel)
Using one dict per postal region code pre-generated, cPickled in a source file (106 KB), and loaded for each run. I was expecting much better performance from this method, but on my system at least, the IO really destroyed it. The server is a not-quite-blindingly-fast-top-of-the-line Mac Mini.
Time for 1m reps: 5895.18 seconds (extrapolated from a 10,000 run).
The summary
Well, I was expecting someone to just give a simple 'duh' answer that I hadn't considered, but it turns out this is much more complicated (and even a little controversial).
If every nanosecond of efficiency counted in this case, I would probably keep a separate process running which implemented one of the binary search or dict solutions and kept the result in memory for an extremely fast lookup. However, since the IF tree takes only five seconds to run a million times, which is plenty fast enough for my small business, that's what I'll end up using in my application.
Thank you to everyone for contributing!
You can throw your ranges into tuples and put the tuples in a list. Then use any() to help you find if your value is within these ranges.
ranges = [(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
if any(lower <= postcode <= upper for (lower, upper) in ranges):
print('M')
Probably the fastest will be to check the membership of a set
>>> from itertools import chain
>>> ranges = ((1000, 2429), (2545, 2575), (2640, 2686), (2890, 2890))
>>> postcodes = set(chain(*(xrange(start, end+1) for start, end in ranges)))
>>> 1000 in postcodes
True
>>> 2500 in postcodes
False
But it does use more memory this way, and building the set takes time, so it's only better if you are building the set once to check against many postcodes in a loop
EDIT: seems that different ranges need to map to different letters
>>> from itertools import chain
>>> ranges = {'M':((1000,2429), (2545,2575), (2640,2686), (2890, 2890)),
# more ranges
}
>>> postcodemap = dict((k,v) for v in ranges for k in chain(*imap(xrange, *zip(*ranges[v]))))
>>> print postcodemap.get(1000)
M
>>> print postcodemap.get(2500)
None
Here is a fast and short solution, using numpy:
import numpy as np
lows = np.array([1, 10, 100]) # the lower bounds
ups = np.array([3, 15, 130]) # the upper bounds
def in_range(x):
return np.any((lows <= x) & (x <= ups))
Now for instance
in_range(2) # True
in_range(23) # False
you only have to solve for edge cases and for one number between edge cases when doing inequalities.
e.g. if you do the following tests on TEN:
10 < 20, 10 < 15, 10 > 8, 10 >12
It will give True True True False
but note that the closest numbers to 10 are 8 and 12
this means that 9,10,11 will give the answers that ten did.. if you don't have too many initial range numbers and they are sparse then this well help. Otherwise you will need to see if your inequalities are transitive and use a range tree or something.
So what you can do is sort all of your boundaries into intervals.
e.g. if your inequalities had the numbers 12, 50, 192,999
you would get the following intervals that ALL have the same answer:
less than 12, 12, 13-49, 50, 51-191, 192, 193-998, 999, 999+
as you can see from these intervals we only need to solve for 9 cases and we can then quickly solve for anything.
Here is an example of how I might carry it out for solving for a new number x using these pre-calculated results:
a) is x a boundary? (is it in the set)
if yes, then return the answer you found for that boundary previously.
otherwise use case b)
b) find the maximum boundary number that is smaller than x, call it maxS
find the minimum boundary number that is larger than x call it minL.
Now just return any previously found solution that was between maxS and minL.
see Python binary search-like function to find first number in sorted list greater than a specific value
for finding closest numbers. bisect module will help (import it in your code)
This will help finding maxS and minL
You can use bisect and the function i have included in my sample code:
def find_ge(a, key):
'''Find smallest item greater-than or equal to key.
Raise ValueError if no such item exists.
If multiple keys are equal, return the leftmost.
'''
i = bisect_left(a, key)
if i == len(a):
raise ValueError('No item found with key at or above: %r' % (key,))
return a[i]
ranges=[(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
numbers=[]
for pair in ranges:
numbers+=list(pair)
numbers+=[-999999,999999] #ensure nothing goes outside the range
numbers.sort()
edges=set(numbers)
edgepairs={}
for i in range(len(numbers)-1):
edgepairs[(numbers[i],numbers[i+1])]=(numbers[i+1]-numbers[i])//2
def slow_solver(x):
return #your answer for postcode x
listedges=list(edges)
edgeanswers=dict(zip(listedges,map(solver,listedges)))
edgepairsanswers=dict(zip(edgepairs.keys(),map(solver,edgepairs.values())))
#now we are ready for fast solving:
def fast_solver(x):
if x in edges:
return edgeanswers[x]
else:
#find minL and maxS using find_ge and your own similar find_le
return edgepairsanswers[(minL,maxS)]
The full data isn't there, but I'm assuming the ranges are non-overlapping, so you can express your ranges as a single sorted tuple of ranges, along with their codes:
ranges = (
(1000, 2249, 'M'),
(2250, 2265, 'N'),
(2555, 2574, 'M'),
# ...
)
This means we can binary search over them in one go. This should be O(log(N)) time, which should result in pretty decent performance with very large sets.
def code_lookup(value, ranges):
left, right = 0, len(ranges)
while left != right - 1:
mid = left + (right - left) // 2
if value <= ranges[mid - 1][1]: # Check left split max
right = mid
elif value >= ranges[mid][0]: # Check right split min
left = mid
else: # We are in a gap
return None
if ranges[left][0] <= value <= ranges[left][1]:
# Return the code
return ranges[left][2]
I don't have your exact values, but for comparison I ran it against some generated ranges (77 ranges with various codes) and compared it to a naive approach:
def get_code_naive(value):
if 1000 < value < 2249:
return 'M'
if 2250 < value < 2265:
return 'N'
# ...
The result for 1,000,000 was that the naive version ran in about 5 sec and the binary search version in 4 sec. So it's a bit faster (20%), the codes are a lot nicer to maintain and the longer the list gets, the more it will out-perform the naive method over time.
Recently I had a similar requirement and I used bit manipulation to test if an integer belongs to said range. It is definitely faster, but I guess not suitable if your ranges involve huge numbers. I liberally copied example methods from here
First we create a binary number which will have all bits in the range set to 1.
#Sets the bits to one between lower and upper range
def setRange(permitRange, lower, upper):
# the range is inclusive of left & right edge. So add 1 upper limit
bUpper = 1 << (upper + 1)
bLower = 1 << lower
mask = bUpper - bLower
return (permitRange | mask)
#For my case the ranges also include single integers. So added method to set single bits
#Set individual bits to 1
def setBit(permitRange, number):
mask = 1 << vlan
return (permitRange| mask)
Now time to parse the range and populate our binary mask. If the highest number in the range is n, we will be creating integer greater than 2^n in binary
#Example range (10-20, 25, 30-50)
rangeList = "10-20, 25, 30-50"
maxRange = 100
permitRange = 1 << maxRange
for range in rangeList.split(","):
if range.isdigit():
permitRange = setBit(permitRange, int(range))
else:
lower, upper = range.split("-",1)
permitRange = setRange(permitRange, int(lower), int(upper))
return permitRange
To check if a number 'n' belongs to the range, simply test the bit at n'th position
#return a non-zero result, 2**offset, if the bit at 'offset' is one.
def testBit(permitRange, number):
mask = 1 << number
return (permitRange & mask)
if testBit(permitRange,10):
do_something()
Warning - This is probably premature optimisation. For a large list of ranges it might be worthwhile, but probably not in your case. Also, although dictionary/set solutions will use more memory, they are still probably a better choice.
You could do a binary-search into your range end-points. This would be easy if all ranges are non-overlapping, but could still be done (with some tweaks) for overlapping ranges.
Do a find-highest-match-less-than binary search. This is the same as a find-lowest-match-greater-than-or-equal (lower bound) binary search, except that you subtract one from the result.
Use half-open items in your list of end points - that is if your range is 1000..2429 inclusive, use the values 1000 and 2430. If you get an end-point and a start-point with the same value (two ranges touching, so there is no gap between) exclude the end-point for the lower range from your list.
If you find a start-of-range end-point, your goal value is within that range. If you find an end-of-range end-point, your goal value isn't in any range.
The binary search algorithm is roughly (don't expect this to run without editing)...
while upperbound > lowerbound :
testpos = lowerbound + ((upperbound-lowerbound) // 2)
if item [testpos] > goal :
# new best-so-far
upperbound = testpos
else :
lowerbound = testpos + 1
Note - the "//" division operator is necessary for integer division in Python 3. In Python 2, the normal "/" will work, but it's best to be ready for Python 3.
At the end, both upperbound and lowerbound point to the found item - but for the "upper bound" search. Subtract one to get the required search result. If that gives -1, there is no matching range.
There's probably a binary search routine in the library that does the upper-bound search, so prefer that to this if so. To get a better understanding of how the binary search works, see How can I better understand the one-comparison-per-iteration binary search? - no, I'm not above begging for upvotes ;-)
Python has a range(a, b) function which means the range from (and including) a, to (but excluding) b. You can make a list of these ranges and check to see if a number is in any of them. It may be more efficient to use xrange(a, b) which has the same meaning but doesn't actually make a list in memory.
list_of_ranges = []
list_of_ranges.append(xrange(1000, 2430))
list_of_ranges.append(xrange(2545, 2576))
for x in [999, 1000, 2429, 2430, 2544, 2545]:
result = False
for r in list_of_ranges:
if x in r:
result = True
break
print x, result
A bit of a silly approach to an old question, but I was curious how well regex character classes would handle the problem, since this exact problem occurs frequently in questions about character validity.
To make a regex for the "M" postal codes you showed, we can turn the numbers into unicode using chr():
m_ranges = [(1000, 2249), (2545, 2575), (2640, 2686)]
m_singletons = [2890]
m_range_char_class_members = [fr"{chr(low)}-{chr(high)}" for (low, high) in m_ranges]
m_singleton_char_class_members = [fr"{chr(x)}" for x in m_singletons]
m_char_class = f"[{''.join(m_range_char_class_members + m_singleton_char_class_members)}]"
m_regex = re.compile(m_char_class)
Then a very rough benchmark on 1 million random postal codes for this method vs your original if-statement:
test_values = [random.randint(1000, 9999) for _ in range(1000000)]
def is_m_regex(num):
return m_regex.match(chr(num))
def is_m_if(num):
return 1000 <= num <= 2249 or 2545 <= num <= 2575 or 2640 <= num <= 2686 or num == 2890
def run_regex_test():
start_time = time.time()
for i in test_values:
is_m_regex(i)
print("--- REGEX: %s seconds ---" % (time.time() - start_time))
def run_if_test():
start_time = time.time()
for i in test_values:
is_m_if(i)
print("--- IF: %s seconds ---" % (time.time() - start_time))
...
running regex test
--- REGEX: 0.3418138027191162 seconds ---
--- IF: 0.19183707237243652 seconds ---
So this would suggest that for comparing one character at a time, using raw if statements is faster than character classes in regexes. No surprise here, since using regex is a bit silly for this problem.
BUT. When doing a an operation like sub to eliminate all matches from a string composed of all the original test values, it ran much quicker:
blob_value = ''.join([chr(x) for x in test_values])
def run_regex_test_char_blob():
start_time = time.time()
subbed = m_regex.sub('', blob_value)
print("--- REGEX BLOB: %s seconds ---" % (time.time() - start_time))
print(f"original blob length : {len(blob_value)}")
print(f"sub length : {len(subbed)}")
...
--- REGEX BLOB: 0.03655815124511719 seconds ---
original blob length : 1000000
sub length : 851928
The sub method here replaces all occurrences of M-postal-characters (~15% of this sample), which means it operated on all 1 million characters of the string. That would suggest to me that mass operations by the re package are MUCH more efficient than individual operations suggested in these answers. So if you've really got a lot of comparisons to do at once in a data pipeline, you may find the best performance by doing some string composition and using regex.
In Python 3.2 functools.lru_cache was introduced.
Your solution along with the beforementioned decorator should be pretty fast.
Or, Python 3.9's functools.cache could be used as well (which should be even faster).
Have you really made benchmarks? Does the performance of this piece of code influence the performance of the overall application? So benchmark first!
But you can also use a dict e.g. for storing all keys of the "M" ranges:
mhash = {1000: true, 1001: true,..., 2429: true,...}
if postcode in mhash:
print 'M'
Of course: the hashes require more memory but access time is O(1).
Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...
A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.
import heapq
def nth_largest(n, iter):
return heapq.nlargest(n, iter)[-1]
Example Usage:
>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920
Confirm result by sorting:
>>> list(sorted(iter))[-10]
920
Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.
Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).
A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).
Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison.
In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.
You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.
You could try the Median of Medians method - it's speed is O(N).
Use heapsort. It only partially orders the list until you draw the elements out.
You essentially want to produce a "top-N" list and select the one at the end of that list.
So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.
After you finish scanning, pick the last item in your top-N list.
An example for ints and N = 5:
int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value
for(int i = 0; i < largeArray.length; i++) {
if(largeArray[i] > top5[4]) {
// insert into top5:
top5[4] = largeArray[i];
// resort:
quickSort(top5);
}
}
As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).
However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).
According to Wikipedia, this is the best selection algorithm:
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
Its complexity is O(n)
One thing you should do if this is in production code is test with samples of your data.
For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.
The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.
So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.
You can keep two different counts for each element -- the number of elements bigger than the element, and the number of elements lesser than the element.
Then do a if check N == number of elements bigger than each element
-- the element satisfies this above condition is your output
check below solution
def NthHighest(l,n):
if len(l) <n:
return 0
for i in range(len(l)):
low_count = 0
up_count = 0
for j in range(len(l)):
if l[j] > l[i]:
up_count = up_count + 1
else:
low_count = low_count + 1
# print(l[i],low_count, up_count)
if up_count == n-1:
#print(l[i])
return l[i]
# # find the 4th largest number
l = [1,3,4,9,5,15,5,13,19,27,22]
print(NthHighest(l,4))
-- using the above solution you can find both - Nth highest as well as Nth Lowest
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Hope it helps. :-)
Pandas Nlargest Documentation