I'm trying to solve an issue with subsection set.
The input data is the list and the integer.
The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches:
a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long)
b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.
Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.
a = [range(12)]
k = 3
One of the possible solution:
[1,2,11,12] [3,4,9,10] [5,6,7,8]
because sum is 26, 26, 26
Not always it is possible to create exactly the equal sums or number of
elements. The difference between maximum and minimum number of elements in
sets could be 0 (if len(a)/k is integer) or 1.
edit 1:
I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.
Best regards,
Szczepan
I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:
Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.
Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.
In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).
How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.
import numpy as np
my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1
split = np.array_split(my_numbers, chunks)
for i in range(iter_limit):
split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
current_min = split_means.min()
current_max = split_means.max()
mean_diff = split_means.ptp()
if(i % 100 == 0 or mean_diff <= accepatable_range):
print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
if mean_diff <= accepatable_range:
print('Acceptable solution found')
break
min_index = split_means.argmin()
max_index = split_means.argmax()
if max_index < min_index:
merged = np.hstack((split.pop(min_index), split.pop(max_index)))
else:
merged = np.hstack((split.pop(max_index), split.pop(min_index)))
reshuffle_range = mean_diff+1
while reshuffle_range > mean_diff:
# this while just ensures that you're not getting worse split, either the same or better
np.random.shuffle(merged)
modified_arrays = np.array_split(merged, 2)
reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
split += modified_arrays
I am trying to implement brown clustering algorithm in python.
I have data structure of cluster = List[List]
At any gives time the outside list length will be maximum 40 or 41.
But internal list contains english words such as 'the', 'hello' etc
So I have total of words 8000(vocabulary) and initially first 40 words are put into cluster.
I iterate over my vocabulary from 41 to 8000
# do some compution this takes very less times.
# Merge 2 item in list and delete one item from list
# ex: if c1 and c2 are items of clusters then
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
But the time takes for line clusters[c1] = clusters[c1] + clusters[c1] grows gradually as i iterate over my vocabulary. Initially for 41-50 is it 1sec, but for every 20 items in vocabulary the time grows by 1 sec.
On commenting just clusters[c1] = clusters[c1] + clusters[c1] from my entire code, i observer all iterations takes constant time. I am not sure how can i speed up this process.
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
#clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
I am new to stackoverflow, please excuse me if any incorrect formatting here.
Thanks
The problem you're running into is that list concatenation is a linear time operation. Thus, your entire loop is O(n^2) (and that's prohibitively slow for n much larger than 1000). This is ignoring how copying such large lists can be bad for cache performance, etc.
Disjoint Set data structure
The solution I recommend is to use a disjoint set datastructure. This is a tree-based datastructure that "self-flattens" as you perform queries, resulting in a very fast runtimes for "merging" clusters.
The basic idea is that each word starts off as its own "singleton" tree, and merging clusters consists of making the root of one tree the child of another. This repeats (with some care for balancing) until you have as many clusters as desired.
I've written an example implementation (GitHub link) that assumes elements of each set are numbers. As long as you have a mapping from vocabulary terms to integers, it should work just fine for your purposes. (Note: I've done some preliminary testing, but I wrote it in 5 minutes right now so I'd recommend checking my work. ;) )
To use in your code, I would do something like the following:
clusters = DisjointSet(8000)
# some code to merge the first 40 words into clusters
for i in range(41, 8000):
c1 = some_computation() # assuming c1 is a number
c2 = some_computation() # assuming c2 is a number
clusters.join(c1, c2)
# Now, if you want to determine if some word with number k is
# in the same cluster as a word with number j:
print("{} and {} are in the same cluster? {}".format(j, k, clusters.query(j, k))
Regarding Sets vs Lists
While sets provide faster access time than lists, they actually have worse runtime when copying. This makes sense in theory, because a set object actually has to allocate and assign more memory space than a list for an appropriate load factor. Also, it is likely inserting so many items could result in a "rehash" of the entire hash table, which is a quadratic-time operation in worst-case.
However, practice is what we're concerned with now, so I ran a quick experiment to determine exactly how worse off sets were than lists.
Code for performing this test, in case anyone was interested, is below. I'm using the Intel packaging of Python, so my performance may be slightly faster than on your machine.
import time
import random
import numpy as np
import matplotlib.pyplot as plt
data = []
for trial in range(5):
trial_data = []
for N in range(0, 20000, 50):
l1 = random.sample(range(1000000), N)
l2 = random.sample(range(1000000), N)
s1 = set(l1)
s2 = set(l2)
# Time to concatenate two lists of length N
start_lst = time.clock()
l3 = l1+l2
stop_lst = time.clock()
# Time to union two sets of length N
start_set = time.clock()
s3 = s1|s2
stop_set = time.clock()
trial_data.append([N, stop_lst - start_lst, stop_set - start_set])
data.append(trial_data)
# average the trials and plot
data_array = np.array(data)
avg_data = np.average(data_array, 0)
fig = plt.figure()
ax = plt.gca()
ax.plot(avg_data[:,0], avg_data[:,1], label='Lists')
ax.plot(avg_data[:,0], avg_data[:,2], label='Sets')
ax.set_xlabel('Length of set or list (N)')
ax.set_ylabel('Seconds to union or concat (s)')
plt.legend(loc=2)
plt.show()
I have a file with a column of values I would like to use to compare with a dictionary that contains two values that together form a range.
for instance:
File A:
Chr1 200 ....
Chr3 300
File B:
Chr1 200 300 ...
Chr2 300 350 ...
For now I created a dictionary of values for File B:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(LineB)
For the comparison:
for Line in MethylationFile:
Line = Line.strip("\n")
Info = Line.split("\t")
Chr = Info[0]
Location = int(Info[1])
Annotation = ""
for i, r in enumerate(Ranges[Chr]):
n = i + 1
while (n < len(Ranges[Chr])):
if (int(Ranges[Chr][i][1]) <= Location <= int(Ranges[Chr][i][2])):
Annotation = '\t'.join(Ranges[Chr][i][4:])
n +=1
OutFile.write(Line + '\t' + Annotation + '\n')
If I leave the while loop the program does not seem to run (or is probably running too slow to get results) since I have over 7,000 values in each dictionary. If I change the while loop to an if loop the program runs but at an incredibly slow pace.
I'm looking for a way to make this program faster and more efficient
Dictionaries are great when you want to look up a key by exact match. In particular, the hash of the lookup key has to be the same as the hash of the stored key.
If your ranges are consistent, you could fake this by writing a hash function that returns the same value for a range, and for every value within that range. But if they're not, this hash function would have to keep track of all of the known ranges, which takes you back to the same problem you're starting with.
In that case, the right data structure here is probably some kind of sorted collection. If you only need to build up the collection, and then use it many times without ever modifying it, just sorting a list and using the bisect module will do it for you. If you need to modify the collection after creation, you'll want something built around a binary tree or B-tree variant of some kind, like blist or bintrees.
This will reduce the time to find a range from N/2 to log2(N). So, if you've got 10000 ranges, instead of 5000 comparisons, you'll do 14.
While we're at it, it would help to convert the range start and stop values to ints once, instead of doing it each time. Also, if you want to use the stdlib bisect, you unfortunately can't pass a key to most functions, so let's reorganize the ranges into comparable order too. So:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(int(LineB[1]), int(LineB[2]), [LineB[0])
for r in Ranges:
r.sort()
Now, instead of this loop:
for i, r in enumerate(Ranges[Chr]):
# ...
Do this:
i = bisect.bisect(Ranges[Chr], (Location, Location, None))
if i:
r = Ranges[Chr][i-1]
if r[0] <= Location < r[1]:
# do whatever you wanted with r
else:
# there is no range that includes Location
else:
# Location is before all ranges
You have to be careful thinking about bisect, and it's possible I've got this wrong on the first attempt, so… read the docs on what it does, and experiment with your data (printing out the results of the bisect function), before trusting this.
If your ranges can overlap, and you want to be able to find all ranges that contain a value rather than just one, you'll need a bit more than this to keep things efficient. There's no way to fully-order overlapping ranges, so bisect won't cut it.
If you're expecting more than log N matches per average lookup, you can do it with two sorted lists and bisect.
But otherwise, you need a more complex data structure, and more complex code. For example, if you can spare N^2 space, you can keep the time at log N by having, for each range in the first list, a second list, sorted by end, of all the values with a matching start.
And at this point, I think it's getting complex enough that you want to look for a library to do it for you.
However, you might want to consider a different solution.
If you use numpy or a database instead of pure Python, this can't cut the algorithmic complexity from N to log N… but it can cut the constant overhead by a factor of 10 or so, which may be good enough. In fact, if you're doing tons of searches on a medium-small list, it may even be better.
Plus, it looks a lot simpler, and once you get used to array operations or SQL, it may even be more readable. So:
RangeArrays = [np.array(a[:2] for a in value) for value in Ranges]
… or, if Ranges is a dict mapping strings to values, instead of a list:
RangeArrays = {key: np.array(a[:2] for a in value) for key, value in Ranges.items()}
Then, instead of this:
for i, r in enumerate(Ranges[Chr]):
# ...
Do:
comparisons = Location < RangeArrays[Chr]
matches = comparisons[:,0] < comparisons[:,1]
indices = matches.nonzero()[0]
for index in indices:
r = Ranges[indices[0]]
# Do stuff with r
(You can of course make things more concise, but it's worth doing it this way and printing out all of the intermediate steps to see why it works.)
Or, using a database:
cur = db.execute('''SELECT Start, Stop, Chr FROM Ranges
WHERE Start <= ? AND Stop > ?''', (Location, Location))
for (Start, Stop, Chr) in cur:
# do stuff
I am working on a postage application which is required to check an integer postcode against a number of postcode ranges, and return a different code based on which range the postcode matches against.
Each code has more than one postcode range. For example, the M code should be returned if the postcode is within the ranges 1000-2429, 2545-2575, 2640-2686 or is equal to 2890.
I could write this as:
if 1000 <= postcode <= 2429 or 2545 <= postcode <= 2575 or 2640 <= postcode <= 2686 or postcode == 2890:
return 'M'
but this seems like a lot of lines of code, given that there are 27 returnable codes and 77 total ranges to check against. Is there a more efficient (and preferably more concise) method of matching an integer to all these ranges using Python?
Edit: There's a lot of excellent solutions flying around, so I have implemented all the ones that I could, and benchmarked their performances.
The environment for this program is a web service (Django-powered actually) which needs to check postcode region codes one-by-one, on the fly. My preferred implementation, then, would be one that can be quickly used for each request, and does not need any process to be kept in memory, or needs to process many postcodes in bulk.
I tested the following solutions using timeit.Timer with default 1000000 repetitions using randomly generated postcodes each time.
IF solution (my original)
if 1000 <= postcode <= 2249 or 2555 <= postcode <= 2574 or ...:
return 'M'
if 2250 <= postcode <= 2265 or ...:
return 'N'
...
Time for 1m reps: 5.11 seconds.
Ranges in tuples (Jeff Mercado)
Somewhat more elegant to my mind and certainly easier to enter and read the ranges. Particularly good if they change over time, which is possible. But it did end up four times slower in my implementation.
if any(lower <= postcode <= upper for (lower, upper) in [(1000, 2249), (2555, 2574), ...]):
return 'M'
if any(lower <= postcode <= upper for (lower, upper) in [(2250, 2265), ...]):
return 'N'
...
Time for 1m reps: 19.61 seconds.
Set membership (gnibbler)
As stated by the author, "it's only better if you are building the set once to check against many postcodes in a loop". But I thought I would test it anyway to see.
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((1000, 2249), (2555, 2574), ...)))):
return 'M'
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((2250, 2265), ...)))):
return 'N'
...
Time for 1m reps: 339.35 seconds.
Bisect (robert king)
This one may have been a bit above my intellect level. I learnt a lot reading about the bisect module but just couldn't quite work out which parameters to give find_ge() to make a runnable test. I expect that it would be extremely fast with a loop of many postcodes, but not if it had to do the setup each time. So, with 1m repetitions of filling numbers, edgepairs, edgeanswers etc for just one postal region code (the M code with four ranges), but not actually running the fast_solver:
Time for 1m reps: 105.61 seconds.
Dict (sentinel)
Using one dict per postal region code pre-generated, cPickled in a source file (106 KB), and loaded for each run. I was expecting much better performance from this method, but on my system at least, the IO really destroyed it. The server is a not-quite-blindingly-fast-top-of-the-line Mac Mini.
Time for 1m reps: 5895.18 seconds (extrapolated from a 10,000 run).
The summary
Well, I was expecting someone to just give a simple 'duh' answer that I hadn't considered, but it turns out this is much more complicated (and even a little controversial).
If every nanosecond of efficiency counted in this case, I would probably keep a separate process running which implemented one of the binary search or dict solutions and kept the result in memory for an extremely fast lookup. However, since the IF tree takes only five seconds to run a million times, which is plenty fast enough for my small business, that's what I'll end up using in my application.
Thank you to everyone for contributing!
You can throw your ranges into tuples and put the tuples in a list. Then use any() to help you find if your value is within these ranges.
ranges = [(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
if any(lower <= postcode <= upper for (lower, upper) in ranges):
print('M')
Probably the fastest will be to check the membership of a set
>>> from itertools import chain
>>> ranges = ((1000, 2429), (2545, 2575), (2640, 2686), (2890, 2890))
>>> postcodes = set(chain(*(xrange(start, end+1) for start, end in ranges)))
>>> 1000 in postcodes
True
>>> 2500 in postcodes
False
But it does use more memory this way, and building the set takes time, so it's only better if you are building the set once to check against many postcodes in a loop
EDIT: seems that different ranges need to map to different letters
>>> from itertools import chain
>>> ranges = {'M':((1000,2429), (2545,2575), (2640,2686), (2890, 2890)),
# more ranges
}
>>> postcodemap = dict((k,v) for v in ranges for k in chain(*imap(xrange, *zip(*ranges[v]))))
>>> print postcodemap.get(1000)
M
>>> print postcodemap.get(2500)
None
Here is a fast and short solution, using numpy:
import numpy as np
lows = np.array([1, 10, 100]) # the lower bounds
ups = np.array([3, 15, 130]) # the upper bounds
def in_range(x):
return np.any((lows <= x) & (x <= ups))
Now for instance
in_range(2) # True
in_range(23) # False
you only have to solve for edge cases and for one number between edge cases when doing inequalities.
e.g. if you do the following tests on TEN:
10 < 20, 10 < 15, 10 > 8, 10 >12
It will give True True True False
but note that the closest numbers to 10 are 8 and 12
this means that 9,10,11 will give the answers that ten did.. if you don't have too many initial range numbers and they are sparse then this well help. Otherwise you will need to see if your inequalities are transitive and use a range tree or something.
So what you can do is sort all of your boundaries into intervals.
e.g. if your inequalities had the numbers 12, 50, 192,999
you would get the following intervals that ALL have the same answer:
less than 12, 12, 13-49, 50, 51-191, 192, 193-998, 999, 999+
as you can see from these intervals we only need to solve for 9 cases and we can then quickly solve for anything.
Here is an example of how I might carry it out for solving for a new number x using these pre-calculated results:
a) is x a boundary? (is it in the set)
if yes, then return the answer you found for that boundary previously.
otherwise use case b)
b) find the maximum boundary number that is smaller than x, call it maxS
find the minimum boundary number that is larger than x call it minL.
Now just return any previously found solution that was between maxS and minL.
see Python binary search-like function to find first number in sorted list greater than a specific value
for finding closest numbers. bisect module will help (import it in your code)
This will help finding maxS and minL
You can use bisect and the function i have included in my sample code:
def find_ge(a, key):
'''Find smallest item greater-than or equal to key.
Raise ValueError if no such item exists.
If multiple keys are equal, return the leftmost.
'''
i = bisect_left(a, key)
if i == len(a):
raise ValueError('No item found with key at or above: %r' % (key,))
return a[i]
ranges=[(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
numbers=[]
for pair in ranges:
numbers+=list(pair)
numbers+=[-999999,999999] #ensure nothing goes outside the range
numbers.sort()
edges=set(numbers)
edgepairs={}
for i in range(len(numbers)-1):
edgepairs[(numbers[i],numbers[i+1])]=(numbers[i+1]-numbers[i])//2
def slow_solver(x):
return #your answer for postcode x
listedges=list(edges)
edgeanswers=dict(zip(listedges,map(solver,listedges)))
edgepairsanswers=dict(zip(edgepairs.keys(),map(solver,edgepairs.values())))
#now we are ready for fast solving:
def fast_solver(x):
if x in edges:
return edgeanswers[x]
else:
#find minL and maxS using find_ge and your own similar find_le
return edgepairsanswers[(minL,maxS)]
The full data isn't there, but I'm assuming the ranges are non-overlapping, so you can express your ranges as a single sorted tuple of ranges, along with their codes:
ranges = (
(1000, 2249, 'M'),
(2250, 2265, 'N'),
(2555, 2574, 'M'),
# ...
)
This means we can binary search over them in one go. This should be O(log(N)) time, which should result in pretty decent performance with very large sets.
def code_lookup(value, ranges):
left, right = 0, len(ranges)
while left != right - 1:
mid = left + (right - left) // 2
if value <= ranges[mid - 1][1]: # Check left split max
right = mid
elif value >= ranges[mid][0]: # Check right split min
left = mid
else: # We are in a gap
return None
if ranges[left][0] <= value <= ranges[left][1]:
# Return the code
return ranges[left][2]
I don't have your exact values, but for comparison I ran it against some generated ranges (77 ranges with various codes) and compared it to a naive approach:
def get_code_naive(value):
if 1000 < value < 2249:
return 'M'
if 2250 < value < 2265:
return 'N'
# ...
The result for 1,000,000 was that the naive version ran in about 5 sec and the binary search version in 4 sec. So it's a bit faster (20%), the codes are a lot nicer to maintain and the longer the list gets, the more it will out-perform the naive method over time.
Recently I had a similar requirement and I used bit manipulation to test if an integer belongs to said range. It is definitely faster, but I guess not suitable if your ranges involve huge numbers. I liberally copied example methods from here
First we create a binary number which will have all bits in the range set to 1.
#Sets the bits to one between lower and upper range
def setRange(permitRange, lower, upper):
# the range is inclusive of left & right edge. So add 1 upper limit
bUpper = 1 << (upper + 1)
bLower = 1 << lower
mask = bUpper - bLower
return (permitRange | mask)
#For my case the ranges also include single integers. So added method to set single bits
#Set individual bits to 1
def setBit(permitRange, number):
mask = 1 << vlan
return (permitRange| mask)
Now time to parse the range and populate our binary mask. If the highest number in the range is n, we will be creating integer greater than 2^n in binary
#Example range (10-20, 25, 30-50)
rangeList = "10-20, 25, 30-50"
maxRange = 100
permitRange = 1 << maxRange
for range in rangeList.split(","):
if range.isdigit():
permitRange = setBit(permitRange, int(range))
else:
lower, upper = range.split("-",1)
permitRange = setRange(permitRange, int(lower), int(upper))
return permitRange
To check if a number 'n' belongs to the range, simply test the bit at n'th position
#return a non-zero result, 2**offset, if the bit at 'offset' is one.
def testBit(permitRange, number):
mask = 1 << number
return (permitRange & mask)
if testBit(permitRange,10):
do_something()
Warning - This is probably premature optimisation. For a large list of ranges it might be worthwhile, but probably not in your case. Also, although dictionary/set solutions will use more memory, they are still probably a better choice.
You could do a binary-search into your range end-points. This would be easy if all ranges are non-overlapping, but could still be done (with some tweaks) for overlapping ranges.
Do a find-highest-match-less-than binary search. This is the same as a find-lowest-match-greater-than-or-equal (lower bound) binary search, except that you subtract one from the result.
Use half-open items in your list of end points - that is if your range is 1000..2429 inclusive, use the values 1000 and 2430. If you get an end-point and a start-point with the same value (two ranges touching, so there is no gap between) exclude the end-point for the lower range from your list.
If you find a start-of-range end-point, your goal value is within that range. If you find an end-of-range end-point, your goal value isn't in any range.
The binary search algorithm is roughly (don't expect this to run without editing)...
while upperbound > lowerbound :
testpos = lowerbound + ((upperbound-lowerbound) // 2)
if item [testpos] > goal :
# new best-so-far
upperbound = testpos
else :
lowerbound = testpos + 1
Note - the "//" division operator is necessary for integer division in Python 3. In Python 2, the normal "/" will work, but it's best to be ready for Python 3.
At the end, both upperbound and lowerbound point to the found item - but for the "upper bound" search. Subtract one to get the required search result. If that gives -1, there is no matching range.
There's probably a binary search routine in the library that does the upper-bound search, so prefer that to this if so. To get a better understanding of how the binary search works, see How can I better understand the one-comparison-per-iteration binary search? - no, I'm not above begging for upvotes ;-)
Python has a range(a, b) function which means the range from (and including) a, to (but excluding) b. You can make a list of these ranges and check to see if a number is in any of them. It may be more efficient to use xrange(a, b) which has the same meaning but doesn't actually make a list in memory.
list_of_ranges = []
list_of_ranges.append(xrange(1000, 2430))
list_of_ranges.append(xrange(2545, 2576))
for x in [999, 1000, 2429, 2430, 2544, 2545]:
result = False
for r in list_of_ranges:
if x in r:
result = True
break
print x, result
A bit of a silly approach to an old question, but I was curious how well regex character classes would handle the problem, since this exact problem occurs frequently in questions about character validity.
To make a regex for the "M" postal codes you showed, we can turn the numbers into unicode using chr():
m_ranges = [(1000, 2249), (2545, 2575), (2640, 2686)]
m_singletons = [2890]
m_range_char_class_members = [fr"{chr(low)}-{chr(high)}" for (low, high) in m_ranges]
m_singleton_char_class_members = [fr"{chr(x)}" for x in m_singletons]
m_char_class = f"[{''.join(m_range_char_class_members + m_singleton_char_class_members)}]"
m_regex = re.compile(m_char_class)
Then a very rough benchmark on 1 million random postal codes for this method vs your original if-statement:
test_values = [random.randint(1000, 9999) for _ in range(1000000)]
def is_m_regex(num):
return m_regex.match(chr(num))
def is_m_if(num):
return 1000 <= num <= 2249 or 2545 <= num <= 2575 or 2640 <= num <= 2686 or num == 2890
def run_regex_test():
start_time = time.time()
for i in test_values:
is_m_regex(i)
print("--- REGEX: %s seconds ---" % (time.time() - start_time))
def run_if_test():
start_time = time.time()
for i in test_values:
is_m_if(i)
print("--- IF: %s seconds ---" % (time.time() - start_time))
...
running regex test
--- REGEX: 0.3418138027191162 seconds ---
--- IF: 0.19183707237243652 seconds ---
So this would suggest that for comparing one character at a time, using raw if statements is faster than character classes in regexes. No surprise here, since using regex is a bit silly for this problem.
BUT. When doing a an operation like sub to eliminate all matches from a string composed of all the original test values, it ran much quicker:
blob_value = ''.join([chr(x) for x in test_values])
def run_regex_test_char_blob():
start_time = time.time()
subbed = m_regex.sub('', blob_value)
print("--- REGEX BLOB: %s seconds ---" % (time.time() - start_time))
print(f"original blob length : {len(blob_value)}")
print(f"sub length : {len(subbed)}")
...
--- REGEX BLOB: 0.03655815124511719 seconds ---
original blob length : 1000000
sub length : 851928
The sub method here replaces all occurrences of M-postal-characters (~15% of this sample), which means it operated on all 1 million characters of the string. That would suggest to me that mass operations by the re package are MUCH more efficient than individual operations suggested in these answers. So if you've really got a lot of comparisons to do at once in a data pipeline, you may find the best performance by doing some string composition and using regex.
In Python 3.2 functools.lru_cache was introduced.
Your solution along with the beforementioned decorator should be pretty fast.
Or, Python 3.9's functools.cache could be used as well (which should be even faster).
Have you really made benchmarks? Does the performance of this piece of code influence the performance of the overall application? So benchmark first!
But you can also use a dict e.g. for storing all keys of the "M" ranges:
mhash = {1000: true, 1001: true,..., 2429: true,...}
if postcode in mhash:
print 'M'
Of course: the hashes require more memory but access time is O(1).