Multiprocessing a Straight-forward Computation

Multiprocessing a Straight-forward Computation - python

Breakdown of the problem: There's a fair bit of underlying motivation, but let's say we are given some matrix N (code included at the bottom) and we wish to solve the matrix-vector equation Nv = b, where b is a binary string of weight k. I'm currently solving this using numpy.linalg.lstsq. If the 'residual' of this least-square calculation is less than 0.000001, I'm accepting it as a solution.
First, I need to generate all binary strings of a certain length and weight k. The following function does this.
'''
This was provided in one of the answers here:
https://stackoverflow.com/questions/58069431/find-all-binary-strings-of-certain-weight-has-fast-as-possible/
Given a value with weight k, you can get the (co)lexically next value as follows (using bit manipulations).
'''
def ksubsetcolexsuccessor(length,k):
limit=1<<length
val=(1<<k)-1
while val<limit:
yield "{0:0{1}b}".format(val,length)
minbit=val&-val
fillbit = (val+minbit)&~int(val)
val = val+minbit | (fillbit//(minbit<<1))-1
So for length = 6 and k = 2 we have:
0
00011, 000101, 000110, 001001, 001010, 001100, 010001, 010010, 010100, 011000, 100001, 100010, 100100, 101000, 110000.
We plug each into Nv = b and solve:
import numpy as np
solutions = 0
for b in ksubsetcolexsuccessor(n, k):
v = np.linalg.lstsq(N,np.array(list(b), dtype = float))
if v[1] < 0.000001:
solutions += 1
This totals the amount of solutions. My issue is that this is fairly slow. I'd like to eventually try n = 72 and k = 8. I noticed that only a fraction of my CPU is being used, so I'd like to multiprocess. I started by figuring out how to generate the above binary strings in chunks. I have written a function which gives me values for val that allows me to start and stop generating the above binary string sequence in any place. So my plan was to have as many chunks as I have cores on my machine. For example:
Core 1 processes: 000011, 000101, 000110, 001001, 001010, 001100, 010001
Core 2 processes: 010010, 010100, 011000, 100001, 100010, 100100, 101000, 110000
I am unfamiliar with multiprocessing and multithreading. Here is what I've done:
from multiprocessing import Process
'''
A version where you can specify where to start and end by passing start_val and end_val to the function.
It performs the least-squares computation for each b and tallies the trials that work.
'''
def ksubsetcolexsuccessorSegmentComp(n, k, val, limit, num):
while val<limit:
b = "{0:0{1}b}".format(val,n)
v = np.linalg.lstsq(N,np.array(list(b), dtype = float))
if v[1] < 0.000001:
num += 1
minbit=val&-val
fillbit = (val+minbit)&~int(val)
val = val+minbit | (fillbit//(minbit<<1))-1
print(num)
solutions = 0
length = 6
weight = 2
start_val = (1<<k)-1
mid_val = 18 #If splitting into 2 processes
end_val = 1<<n
#Multiprocessing
p1 = Process(target=ksubsetcolexsuccessorSegmentComp, args=(length, weight, start_val, mid_val, solutions))
p2 = Process(target=ksubsetcolexsuccessorSegmentComp, args=(length, weight, mid_val, end_val, solutions))
p1.start()
p2.start()
p1.join()
p2.join()
This does indeed reduce the computation time by around 45% (with increasing the length and weight). My CPU runs at around 50% during this. Oddly enough, if I split into 4 processes it takes roughly the same as if I split into 2 (even though my CPU runs at around 85% during that). On the laptop I'm testing this on, I have 2 physical cores and 4 logical cores. How many processes is optimal on such a CPU?
Questions:
In my lack of knowledge of multiprocessing, have I done this correctly? i.e. is this the fastest way to multiprocess this program?
Can anyone see any clear way to improve computation time? I've looked into multithreading within each process, but it seemed to slow down the computation.
The slowest part of this is definitely np.linalg.lstsq. Are there any clever ways to speed this part up?
If anyone wishes to test the code, here is how I generate the matrix N:
'''
Creating the matrix N
'''
def matrixgenerator(G, cardinality, degree):
matrix = np.zeros((size, (n-1)**2 + 1))
#Generate the matrix (it'll be missing stab_nn)
column_index = 0
for i in range(1,n):
for j in range(1,n):
for k in range(size):
if G[k](i) == j:
matrix[k][column_index] = 1
column_index += 1
#Determine the final column of N, which is determined by the characteristic vector of stab_nn
for k in range(size):
if G[k](n) == n:
matrix[k][(n-1)**2] = 1
return matrix
n = 3 #This will give length 6 and weight 2 binary strings
#Try n = 7 and change the second argument below to '4' - this takes roughly 14 min with 2 processes on my device
Group = TransitiveGroup(n, 2) #SageMath 8.8
length = Group.order()
weight = int(size / n)
N = matrixgenerator(Group, length, n)
This requires SageMath 8.8. Note this is the only part of the code that requires SageMath. Everything else is effectively written in Python 2.7. Thanks in advance.

Related

Performance improvments for Python large integer multiplication

In python, I'm working on a project that involves very large number multiplication due to taking the nth factorial of x where x is n 1s in a row.
The whole things work very efficiently, but I'm spending 80%+ of my computation time calculating the product of the integers for the factorial.
This bottleneck becomes especially noticeable at n = 7 where I effectively hit a brick wall. 6 takes under 0.1 seconds, 7 takes 7.5 seconds, and 8 takes so long I've stopped it after a few minutes without it completing.
Any way I can improve the efficiency of this? More specifically the efficiency of the math.prod(arr).
import argparse
import MyFormatter
import datetime
import math
def first_n_digits(num, n):
return num // 10 ** (int(math.log(num, 10)) - n + 1)
start = datetime.datetime.now()
parser = argparse.ArgumentParser(
formatter_class=MyFormatter.MyFormatter,
description="Calcs x factorial",
usage="",
)
parser.add_argument("-n", "--number", type=int)
args = parser.parse_args()
if args.number == 1 :
print(1)
exit()
s = ""
for _ in range(0, args.number) :
s = s + "1"
n = 1
s = int(s)
arr = []
while (s > 1) :
arr.append(s)
s -= args.number
n = math.prod(arr)
fnd = str(first_n_digits(n,3))
print("{}.{}{}e{}".format(fnd[0], fnd[1], fnd[2], int(math.log10(n))))
end = datetime.datetime.now()
print(end-start)

You don't need an exact product. You just need 3 leading digits and an order of magnitude. You're wasting tremendous amounts of time and memory doing the computation in exact integer arithmetic.
One initial step would be to add up logarithms instead of multiplying integers:
log_prod = 0
while s > 1:
log_prod += math.log10(s)
s -= args.number
magnitude = int(log_prod)
normalized = 10**(log_prod-magnitude)
This discards millions of digits of precision you don't need, computing results in approximate floating-point arithmetic that still has enough precision for your use case. normalized is a number between 1 and 10 that has the same leading digits as the full product, and magnitude is the full product's order of magnitude.
This still has to add up a lot of logarithms as the input size increases, taking exponentially more time and losing more precision. Further steps might involve using a more sophisticated summation routine (helping with precision, but not runtime), or finding a different way to express the multifactorial that's more amenable to computation.

Efficient algorithm for getting number of partitions of integer with distinct parts (Partition function Q)

I need to create function which will take one argument int and output int which represents the number of distinct parts of input integer's partition. Namely,
input:3 -> output: 1 -> {1, 2}
input:6 -> output: 3 -> {1, 2, 3}, {2, 4}, {1, 5}
...
Since I am looking only for distinct parts, something like this is not allowed:
4 -> {1, 1, 1, 1} or {1, 1, 2}
So far I have managed to come up with some algorithms which would find every possible combination, but they are pretty slow and effective only until n=100 or so.
And since I only need number of combinations not the combinations themselves Partition Function Q should solve the problem.
Does anybody know how to implement this efficiently?
More information about the problem: OEIS, Partition Function Q
EDIT:
To avoid any confusion, the DarrylG answer also includes the trivial (single) partition, but this does not affect the quality of it in any way.
EDIT 2:
The jodag (accepted answer) does not include trivial partition.

Tested two algorithms
Simple recurrence relation
WolframMathword algorithm (based upon Georgiadis, Kediaya, Sloane)
Both implemented with Memoization using LRUCache.
Results: WolframeMathword approach orders of magnitude faster.
1. Simple recurrence relation (with Memoization)
Reference
Code
#lru_cache(maxsize=None)
def p(n, d=0):
if n:
return sum(p(n-k, n-2*k+1) for k in range(1, n-d+1))
else:
return 1
Performance
n Time (sec)
10 time elapsed: 0.0020
50 time elapsed: 0.5530
100 time elapsed: 8.7430
200 time elapsed: 168.5830
2. WolframMathword algorithm
(based upon Georgiadis, Kediaya, Sloane)
Reference
Code
# Implementation of q recurrence
# https://mathworld.wolfram.com/PartitionFunctionQ.html
class PartitionQ():
def __init__(self, MAXN):
self.MAXN = MAXN
self.j_seq = self.calc_j_seq(MAXN)
#lru_cache
def q(self, n):
" Q strict partition function "
assert n < self.MAXN
if n == 0:
return 1
sqrt_n = int(sqrt(n)) + 1
temp = sum(((-1)**(k+1))*self.q(n-k*k) for k in range(1, sqrt_n))
return 2*temp + self.s(n)
def s(self, n):
if n in self.j_seq:
return (-1)**self.j_seq[n]
else:
return 0
def calc_j_seq(self, MAX_N):
""" Used to determine if n of form j*(3*j (+/-) 1) / 2
by creating a dictionary of n, j value pairs "
result = {}
j = 0
valn = -1
while valn <= MAX_N:
jj = 3*j*j
valp, valn = (jj - j)//2, (jj+j)//2
result[valp] = j
result[valn] = j
j += 1
return result
Performance
n Time (sec)
10 time elapsed: 0.00087
50 time elapsed: 0.00059
100 time elapsed: 0.00125
200 time elapsed: 0.10933
Conclusion: This algorithm is orders of magnitude faster than the simple recurrence relationship
Algorithm
Reference

I think a straightforward and efficient way to solve this is to explicitly compute the coefficient of the generating function from the Wolfram PartitionsQ link in the original post.
This is a pretty illustrative example of how to construct generating functions and how they can be used to count solutions. To start, we recognize that the problem may be posed as follows:
Let m_1 + m_2 + ... + m_{n-1} = n where m_j = 0 or m_j = j for all j.
Q(n) is the number of solutions of the equation.
We can find Q(n) by constructing the following polynomial (i.e. the generating function)
(1 + x)(1 + x^2)(1 + x^3)...(1 + x^(n-1))
The number of solutions is the number of ways the terms combine to make x^n, i.e. the coefficient of x^n after expanding the polynomial. Therefore, we can solve the problem by simply performing the polynomial multiplication.
def Q(n):
# Represent polynomial as a list of coefficients from x^0 to x^n.
# G_0 = 1
G = [int(g_pow == 0) for g_pow in range(n + 1)]
for k in range(1, n):
# G_k = G_{k-1} * (1 + x^k)
# This is equivalent to adding G shifted to the right by k to G
# Ignore powers greater than n since we don't need them.
G = [G[g_pow] if g_pow - k < 0 else G[g_pow] + G[g_pow - k] for g_pow in range(n + 1)]
return G[n]
Timing (average of 1000 iterations)
import time
print("n Time (sec)")
for n in [10, 50, 100, 200, 300, 500, 1000]:
t0 = time.time()
for i in range(1000):
Q(n)
elapsed = time.time() - t0
print('%-5d%.08f'%(n, elapsed / 1000))
n Time (sec)
10 0.00001000
50 0.00017500
100 0.00062900
200 0.00231200
300 0.00561900
500 0.01681900
1000 0.06701700

You can memoize the recurrences in equations 8, 9, and 10 in the mathematica article you linked for a quadratic in N runtime.

def partQ(n):
result = []
def rec(part, tgt, allowed):
if tgt == 0:
result.append(sorted(part))
elif tgt > 0:
for i in allowed:
rec(part + [i], tgt - i, allowed - set(range(1, i + 1)))
rec([], n, set(range(1, n)))
return result
The work is done by the rec internal function, which takes:
part - a list of parts whose sum is always equal to or less than the target n
tgt - the remaining partial sum that needs to be added to the sum of part to get to n
allowed - a set of number still allowed to be used in the full partitioning
When tgt = 0 is passed, that meant the sum of part if n, and the part is added to the result list. If tgt is still positive, each of the allowed numbers is attempted as an extension of part, in a recursive call.

Python - How to improve efficiency of complex recursive function?

In this video by Mathologer on, amongst other things, infinite sums there are 3 different infinite sums shown at 9:25, when the video freezes suddenly and an elephant diety pops up, challenging the viewer to find "the probable values" of the expressions. I wrote the following script to approximate the last of the three (i.e. 1 + 3.../2...) with increasing precision:
from decimal import Decimal as D, getcontext # for accurate results
def main(c): # faster code when functions defined locally (I think)
def run1(c):
c += 1
if c <= DEPTH:
return D(1) + run3(c)/run2(c)
else:
return D(1)
def run2(c):
c += 1
if c <= DEPTH:
return D(2) + run2(c)/run1(c)
else:
return D(2)
def run3(c):
c += 1
if c <= DEPTH:
return D(3) + run1(c)/run3(c)
else:
return D(3)
return run1(c)
getcontext().prec = 10 # too much precision isn't currently necessary
for x in range(1, 31):
DEPTH = x
print(x, main(0))
Now this is working totally fine for 1 <= x <= 20ish, but it starts taking an eternity for each result after that. I do realize that this is due to the exponentially increasing number of function calls being made at each DEPTH level. It is also clear that I won't be able to calculate the series comfortably up to an arbitrary point. However, the point at which the program slows down is too early for me to clearly identify the limit the series it is converging to (it might be 1.75, but I need more DEPTH to be certain).
My question is: How do I get as much out of my script as possible (performance-wise)?
I have tried:
1. finding the mathematical solution to this problem. (No matching results)
2. finding ways to optimize recursive functions in general. According to multiple sources (e.g. this), Python doesn't optimize tail recursion by default, so I tried switching to an iterative style, but I ran out of ideas on how to accomplish this almost instantly...
Any help is appreciated!
NOTE: I know that I could go about this mathematically instead of "brute-forcing" the limit, but I want to get my program running well, now that I've started...

You can store the results of the run1, run2 and run3 functions in arrays to prevent them from being recalculated every time, since in your example, main(1) calls run1(1), which calls run3(2) and run2(2), which in turn call run1(3), run2(3), run1(3) (again) and run3(3), and so on.
You can see that run1(3) is being called evaluated twice, and this only gets worse as the number increases; if we count the number of times each function is called, those are the results:
run1 run2 run3
1 1 0 0
2 0 1 1
3 1 2 1
4 3 2 3
5 5 6 5
6 11 10 11
7 21 22 21
8 43 42 43
9 85 86 85
...
20 160,000 each (approx.)
...
30 160 million each (approx.)
This is actually a variant of a Pascal triangle, and you could probably figure out the results mathematically; but since here you asked for a non mathematical optimization, just notice how the number of calls increases exponentially; it doubles at each iteration. This is even worse since each call will generate thousands of subsequent calls with higher values, which is what you want to avoid.
Therefore what you want to do is store the value of each call, so that the function does not need to be called a thousand times (and itself make thousands more calls) to always get the same result. This is called memoization.
Here is an example solution in pseudo code:
before calling main, declare the arrays val1, val2, val3, all of size DEPTH, and fill them with -1
function run1(c) # same thing for run2 and run3
c += 1
if c <= DEPTH
local3 = val3(c) # read run3(c)
if local3 is -1 # if run3(c) hasn't been computed yet
local3 = run3(c) # we compute it
val3(c) = local3 # and store it into the array
local2 = val2(c) # same with run2(c)
if local2 is -1
local2 = run2(c)
val2(c) = local2
return D(1) + local3/local2 # we use the value we got from the array or from the computation
else
return D(1)
Here I use -1 since your functions seem to only generate positive numbers, and -1 is an easy placeholder for the empty cells. In other cases you might have to use an object as Cabu below me did. I however think this would be slower due to the cost of retrieving properties in an object versus reading an array, but I might be wrong about that. Either way, your code should be much, much faster with it is now, with a cost of O(n) instead of O(2^n).
This would technically allow your code to run forever at a constant speed, but the recursion will actually cause an early stack overflow. You might still be able to get to a depth of several thousands before that happens though.
Edit: As ShadowRanger added in the comments, you can keep your original code and simply add #lru_cache(maxsize=n) before each of your run1, run2 and run3 functions, where n is one of the first powers of two above DEPTH (for example, 32 if depth is 25). This might require an import directive to work.

With some memoization, You could get up to the stack overflow:
from decimal import Decimal as D, getcontext # for accurate results
def main(c): # faster code when functions defined locally (I think)
mrun1 = {} # store partial results of run1, run2 and run3
# This have not been done in the as parameter of the
# run function to be able to reset them easily
def run1(c):
if c in mrun1: # if partial result already computed, return it
return mrun1[c]
c += 1
if c <= DEPTH:
v = D(1) + run3(c) / run2(c)
else:
v = D(1)
mrun1[c] = v # else store it and return the value
return v
def run2(c):
if c in mrun2:
return mrun2[c]
c += 1
if c <= DEPTH:
v = D(2) + run2(c) / run1(c)
else:
v = D(2)
mrun2[c] = v
return v
def run3(c):
if c in mrun3:
return mrun3[c]
c += 1
if c <= DEPTH:
v = D(3) + run1(c) / run3(c)
else:
v = D(3)
mrun3[c] = v
return v
return run1(c)
getcontext().prec = 150 # too much precision isn't currently necessary
for x in range(1, 997):
DEPTH = x
print(x, main(0))
Python will stack overflow if you go over 997.

How to speed up code to solve bit deletion puzzle

[This is related to Minimum set cover ]
I would like to solve the following puzzle by computer for small size of n. Consider all 2^n binary vectors of length n. For each one you delete exactly n/3 of the bits, leaving a binary vector length 2n/3 (assume n is an integer multiple of 3). The goal is to choose the bits you delete so as to minimize the number of different binary vectors of length 2n/3 that remain at the end.
For example, for n = 3 the optimal answer is 2 different vectors 11 and 00. For n = 6 it is 4, for n = 9 it is 6 and for n = 12 it is 10.
I had previously attempted to solve this problem as a minimum set cover problem of the following sort. All the lists contain only 1s and 0s.
I say that a list A covers a list B if you can make B from A by inserting exactly x symbols.
Consider all 2^n lists of 1s and 0s of length n and set x = n/3. I would like to compute a minimal set of lists of length 2n/3 that covers them all. David Eisenstat provided code that converted this minimal set cover problem into a Mixed Integer Programming Problem that could be fed into CPLEX (or http://scip.zib.de/ which is open source).
from collections import defaultdict
from itertools import product, combinations
def all_fill(source, num):
output_len = (len(source) + num)
for where in combinations(range(output_len), len(source)):
poss = ([[0, 1]] * output_len)
for (w, s) in zip(where, source):
poss[w] = [s]
for tup in product(*poss):
(yield tup)
def variable_name(seq):
return ('x' + ''.join((str(s) for s in seq)))
n = 12
shortn = ((2 * n) // 3)
x = (n // 3)
all_seqs = list(product([0, 1], repeat=shortn))
hit_sets = defaultdict(set)
for seq in all_seqs:
for fill in all_fill(seq, x):
hit_sets[fill].add(seq)
print('Minimize')
print(' + '.join((variable_name(seq) for seq in all_seqs)))
print('Subject To')
for (fill, seqs) in hit_sets.items():
print(' + '.join((variable_name(seq) for seq in seqs)), '>=', 1)
print('Binary')
for seq in all_seqs:
print(variable_name(seq))
print('End')
The problem is that if you set n=15 then the instance it outputs is too large for any solver I can find. Is there a more efficient way of solving this problem so I can solve n=15 or even n = 18?

This doesn't solve your problem (well, not quickly enough), but you're not getting many ideas and someone else may find something useful to build on here.
It's a short pure Python 3 program, using backtracking search with some greedy ordering heuristics. It solves the N = 3, 6, and 9 instances very quickly. It finds a cover of size 10 for N=12 quickly too, but will apparently take a much longer time to exhaust the search space (I'm out of time for this, and it's still running). For N=15, the initialization time is already slow.
Bitstrings are represented by plain N-bit integers here, so consume little storage. That's to ease recoding in a faster language. It does make heavy use of sets of integers, but no other "advanced" data structures.
Hope this helps someone! But it's clear that the combinatorial explosion of possibilities as N increases ensures that nothing will be "fast enough" without digging deeper into the mathematics of the problem.
def dump(cover):
for s in sorted(cover):
print(" {:0{width}b}".format(s, width=I))
def new_best(cover):
global best_cover, best_size
assert len(cover) < best_size
best_size = len(cover)
best_cover = cover.copy()
print("N =", N, "new best cover, size", best_size)
dump(best_cover)
def initialize(N, X, I):
from itertools import combinations
# Map a "wide" (length N) bitstring to the set of all
# "narrow" (length I) bitstrings that generate it.
w2n = [set() for _ in range(2**N)]
# Map a narrow bitstring to all the wide bitstrings
# it generates.
n2w = [set() for _ in range(2**I)]
for wide, wset in enumerate(w2n):
for t in combinations(range(N), X):
narrow = wide
for i in reversed(t): # largest i to smallest
hi, lo = divmod(narrow, 1 << i)
narrow = ((hi >> 1) << i) | lo
wset.add(narrow)
n2w[narrow].add(wide)
return w2n, n2w
def solve(needed, cover):
if len(cover) >= best_size:
return
if not needed:
new_best(cover)
return
# Find something needed with minimal generating set.
_, winner = min((len(w2n[g]), g) for g in needed)
# And order its generators by how much reduction they make
# to `needed`.
for g in sorted(w2n[winner],
key=lambda g: len(needed & n2w[g]),
reverse=True):
cover.add(g)
solve(needed - n2w[g], cover)
cover.remove(g)
N = 9 # CHANGE THIS TO WHAT YOU WANT
assert N % 3 == 0
X = N // 3 # number of bits to exclude
I = N - X # number of bits to include
print("initializing")
w2n, n2w = initialize(N, X, I)
best_cover = None
best_size = 2**I + 1 # "infinity"
print("solving")
solve(set(range(2**N)), set())
Example output for N=9:
initializing
solving
N = 9 new best cover, size 6
000000
000111
001100
110011
111000
111111
Followup
For N=12 this eventually finished, confirming that the minimal covering set contains 10 elements (which it found very soon at the start). I didn't time it, but it took at least 5 hours.
Why's that? Because it's close to brain-dead ;-) A completely naive search would try all subsets of the 256 8-bit short strings. There are 2**256 such subsets, about 1.2e77 - it wouldn't finish in the expected lifetime of the universe ;-)
The ordering gimmicks here first detect that the "all 0" and "all 1" short strings must be in any covering set, so pick them. That leaves us looking at "only" the 254 remaining short strings. Then the greedy "pick an element that covers the most" strategy very quickly finds a covering set with 11 total elements, and shortly thereafter a covering with 10 elements. That happens to be optimal, but it takes a long time to exhaust all other possibilities.
At this point, any attempt at a covering set that reaches 10 elements is aborted (it can't possibly be smaller than 10 elements then!). If that were done wholly naively too, it would need to try adding (to the "all 0" and "all 1" strings) all 8-element subsets of the 254 remaining, and 254-choose-8 is about 3.8e14. Very much smaller than 1.2e77 - but still way too large to be practical. It's an interesting exercise to understand how the code manages to do so much better than that. Hint: it has a lot to do with the data in this problem.
Industrial-strength solvers are incomparably more sophisticated and complex. I was pleasantly surprised at how well this simple little program did on the smaller problem instances! It got lucky.
But for N=15 this simple approach is hopeless. It quickly finds a cover with 18 elements, but makes no more visible progress for at least hours. Internally, it's still working with needed sets containing hundreds (even thousands) of elements, which makes the body of solve() quite expensive. It still has 2**10 - 2 = 1022 short strings to consider, and 1022-choose-16 is about 6e34. I don't expect it would visibly help even if this code were sped by a factor of a million.
It was fun to try, though :-)
And a small rewrite
This version runs at least 6 times faster on a full N=12 run, simply by cutting off futile searches one level earlier. Also speeds initialization, and cuts memory use by changing the 2**N w2n sets into lists (no set operations are used on those). It's still hopeless for N=15, though :-(
def dump(cover):
for s in sorted(cover):
print(" {:0{width}b}".format(s, width=I))
def new_best(cover):
global best_cover, best_size
assert len(cover) < best_size
best_size = len(cover)
best_cover = cover.copy()
print("N =", N, "new best cover, size", best_size)
dump(best_cover)
def initialize(N, X, I):
from itertools import combinations
# Map a "wide" (length N) bitstring to the set of all
# "narrow" (length I) bitstrings that generate it.
w2n = [set() for _ in range(2**N)]
# Map a narrow bitstring to all the wide bitstrings
# it generates.
n2w = [set() for _ in range(2**I)]
# mask[i] is a string of i 1-bits
mask = [2**i - 1 for i in range(N)]
for t in combinations(range(N), X):
t = t[::-1] # largest i to smallest
for wide, wset in enumerate(w2n):
narrow = wide
for i in t: # delete bit 2**i
narrow = ((narrow >> (i+1)) << i) | (narrow & mask[i])
wset.add(narrow)
n2w[narrow].add(wide)
# release some space
for i, s in enumerate(w2n):
w2n[i] = list(s)
return w2n, n2w
def solve(needed, cover):
if not needed:
if len(cover) < best_size:
new_best(cover)
return
if len(cover) >= best_size - 1:
# can't possibly be extended to a cover < best_size
return
# Find something needed with minimal generating set.
_, winner = min((len(w2n[g]), g) for g in needed)
# And order its generators by how much reduction they make
# to `needed`.
for g in sorted(w2n[winner],
key=lambda g: len(needed & n2w[g]),
reverse=True):
cover.add(g)
solve(needed - n2w[g], cover)
cover.remove(g)
N = 9 # CHANGE THIS TO WHAT YOU WANT
assert N % 3 == 0
X = N // 3 # number of bits to exclude
I = N - X # number of bits to include
print("initializing")
w2n, n2w = initialize(N, X, I)
best_cover = None
best_size = 2**I + 1 # "infinity"
print("solving")
solve(set(range(2**N)), set())
print("best for N =", N, "has size", best_size)
dump(best_cover)

First consider if you have 6 bits. You can throw away 2 bits. Therefore, any pattern balance of 6-0, 5-1 or 4-2 can be converted to 0000 or 1111. In the case a 3-3 zero-one balance any pattern can be converted to one of four cases: 1000, 0001, 0111, or 1110. Therefore, one possible minimum set for 6 bits is:
0000
0001
0111
1110
1000
1111
Now consider 9 bits with 3 thrown away. You have the following set of 14 master patterns:
000000
100000
000001
010000
000010
110000
000011
001111
111100
101111
111101
011111
111110
111111
In other words, each pattern set has ones/zeros in the center, with every permutation of n/3-1 bits on each end. For example, if you have 24 bits then you will have 17 bits in the center and 7 bits on the ends. Since 2^7 = 128 you will have 4 x 128 - 2 = 510 possible patterns.
To find correct deletions there are various algorithms. One method is to find the edit distance between the current bit set and each master pattern. The pattern with the minimum edit distance is the one to convert to. This method uses dynamic programming. Another method would be to do a tree search through the patterns using a set of rules to find the matching pattern.

Why the maxmin divide and conquer implementation is slower than the others maxmin algorithms?

I'm comparing the complexity about the implementation of the maxmin algorithm and I have implemented in two ways: the brute force way and the divide and conquer way. After I tested both two algorithms for ten input of elements between 1000000 and 10000000. Follow below the algorithms:
The brute force implementation below:
def maxmin1(vetor):
max,min = vetor[0],vetor[0];
for elem in vetor[1:]:
if elem > max:
max = elem
if elem < min:
min = elem
return (min,max)
and divide and conquer implementation below:
def maxmin4(vetor,inicio,fim):
if ((fim - inicio) == 1):
return (vetor[inicio], vetor[inicio])
elif ((fim - inicio) == 2):
if( vetor[inicio] < vetor[inicio+1]):
return (vetor[inicio], vetor[inicio+1])
else:
return (vetor[inicio+1], vetor[inicio])
else:
(min_left,max_left) = maxmin4(vetor,inicio,(fim-inicio)/2 + inicio)
(min_right,max_right) = maxmin4(vetor,(fim-inicio)/2 + inicio,fim)
if (max_left < max_right):
max = max_right
else:
max = max_left
if (min_left < min_right):
min = min_left
else:
min = min_right
return (min,max)
and the results:
input N time algorithm 1 | time algorithm 2
1000000 | 0.1299650669 | 0.6347620487
2000000 | 0.266600132 | 1.3034451008
3000000 | 0.393116951 | 2.2436430454
4000000 | 0.5371210575 | 2.5098109245
5000000 | 0.6094739437 | 3.4496300221
6000000 | 0.8271648884 | 4.6163318157
7000000 | 1.0598180294 | 4.8950240612
8000000 | 1.053456068 | 5.1900761128
9000000 | 1.1843969822 | 5.6422820091
10000000| 1.361964941 | 6.9290060997
I don't understand why the first algorithm was faster than the second, since the first have the complexity 2(n -1) and the second have complexity 3n/2 -2 and in theory the first is slower than the second. why it happens?

As it turns out, there does seem to be a bug in your code or a mistake in your analysis—but it doesn't matter. I'll get to it at the end.
If you look at your results, it seems pretty clear that there's a constant difference of about 5x between the two. That implies that the algorithmic complexity of the second isn't any worse than the first, it's just got a much higher constant multiplier—you're doing the same number of steps, but each one is much more work.
It's possible that this is just an artifact of you testing such a narrow range, only a single factor of 10. But running your tests with a wider range of values, like this:
for i in 100, 1000, 10000, 100000, 1000000, 10000000:
v = [random.random() for _ in range(i)]
t1 = timeit.timeit(lambda: maxmin1(v), number=1)
t2 = timeit.timeit(lambda: maxmin4(v, 0, len(v)), number=1)
print('{:8}: {:.8f} {:.8f} (x{:.8f})'.format(i, t1, t2, t2/t1))
… you can see that the pattern holds up:
100: 0.00002003 0.00010014 (x5.00000000)
1000: 0.00017500 0.00080800 (x4.61716621)
10000: 0.00172400 0.00821304 (x4.76393307)
100000: 0.01630187 0.08839488 (x5.42237660)
1000000: 0.17010999 0.76053309 (x4.47083153)
10000000: 1.77093697 8.32503319 (x4.70092010)
So, why the higher constant overhead in the second version? Well, the first version is just doing a simple for iteration, two comparisons, and 1 assignment for each element. The second is calling functions, building and exploding tuples, doing more comparisons, etc. That's bound to be slower. If you want to know why it's exactly 5x slower (or, actually, 15x slower, if you're doing 2n/3 steps instead of just 2n), you'll need to do some profiling, or at least look at the bytecode. But I don't think it's worth it.
The moral of the story is that there's a reason 2(n-1) and 2n/3-2 are both O(n): When you've got two different complexity classes, like O(n) and O(n**2), that will always make a difference for large n; when you've got two algorithms in the same class, the constants in the implementation (the cost of each step) can easily outweigh the constants in the step count.
Meanwhile, how can we verify the 2n/3-2 analysis? Simple, just add a global counter that you increment once for each call to maxmin4. Here are the expected and actual results:
100: 65 127
1000: 665 1023
10000: 6665 11807
100000: 66665 131071
1000000: 666665 1048575
10000000: 6666665 11611391
But this just means you're doing about 2/3rds as many steps instead of about 1/3rd, so the constant cost of each steps is 7.5x rather than 15x. In the end, that doesn't really affect the analysis.

Although the divide and conquer approach guarantees the minimum number of compares, the actual complexity of the program depends on the total number of operations performed in the program.
In your case, you do around 4 or 5 operations for about n/2 function calls ( leaf nodes of the binary tree of the function calls), and around 16 operations for the internal nodes ( counting all the assignments, arithmetic operations, comparisons,and tuple constructions). That sums up to around 10n total operations.
In the first program, the total number of operations are essentially 2.x*n (where x depends on the number of assignments performed) .
This, together with the relative simplicity of operations in program 1 over program 2 results in the factor of 5 observed in the two programs.
Also, the number of compares by the divide and conquer algorithm should be 3n/2, and not 2n/3.

I would be very surprised to ever see Python recursion run faster then Python iteration. Try this implementation of maxmin, taking two values at a time.
def minmax(seq):
def byTwos(seq):
# yield values from sequence two at a time
# if an odd number of values, just return
# the last value twice (won't hurt minmax
# evaluation)
seq = iter(seq)
while 1:
last = next(seq)
yield last,next(seq,last)
seqByTwos = byTwos(seq)
# initialize minval and maxval
a,b = next(seqByTwos,(None,None))
if a < b:
minval,maxval = a,b
else:
minval,maxval = b,a
# now walk the rest of the sequence
for a,b in seqByTwos:
if a < b:
if a < minval:
minval = a
if b > maxval:
maxval = b
else:
if b < minval:
minval = b
if a > maxval:
maxval = a
return minval, maxval
If you want to count comparisons, then pass a sequence of objects that implement __lt__ and __gt__, and have those methods update a global counter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing a Straight-forward Computation - python

Related

Performance improvments for Python large integer multiplication

Efficient algorithm for getting number of partitions of integer with distinct parts (Partition function Q)

Python - How to improve efficiency of complex recursive function?

How to speed up code to solve bit deletion puzzle

Why the maxmin divide and conquer implementation is slower than the others maxmin algorithms?

Categories

Resources