why is this async matrix multiplication code slower than synchronous?

why is this async matrix multiplication code slower than synchronous? - python

I'm trying to implement a matrix multiplication function that is faster than my synchronous multiplication function by using concurrent.futures.threadPoolExecutor() threads. Here is the code:
def asyncmult(m1, m2):
threads = []
prod = np.zeros((m1.shape[0], m2.shape[1]))
def multvecs(inp):
vec1, vec2, index = inp
i, j = index
sum = 0
for k in range(vec1.shape[0]):
sum += vec1[k] * vec2[k]
prod[i][j] = sum
with concurrent.futures.ThreadPoolExecutor() as executor:
for i in range(m1.shape[0]):
for j in range(m2.shape[1]):
f = executor.submit(multvecs, (m1[i], m2[:, j], (i, j)))
threads.append(f)
return prod
on plotting against time taken with increase in input matrix size this function performs the worse than synchronous.
why is this happening? How can I fix this?

Related

Find the number in a given range so that the gcd of the number with any element of a given list will always be 1

Given a number M and a list A which contains N elements (A1, A2,...)
Find the all the numbers k so that:
1=<k=<M which satisfied gcd(Ai, k) is always equal to 1
Here's my code, the only problem for it is that it uses loops in each other, which slow the process if my inputs are big, how can I fix it so that it requires less time?
N, M = [int(v) for v in input().split()]
A = [int(v) for v in input().split()]
from math import gcd
cnt = 0
print(N)
for k in range(1, M+1):
for i in range(N):
if gcd(k, A[i]) == 1:
cnt += 1
if cnt == N:
print(k)
cnt = 0
inputs example: (first line contains N and M, second contains the list A1, A2,...)
3 12
6 1 5

Here's a fast version that eliminates the nested loops:
N, M = [int(v) for v in input().split()]
A = [int(v) for v in input().split()]
from math import gcd
print(N)
l = 1
for v in A:
l = l*v//gcd(l, v)
for k in range(1, M+1):
if gcd(l, k) == 1:
print(k)
It works by first taking the LCM, l, of the values in A. It then suffices to check if the GCD of k and l is 1, which means there are no common factors with any of the values in A.
Note: If you're using a newer version of Python than I am (3.9 or later), you can import lcm from math and replace l = l*v//gcd(l, v) with l = lcm(l, v).
Or, as Kelly Bundy pointed out, lcm accepts an arbitrary number of arguments, so the first loop can be replaced with l = lcm(*A) if you're using 3.9 or later.

Just another approach using sympy.theory, factorint and Python sets which from the point of view of speed has on my machine no advantage compared to the math.lcm() or the math.gcd() based solutions if applied to small sizes of lists and numbers, but excels at very large size of randomized list:
M = 12
lstA = (6, 1, 5)
from sympy.ntheory import factorint
lstAfactors = []
for a in lstA:
lstAfactors += factorint(a)
setA = set(lstAfactors)
for k in range(1, M+1):
if not (set(factorint(k)) & setA):
print(k)
The code above implements the idea described in the answer of Yatisi coded by Tom Karzes using math.gcd(), but is using sympy.ntheory factorint() and set() instead of math gcd().
In terms of speed the factorint() solution seems to be fastest on the below tested data:
# ======================================================================
from time import perf_counter as T
from math import gcd, lcm
from sympy import factorint
from random import choice
#M = 3000
#lstA = 100 * [6, 12, 18, 121, 256, 1024, 361, 2123, 39]
M = 8000
lstA = [ choice(range(1,8000)) for _ in range(8000) ]
# ----------------------------------------------------------------------
from sympy.ntheory import factorint
lstResults = []
lstAfactors = []
sT=T()
for a in lstA:
lstAfactors += factorint(a)
setA = set(lstAfactors)
for k in range(1, M+1):
if not (set(factorint(k)) & setA):
lstResults += [k]
print("factorint:", T()-sT)
#print(lstResults)
print("---")
# ----------------------------------------------------------------------
lstResults = []
sT=T()
#l = 1
#for a in lstA:
# l = (l*a)//gcd(l, a) # can be replaced by:
l = lcm(*lstA) # least common multiple divisible by all lstA items
# ^-- which runs MAYBE a bit faster than the loop with gcd()
for k in range(1, M+1):
if gcd(l, k) == 1:
lstResults += [k]
print("lcm() :", T()-sT)
#print(lstResults)
print("---")
# ----------------------------------------------------------------------
lstResults = []
sT=T()
l = 1
for a in lstA:
l = (l*a)//gcd(l, a) # can be replaced by:
#l = lcm(*lstA) # least common multiple divisible by all lstA items
# ^-- which runs MAYBE a bit faster than the loop with gcd()
for k in range(1, M+1):
if gcd(l, k) == 1:
lstResults += [k]
print("gcd() :", T()-sT)
#print(lstResults)
print("---")
# ----------------------------------------------------------------------
import numpy as np
A = np.array(lstA)
def find_gcd_np(M, A, to_gcd=1):
vals = np.arange(1, M + 1)
return vals[np.all(np.gcd(vals, np.array(A)[:, None]) == to_gcd, axis=0)]
sT=T()
lstResults = find_gcd_np(M, A, 1).tolist()
print("numpy :", T()-sT)
#print(lstResults)
print("---")
printing
factorint: 0.09754624799825251
---
lcm() : 0.10102138598449528
---
gcd() : 0.10236155497841537
---
numpy : 6.923375226906501
---
The timing results change extremely for the second data variant in the code provided above printing:
factorint: 0.021642255946062505
---
lcm() : 0.0010238440008834004
---
gcd() : 0.0013772319070994854
---
numpy : 0.19953695288859308
---
where the factorint based approach is 20x and the numpy based approach 200x times slower than the gcd/lcm based one.
Run the timing test yourself online. It won't run the case of large data, but it can at least demonstrate that the numpy approach is 100x times slower than the gcd one:
factorint: 0.03271647123619914
---
lcm() : 0.003286922350525856
---
gcd() : 0.0029655308462679386
---
numpy : 0.41759901121258736
1 https://ato.pxeger.com/run?1=3VXBitswED0W9BXDGoq16-zaSbtsAzmk0GN6KLmUkAbFkdcismQkZbc-9Et62Uv7Uf2ajiwnNt1Du7ClUIORpXmaefNmLH39Xjeu1Orh4dvBFaObHy--RDB7locURlfgRMUBQFS1Ng5qbopNrg_KcQPMwjKAKubKHnSb7xKQeRVstqnq5mQrWO60EcoFo2Fqh0NnzEstck6iBfqCGUzSNCWRtG6OkyxN4RxW1wlkY3xv_JglMH7tV9LxqwQm136ejSf4-WZNhk46H6suQoxhb3mcJTdpSikU2sAGhIKw7BdhTUgEo2d5BjJcKldybZrHaiDDD9wepLOe9GrdUg5mGxbscraMKfFkmSfrAVOCOQ6RF7PeZ8wosbxNHId4AAte9n3KKNziIqOtOxAFKO0g9pt6Z3sU6qV3NO9gEEIfWWPk1X5N6hZ8dto3PUsAaY_skpIoGPtN9AhHlc7oMyo-4DXULpK-kUj0i4ZRmwqaYnnO6NUV9m8sE2AUIsiZgi0Hw2vJcr6DbTMF4rHY3_G53-9RkjOL7aurSiuoMK6oJYeduBNWbPFr2wCTsg0HwvHKYsxPoxHclyIvwRyUhcX849t3yGorfFtY_3-5EoNjw4DUuoZ7gf-Yp_bb6nX89xQPAsj-oFo-F-oRT6nW3y5WqNXjdn9SpaL_rVSt139Wqu7YUgd_pOPxr2rijxdVXzJjWNOeMZTseAGFULsNkt2oOl4kME_AaT-fHZO_Y9Iet56kgQvIaGs23B2MalErj5EyxsFn75eSPuScrqYJvNeKr1sRQxjsic_CzlLat9Owyx6zy-il01JYF5-0C1k-UepwC3eX8fFS_gk)

This is probably more a math question than a programming question, however, here comes my take: Depending on M and A, it might be better to
Find the prime divisors of the Ai (have a look at this) and put them in a set.
Either remove (sieve) all multiples of these primes from list(range(1,M+1)), which you can do (more) efficiently by smart ordering, or find all primes smaller or equal to M (which could even be pre-computed) that are not divisors of any Ai and compute all multiples up to M.
Explanation: Since gcd(Ai,k)=1 if and only if Ai and k have no common divisors, they also have no prime divisors. Thus, we can first find all prime divisors of the Ai and then make sure our k don't have any of them as divisors, too.

Using numpy with vectorised operations will be a good alternative when your input range M goes up to hundreds and higher and A is stably small (is about as your current A):
import numpy as np
def find_gcd_np(M, A, to_gcd=1):
vals = np.arange(1, M + 1)
return vals[np.all(np.gcd(vals, np.array(A)[:, None]) == to_gcd, axis=0)]
Usage:
print(find_gcd_np(100, [6, 1, 5], 1))

accelerated FFT to be invoked from Python Numba CUDA kernel

I need to calculate the Fourier transform of a 256 element float64 signal. The requirement is as such that I need to invoke these FFTs from inside a cuda.jitted section and it must be completed within 25usec. Alas cuda.jit-compiled functions do not allow to invoke external libraries => I wrote my own. Alas my single-core code is still way too slow (~250usec on a Quadro P4000). Is there a better way?
I created a single core FFT-function that gives correct results, but is alas 10x too slow. I don't understand how to make good use of multiple cores.
---fft.py
from numba import cuda, boolean, void, int32, float32, float64, complex128
import math, sys, cmath
def _transform_radix2(vector, inverse, out):
n = len(vector)
levels = int32(math.log(float32(n))/math.log(float32(2)))
assert 2**levels==n # error: Length is not a power of 2
#uncomment either Numba.Cuda or Numpy memory allocation, (intelligent conditional compileation??)
exptable = cuda.local.array(1024, dtype=complex128)
#exptable = np.zeros(1024, np.complex128)
assert (n // 2) <= len(exptable) # error: FFT length > MAXFFTSIZE
coef = complex128((2j if inverse else -2j) * math.pi / n)
for i in range(n // 2):
exptable[i] = cmath.exp(i * coef)
for i in range(n):
x = i
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
out[i] = vector[y]
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
temp = out[j + halfsize] * exptable[k]
out[j + halfsize] = out[j] - temp
out[j] += temp
k += tablestep
size *= 2
scale=float64(n if inverse else 1)
for i in range(n):
out[i]=out[i]/scale # the inverse requires a scaling
# now create the Numba.cuda version to be called by a GPU
gtransform_radix2 = cuda.jit(device=True)(_transform_radix2)
---test.py
from numba import cuda, void, float64, complex128, boolean
import cupy as cp
import numpy as np
import timeit
import fft
#cuda.jit(void(float64[:],boolean, complex128[:]))
def fftbench(y, inverse, FT):
Y = cuda.local.array(256, dtype=complex128)
for i in range(len(y)):
Y[i]=complex128(y[i])
fft.gtransform_radix2(Y, False, FT)
str='\nbest [%2d/%2d] iterations, min:[%9.3f], max:[%9.3f], mean:[%9.3f], std:[%9.3f] usec'
a=[127.734375 ,130.87890625 ,132.1953125 ,129.62109375 ,118.6015625
,110.2890625 ,106.55078125 ,104.8203125 ,106.1875 ,109.328125
,113.5 ,118.6640625 ,125.71875 ,127.625 ,120.890625
,114.04296875 ,112.0078125 ,112.71484375 ,110.18359375 ,104.8828125
,104.47265625 ,106.65625 ,109.53515625 ,110.73828125 ,111.2421875
,112.28125 ,112.38671875 ,112.7734375 ,112.7421875 ,113.1328125
,113.24609375 ,113.15625 ,113.66015625 ,114.19921875 ,114.5
,114.5546875 ,115.09765625 ,115.2890625 ,115.7265625 ,115.41796875
,115.73828125 ,116. ,116.55078125 ,116.5625 ,116.33984375
,116.63671875 ,117.015625 ,117.25 ,117.41015625 ,117.6640625
,117.859375 ,117.91015625 ,118.38671875 ,118.51171875 ,118.69921875
,118.80859375 ,118.67578125 ,118.78125 ,118.49609375 ,119.0078125
,119.09375 ,119.15234375 ,119.33984375 ,119.31640625 ,119.6640625
,119.890625 ,119.80078125 ,119.69140625 ,119.65625 ,119.83984375
,119.9609375 ,120.15625 ,120.2734375 ,120.47265625 ,120.671875
,120.796875 ,120.4609375 ,121.1171875 ,121.35546875 ,120.94921875
,120.984375 ,121.35546875 ,120.87109375 ,120.8359375 ,121.2265625
,121.2109375 ,120.859375 ,121.17578125 ,121.60546875 ,121.84375
,121.5859375 ,121.6796875 ,121.671875 ,121.78125 ,121.796875
,121.8828125 ,121.9921875 ,121.8984375 ,122.1640625 ,121.9375
,122. ,122.3515625 ,122.359375 ,122.1875 ,122.01171875
,121.91015625 ,122.11328125 ,122.1171875 ,122.6484375 ,122.81640625
,122.33984375 ,122.265625 ,122.78125 ,122.44921875 ,122.34765625
,122.59765625 ,122.63671875 ,122.6796875 ,122.6171875 ,122.34375
,122.359375 ,122.7109375 ,122.83984375 ,122.546875 ,122.25390625
,122.06640625 ,122.578125 ,122.7109375 ,122.83203125 ,122.5390625
,122.2421875 ,122.06640625 ,122.265625 ,122.13671875 ,121.8046875
,121.87890625 ,121.88671875 ,122.2265625 ,121.63671875 ,121.14453125
,120.84375 ,120.390625 ,119.875 ,119.34765625 ,119.0390625
,118.4609375 ,117.828125 ,117.1953125 ,116.9921875 ,116.046875
,115.16015625 ,114.359375 ,113.1875 ,110.390625 ,108.41796875
,111.90234375 ,117.296875 ,127.0234375 ,147.58984375 ,158.625
,129.8515625 ,120.96484375 ,124.90234375 ,130.17578125 ,136.47265625
,143.9296875 ,150.24609375 ,141. ,117.71484375 ,109.80859375
,115.24609375 ,118.44140625 ,120.640625 ,120.9921875 ,111.828125
,101.6953125 ,111.21484375 ,114.91015625 ,115.2265625 ,118.21875
,125.3359375 ,139.44140625 ,139.76953125 ,135.84765625 ,137.3671875
,141.67578125 ,139.53125 ,136.44921875 ,135.08203125 ,135.7890625
,137.58203125 ,138.7265625 ,154.33203125 ,172.01171875 ,152.24609375
,129.8046875 ,125.59375 ,125.234375 ,127.32421875 ,132.8984375
,147.98828125 ,152.328125 ,153.7734375 ,155.09765625 ,156.66796875
,159.0546875 ,151.83203125 ,138.91796875 ,138.0546875 ,140.671875
,143.48046875 ,143.99609375 ,146.875 ,146.7578125 ,141.15234375
,141.5 ,140.76953125 ,140.8828125 ,145.5625 ,150.78125
,148.89453125 ,150.02734375 ,150.70703125 ,152.24609375 ,148.47265625
,131.95703125 ,125.40625 ,123.265625 ,123.57421875 ,129.859375
,135.6484375 ,144.51171875 ,155.05078125 ,158.4453125 ,140.8125
,100.08984375 ,104.29296875 ,128.55078125 ,139.9921875 ,143.38671875
,143.69921875 ,137.734375 ,124.48046875 ,116.73828125 ,114.84765625
,113.85546875 ,117.45703125 ,122.859375 ,125.8515625 ,133.22265625
,139.484375 ,135.75 ,122.69921875 ,115.7734375 ,116.9375
,127.57421875]
y1 =cp.zeros(len(a), cp.complex128)
FT1=cp.zeros(len(a), cp.complex128)
for i in range(len(a)):
y1[i]=a[i] #convert to complex to feed the FFT
r=1000
series=sorted(timeit.repeat("fftbench(y1, False, FT1)", number=1, repeat=r, globals=globals()))
series=series[0:r-5]
print(str % (len(series), r, 1e6*np.min(series), 1e6*np.max(series), 1e6*np.mean(series), 1e6*np.std(series)));
a faster implementation t<<25usec

The drawback of your algorithm is that even on GPU it runs on a single-core.
In order to understand how to design algorithms on Nvidia GPGPU I recommend to look at :
the CUDA C Programming guide and to the numba documentation to apply the code in python.
Moreover to understand what's wrong with your code, I recommend to use Nvidia profiler.
The following parts of the answer will explained how to apply the basics on your example.
Run multiples threads
To improve performances, you will first need to launch multiples threads that can run in parallel, CUDA handle threads as follow:
Threads are grouped into blocs of n threads (n < 1024)
Each thread withing the same bloc can be synchronized and have access to a (fast) common memory space called "shared memory".
You can run multiples blocs in parallel in a "grid" but you will lose the synchronization mechanism.
The syntax to run multiples threads is the following:
fftbench[griddim, blockdim](y1, False, FT1)
to simplify, I will use only one bloc of size 256:
fftbench[1, 256](y1, False, FT1)
Memory
To improve GPU performances it's important to look where the data will be stored, their is three main spaces:
global memory: it's the "RAM" of your GPU, it's slow and have a high latency, this is where all your array are placed when you send them to the GPU.
shared memory: it's a little fast access memory, all the thread of a bloc have access to the same shared memory.
local memory: physically it's the same that global memory, but each thread access its own local memory.
Typically, if you use multiples times the sames data, you should try store them in shared memory to prevent latency from the global memory.
In your code, you can store exptable in shared memory:
exptable = cuda.shared.array(1024, dtype=complex128)
and if n is not too big, you may want to use a working instead of using out:
working = cuda.shared.array(256, dtype=complex128)
Assign tasks to each thread
Of course if you don't change your function, all thread will do the same job and it will just slow down your program.
In this example we will assign each thread to one cell of the array. To do so, we have to get the unique id of thread withing a bloc:
idx = cuda.threadIdx.x
Now we will be able to speed up the for loops, lets handle them one by one:
exptable = cuda.shared.array(1024, dtype=complex128)
...
for i in range(n // 2):
exptable[i] = cmath.exp(i * coef)
Here is the goal: we will want the n/2 first threads to fill this array, then all the thread will be able to use it.
So in this case just replace the for loop by a condition on the thread idx's:
if idx < n // 2:
exptable[idx] = cmath.exp(idx * coef)
For the two last loops it's easier, each thread will deal with one cell of the array:
for i in range(n):
x = i
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
out[i] = vector[y]
become
x = idx
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
working[idx] = vector[y]
and
for i in range(n):
out[i]=out[i]/scale # the inverse requires a scaling
become
out[idx]=working[idx]/scale # the inverse requires a scaling
I use the shared array working but you can replace it by out if you want to use global memory.
Now, lets look at the while loop, we said that we want each thread to only deal with one cell of the array. So we can try to parallelize the two for loops inside.
...
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
temp = out[j + halfsize] * exptable[k]
out[j + halfsize] = out[j] - temp
out[j] += temp
k += tablestep
...
To simplify I will only use half of the threads, we will take the 128 first threads and determine j as follow:
...
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
...
k is:
k = tablestep*(idx%halfsize)
so we got the loop:
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
k = tablestep*(idx%halfsize)
temp = working[j + halfsize] * exptable[k]
working[j + halfsize] = working[j] - temp
working[j] += temp
size *= 2
Synchronization
Last but not least, we need to synchronize all theses threads. In fact the program will not work if we do not synch. On the GPU thread may not run at the same time so you can get issues when data are produced by one thread and used by another one, for example:
exptable[0] is used by thread_2 before thread_0 fill store its value
working[j + halfsize] is moddified by another thread before you store it in temp
to prevent this we can use the function:
cuda.syncthreads()
All the threads in the same bloc will finish this line before execution the rest of the code.
In this example, you need to synchronize at two point, after the working initialization and after each iteration of the while loop.
then your code look like:
def _transform_radix2(vector, inverse, out):
n = len(vector)
levels = int32(math.log(float32(n))/math.log(float32(2)))
assert 2**levels==n # error: Length is not a power of 2
exptable = cuda.shared.array(1024, dtype=complex128)
working = cuda.shared.array(256, dtype=complex128)
assert (n // 2) <= len(exptable) # error: FFT length > MAXFFTSIZE
coef = complex128((2j if inverse else -2j) * math.pi / n)
if idx < n // 2:
exptable[idx] = cmath.exp(idx * coef)
x = idx
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
working[idx] = vector[y]
cuda.syncthreads()
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
k = tablestep*(idx%halfsize)
temp = working[j + halfsize] * exptable[k]
working[j + halfsize] = working[j] - temp
working[j] += temp
size *= 2
cuda.syncthreads()
scale=float64(n if inverse else 1)
out[idx]=working[idx]/scale # the inverse requires a scaling
I feel like your question is a good way to introduce some basics about GPGPU computing and I try to answer it in a didactic way. The final code is far from perfect and can be optimized a lot, I highly recommend you to read this Programming guide if you want to learn more about GPU optimizations.

Parallelize these nested for loops in python

I have a multidimensional array (result) that should be filled by some nested loops. Function fun() is a complex and time-consuming function. I want to fill my array elements in a parallel manner, so I can use all my system's processing power.
Here's the code:
import numpy as np
def fun(x, y, z):
# time-consuming computation...
# ...
return output
dim1 = 10
dim2 = 20
dim3 = 30
result = np.zeros([dim1, dim2, dim3])
for i in xrange(dim1):
for j in xrange(dim2):
for k in xrange(dim3):
result[i, j, k] = fun(i, j, k)
My question is that "Can I parallelize this code or not? if yes, How?"
I'm using Windows 10 64-bit and python 2.7.
Please provide your solution by changing my code if you can.
Thanks!

If you want a more general solution, taking advantage of fully parallel execution, then why not use something like this:
>>> import multiprocess as mp
>>> p = mp.Pool()
>>>
>>> # a time consuming function taking x,y,z,...
>>> def fun(*args):
... import time
... time.sleep(.1)
... return sum(*args)
...
>>> dim1, dim2, dim3 = 10, 20, 30
>>> import itertools
>>> input = ((i,j,k) for i,j,k in itertools.combinations_with_replacement(xrange(dim3), 3) if i < dim1 and j < dim2)
>>> results = p.map(fun, input)
>>> p.close()
>>> p.join()
>>>
>>> results[:2]
[0, 1]
>>> results[-2:]
[56, 57]
Note I'm using multiprocess instead of multiprocessing, but that's only to get the ability to work in the interpreter.
I didn't use a numpy.array, but if you had to... you could just dump the output from p.map directly into a numpy.array and then modify the shape attribute to be shape = (dim1, dim2, dim3), or you could do something like this:
>>> input = ((i,j,k) for i,j,k in itertools.combinations_with_replacement(xrange(dim3), 3) if i < dim1 and j < dim2)
>>> import numpy as np
>>> results = np.empty(dim1*dim2*dim3)
>>> res = p.imap(fun, input)
>>> for i,r in enumerate(res):
... results[i] = r
...
>>> results.shape = (dim1,dim2,dim3)

Here is a version of code that runs fun(i, j, k) in parallel for differend k indices. This is done by running fun in different processes by using https://docs.python.org/2/library/multiprocessing.html
import numpy as np
from multiprocessing import Pool
def fun(x, y, z):
# time-consuming computation...
# ...
return output
def fun_wrapper(indices):
fun(*indices)
if __name__ == '__main__':
dim1 = 10
dim2 = 20
dim3 = 30
result = np.zeros([dim1, dim2, dim3])
pool = Pool(processes=8)
for i in xrange(dim1):
for j in xrange(dim2):
result[i, j] = pool.map(fun_wrapper, [(i, j, k) for k in xrange(dim3)])
This is not the most elegant solution but you may start with it. And you will get a speed up only if fun contains time-consuming computation

A simple approach could be to divide the array in sections and create some threads to operate throught these sections. For example one section from (0,0,0) to (5,10,15) and other one from (5,10,16) to (10,20,30).
You can use threading module and do something like this
import numpy as np
import threading as t
def fun(x, y, z):
# time-consuming computation...
# ...
return output
dim1 = 10
dim2 = 20
dim3 = 30
result = np.zeros([dim1, dim2, dim3])
#b - beginning index, e - end index
def work(ib,jb,kb,ie,je,ke):
for i in xrange(ib,ie):
for j in xrange(jb,je):
for k in xrange(kb,ke):
result[i, j, k] = fun(i, j, k)
threads = list()
threads.append(t.Thread(target=work, args(0,0,0,dim1/2,dim2/2,dim3/2))
threads.append(t.Thread(target=work, args(dim1/2,dim2/2,dim3/2 +1,dim1, dim2, dim3))
for thread in threads:
thread.start()
You can define these sections through some algorithm and determine the number of threads dynamically. Hope it helps you or at least give you some ideas.

Gurobi Error Divisor must be a constant when making a more complex objective function

When I use a fairly straight forward cost function for my optimization objective function, gurobi gives back an answer but when I complicate things with math.log() functions or even with i**2 instead of i*i it produces an error similar to one of the following:
GurobiError: Divisor must be a constant
TypeError: a float is required
TypeError: unsupported operand type(s) for ** or pow(): 'Var' and 'int'
I tried to reformulate math.log((m-i)/i) to math.log(m-i)- math.log(i) this produces the float is required error. changing i*i to i**2 produces unsupported error.
Now my question is: is it just impossible to make a more complex function within Gurobi? or am I making an mistake elsewhere.
Here is a snippit of my model
from gurobipy import *
import pandas as pd
import numpy as np
import time
import math
start_time = time.time()
# example NL (i, 20, 0.08, -6.7, 301)
def cost(i, j, k, l, m):
cost = (j - l)*i + k*i*i - l*(m - i) * (math.log((m - i) / i ))
return cost
def utility(i, j, k, l):
utility = j + k*i + l*i*i
return utility
"""
def cost(i, j, k, l):
cost = j + k*i + .5*l*i*i
return cost
"""
# assign files to use as input and as output
outputfile = 'model1nodeoutput.csv'
inputfile = 'marketclearinginput.xlsx'
# define dataframes
dfdemand = pd.read_excel(inputfile, sheetname="demand", encoding='utf8')
dfproducer = pd.read_excel(inputfile, sheetname="producer", encoding='utf8')
m = Model("1NodeMultiPeriod")
dofprod = [m.addVar(lb=3.0, ub=300, name=h) for h in dfproducer['name']]
dofdem = [m.addVar(lb=3.0, ub=300, name=h) for h in dfdemand['name']]
# Integrate new variables
m.update()
# Set objective
m.setObjective(quicksum([utility(i, j, k, l) for i, j, k, l
in zip(dofdem, dfdemand['c'], dfdemand['a'], dfdemand['b'])]) -
quicksum([cost(i, j, k, l, m) for i, j, k, l, m
in zip(dofprod, dfproducer['c'], dfproducer['a'], dfproducer['b'], dfproducer['Pmax'])]),
GRB.MAXIMIZE)
# Set constraints
# Set constraints for producers
for i, j, k in zip(dofprod, dfproducer['Pmin'], dfproducer['Pmax']):
m.addConstr(i >= j)
m.addConstr(i <= k)
# Set constraints for demand
for i, j, k in zip(dofdem, dfdemand['Pmin'], dfdemand['Pmax']):
m.addConstr(i >= j)
m.addConstr(i <= k)
# Build the timestamp list, pd or np unique both possible, pd faster and preserves order
# Timestamps skips the first 3 symbols (example L1T2034 becomes 2034)
timestamps = pd.unique([i.varName[3:] for i in dofprod])
# Set constraint produced >= demanded (this should be te last constraint added for shadow variables)
for h in timestamps:
m.addConstr(quicksum([i for i in dofprod if i.varName.endswith(h)]) >=
quicksum([i for i in dofdem if i.varName.endswith(h)]))
m.optimize()

Your problem might have to do with the Gurobi quicksum() function. Perhaps try sum().

Dynamic programming solution to maximizing an expression by placing parentheses

I'm trying to implement an algorithm from Algorithmic Toolbox course on Coursera that takes an arithmetic expression such as 5+8*4-2 and computes its largest possible value. However, I don't really understand the choice of indices in the last part of the shown algorithm; my implementation fails to compute values using the ones initialized in 2 tables (which are used to store maximized and minimized values of subexpressions).
The evalt function just takes the char, turns it into the operand and computes a product of two digits:
def evalt(a, b, op):
if op == '+':
return a + b
#and so on
MinMax computes the minimum and the maximum values of subexpressions
def MinMax(i, j, op, m, M):
mmin = 10000
mmax = -10000
for k in range(i, j-1):
a = evalt(M[i][k], M[k+1][j], op[k])
b = evalt(M[i][k], m[k+1][j], op[k])
c = evalt(m[i][k], M[k+1][j], op[k])
d = evalt(m[i][k], m[k+1][j], op[k])
mmin = min(mmin, a, b, c, d)
mmax = max(mmax, a, b, c, d)
return(mmin, mmax)
And this is the body of the main function
def get_maximum_value(dataset):
op = dataset[1:len(dataset):2]
d = dataset[0:len(dataset)+1:2]
n = len(d)
#iniitializing matrices/tables
m = [[0 for i in range(n)] for j in range(n)] #minimized values
M = [[0 for i in range(n)] for j in range(n)] #maximized values
for i in range(n):
m[i][i] = int(d[i]) #so that the tables will look like
M[i][i] = int(d[i]) #[[i, 0, 0...], [0, i, 0...], [0, 0, i,...]]
for s in range(n): #here's where I get confused
for i in range(n-s):
j = i + s
m[i][j], M[i][j] = MinMax(i,j,op,m,M)
return M[0][n-1]

Sorry to bother, here's what had to be improved:
for s in range(1,n)
in the main function, and
for k in range(i, j):
in MinMax function. Now it works.

The following change should work.
for s in range(1,n):
for i in range(0,n-s):

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

why is this async matrix multiplication code slower than synchronous? - python

Related

Find the number in a given range so that the gcd of the number with any element of a given list will always be 1

accelerated FFT to be invoked from Python Numba CUDA kernel

Parallelize these nested for loops in python

Gurobi Error Divisor must be a constant when making a more complex objective function

Dynamic programming solution to maximizing an expression by placing parentheses

Categories

Resources