Why is the program execution time the same as before?

Why is the program execution time the same as before? - python

For some reason, the execution time is still the same as without threading.
But if I add something like time.sleep(secs) there clearly is threading in work inside the target def d.
def d(CurrentPos, polygon, angale, id):
Returnvalue = 0
lock = True
steg = 0.0005
distance = 0
x = 0
y = 0
while lock == True:
x = math.sin(math.radians(angale)) * distance + CurrentPos[0]
y = math.cos(math.radians(angale)) * distance + CurrentPos[1]
Localpoint = Point(x, y)
inout = polygon.contains(Localpoint)
distance = distance + steg
if inout == False:
lock = False
l = LineString([[CurrentPos[0], CurrentPos[1]],[x,y]])
Returnvalue = list(l.intersection(polygon).coords)[0]
Returnvalue = calculateDistance(CurrentPos[0], CurrentPos[1],
Returnvalue[0], Returnvalue[1])
with Arraylock:
ReturnArray.append(Returnvalue)
ReturnArray.append(id)
def Main(CurrentPos, Map):
threads = []
for i in range(8):
t = threading.Thread(target = d, name ='thread{}'.format(i), args =
(CurrentPos, Map, angales[i], i))
threads.append(t)
t.start()
for i in threads:
i.join()

Welcome to the world of the Global Interpreter Lock a.k.a. GIL. Your function looks like a CPU bound code (some calculations, loops, ifs, memory access, etc.). You can't use threads to increase performance of CPU bound tasks, sorry. It is Python's limitation.
There are functions in Python that release GIL, e.g. disk i/o, network i/o and the one you've actually tried: sleep. And indeed, threads do increase performance of i/o bound tasks. But arithmetic and/or memory access won't run parallely in Python.
The standard workaround is to use processes instead of threads. But this is often painful due to not-that-easy interprocess communication. You may also want to consider using some low level libraries like numpy that actually releases GIL in certain situations (you can only do that at C level, GIL is not accessible from Python itself) or using some other language without this limitation, e.g. C#, Java, C, C++ and so on.

Related

Overhead of python multiprocessing initialization is worse than benefits

I want to use a trie search with python 3.7 in order to match a string with some given words.
The trie search algorithm is actually quite fast, however I also want to use all of the cores my CPU has. Lets assume my pc has 8 cores and I want to use 7 of them.
So I split my word database into 7 equally big lists and created a trie off every one. (That's the basic Idea for parallizing the code)
However, when I call Process() off the multiprocessing module, the Process().start() method can take up a couple of seconds on the real database. (the search itself takes about a microseconds).
To be honest, I'm not yet a professional programmer, which means I probably have put in some major mistake in the code. Does someone see the reason the start of the process is so damn slow?
Please consider that I tested the script with a way bigger database than the trie below. I also tested the script with calling only 1 process each time and that was also significantly slower.
I wanted to provide less code, however I think It's nice to see the running problem. I can also provide additional info if needed.
import string
import sys
import time
from multiprocessing import Process, Manager
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
currentNode = root
for char in word:
if char not in currentNode.children:
currentNode.children[char] = TrieNode()
currentNode = currentNode.children[char]
currentNode.isString = True
def findStrings(self, prefix, node, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node.isString:
results.append(prefix)
for char in node.children:
self.findStrings(prefix + char, node.children[char], results)
def findSubStrings(self, start_prefix, root, results):
currentNode = root
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in currentNode.children:
break
# Wechsle zu Kindern in anderem Falle
else:
currentNode = currentNode.children[char]
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
self.findStrings(start_prefix, currentNode, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
num_repeats = 20
real_num_repeats = n_cores * num_repeats
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Extending trie to use it on multiple cores at once
multiroot = [root] * n_cores
# Measure time
print('Single process ...')
t_0 = time.time()
for i in range(real_num_repeats):
r = []
root.findSubStrings('he', root, r)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# using multicore to speed up the process
man = Manager()
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
t_00 = time.time()
p_init_time = 0
procs_append_time = 0
p_start_time = 0
for i in range(num_repeats):
# Create Share-able list
res = man.list()
procs = []
for i in range(n_cores):
t_0 = time.time()
p = Process(target=multiroot[i].findSubStrings, args=('a', multiroot[i], res))
t_1 = time.time()
p_init_time += t_1 - t_0
procs.append(p)
t_2 = time.time()
procs_append_time += t_2 - t_1
p.start()
p_start_time += time.time() - t_2
for p in procs:
p.join()
multi_proc_time = time.time() - t_00
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
append_overhead = procs_append_time / single_proc_time
start_overhead = p_start_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Process(...) overhead: {init_overhead:.1%}")
print(f"procs.append(p) overhead: {append_overhead:.1%}")
print(f"p.start() overhead: {start_overhead:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
Single process ...
0.007229958261762347
Multiprocess ...
0.7615800397736686
Process(...) overhead: 0.9%
procs.append(p) overhead: 0.0%
p.start() overhead: 8.2%
Total overhead: 10573.8%

General idea
There are many things to consider and most of them are already described in Multiprocessing > Programming guidelines. The most important thing is to remember that you are actually working with multiple processes and so there are 3 (or 4) ways of how variables are handled:
Synchronized wrappers over ctypes shared-state variables (like
multiprocessing.Value). Actual variable is always "one object" in
memory, and wrapper by default is using "locking" to set/get real
value.
Proxies (like Manager().list()). These variables are similar to shared-state variables, but are placed in the special "server process", and all the operations over them are actually sending pickled values between manager-process and active-process:
results.append(x) pickles x and sends it from manager process to active process that makes this call,
then it's unpickled
Any other access to results (like len(results), iteration over results) involves the same pickling/sending/unpickling process.
So generally proxies are much slower than any other approach for common variables and in many cases using manager
for "local" parallelization will give worse performance even compared to single-process runs.
But a manager-server can be used remotely, so it's reasonable to use them when you want to parallelize the work
using workers distributed on multiple machines
Objects available during subprocess create. For "fork" start method all the objects available during creation of subprocess are still available and "not shared", so changing them only changes it "locally for the subprocess". But before they are changed each process really "shares" the memory for each such object, so:
If they are used "read-only", then nothing is copied or "communicated".
If they are changed then they are copied inside the subprocess and, the copy is being changed. This is called Copy-On-Write or COW.
Please note that making a new reference to object, e.g. assigning a variable
to reference it, or appending it to a list increases ref_count of object, and that is
considered to be "a change".
Behavior may also vary depending on "start method": e.g. for "spawn"/"forkserver" method changeable global variables are not really "the same objects" value seen by subprocess may not be the same as in parent process.
So initial values of multiroot[i] (used in Process(target=..., args=(..., multiroot[i], ...))) are shared but:
if you are not using 'fork' start method (and by default Windows is not using it), then all the args are pickled at least once for each subprocess. And so start may be taking a long time if multiroot[i].children is huge.
Even if you are using fork: initially multiroot[i] seems to be shared and not copied, but I'm not sure what happens
when variables are assigned inside of findSubStrings method (e.g. currentNode = ...) — maybe it's causing copy-on-write (COW) and so whole instance of TrieNode is being copied.
What can be done to improve the situation:
If you are using fork start method, then make sure that "database" objects (TrieNode instances) are truly
readonly and don't event have methods with variables assignments in them. For example you can move findSubStrings to another class, and make sure to call all the instance.insertString before starting subprocesses.
You are using man.list() instance as a results argument to findSubStrings. This means that for each subprocess
a different "wrapper" is created, and all the results.append(prefix) actions are pickling prefix, and then sending it
to server process. If you are using Pool with limited number of processes, then it's not a big deal. If you are spawning
huge amount of subprocesses, then it might affect performance. And I think that by default they all use "locking" so concurrent appends migth be relatively slow. If order of items in results does not matter (I'm not experienced with prefix-trees and don't remember theory behind it), then you can fully avoid any overheads related to concurrent
results.append:
create new results list inside the findSubStrings method. Don't use res = man.list() at all.
To get the "final" results: iterate over every result object returned by pool.apply_async());
get the results; "merge them".
Using weak references
Using currentNode = root in findSubStrings will result in COW of root. That's why weak references (currentNodeRef = weakref.ref(root)) can give a little extra benefit.
Example
import string
import sys
import time
import weakref
from copy import deepcopy
from multiprocessing import Pool
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
current_node = root
for char in word:
if char not in current_node.children:
current_node.children[char] = TrieNode()
current_node = current_node.children[char]
current_node.isString = True
# findStrings: not a method of TrieNode anymore, and works with reference to node.
def findStrings(prefix, node_ref, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node_ref().isString:
results.append(prefix)
for char in node_ref().children:
findStrings(prefix + char, weakref.ref(node_ref().children[char]), results)
# findSubStrings: not a method of TrieNode anymore, and works with reference to node.
def findSubStrings(start_prefix, node_ref, results=None):
if results is None:
results = []
current_node_ref = node_ref
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in current_node_ref().children:
break
# Wechsle zu Kindern in anderem Falle
else:
current_node_ref = weakref.ref(current_node_ref().children[char])
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
findStrings(start_prefix, current_node_ref, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
real_num_repeats = 420
simulated_num_repeats = real_num_repeats // n_cores
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Create tries for subprocesses:
multiroot = [deepcopy(root) for _ in range(n_cores)]
# NOTE: actually all subprocesses can use the same `root`, but let's copy them to simulate
# that we are using different tries when splitting job to sub-jobs
# localFindSubStrings: defined after `multiroot`, so `multiroot` can be used as "shared" variable
def localFindSubStrings(start_prefix, root_index=None, results=None):
if root_index is None:
root_ref = weakref.ref(root)
else:
root_ref = weakref.ref(multiroot[root_index])
return findSubStrings(start_prefix, root_ref, results)
# Measure time
print('Single process ...')
single_proc_num_results = None
t_0 = time.time()
for i in range(real_num_repeats):
iteration_results = localFindSubStrings('help', )
if single_proc_num_results is None:
single_proc_num_results = len(iteration_results)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
p_init_time = 0
apply_async_time = 0
results_join_time = 0
# Should processes be joined between repeats (simulate single job on multiple cores) or not (simulate multiple jobs running simultaneously)
PARALLEL_REPEATS = True
if PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
final_results = []
t_00 = time.time()
for repeat_num in range(simulated_num_repeats):
final_result = []
final_results.append(final_result)
if not PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
else:
t_1 = time.time()
async_results.append(
(
final_result,
pool.starmap_async(
localFindSubStrings,
[('help', core_num) for core_num in range(n_cores)],
)
)
)
t_2 = time.time()
apply_async_time += t_2 - t_1
if not PARALLEL_REPEATS:
for _, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
if PARALLEL_REPEATS:
for final_result, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
multi_proc_time = time.time() - t_00
# Work is not really parallelized, instead it's just 'duplicated' over cores,
# and so we divide using `real_num_repeats` (not `simulated_num_repeats`)
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
apply_async_overhead = apply_async_time / single_proc_time
results_join_percent = results_join_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Pool(...) overhead: {init_overhead:.1%}")
print(f"pool.starmap_async(...) overhead: {apply_async_overhead:.1%}")
print(f"Results join time percent: {results_join_percent:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
for iteration_results in final_results:
num_results = len(iteration_results) / n_cores
if num_results != single_proc_num_results:
raise AssertionError(f'length of results should not change! {num_results} != {single_proc_num_results}')
NOTES:
PARALLEL_REPEATS=True simulates running of the multiple jobs (for example each job should be started for different prefixes, but in example I use the same prefix to have consistent "load" for each run), and each job is "parallelized" over all cores.
PARALLEL_REPEATS=False simulates running of a single job
parallelized over all cores and it's slower than single-process
solution.
It seems that parallelism is only better when each
worker in the pool is issued apply_async more than 1 time.
Example output:
Single process ...
0.007109369550432477
Multiprocess ...
0.002928720201764788
Pool(...) overhead: 1.3%
pool.apply_async(...) overhead: 1.5%
Results join time percent: 1.8%
Total overhead: -58.8%

at first I want to thank everyone who was participating as every answer contributed to the solution.
As the first comments pointed out, creating a new process every time leads to python shifting the needed data into the process. This can take a couple of seconds and is leading to a non-desired delay.
What brought the ultimate solution for me is creating the processes (one per core) once using the Process class of the multiprocessing library once during the startup of the program.
You can then communicate with the process using the Pipe class of the same module.
I found the ping-pong example here really helping: https://www.youtube.com/watch?v=s1SkCYMnfbY&t=900s
It is still not optimal as multiple pipes trying to talk to the process during the same time causes the process to crash.
However, I should be able solving this issue using queues. If someone is interested in the solution feel free to ask.

Python Threading inconsistent execution time

Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)

It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module

How to make the python code with two for loop run faster(Is there a python way of doing Mathematica's Parallelize)?

I am completely new to python or any such programming language. I have some experience with Mathematica. I have a mathematical problem which though Mathematica solves with her own 'Parallelize' methods but leaves the system quite exhausted after using all the cores! I can barely use the machine during the run. Hence, I was looking for some coding alternative and found python kind of easy to learn and implement. So without further ado, let me tell you the mathematical problem and issues with my python code. As the full code is too long, let me give an outline.
1. Numericall solve a differential equation of the form y''(t) + f(t)y(t)=0, to get y(t) for some range, say C <= t <= D
2.Next, Interpolate the numerical result for some desired range to get the function: w(t), say for A <= t <= B
3. Using w(t), to solve another differential equation of the form z''(t) + [ a + b W(t)] z(t) =0 for some range of a and b, for which I am using the loop.
4. Deine F = 1 + sol1[157], to make a list like {a, b, F}. So let me give a prototype loop as this take most of the computation time.
for q in np.linspace(0.0, 4.0, 100):
for a in np.linspace(-2.0, 7.0, 100):
print('Solving for q = {}, a = {}'.format(q,a))
sol1 = odeint(fun, [1, 0], t, args=( a, q))[..., 0]
print(t[157])
F = 1 + sol1[157]
f1.write("{} {} {} \n".format(q, a, F))
f1.close()
Now, the real loop takes about 4 hrs and 30 minutes to complete (With some built-in functional form of w(t), it takes about 2 minute). When, I applied (without properly understanding what it does and how!) numba/autojit before the definition of fun in my code, the run time significantly improved and takes about 2 hrs and 30 minute. Also, writing two loops as itertools/product further reduces the run time by about 2 minutes only! However, Mathematica, when I let her use all the 4 cores, finishes the task within 30 minutes.
So, is there a way to improve the runtime in python?

To speed up python, you have three options:
deal with specific bottlenecks in the program (as suggested in #LutzL's comment)
try to speed up the code by compiling it into C using cython (or including C code using weave or similar techniques). Since the time-consuming computations in your case are not in python code proper but in scipy modules (at least I believe they are), this would not help you very much here.
implement multiprocessing as you suggested in your original question. This will speed up your code to up to X (slightly less than) times faster if you have X cores. Unfortunately this is rather complicated in python.
Implementing multiprocessing - example using the prototype loop from the original question
I assume that the computations you do inside the nested loops in your prototype code are actually independent from one another. Since your prototype code is incomplete, I am not sure this is the case, however. Otherwise it will, of course, not work. I will give an example using not your differential equation problem for the fun function but a prototype of the same signature (input and output variables).
import numpy as np
import scipy.integrate
import multiprocessing as mp
def fun(y, t, b, c):
# replace this function with whatever function you want to work with
# (this one is the example function from the scipy docs for odeint)
theta, omega = y
dydt = [omega, -b*omega - c*np.sin(theta)]
return dydt
#definitions of work thread and write thread functions
def run_thread(input_queue, output_queue):
# run threads will pull tasks from the input_queue, push results into output_queue
while True:
try:
queueitem = input_queue.get(block = False)
if len(queueitem) == 3:
a, q, t = queueitem
sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
F = 1 + sol1[157]
output_queue.put((q, a, F))
except Exception as e:
print(str(e))
print("Queue exhausted, terminating")
break
def write_thread(queue):
# write thread will pull results from output_queue, write them to outputfile.txt
f1 = open("outputfile.txt", "w")
while True:
try:
queueitem = queue.get(block = False)
if queueitem[0] == "TERMINATE":
f1.close()
break
else:
q, a, F = queueitem
print("{} {} {} \n".format(q, a, F))
f1.write("{} {} {} \n".format(q, a, F))
except:
# necessary since it will throw an error whenever output_queue is empty
pass
# define time point sequence
t = np.linspace(0, 10, 201)
# prepare input and output Queues
mpM = mp.Manager()
input_queue = mpM.Queue()
output_queue = mpM.Queue()
# prepare tasks, collect them in input_queue
for q in np.linspace(0.0, 4.0, 100):
for a in np.linspace(-2.0, 7.0, 100):
# Your computations as commented here will now happen in run_threads as defined above and created below
# print('Solving for q = {}, a = {}'.format(q,a))
# sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
# print(t[157])
# F = 1 + sol1[157]
input_tupel = (a, q, t)
input_queue.put(input_tupel)
# create threads
thread_number = mp.cpu_count()
procs_list = [mp.Process(target = run_thread , args = (input_queue, output_queue)) for i in range(thread_number)]
write_proc = mp.Process(target = write_thread, args = (output_queue,))
# start threads
for proc in procs_list:
proc.start()
write_proc.start()
# wait for run_threads to finish
for proc in procs_list:
proc.join()
# terminate write_thread
output_queue.put(("TERMINATE",))
write_proc.join()
Explanation
We define the individual problems (or rather their parameters) before commencing computation; we collect them in an input Queue.
We define a function (run_thread) that is run in the threads. This function computes individual problems until there are none left in the input Queue; it pushes the results into an output Queue.
We start as many such threads as we have CPUs.
We start an additional thread (write_thread) for collecting the results from the output queue and writing them into a file.
Caveats
For smaller problems, you can run multiprocessing without Queues. However, if the number of individual computations is large, you will exceed the maximum number of threads the kernel will allow you after which the kernel kills your program.
There are differences between different operating systems for how multiprocessing works. The example above will work on Linux (perhaps also on other Unix like systems such as Mac and BSD), not on Windows. The reason is that Windows does not have a fork() system call. (I do not have access to a Windows, can therefore not try to implement it for Windows.)

Python Multiprocessing speed issue

I have a nested for loop of the form
while x<lat2[0]:
while y>lat3[1]:
if (is_inside_nepal([x,y])):
print("inside")
else:
print("not")
y = y - (1/150.0)
y = lat2[1]
x = x + (1/150.0)
#here lat2[0] represents a large number
Now this normally takes around 50s for executing.
And I have changed this loop to a multiprocessing code.
def v1find_coordinates(q):
while not(q.empty()):
x1 = q.get()
x2 = x1 + incfactor
while x1<x2:
def func(x1):
while y>lat3[1]:
if (is_inside([x1,y])):
print x1,y,"inside"
else:
print x1,y,"not inside"
y = y - (1/150.0)
func(x1)
y = lat2[1]
x1 = x1 + (1/150.0)
incfactor = 0.7
xvalues = drange(x,lat2[0],incfactor)
#this drange function is to get list with increment factor as decimal
cores = mp.cpu_count()
q = Queue()
for i in xvalues:
q.put(i)
for i in range(0,cores):
p = Process(target = v1find_coordinates,args=(q,) )
p.start()
p.Daemon = True
processes.append(p)
for i in processes:
print ("now joining")
i.join()
This multiprocessing code also takes around 50s execution time. This means there is no difference of time between the two.
I also have tried using pools. I have also managed the chunk size. I have googled and searched through other stackoverflow. But can't find any satisfying answer.
The only answer I could find was time was taken in process management to make both the result same. If this is the reason then how can I get the multiprocessing work to obtain faster results?
Will implementing in C from Python give faster results?
I am not expecting drastic results but by common sense one can tell that running on 4 cores should be a lot faster than running in 1 core. But I am getting similar results. Any kind of help would be appreciated.

You seem to be using a thread Queue (from Queue import Queue). This does not work as expected as Process uses fork() and it clones the entire Queue into each worker Process
Use:
from multiprocessing import Queue

Does python have atomic CompareAndSet operation?

I need to use atomic CompareAndSet operation in my python program, but I didn't find reference about how to use it.
Does python provide such atomic function?
Thank you.

From the atomics library:
import atomics
a = atomics.atomic(width=4, atype=atomics.INT)
# set to 5 if a.load() compares == to 0
res = a.cmpxchg_strong(expected=0, desired=5)
print(res.success)
Note: I am the author of this library

Python atomic for shared data types.
https://sharedatomic.top
The module can be used for atomic operations under multiple processs and multiple threads conditions. High performance python! High concurrency, High performance!
atomic api Example with multiprocessing and multiple threads:
You need the following steps to utilize the module:
create function used by child processes, refer to UIntAPIs, IntAPIs, BytearrayAPIs, StringAPIs, SetAPIs, ListAPIs, in each process, you can create multiple threads.
def process_run(a):
def subthread_run(a):
a.array_sub_and_fetch(b'\x0F')
threadlist = []
for t in range(5000):
threadlist.append(Thread(target=subthread_run, args=(a,)))
for t in range(5000):
threadlist[t].start()
for t in range(5000):
threadlist[t].join()
create the shared bytearray
a = atomic_bytearray(b'ab', length=7, paddingdirection='r', paddingbytes=b'012', mode='m')
start processes / threads to utilize the shared bytearray
processlist = []
for p in range(2):
processlist.append(Process(target=process_run, args=(a,)))
for p in range(2):
processlist[p].start()
for p in range(2):
processlist[p].join()
assert a.value == int.to_bytes(27411031864108609, length=8, byteorder='big')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is the program execution time the same as before? - python

Related

Overhead of python multiprocessing initialization is worse than benefits

Python Threading inconsistent execution time

How to make the python code with two for loop run faster(Is there a python way of doing Mathematica's Parallelize)?

Python Multiprocessing speed issue

Does python have atomic CompareAndSet operation?

Categories

Resources