I am writing a simple caesar cipher program in python using threads and queues. Even though my program is able to run, it doesn't create the necessary output file. Would appreciate any help, thanks!
I am guessing the anomaly starts where I use the queues to store ciphered strings, here:
for i in range(0,len(data),l):
while not q1.full:
q1.put(data[index:index+l])
index+=l
while not q2.empty:
output_file.write(q2.get())
Here is the whole code:
import threading
import sys
import Queue
import string
#argumanlarin alinmasi
if len(sys.argv)!=4:
print("Duzgun giriniz: '<filename>.py s n l'")
sys.exit(0)
else:
s=int(sys.argv[1])
n=int(sys.argv[2])
l=int(sys.argv[3])
#Global
index = 0
#kuyruk deklarasyonu
q1 = Queue.Queue(n)
q2 = Queue.Queue(2000)
lock = threading.Lock()
#Threadler
threads=[]
#dosyayi okuyarak stringe cevirme
myfile=open('metin.txt','r')
data=myfile.read()
#Thread tanimlamasi
class WorkingThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
lock.acquire()
q2.put(self.caesar(q1.get(), s))
lock.release()
def caesar(self, plaintext, shift):
alphabet = string.ascii_lowercase
shifted_alphabet = alphabet[shift:] + alphabet[:shift]
table = string.maketrans(alphabet, shifted_alphabet)
return plaintext.translate(table)
for i in range(0,n):
current_thread = WorkingThread()
current_thread.start()
threads.append(current_thread)
output_file=open("crypted"+ "_"+ str(s)+"_"+str(n)+"_"+str(l)+".txt", "w")
for i in range(0,len(data),l):
while not q1.full:
q1.put(data[index:index+l])
index+=l
while not q2.empty:
output_file.write(q2.get())
for i in range(0,n):
threads[i].join()
output_file.close()
myfile.close()
while not q1.full can never be True, as full is a method, and as such will always be True in a boolean context, therefore not q1.full will always be False, you need to call the method: q1.full(). Same for q2.full.
Also, you should'n try to detect wheather the queue is full in this case. If it ever would be not full, then you would continut to add data until it was and then ignore the rest, or your index can increase beyond the size of data and you would continue to add 0-length data chunks.
You should use a separate thread for writing to q1 and for reading from q2, then you can just let q1 block on put().
Also, you're using the same lock in your worker threads to basically serialize all computations, which defeats the purpose of threading. The problem you're dealing with is CPU bound, for which multithreading isn't going to give you any speedup in python. Have a look at the multiprocessing module. Using multiprocessing.Pool.map() (or some other of the map-methods) the whole program could be simplified dramatically and give you a speed up through mutliprocessig at the same time.
Related
I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)
You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.
I am trying to do a word counter with mapreduce using concurrent.futures, previously I've done a multi threading version, but was so slow because is CPU bound.
I have done the mapping part to divide the words into ['word1',1], ['word2,1], ['word1,1], ['word3',1] and between the processes, so each process will take care of a part of the text file. The next step ("shuffling") is to put these words in a dictionary so that it looks like this: word1: [1,1], word2:[1], word3: [1], but I cannot share the dictionary between the processes because we are using multiprocessing instead of multithreading, so how can I make each process add the "1" to the dictionary shared between all the processes? I'm stuck with this, and I can't continue.
I am at this point:
import sys
import re
import concurrent.futures
import time
# Read text file
def input(index):
try:
reader = open(sys.argv[index], "r", encoding="utf8")
except OSError:
print("Error")
sys.exit()
texto = reader.read()
reader.close()
return texto
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
n_processes = 4
# Creating processes
with concurrent.futures.ProcessPoolExecutor() as executor:
results = []
for id_process in range(n_processes):
results.append(executor.submit(mapping, words, n_processes, id_process))
for f in concurrent.futures.as_completed(results):
print(f.result())
def mapping(words, n_processes, id_process):
word_map_result = []
for i in range(int((id_process / n_processes) * len(words)),
int(((id_process + 1) / n_processes) * len(words))):
word_map_result.append([words[i], 1])
return word_map_result
if __name__ == '__main__':
if len(sys.argv) == 1:
print("Please, specify a text file...")
sys.exit()
start_time = time.time()
for index in range(1, len(sys.argv)):
print(sys.argv[index], ":", sep="")
text = input(index)
splitting(text)
# for word in result_dictionary_words:
# print(word, ':', result_dictionary_words[word])
print("--- %s seconds ---" % (time.time() - start_time))
I've seen that when doing concurrent programming it is usually best to avoid using shared state as far as possible, so how I can implement Map reduce word count without share the dictionary between processes?
You can create a shared dictionary using a Manager from multiprocessing. I understand from your program that it is your word_map_result you need to share.
You could try something like this
from multiprocessing import Manager
...
def splitting():
...
word_map_result = Manager().dict()
with concurrent.futures.....:
...
results.append(executor.submit(mapping, words, n_processes, id_process, word_map_result)
...
...
def mapping(words, n_processes, id_process, word_map_result):
for ...
# Do not return anything - word_map_result is up to date in your main process
Basically you will remove the local copy of word_map_result from your mapping function and pass it the Manager instance as a parameter. This word_map_result is now shared between all your subprocesses and the main program. Managers add data transfer overhead, though, so this might not help you very much.
In this case you do not return anything from the workers so you do not need the for loop to process results either in your main program - your word_map_result is identical in all subprocesses and the main program.
I may have misunderstood your problem and I am not familiar with the algorithm if it is possible to re-engineer that to work so that you don't need to share anything between processes.
It seems like a misconception to be using multiprocessing at all. First, there is overhead in creating the pool and overhead in passing data to and from the processes. And if you decide to use a shared, managed dictionary that worker function mapping can use to store its results in, know that a managed dictionary uses a proxy, the accessing of which is rather slow. The alternative to using a managed dictionary would be as you currently have it, i.e. mapping returns a list and the main process uses those results to create the keys and values of the dictionary. But what then is the point of mapping returning a list where each element is always a list of two elements where the second element is always the constant value 1? Isn't that rather wasteful of time and space?
I think your performance will be no faster (probably slower) than just implementing splitting as:
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
results = {}
for word in words:
results[word] = [1]
return results
I have been looking around for some time, but haven't had luck finding an example that could solve my problem. I have added an example from my code. As one can notice this is slow and the 2 functions could be done separately.
My aim is to print every second the latest parameter values. At the same time the slow processes can be calculated in the background. The latest value is shown and when any process is ready the value is updated.
Can anybody recommend a better way to do it? An example would be really helpful.
Thanks a lot.
import time
def ProcessA(parA):
# imitate slow process
time.sleep(5)
parA += 2
return parA
def ProcessB(parB):
# imitate slow process
time.sleep(10)
parB += 5
return parB
# start from here
i, parA, parB = 1, 0, 0
while True: # endless loop
print(i)
print(parA)
print(parB)
time.sleep(1)
i += 1
# update parameter A
parA = ProcessA(parA)
# update parameter B
parB = ProcessB(parB)
I imagine this should do it for you. This has the benefit of you being able to add extra parallel funcitons up to a total equal to the number of cores you have. Edits are welcome.
#import time module
import time
#import the appropriate multiprocessing functions
from multiprocessing import Pool
#define your functions
#whatever your slow function is
def slowFunction(x):
return someFunction(x)
#printingFunction
def printingFunction(new,current,timeDelay):
while new == current:
print current
time.sleep(timeDelay)
#set the initial value that will be printed.
#Depending on your function this may take some time.
CurrentValue = slowFunction(someTemporallyDynamicVairable)
#establish your pool
pool = Pool()
while True: #endless loop
#an asynchronous function, this will continue
# to run in the background while your printing operates.
NewValue = pool.apply_async(slowFunction(someTemporallyDynamicVairable))
pool.apply(printingFunction(NewValue,CurrentValue,1))
CurrentValue = NewValue
#close your pool
pool.close()
This is a followup question to this. User Will suggested using a queue, I tried to implement that solution below. The solution works just fine with j=1000, however, it hangs as I try to scale to larger numbers. I am stuck here and cannot determine why it hangs. Any suggestions would be appreciated. Also, the code is starting to get ugly as I keep messing with it, I apologize for all the nested functions.
def run4(j):
"""
a multicore approach using queues
"""
from multiprocessing import Process, Queue, cpu_count
import os
def bazinga(uncrunched_queue, crunched_queue):
"""
Pulls the next item off queue, generates its collatz
length and
"""
num = uncrunched_queue.get()
while num != 'STOP': #Signal that there are no more numbers
length = len(generateChain(num, []) )
crunched_queue.put([num , length])
num = uncrunched_queue.get()
def consumer(crunched_queue):
"""
A process to pull data off the queue and evaluate it
"""
maxChain = 0
biggest = 0
while not crunched_queue.empty():
a, b = crunched_queue.get()
if b > maxChain:
biggest = a
maxChain = b
print('%d has a chain of length %d' % (biggest, maxChain))
uncrunched_queue = Queue()
crunched_queue = Queue()
numProcs = cpu_count()
for i in range(1, j): #Load up the queue with our numbers
uncrunched_queue.put(i)
for i in range(numProcs): #put sufficient stops at the end of the queue
uncrunched_queue.put('STOP')
ps = []
for i in range(numProcs):
p = Process(target=bazinga, args=(uncrunched_queue, crunched_queue))
p.start()
ps.append(p)
p = Process(target=consumer, args=(crunched_queue, ))
p.start()
ps.append(p)
for p in ps: p.join()
You're putting 'STOP' poison pills into your uncrunched_queue (as you should), and having your producers shut down accordingly; on the other hand your consumer only checks for emptiness of the crunched queue:
while not crunched_queue.empty():
(this working at all depends on a race condition, btw, which is not good)
When you start throwing non-trivial work units at your bazinga producers, they take longer. If all of them take long enough, your crunched_queue dries up, and your consumer dies. I think you may be misidentifying what's happening - your program doesn't "hang", it just stops outputting stuff because your consumer is dead.
You need to implement a smarter methodology for shutting down your consumer. Either look for n poison pills, where n is the number of producers (who accordingly each toss one in the crunched_queue when they shut down), or use something like a Semaphore that counts up for each live producer and down when one shuts down.
I'm trying to run three functions (each can take up to 1 second to execute) every second. I'd then like to store the output from each function, and write them to separate files.
At the moment I'm using Timers for my delay handling. (I could subclass Thread, but that's getting a bit complicated for this simple script)
def main:
for i in range(3):
set_up_function(i)
t = Timer(1, run_function, [i])
t.start()
time.sleep(100) # Without this, main thread exits
def run_function(i):
t = Timer(1, run_function, [i])
t.start()
print function_with_delay(i)
What's the best way to handle the output from function_with_delay? Append the result to a global list for each function?
Then I could put something like this at the end of my main function:
...
while True:
time.sleep(30) # or in a try/except with a loop of 1 second sleeps so I can interrupt
for i in range(3):
save_to_disk(data[i])
Thoughts?
Edit: Added my own answer as a possibility
I believe the python Queue module is designed for precisely this sort of scenario. You could do something like this, for example:
def main():
q = Queue.Queue()
for i in range(3):
t = threading.Timer(1, run_function, [q, i])
t.start()
while True:
item = q.get()
save_to_disk(item)
q.task_done()
def run_function(q, i):
t = threading.Timer(1, run_function, [q, i])
t.start()
q.put(function_with_delay(i))
I would say store a list of lists (bool, str), where bool is whether the function has finished running and str is the output. Each function locks the list with a mutex to append output (or if you don't care about thread safety omit this). Then, have a simple polling loop checking if all the bool values are True, and if so then do your save_to_disk calls.
Another alternative would be to implement a class (taken from this answer) that uses threading.Lock(). This has the advantage of being able to wait on the ItemStore, and save_to_disk can use getAll, rather than polling the queue. (More efficient for large data sets?)
This is particularly suited to writing at a set time interval (ie every 30 seconds), rather than once per second.
class ItemStore(object):
def __init__(self):
self.lock = threading.Lock()
self.items = []
def add(self, item):
with self.lock:
self.items.append(item)
def getAll(self):
with self.lock:
items, self.items = self.items, []
return items