I am trying to do a word counter with mapreduce using concurrent.futures, previously I've done a multi threading version, but was so slow because is CPU bound.
I have done the mapping part to divide the words into ['word1',1], ['word2,1], ['word1,1], ['word3',1] and between the processes, so each process will take care of a part of the text file. The next step ("shuffling") is to put these words in a dictionary so that it looks like this: word1: [1,1], word2:[1], word3: [1], but I cannot share the dictionary between the processes because we are using multiprocessing instead of multithreading, so how can I make each process add the "1" to the dictionary shared between all the processes? I'm stuck with this, and I can't continue.
I am at this point:
import sys
import re
import concurrent.futures
import time
# Read text file
def input(index):
try:
reader = open(sys.argv[index], "r", encoding="utf8")
except OSError:
print("Error")
sys.exit()
texto = reader.read()
reader.close()
return texto
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
n_processes = 4
# Creating processes
with concurrent.futures.ProcessPoolExecutor() as executor:
results = []
for id_process in range(n_processes):
results.append(executor.submit(mapping, words, n_processes, id_process))
for f in concurrent.futures.as_completed(results):
print(f.result())
def mapping(words, n_processes, id_process):
word_map_result = []
for i in range(int((id_process / n_processes) * len(words)),
int(((id_process + 1) / n_processes) * len(words))):
word_map_result.append([words[i], 1])
return word_map_result
if __name__ == '__main__':
if len(sys.argv) == 1:
print("Please, specify a text file...")
sys.exit()
start_time = time.time()
for index in range(1, len(sys.argv)):
print(sys.argv[index], ":", sep="")
text = input(index)
splitting(text)
# for word in result_dictionary_words:
# print(word, ':', result_dictionary_words[word])
print("--- %s seconds ---" % (time.time() - start_time))
I've seen that when doing concurrent programming it is usually best to avoid using shared state as far as possible, so how I can implement Map reduce word count without share the dictionary between processes?
You can create a shared dictionary using a Manager from multiprocessing. I understand from your program that it is your word_map_result you need to share.
You could try something like this
from multiprocessing import Manager
...
def splitting():
...
word_map_result = Manager().dict()
with concurrent.futures.....:
...
results.append(executor.submit(mapping, words, n_processes, id_process, word_map_result)
...
...
def mapping(words, n_processes, id_process, word_map_result):
for ...
# Do not return anything - word_map_result is up to date in your main process
Basically you will remove the local copy of word_map_result from your mapping function and pass it the Manager instance as a parameter. This word_map_result is now shared between all your subprocesses and the main program. Managers add data transfer overhead, though, so this might not help you very much.
In this case you do not return anything from the workers so you do not need the for loop to process results either in your main program - your word_map_result is identical in all subprocesses and the main program.
I may have misunderstood your problem and I am not familiar with the algorithm if it is possible to re-engineer that to work so that you don't need to share anything between processes.
It seems like a misconception to be using multiprocessing at all. First, there is overhead in creating the pool and overhead in passing data to and from the processes. And if you decide to use a shared, managed dictionary that worker function mapping can use to store its results in, know that a managed dictionary uses a proxy, the accessing of which is rather slow. The alternative to using a managed dictionary would be as you currently have it, i.e. mapping returns a list and the main process uses those results to create the keys and values of the dictionary. But what then is the point of mapping returning a list where each element is always a list of two elements where the second element is always the constant value 1? Isn't that rather wasteful of time and space?
I think your performance will be no faster (probably slower) than just implementing splitting as:
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
results = {}
for word in words:
results[word] = [1]
return results
Related
I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)
You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.
I have a python function that is pretty complex that im trying to run vs around 100 or so different NYSE symbols on the stock market. Right now it takes around 5 minutes to complete. This is not a terrible amount of time, but im trying to make it quicker by multithreading. My idea is that, since this is a single function, im just passing new parameters each iteration, it would maybe work to store a list of symbols that have "completed" then on a new iteration it just runs through the list and if a symbol doesnt exists in the list, it runs the computation. Heres some code i put together:
iteration_count = 0
for index, row in stocklist_df.iterrows():
# The below just filters input data
if '-' in row[0]:
continue
elif '.' in row[0]:
continue
elif '^' in row[0]:
continue
elif len(row[0]) > 4:
continue
else:
symbol = row[0]
#idea is that on first iteration it runs on first thread and appends to threadlist
#second iteration looks at threadlist and if symbol exists, then skips and goes to the next
threadlist.append([symbol, iteration_count])
t1 = threading.Thread(target=get_info(symbol, 0))
t1.start()
if iteration_count > 1:
t2 = threading.Thread(target=get_info(symbol, 0))
t2.start()
Right now this doesnt appear to be working, and im not sure this is the best solution or maybe im implementing it wrong. How can i achieve this task?
I confess to having had some difficulty following your logic. I will just offer up that the usual method of handling threading when you have multiple, similar requests is to use thread pooling. There are several ways offered by Python, such as the ThreadpoolExecutor class concurrent.futures module (see the manual for documentation). The following is an example. Here function get_info essentially just returns its argument:
import concurrent.futures
def get_info(symbol):
return 'answer: ' + symbol
symbols = ['abc', 'def', 'ghi', 'jkl']
NUMBER_THREADS = min(30, len(symbols))
with concurrent.futures.ThreadPoolExecutor(max_workers=NUMBER_THREADS) as executor:
results = executor.map(get_info, symbols)
for result in results:
print(result)
Prints:
answer: abc
answer: def
answer: ghi
answer: jkl
You can play around with the number of threads you create. If you are using for example, the requests package to retrieve URLs from the same website, then you might wish to create a requests Session object and pass that as an additional argument to get_info:
import concurrent.futures
import requests
import functools
def get_info(session, symbol):
"""
r = session.get('https://somewebsite.com?symbol=' + symbol)
return r.text
"""
return 'symbol answer: ' + symbol
symbols = ['abc', 'def', 'ghi', 'jkl']
NUMBER_THREADS = min(30, len(symbols))
with requests.Session() as session:
get_info_with_session = functools.partial(get_info, session) # this will be the first argument
with concurrent.futures.ThreadPoolExecutor(max_workers=NUMBER_THREADS) as executor:
results = executor.map(get_info_with_session, symbols)
for result in results:
print(result)
I have been looking around for some time, but haven't had luck finding an example that could solve my problem. I have added an example from my code. As one can notice this is slow and the 2 functions could be done separately.
My aim is to print every second the latest parameter values. At the same time the slow processes can be calculated in the background. The latest value is shown and when any process is ready the value is updated.
Can anybody recommend a better way to do it? An example would be really helpful.
Thanks a lot.
import time
def ProcessA(parA):
# imitate slow process
time.sleep(5)
parA += 2
return parA
def ProcessB(parB):
# imitate slow process
time.sleep(10)
parB += 5
return parB
# start from here
i, parA, parB = 1, 0, 0
while True: # endless loop
print(i)
print(parA)
print(parB)
time.sleep(1)
i += 1
# update parameter A
parA = ProcessA(parA)
# update parameter B
parB = ProcessB(parB)
I imagine this should do it for you. This has the benefit of you being able to add extra parallel funcitons up to a total equal to the number of cores you have. Edits are welcome.
#import time module
import time
#import the appropriate multiprocessing functions
from multiprocessing import Pool
#define your functions
#whatever your slow function is
def slowFunction(x):
return someFunction(x)
#printingFunction
def printingFunction(new,current,timeDelay):
while new == current:
print current
time.sleep(timeDelay)
#set the initial value that will be printed.
#Depending on your function this may take some time.
CurrentValue = slowFunction(someTemporallyDynamicVairable)
#establish your pool
pool = Pool()
while True: #endless loop
#an asynchronous function, this will continue
# to run in the background while your printing operates.
NewValue = pool.apply_async(slowFunction(someTemporallyDynamicVairable))
pool.apply(printingFunction(NewValue,CurrentValue,1))
CurrentValue = NewValue
#close your pool
pool.close()
I am writing a simple caesar cipher program in python using threads and queues. Even though my program is able to run, it doesn't create the necessary output file. Would appreciate any help, thanks!
I am guessing the anomaly starts where I use the queues to store ciphered strings, here:
for i in range(0,len(data),l):
while not q1.full:
q1.put(data[index:index+l])
index+=l
while not q2.empty:
output_file.write(q2.get())
Here is the whole code:
import threading
import sys
import Queue
import string
#argumanlarin alinmasi
if len(sys.argv)!=4:
print("Duzgun giriniz: '<filename>.py s n l'")
sys.exit(0)
else:
s=int(sys.argv[1])
n=int(sys.argv[2])
l=int(sys.argv[3])
#Global
index = 0
#kuyruk deklarasyonu
q1 = Queue.Queue(n)
q2 = Queue.Queue(2000)
lock = threading.Lock()
#Threadler
threads=[]
#dosyayi okuyarak stringe cevirme
myfile=open('metin.txt','r')
data=myfile.read()
#Thread tanimlamasi
class WorkingThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
lock.acquire()
q2.put(self.caesar(q1.get(), s))
lock.release()
def caesar(self, plaintext, shift):
alphabet = string.ascii_lowercase
shifted_alphabet = alphabet[shift:] + alphabet[:shift]
table = string.maketrans(alphabet, shifted_alphabet)
return plaintext.translate(table)
for i in range(0,n):
current_thread = WorkingThread()
current_thread.start()
threads.append(current_thread)
output_file=open("crypted"+ "_"+ str(s)+"_"+str(n)+"_"+str(l)+".txt", "w")
for i in range(0,len(data),l):
while not q1.full:
q1.put(data[index:index+l])
index+=l
while not q2.empty:
output_file.write(q2.get())
for i in range(0,n):
threads[i].join()
output_file.close()
myfile.close()
while not q1.full can never be True, as full is a method, and as such will always be True in a boolean context, therefore not q1.full will always be False, you need to call the method: q1.full(). Same for q2.full.
Also, you should'n try to detect wheather the queue is full in this case. If it ever would be not full, then you would continut to add data until it was and then ignore the rest, or your index can increase beyond the size of data and you would continue to add 0-length data chunks.
You should use a separate thread for writing to q1 and for reading from q2, then you can just let q1 block on put().
Also, you're using the same lock in your worker threads to basically serialize all computations, which defeats the purpose of threading. The problem you're dealing with is CPU bound, for which multithreading isn't going to give you any speedup in python. Have a look at the multiprocessing module. Using multiprocessing.Pool.map() (or some other of the map-methods) the whole program could be simplified dramatically and give you a speed up through mutliprocessig at the same time.
I saw the reference here, and tried to use the method for my for loop, but it seems not working as expected.
def concatMessage(obj_grab, content):
for logCatcher in obj_grab:
for key in logCatcher.dic_map:
regex = re.compile(key)
for j in range(len(content)):
for m in re.finditer(regex, content[j]):
content[j] += " " + logCatcher.index + " " + logCatcher.dic_map[key]
return content
def transferConcat(args):
return concatMessage(*args)
if __name__ == "__name__":
pool = Pool()
content = pool.map(transferConcat, [(obj_grab, content)])[0]
pool.close()
pool.join()
I want to enhance the performance of for loop because it takes 22 seconds to run.
When I run the method directly, it also takes about 22 seconds.
It seems the enhancement has failed.
What should I do to enhance my for loop speed?
Why is pool.map not working in my case?
After remind by nablahero, I revised my code as below:
if __name__ == "__main__":
content = input_file(target).split("\n")
content = manager.list(content)
for files in source:
obj_grab.append((LogCatcher(files), content))
pool = Pool()
pool.map(transferConcat, obj_grab)
pool.close()
pool.join()
def concatMessage(LogCatcher, content):
for key in LogCatcher.dic_map:
regex = re.compile(key)
for j in range(len(content)):
for m in re.finditer(regex, content[j]):
content[j] += LogCatcher.index + LogCatcher.dic_map[key]
def transferConcat(args):
return concatMessage(*args)
after the long waiting, it caused 82 secs to finish...
Why I got this situation? How can I revise my code?
obj_grab is a list, which contains logCatchers of different file intput
content is the file I want to concat, and use Manager() to let multiprocess concat the same file.
What's in obj_grab and content? I guess it only contains one object so when you're starting your Pool you call the function transferConcat only once because you only got one object in obj_grab and content.
If you use map have a look at your reference again. obj_grab and content must be lists of objects in order to speed your program up, because it call the function multiple times with different obj_grab and content's.
pool.map does not speed up the function itself - the function just gets called multiple times in parallel with different data!
I hope that clears some things up.