How to alternate iterations in multithreading a single function with python - python

I have a python function that is pretty complex that im trying to run vs around 100 or so different NYSE symbols on the stock market. Right now it takes around 5 minutes to complete. This is not a terrible amount of time, but im trying to make it quicker by multithreading. My idea is that, since this is a single function, im just passing new parameters each iteration, it would maybe work to store a list of symbols that have "completed" then on a new iteration it just runs through the list and if a symbol doesnt exists in the list, it runs the computation. Heres some code i put together:
iteration_count = 0
for index, row in stocklist_df.iterrows():
# The below just filters input data
if '-' in row[0]:
continue
elif '.' in row[0]:
continue
elif '^' in row[0]:
continue
elif len(row[0]) > 4:
continue
else:
symbol = row[0]
#idea is that on first iteration it runs on first thread and appends to threadlist
#second iteration looks at threadlist and if symbol exists, then skips and goes to the next
threadlist.append([symbol, iteration_count])
t1 = threading.Thread(target=get_info(symbol, 0))
t1.start()
if iteration_count > 1:
t2 = threading.Thread(target=get_info(symbol, 0))
t2.start()
Right now this doesnt appear to be working, and im not sure this is the best solution or maybe im implementing it wrong. How can i achieve this task?

I confess to having had some difficulty following your logic. I will just offer up that the usual method of handling threading when you have multiple, similar requests is to use thread pooling. There are several ways offered by Python, such as the ThreadpoolExecutor class concurrent.futures module (see the manual for documentation). The following is an example. Here function get_info essentially just returns its argument:
import concurrent.futures
def get_info(symbol):
return 'answer: ' + symbol
symbols = ['abc', 'def', 'ghi', 'jkl']
NUMBER_THREADS = min(30, len(symbols))
with concurrent.futures.ThreadPoolExecutor(max_workers=NUMBER_THREADS) as executor:
results = executor.map(get_info, symbols)
for result in results:
print(result)
Prints:
answer: abc
answer: def
answer: ghi
answer: jkl
You can play around with the number of threads you create. If you are using for example, the requests package to retrieve URLs from the same website, then you might wish to create a requests Session object and pass that as an additional argument to get_info:
import concurrent.futures
import requests
import functools
def get_info(session, symbol):
"""
r = session.get('https://somewebsite.com?symbol=' + symbol)
return r.text
"""
return 'symbol answer: ' + symbol
symbols = ['abc', 'def', 'ghi', 'jkl']
NUMBER_THREADS = min(30, len(symbols))
with requests.Session() as session:
get_info_with_session = functools.partial(get_info, session) # this will be the first argument
with concurrent.futures.ThreadPoolExecutor(max_workers=NUMBER_THREADS) as executor:
results = executor.map(get_info_with_session, symbols)
for result in results:
print(result)

Related

Share variable in concurrent.futures

I am trying to do a word counter with mapreduce using concurrent.futures, previously I've done a multi threading version, but was so slow because is CPU bound.
I have done the mapping part to divide the words into ['word1',1], ['word2,1], ['word1,1], ['word3',1] and between the processes, so each process will take care of a part of the text file. The next step ("shuffling") is to put these words in a dictionary so that it looks like this: word1: [1,1], word2:[1], word3: [1], but I cannot share the dictionary between the processes because we are using multiprocessing instead of multithreading, so how can I make each process add the "1" to the dictionary shared between all the processes? I'm stuck with this, and I can't continue.
I am at this point:
import sys
import re
import concurrent.futures
import time
# Read text file
def input(index):
try:
reader = open(sys.argv[index], "r", encoding="utf8")
except OSError:
print("Error")
sys.exit()
texto = reader.read()
reader.close()
return texto
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
n_processes = 4
# Creating processes
with concurrent.futures.ProcessPoolExecutor() as executor:
results = []
for id_process in range(n_processes):
results.append(executor.submit(mapping, words, n_processes, id_process))
for f in concurrent.futures.as_completed(results):
print(f.result())
def mapping(words, n_processes, id_process):
word_map_result = []
for i in range(int((id_process / n_processes) * len(words)),
int(((id_process + 1) / n_processes) * len(words))):
word_map_result.append([words[i], 1])
return word_map_result
if __name__ == '__main__':
if len(sys.argv) == 1:
print("Please, specify a text file...")
sys.exit()
start_time = time.time()
for index in range(1, len(sys.argv)):
print(sys.argv[index], ":", sep="")
text = input(index)
splitting(text)
# for word in result_dictionary_words:
# print(word, ':', result_dictionary_words[word])
print("--- %s seconds ---" % (time.time() - start_time))
I've seen that when doing concurrent programming it is usually best to avoid using shared state as far as possible, so how I can implement Map reduce word count without share the dictionary between processes?
You can create a shared dictionary using a Manager from multiprocessing. I understand from your program that it is your word_map_result you need to share.
You could try something like this
from multiprocessing import Manager
...
def splitting():
...
word_map_result = Manager().dict()
with concurrent.futures.....:
...
results.append(executor.submit(mapping, words, n_processes, id_process, word_map_result)
...
...
def mapping(words, n_processes, id_process, word_map_result):
for ...
# Do not return anything - word_map_result is up to date in your main process
Basically you will remove the local copy of word_map_result from your mapping function and pass it the Manager instance as a parameter. This word_map_result is now shared between all your subprocesses and the main program. Managers add data transfer overhead, though, so this might not help you very much.
In this case you do not return anything from the workers so you do not need the for loop to process results either in your main program - your word_map_result is identical in all subprocesses and the main program.
I may have misunderstood your problem and I am not familiar with the algorithm if it is possible to re-engineer that to work so that you don't need to share anything between processes.
It seems like a misconception to be using multiprocessing at all. First, there is overhead in creating the pool and overhead in passing data to and from the processes. And if you decide to use a shared, managed dictionary that worker function mapping can use to store its results in, know that a managed dictionary uses a proxy, the accessing of which is rather slow. The alternative to using a managed dictionary would be as you currently have it, i.e. mapping returns a list and the main process uses those results to create the keys and values of the dictionary. But what then is the point of mapping returning a list where each element is always a list of two elements where the second element is always the constant value 1? Isn't that rather wasteful of time and space?
I think your performance will be no faster (probably slower) than just implementing splitting as:
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
results = {}
for word in words:
results[word] = [1]
return results

Running out of memory on python product iteration chain

I am trying to build a list of possible string combinations to then iterate against it. I am running out of memory executing the below line, which I get because it's several billion lines.
data = list(map(''.join,chain.from_iterable(product(string.digits+string.ascii_lowercase+'/',repeat = i) for i in range(0,7))))
So I think, rather than creating this massive iterable list, I create it and execute against it in waves with some kind of "holding string" that I save to memory and can restart from when I want. IE, generate and iterate against a million rows, then save the holding string to file. Then start up again with the next million rows, but start my mapping/iterations at the "holding string" or the next row. I have no clue how to do that. I think I might have to not use the .from_iterable(product( code that I had implemented. If that idea is not clear (or is clear but stupid) let me know.
Also, another option rather than breaking up the memory issue, would be to somehow optimize the iterable list itself, I'm not sure how I would do that either. I'm trying to map an API that has no existing documentation. While I don't know that a non-exhaustive list is the route to take, I'm certainly open to suggestions.
Here is the code chunk I've been using:
import csv
import string
from itertools import product, chain
#Open stringfile. If it doesn't exist, create it
try:
with open(stringfile) as f:
reader = csv.reader(f,delimiter=',')
data = list(reader)
f.close()
except:
data = list(map(''.join, chain.from_iterable(product(string.digits+string.ascii_lowercase + '/', repeat = i) for i in range(0,6))))
f=open(stringfile,'w')
f.write(str('\n.join(data)))
f.close()
pass
#Iterate against
...
EDIT: Further poking at this led me to this thread, which is similar topic. There is discussion about using islice, which helps me post-mapping (the script crashed last night while doing the API calls due to an error with my exception handling). I just restarted it at the 400k-th iterable.
Can I use .islice within a product? So for the generator, generate items 10mil-12mil (for example) and operate on just those items as a way to preserve memory?
Here is the most recent snippet of what I'm doing. You can see I plugged in the islice further down in the actual iteration, but I want to islice in the actual generation (the data = line).
#Open stringfile. If it doesn't exist, create it
try:
with open(stringfile) as f:
reader = csv.reader(f,delimiter=',')
data = list(reader)
f.close()
except:
data = list(map(''.join, chain.from_iterable(product(string.digits + string.ascii_lowercase + '/',repeat = i) for i in range(3,5))))
f=open(stringfile,'w')
f.write(str('\n'.join(data)))
f.close()
pass
print("Total items: " + str(len(data)-substart))
fdf = pd.DataFrame()
sdf = pd.DataFrame()
qdf = pd.DataFrame()
attctr = 0
#Iterate through the string combination list
for idx,kw in islice(enumerate(data),substart,substop):
#Attempt API call. Do the cooldown function if there is an issue.
if idx/1000 == int(idx/1000):
print("Iteration " + str(idx) + " of " + str(len(data)))
attctr +=1
if attctr == attcd:
print("Cooling down!")
time.sleep(cdtimer)
attctr = 0
try:
....

Python - how do i save a itertools.product loop and resume where it left off

I am writing a python script to basically check every possible url and log it if it responds to a request.
I found a post on StackOverflow that suggested a method of generating the strings for the urls which works well.
for n in range(1, 4 + 1):
for comb in product(chars, repeat=n):
url = ("http://" + ''.join(comb) + ".com")
currentUrl = url
checkUrl(url)
As you can imagine there is way to many urls and it is going to take a very long time so I am trying to make a way to save my script and resume from were it left off.
My question is how can I have the loop start from a specific place, or does anyone have a working piece of code that does the same thing and will allow me to specify at starting point.
This is my script soo far..
import urllib.request
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product
goodUrls = "Valid_urls.txt"
saveFile = "save.txt"
currentUrl = ''
def checkUrl(url):
print("Trying - "+url)
try:
urllib.request.urlopen(url)
except Exception as e:
None
else:
log = open(goodUrls, 'a')
log.write(url + '\n')
chars = digits + ascii_lowercase
try:
while True:
for n in range(1, 4 + 1):
for comb in product(chars, repeat=n):
url = ("http://" + ''.join(comb) + ".com")
currentUrl = url
checkUrl(url)
except KeyboardInterrupt:
print("Saving and Exiting")
open(saveFile,'w').write(currentUrl)
The return value of itertools.product is a generator object. As such all you'll have to do is:
products = product(...)
for foo in products:
if bar(foo):
spam(foo)
break
# other stuff
for foo in products:
# starts where you left off.
In your case the time taken to iterate through the possibilities is pretty small, at least compared to the time it'll take to make all those network requests. You could either save all the possibilities to disk and dump a list of what's left after every run of the program, or you could just save which number you're on. Since product has deterministic output, that should do it.
try:
with open("progress.txt") as f:
first_up = int(f.read().strip())
except FileNotFoundError:
first_up = 0
try:
for i, foo in enumerate(products):
if i <= first_up:
continue # skip this iteration
# do stuff down here
except KeyboardInterrupt:
# this is really rude to do, by the by....
print("Saving and exiting"
with open("progress.txt", "w") as f:
f.write(str(i))
If there's some reason you need a human-readable "progress" file, you can save your last password as you did above and do:
for foo in itertools.dropwhile(products, lambda p != saved_password):
# do stuff
Although the attempt to find all the URLs by this method is ridiculous, the general question posed is a very good one. The short answer is that you cannot pickle an iterator in a straightforward way, because the pickle mechanism can't save the iterator's internal state. However, you can pickle an object that implements both __iter__ and __next__. So if you create a class that has the desired functionality and also works as an iterator (by implementing those two functions), it can be pickled and reloaded. The reloaded object, when you make an iterator from it, will continue from where it left off.
#! python3.6
import pickle
class AllStrings:
CHARS = "abcdefghijklmnopqrstuvwxyz0123456789"
def __init__(self):
self.indices = [0]
def __iter__(self):
return self
def __next__(self):
s = ''.join([self.CHARS[n] for n in self.indices])
for m in range(len(self.indices)):
self.indices[m] += 1
if self.indices[m] < len(self.CHARS):
break
self.indices[m] = 0
else:
self.indices.append(0)
return s
try:
with open("bookmark.txt", "rb") as f:
all_strings = pickle.load(f)
except IOError:
all_strings = AllStrings()
try:
for s in iter(all_strings):
print(s)
except KeyboardInterrupt:
with open("bookmark.txt", "wb") as f:
pickle.dump(all_strings, f)
This solution also removes the limitation on the length of the string. The iterator will run forever, eventually generating all possible strings. Of course at some point the application will stop due to the increasing entropy of the universe.

python parallel calculations in a while loop

I have been looking around for some time, but haven't had luck finding an example that could solve my problem. I have added an example from my code. As one can notice this is slow and the 2 functions could be done separately.
My aim is to print every second the latest parameter values. At the same time the slow processes can be calculated in the background. The latest value is shown and when any process is ready the value is updated.
Can anybody recommend a better way to do it? An example would be really helpful.
Thanks a lot.
import time
def ProcessA(parA):
# imitate slow process
time.sleep(5)
parA += 2
return parA
def ProcessB(parB):
# imitate slow process
time.sleep(10)
parB += 5
return parB
# start from here
i, parA, parB = 1, 0, 0
while True: # endless loop
print(i)
print(parA)
print(parB)
time.sleep(1)
i += 1
# update parameter A
parA = ProcessA(parA)
# update parameter B
parB = ProcessB(parB)
I imagine this should do it for you. This has the benefit of you being able to add extra parallel funcitons up to a total equal to the number of cores you have. Edits are welcome.
#import time module
import time
#import the appropriate multiprocessing functions
from multiprocessing import Pool
#define your functions
#whatever your slow function is
def slowFunction(x):
return someFunction(x)
#printingFunction
def printingFunction(new,current,timeDelay):
while new == current:
print current
time.sleep(timeDelay)
#set the initial value that will be printed.
#Depending on your function this may take some time.
CurrentValue = slowFunction(someTemporallyDynamicVairable)
#establish your pool
pool = Pool()
while True: #endless loop
#an asynchronous function, this will continue
# to run in the background while your printing operates.
NewValue = pool.apply_async(slowFunction(someTemporallyDynamicVairable))
pool.apply(printingFunction(NewValue,CurrentValue,1))
CurrentValue = NewValue
#close your pool
pool.close()

Why is multiprocessing performance is invisible?

I saw the reference here, and tried to use the method for my for loop, but it seems not working as expected.
def concatMessage(obj_grab, content):
for logCatcher in obj_grab:
for key in logCatcher.dic_map:
regex = re.compile(key)
for j in range(len(content)):
for m in re.finditer(regex, content[j]):
content[j] += " " + logCatcher.index + " " + logCatcher.dic_map[key]
return content
def transferConcat(args):
return concatMessage(*args)
if __name__ == "__name__":
pool = Pool()
content = pool.map(transferConcat, [(obj_grab, content)])[0]
pool.close()
pool.join()
I want to enhance the performance of for loop because it takes 22 seconds to run.
When I run the method directly, it also takes about 22 seconds.
It seems the enhancement has failed.
What should I do to enhance my for loop speed?
Why is pool.map not working in my case?
After remind by nablahero, I revised my code as below:
if __name__ == "__main__":
content = input_file(target).split("\n")
content = manager.list(content)
for files in source:
obj_grab.append((LogCatcher(files), content))
pool = Pool()
pool.map(transferConcat, obj_grab)
pool.close()
pool.join()
def concatMessage(LogCatcher, content):
for key in LogCatcher.dic_map:
regex = re.compile(key)
for j in range(len(content)):
for m in re.finditer(regex, content[j]):
content[j] += LogCatcher.index + LogCatcher.dic_map[key]
def transferConcat(args):
return concatMessage(*args)
after the long waiting, it caused 82 secs to finish...
Why I got this situation? How can I revise my code?
obj_grab is a list, which contains logCatchers of different file intput
content is the file I want to concat, and use Manager() to let multiprocess concat the same file.
What's in obj_grab and content? I guess it only contains one object so when you're starting your Pool you call the function transferConcat only once because you only got one object in obj_grab and content.
If you use map have a look at your reference again. obj_grab and content must be lists of objects in order to speed your program up, because it call the function multiple times with different obj_grab and content's.
pool.map does not speed up the function itself - the function just gets called multiple times in parallel with different data!
I hope that clears some things up.

Categories

Resources