Multiprocessing pool returning wrong results - python

Another confused parallel coder here!
Our internal Hive database has an API layer to which we need to use to access the data. There is a 300 second query timeout limit, so I wanted to use multiprocessing to execute multiple queries in parallel:
from multiprocessing import Pool
import pandas as pd
import time
from hive2pandas_anxpy import Hive2Pandas # custom module for querying our Hive db and converting the results to a Pandas dataframe
import datetime
def run_query(hour):
start_time = time.time()
start_datetime = datetime.datetime.now()
query = """SELECT id, revenue from table where date='2014-05-20 %s' limit 50""" % hour
h2p = Hive2Pandas(query, 'username')
h2p.run()
elapsed_time = int(time.time() - start_time)
end_datetime = datetime.datetime.now()
return {'query':query, 'start_time':start_datetime, 'end_time':end_datetime, 'elapsed_time':elapsed_time, 'data':h2p.data_df}
if __name__ == '__main__':
start_time = time.time()
pool = Pool(4)
hours = ['17','18','19']
results = pool.map_async(run_query, hours)
pool.close()
pool.join()
print int(time.time() - start_time)
The issue I'm having is that one of the queries always returns no data, but when I run the same query in the usual fashion, it returns data. Since I'm new to multiprocessing, I'm wondering if there are there any obvious issues with how I'm using it above?

I think the issue you are having is that the results object is not ready by the time you want to use it. Also if you have a known amount of time for a timeout, I would suggest using that to your advantage in the code.
This code shows an example of how you can force a timeout after 300 seconds if the results from all of them are not collected by then.
if __name__ == '__main__':
start_time = time.time()
hours = ['17','18','19']
with Pool(processes=4) as pool:
results = pool.map_async(run_query, hours)
print(results.get(timeout=300))
print int(time.time() - start_time)
Otherwise you should still be using results.get() to return your data, or specify a callback function for map_async.

Related

how to "poll" python multiprocess pool apply_async

I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.

Dask: How to return a tuple of futures in client.submit

I need to return a tuple from a task which has to be unpacked in the main process because each element of the tuple will go to different dask tasks. I would like to avoid unnecessary communication so I think that the tuple elements should be Futures.
The best way I came up with is to scatter the data only to the same worker to get the future. Is there a better way to do it?
import numpy as np
import time
from dask.distributed import Client
from dask.distributed import get_client, get_worker
def other_task(data):
return len(data)
def costly_task():
start = time.time()
client = get_client()
data = np.arange(100000000)
metadata = len(data)
print('worker time', time.time() - start)
start = time.time()
future_data, future_metadata = client.scatter(data, workers=[get_worker().id]), client.scatter(metadata, workers=[get_worker().id])
print('scatter time', time.time() - start)
return future_data, future_metadata
if __name__ == '__main__':
client = Client(processes=True)
start = time.time()
future_data, future_metadata = client.submit(costly_task).result()
metadata = future_metadata.result()
print('costly task time', time.time() - start)
other_metadata = client.submit(other_task, future_data).result()
print('total time', time.time() - start)
The times I get with the script above are:
worker time 0.12443423271179199
scatter time 0.7880995273590088
costly task time 0.923513650894165
total time 0.9366424083709717
It is possible via delayed:
from dask.distributed import Client
from dask import delayed
#delayed(nout=3)
def costly_task():
return 1, 2, 3
client = Client(processes=True)
a, b, c = costly_task() # these are delayed
future_a, future_b, future_c = client.compute([a, b, c]) # split results
See this Q&A for further details.

Trying to add throttle control to paralleled API calls in python

I am using Google places API which has a query per second limit of 10. This means I cannot make more than 10 requests within a second. If we were using Serial execution this wouldn't be an issue as the APIs avg response time is 250 ms, so i will be able to make just 4 calls in a second.
To utilize the entire 10 QPS limit i used multithreading and made parallel API calls. But now i need to control the number of calls that can happen in a second, it should not go beyond 10 (google API starts throwing errors if i cross the limit)
Below is the code that i have so far, I am not able to figure out why the program just gets stuck sometimes or takes alot longer than required.
import time
from datetime import datetime
import random
from threading import Lock
from concurrent.futures import ThreadPoolExecutor as pool
import concurrent.futures
import requests
import matplotlib.pyplot as plt
from statistics import mean
from ratelimiter import RateLimiter
def make_parallel(func, qps=10):
lock = Lock()
threads_execution_que = []
limit_hit = False
def qps_manager(arg):
current_second = time.time()
lock.acquire()
if len(threads_execution_que) >= qps or limit_hit:
limit_hit = True
if current_second - threads_execution_que[0] <= 1:
time.sleep(current_second - threads_execution_que[0])
current_time = time.time()
threads_execution_que.append(current_time)
lock.release()
res = func(arg)
lock.acquire()
threads_execution_que.remove(current_time)
lock.release()
return res
def wrapper(iterable, number_of_workers=12):
result = []
with pool(max_workers=number_of_workers) as executer:
bag = {executer.submit(func, i): i for i in iterable}
for future in concurrent.futures.as_completed(bag):
result.append(future.result())
return result
return wrapper
#make_parallel
def api_call(i):
min_func_time = random.uniform(.25, .3)
start_time = time.time()
try:
response = requests.get('https://jsonplaceholder.typicode.com/posts', timeout=1)
except Exception as e:
response = e
if (time.time() - start_time) - min_func_time < 0:
time.sleep(min_func_time - (time.time() - start_time))
return response
api_call([1]*50)
Ideally the code should take not more than 1.5 seconds, but currently it is taking about 12-14 seconds.
The script speeds up to its expected speed as soon as i remove the QPS manager logic.
Please do suggest what i am doing wrong and also, if there is any package available already which does this mechanism out of the box.
Looks like ratelimit does just that:
from ratelimit import limits, sleep_and_retry
#make_parallel
#sleep_and_retry
#limits(calls=10, period=1)
def api_call(i):
try:
response = requests.get("https://jsonplaceholder.typicode.com/posts", timeout=1)
except Exception as e:
response = e
return response
EDIT: I did some testing and it looks like #sleep_and_retry is a little too optimistic, so just increase the period a little, to 1.2 second:
s = datetime.now()
api_call([1] * 50)
elapsed_time = datetime.now() - s
print(elapsed_time > timedelta(seconds=50 / 10))

Slow brute force program in python

So here's the problem, our security teacher made a site that requires authentification and then asks for a code (4 characters) so that you can access to a file. He told us to write a brute force program in Python (any library we want) that can find the password. So to do that I wanted first to make a program that can try random combinations on that code field just to have an idea about the time of each request ( I'm using requests library) and the result was disapointing each request takes around 8 secs.
With some calculations: 4^36=13 436 928 possible combination that would take my program around 155.52 days.
I would really apreciate if any one can help me out to make that faster. ( he told us that it is possible to make around 1200 combinations per sec)
Here's my code:
import requests
import time
import random
def gen():
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"
pw_length = 4
mypw = ""
for i in range(pw_length):
next_index = random.randrange(len(alphabet))
mypw = mypw + alphabet[next_index]
return mypw
t0 = time.clock()
t1 = time.time()
cookie = {'ig': 'b0b5294376ef12a219147211fc33d7bb'}
for i in range(0,5):
t2 = time.clock()
t3 = time.time()
values = {'RECALL':gen()}
r = requests.post('http://www.example.com/verif.php', stream=True, cookies=cookie, data=values)
print("##################################")
print("cpu time for req ",i,":", time.clock()-t2)
print("wall time for req ",i,":", time.time()-t3)
print("##################################")
print("##################################")
print("Total cpu time:", time.clock()-t0)
print("Total wall time:", time.time()-t1)
Thank you
A thing you could try is to use a Pool of workers to do multiple requests in parallel passing a password to each worker. Something like:
import itertools
from multiprocessing import Pool
def pass_generator():
for pass_tuple in itertools.product(alphabet, repeat=4):
yield ''.join(pass_tuple)
def check_password(password):
values = {'RECALL': password}
r = requests.post('http://www.example.com/verif.php', stream=True, cookies=cookie, data=values)
# Check response here.
pool = Pool(processes=NUMBER_OF_PROCESSES)
pool.map(check_password, pass_generator())

Task Chaining with Cursor issue on app engine. Exception: Too big query offset. Anyone else get this issue?

I'm not sure if anyone else has this problem, but I'm getting an exception "Too big query offset" when using a cursor for chaining tasks on appengine development server (not sure if it happens on live).
The error occurs when requesting a cursor after 4000+ records have been processed in a single query.
I wasn't aware that offsets had anything to do with cursors, and perhaps its just a quirk in sdk for app engine.
To fix, either shorten the time allowed before task is deferred (so fewer records get processed at a time) or when checking time elapsed you can also check the number of records processed is still within range. e.g, if time.time() > end_time or count == 2000.Reset count and defer task. 2000 is an arbitrary number, I'm not sure what the limit should be.
EDIT:
After making the above mentioned changes, the never finishes executing. The with_cursor(cursor) code is being called, but seems to start at the beginning each time. Am I missing something obvious?
The code that causes the exception is as follows:
The table "Transact" has 4800 rows. The error occurs when transacts.cursor() is called when time.time() > end_time is true. 4510 records have been processed at the time when the cursor is requested, which seems to cause the error (on development server, haven't tested elsewhere).
def some_task(trans):
tts = db.get(trans)
for t in tts:
#logging.info('in some_task')
pass
def test_cursor(request):
ret = test_cursor_task()
def test_cursor_task(cursor = None):
startDate = datetime.datetime(2010,7,30)
endDate = datetime.datetime(2010,8,30)
end_time = time.time() + 20.0
transacts = Transact.all().filter('transactionDate >', startDate).filter('transactionDate <=',endDate)
count =0
if cursor:
transacts.with_cursor(cursor)
trans =[]
logging.info('queue_trans')
for tran in transacts:
count+=1
#trans.append(str(tran))
trans.append(str(tran.key()))
if len(trans)==20:
deferred.defer(some_task, trans, _countdown = 500)
trans =[]
if time.time() > end_time:
logging.info(count)
if len(trans)>0:
deferred.defer(some_task, trans, _countdown = 500)
trans =[]
logging.info('time limit exceeded setting next call to queue')
cursor = transacts.cursor()
deferred.defer(test_cursor_task, cursor)
logging.info('returning false')
return False
return True
return HttpResponse('')
Hope this helps someone.
Thanks
Bert
Try this again without using the iter functionality:
#...
CHUNK = 500
objs = transacts.fetch(CHUNK)
for tran in objs:
do_your_stuff
if len(objs) == CHUNK:
deferred.defer(my_task_again, cursor=str(transacts.cursor()))
This works for me.

Categories

Resources