so I have a code that needs to do HTTP requests (let's say 1000). I approached it in 3 ways so far with 50 HTTP requests. The results and codes are below.
The fastest is the approach using Threads, issue is that I lose some data (from what I understood due to the GIL). My questions are the following:
My understanding it that the correct approach in this case is to use Multiprocessing. Is there any way I can improve the speed of that approach? Matching the Threading time would be great.
I would guess that the higher the amount of links I have, the more time the Serial and Threading approach would take, while the Multiprocessing approach would increase much more slowly. Do you have any source that will allow me to get an estimate of the time it would take to run the code with n links?
Serial - Time To Run around 10 seconds
def get_data(link, **kwargs):
data = requests.get(link)
if "queue" in kwargs and isinstance(kwargs["queue"], queue.Queue):
kwargs["queue"].put(data)
else:
return data
links = [link_1, link_2, ..., link_n]
matrix = []
for link in links:
matrix.append(get_data(link))
Threads - Time To Run around 0.8 of a second
def get_data_thread(links):
q = queue.Queue()
for link in links:
data = threading.Thread(target = get_data, args = (link, ), kwargs = {"queue" : q})
data.start()
data.join()
return q
matrix = []
q = get_data_thread(links)
while not q.empty():
matrix.append(q.get())
Multiprocessing - Time To Run around 5 seconds
def get_data_pool(links):
p = mp.Pool()
data = p.map(get_data, links)
return data
if __name__ == "__main__":
matrix = get_data_pool(links)
If I were to suggest anything, I would go with AIOHTTP. A sketch of the code:
import aiohttp
import asyncio
async def main(alink):
links = [link_1, link_2, ..., link_n]
matrix = []
async with aiohttp.ClientSession() as session:
async with session.get(alink) as resp:
return resp.data()
if __name__ == "__main__":
loop = asyncio.get_event_loop()
for link in links:
loop.run_until_complete(main(link))
Related
I am here basically accessing the api call with various values coming from the list list_of_string_ids
I am expecting to create 20 threads, tell them to do something, write the values to DB and then have them all returning zero and going again to take the next data etc.
I have problem getting this to work using threading. Below is a code which is working correctly as expected, however it is taking very long to finish execration (around 45 minutes or more). The website I am getting the data from allows Async I/O using rate of 20 requests.
I assume this can make my code 20x faster but not really sure how to implement it.
import requests
import json
import time
import threading
import queue
headers = {'Content-Type': 'application/json',
'Authorization': 'Bearer TOKEN'}
start = time.perf_counter()
project_id_number = 123
project_id_string = 'pjiji4533'
name = "Assignment"
list_of_string_ids = [132,123,5345,123,213,213,...,n] # Len of list is 20000
def construct_url_threaded(project_id_number, id_string):
url = f"https://api.test.com/{}/{}".format(project_id_number,id_string)
r = requests.get(url , headers=headers) # Max rate allowed is 20 requests at once.
json_text = r.json()
comments = json.dumps(json_text, indent=2)
for item in json_text['data']:
# DO STUFF
for string_id in all_string_ids_list:
construct_url_threaded(project_id_number=project_id_number, id_string=string_id)
My trial is below
def main():
q = queue.Queue()
threads = [threading.Thread(target=create_url_threaded, args=(project_id_number,string_id, q)) for i in range(5) ] #5 is for testing
for th in threads:
th.daemon = True
th.start()
result1 = q.get()
result2 = q.get()
So here is my use case:
I read from a database rows containing information to make a complex SOAP call (I'm using zeep to do these calls).
One row from the database corresponds to a request to the service.
There can be up to 20 thousand lines, so I don't want to read everything in memory before making the calls.
I need to process the responses - when the
response is OK, I need to store some returned information back into
my database, and when there is an exception I need to process the
exception for that particular request/response pair.
I need also to capture some external information at the time of the request creation, so that I know where to store the response from the request. In my current code I'm using the delightful property of gather() that makes the results come in the same order.
I read the relevant PEPs and Python documentation but I'm still very confused, as there seems to be multiple ways to solve the same problem.
I also went through countless exercises on the web, but the examples are all trivial - it's either asyncio.sleep() or some webscraping with a finite list of urls.
The solution that I have come up so far kinda works - the asyncio.gather() method is very, very, useful, but I have not been able to 'feed' it from a generator. I'm currently just counting to an arbitrary size and then starting a .gather() operation. I've transcribed the code, with boring parts left out and I've tried to anonymise the code
I've tried solutions involving semaphores, queues, different event loops, but I'm failing every time. Ideally I'd like to be able to create Futures 'continuously' - I think I'm missing the logic of 'convert this awaitable call to a future'
I'd be grateful for any help!
import asyncio
from asyncio import Future
import zeep
from zeep.plugins import HistoryPlugin
history = HistoryPlugin()
max_concurrent_calls = 5
provoke_errors = True
def export_data_async(db_variant: str, order_nrs: set):
st = time.time()
results = []
loop = asyncio.get_event_loop()
def get_client1(service_name: str, system: Systems = Systems.ACME) -> Tuple[zeep.Client, zeep.client.Factory]:
client1 = zeep.Client(wsdl=system.wsdl_url(service_name=service_name),
transport=transport,
plugins=[history],
)
factory_ns2 = client1.type_factory(namespace='ns2')
return client1, factory_ns2
table = 'ZZZZ'
moveback_table = 'EEEEEE'
moveback_dict = create_default_empty_ordered_dict('attribute1 attribute2 attribute3 attribute3')
client, factory = get_client1(service_name='ACMEServiceName')
if log.isEnabledFor(logging.DEBUG):
client.wsdl.dump()
zeep_log = logging.getLogger('zeep.transports')
zeep_log.setLevel(logging.DEBUG)
with Db(db_variant) as db:
db.open_db(CON_STRING[db_variant])
db.init_table_for_read(table, order_list=order_nrs)
counter_failures = 0
tasks = []
sids = []
results = []
def handle_future(future: Future) -> None:
results.extend(future.result())
def process_tasks_concurrently() -> None:
nonlocal tasks, sids, counter_failures, results
futures = asyncio.gather(*tasks, return_exceptions=True)
futures.add_done_callback(handle_future)
loop.run_until_complete(futures)
for i, response_or_fault in enumerate(results):
if type(response_or_fault) in [zeep.exceptions.Fault, zeep.exceptions.TransportError]:
counter_failures += 1
log_webservice_fault(sid=sids[i], db=db, err=response_or_fault, object=table)
else:
db.write_dict_to_table(
moveback_table,
{'sid': sids[i],
'attribute1': response_or_fault['XXX']['XXX']['xxx'],
'attribute2': response_or_fault['XXX']['XXX']['XXXX']['XXX'],
'attribute3': response_or_fault['XXXX']['XXXX']['XXX'],
}
)
db.commit_db_con()
tasks = []
sids = []
results = []
return
for row in db.rows(table):
if int(row.id) % 2 == 0 and provoke_errors:
payload = faulty_message_payload(row=row,
factory=factory,
)
else:
payload = message_payload(row=row,
factory=factory,
)
tasks.append(client.service.myRequest(
MessageHeader=factory.MessageHeader(**message_header_arguments(row=row)),
myRequestPayload=payload,
_soapheaders=[security_soap_header],
))
sids.append(row.sid)
if len(tasks) == max_concurrent_calls:
process_tasks_concurrently()
if tasks: # this is the remainder of len(db.rows) % max_concurrent_calls
process_tasks_concurrently()
loop.run_until_complete(transport.session.close())
db.execute_this_statement(statement=update_sql)
db.commit_db_con()
log.info(db.activity_log)
if counter_failures:
log.info(f"{table :<25} Count failed: {counter_failures}")
print("time async: %.2f" % (time.time() - st))
return results
Failed attempt with Queue: (blocks at await client.service)
loop = asyncio.get_event_loop()
counter = 0
results = []
async def payload_generator(db_variant: str, order_nrs: set):
# code that generates the data for the request
yield counter, row, payload
async def service_call_worker(queue, results):
while True:
counter, row, payload = await queue.get()
results.append(await client.service.myServicename(
MessageHeader=calculate_message_header(row=row)),
myPayload=payload,
_soapheaders=[security_soap_header],
)
)
print(colorama.Fore.BLUE + f'after result returned {counter}')
# Here do the relevant processing of response or error
queue.task_done()
async def main_with_q():
n_workers = 3
queue = asyncio.Queue(n_workers)
e = pprint.pformat(queue)
p = payload_generator(DB_VARIANT, order_list_from_args())
results = []
workers = [asyncio.create_task(service_call_worker(queue, results))
for _ in range(n_workers)]
async for c in p:
await queue.put(c)
await queue.join() # wait for all tasks to be processed
for worker in workers:
worker.cancel()
if __name__ == '__main__':
try:
loop.run_until_complete(main_with_q())
loop.run_until_complete(transport.session.close())
finally:
loop.close()
I'm trying to submit around 150 million jobs to celery using the following code:
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
for line in alldat:
try:
result = chain(get_url.s(line[:-1]),do_work.s(line[:-1])).apply_async()
except:
print ("failed to submit job")
print('task submitted ' + str(line[:-1]))
Would it be faster to split the file into chunks and run multiple instances of this code? Or what can I do? I'm using memcached as the backend, rabbitmq as the broker.
import multiprocessing
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
num_workers = 200
def worker(urls,id):
"""worker function"""
for url in urls:
print ("%s - %s" % (id,url))
result = chain(get_url.s(url),do_work.s(url)).apply_async()
return
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
jobs = []
stack = []
id = 0
for i in alldat:
if (len(stack) < len(alldat) / num_workers):
stack.append(i[:-1])
continue
else:
id = id + 1
p = multiprocessing.Process(target=worker, args=(stack,id,))
jobs.append(p)
p.start()
stack = []
for j in jobs:
j.join()
If I understand your problem correctly:
you have a list of 150M urls
you want to run get_url() then do_work() on each of the urls
so you have two issues:
going over the 150M urls
queuing the tasks
Regarding the main for loop in your code, yes you could do that faster if you use multithreading, especially if you are using multicore cpu. Your master thread could read the file and pass chunks of it to sub-threads that will be creating the celery tasks.
Check the guide and the documentation:
https://realpython.com/intro-to-python-threading/
https://docs.python.org/3/library/threading.html
And now let's imagine you have 1 worker that is receiving these tasks. The code will generate 150M new tasks that will be pushed to the queue. Each chain will be a chain of get_url(), and do_work(), the next chain will run only when do_work() finishes.
If get_url() takes a short time and do_work() takes a long time, it will be a series of quick-task, slow-task, and the total time:
t_total_per_worker = (t_get_url_average+t_do_work_average) X 150M
If you have n workers
t_total = t_total_per_worker/n
t_total = (t_get_url_average+t_do_work_average) X 150M / n
Now if get_url() is time critical while do_work() is not, then, if you can, you should run all 150M get_url() first and when that is done run all 150M do_work(), but that may require changes to your process design.
That is what I would do. Maybe others have better ideas!?
I am making a web scraper to build a database. The site I plan to use has index pages each containing 50 links. The amount of pages to be parsed is estimated to be around 60K and up, this is why I want to implement multiprocessing.
Here is some pseudo-code of what I want to do:
def harvester(index):
main=dict()
....
links = foo.findAll ( 'a')
for link in links:
main.append(worker(link))
# or maybe something like: map_async(worker(link))
def worker(url):
''' this function gather the data from the given url'''
return dictionary
Now what I want to do with that is to have a certain number of worker function to gather data in parallel on different pages. This data would then be appended to a big dictionary located in harvester or written directly in a csv file by the worker function.
I'm wondering how I can implement parallelism. I have done a faire
amount of research on using gevent, threading and multiprocessing but
I am not sure how to implement it.
I am also not sure if appending data to a large dictionary or writing
directly in a csv using DictWriter will be stable with that many input at the same time.
Thanks
I propose you to split your work into separate workers which communicate via Queues.
Here you mostly have IO wait time (crawling, csv writing)
So you can do the following (not tested, just see the idea):
import threading
import Queue
class CsvWriter(threading.Thread):
def __init__(self, resultq):
super(CsvWriter, self).__init__()
self.resultq = resultq
self.writer = csv.DictWriter(open('results.csv', 'wb'))
def run(self):
done = False
while not done:
row = self.requltq.get()
if row != -1:
self.writer.writerow(row)
else:
done = True
class Crawler(threading.Thread):
def __init__(self, inputqueue, resultq):
super(Crawler, self).__init__()
self.iq = inputq
self.oq = resultq
def run(self):
done = False
while not done:
link = self.iq.get()
if link != -1:
result = self.extract_data(link)
self.oq.put(result)
else:
done = True
def extract_data(self, link):
# crawl and extract what you need and return a dict
pass
def main():
linkq = Queue.Queue()
for url in your_urls:
linkq.put(url)
resultq = Queue.Queue()
writer = CsvWriter(resultq)
writer.start()
crawlers = [Crawler(linkq, resultq) for _ in xrange(10)]
[c.start() for c in crawlers]
[linkq.put(-1) for _ in crawlers]
[c.join() for c in crawlers]
resultq.put(-1)
writer.join()
This code should work (fix possible typos) and make it to exit when all the urls are finished
So here's the problem, our security teacher made a site that requires authentification and then asks for a code (4 characters) so that you can access to a file. He told us to write a brute force program in Python (any library we want) that can find the password. So to do that I wanted first to make a program that can try random combinations on that code field just to have an idea about the time of each request ( I'm using requests library) and the result was disapointing each request takes around 8 secs.
With some calculations: 4^36=13 436 928 possible combination that would take my program around 155.52 days.
I would really apreciate if any one can help me out to make that faster. ( he told us that it is possible to make around 1200 combinations per sec)
Here's my code:
import requests
import time
import random
def gen():
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"
pw_length = 4
mypw = ""
for i in range(pw_length):
next_index = random.randrange(len(alphabet))
mypw = mypw + alphabet[next_index]
return mypw
t0 = time.clock()
t1 = time.time()
cookie = {'ig': 'b0b5294376ef12a219147211fc33d7bb'}
for i in range(0,5):
t2 = time.clock()
t3 = time.time()
values = {'RECALL':gen()}
r = requests.post('http://www.example.com/verif.php', stream=True, cookies=cookie, data=values)
print("##################################")
print("cpu time for req ",i,":", time.clock()-t2)
print("wall time for req ",i,":", time.time()-t3)
print("##################################")
print("##################################")
print("Total cpu time:", time.clock()-t0)
print("Total wall time:", time.time()-t1)
Thank you
A thing you could try is to use a Pool of workers to do multiple requests in parallel passing a password to each worker. Something like:
import itertools
from multiprocessing import Pool
def pass_generator():
for pass_tuple in itertools.product(alphabet, repeat=4):
yield ''.join(pass_tuple)
def check_password(password):
values = {'RECALL': password}
r = requests.post('http://www.example.com/verif.php', stream=True, cookies=cookie, data=values)
# Check response here.
pool = Pool(processes=NUMBER_OF_PROCESSES)
pool.map(check_password, pass_generator())