I'm trying to create a script that send's over 1000 requests to one page at the same time. But requests library with threading (1000) threads. Seems to be doing to first 50 or so requests all within 1 second, whereas the other 9950 are taking considerably longer. I measured it like this.
def print_to_cmd(strinng):
queueLock.acquire()
print strinng
queueLock.release()
start = time.time()
resp = requests.get('http://test.net/', headers=header)
end = time.time()
print_to_cmd(str(end-start))
I'm thinking requests library is limiting how fast they are getting sent.
Doe's anybody know a way in python to send requests all at the same time? I have a VPS with 200mb upload so that is not the issue its something to do with python or requests library limiting it. They all need to hit the website within 1 second of each other.
Thanks for reading and I hope somebody can help.
I have generally found that the best solution is to use an asynchronous library like tornado. The easiest solution that I found however is to use ThreadPoolExecutor.
import requests
from concurrent.futures import ThreadPoolExecutor
def get_url(url):
return requests.get(url)
with ThreadPoolExecutor(max_workers=50) as pool:
print(list(pool.map(get_url,list_of_urls)))
I know this is an old question, but you can now do this using asyncio and aiohttp.
import asyncio
import aiohttp
from aiohttp import ClientSession
async def fetch_html(url: str, session: ClientSession, **kwargs) -> str:
resp = await session.request(method="GET", url=url, **kwargs)
resp.raise_for_status()
return await resp.text()
async def make_requests(url: str, **kwargs) -> None:
async with ClientSession() as session:
tasks = []
for i in range(1,1000):
tasks.append(
fetch_html(url=url, session=session, **kwargs)
)
results = await asyncio.gather(*tasks)
# do something with results
if __name__ == "__main__":
asyncio.run(make_requests(url='http://test.net/'))
You can read more about it and see an example here.
Assumed that you know what you are doing, I first suggest you to implement a backoff policy with a jitter to prevent "predictable thundering hoardes" to your server. That said, you should consider to do some threading
import threading
class FuncThread(threading.Thread):
def __init__(self, target, *args):
self._target = target
self._args = args
threading.Thread.__init__(self)
def run(self):
self._target(*self._args)
so that you would do something like
t = FuncThread(doApiCall, url)
t.start()
where your method doApiCall is defined like this
def doApiCall(self, url):
Related
I have a python code and I want to speed it up using threads but when I try to I get the same lines getting duplicated, is there is any way I could speed it up without getting duplicate lines
code
import requests
import json
f = open("urls.json")
data = json.load(f)
def urls():
for i in data['urls']:
r = requests.get("https://" + i)
print(r.headers)
You can use ThreadPoolExecutor class from concurrent.futures. It is efficient way according to Thread class.
You can change the max_workers value according to your task
Here is the piece of code:
import requests
from concurrent.futures import ThreadPoolExecutor
import json
with open("urls.json") as f:
data = json.load(f)
def urls():
urls = ["https://" + url for url in data['urls']]
print(urls)
with ThreadPoolExecutor(max_workers=5) as pool:
iterator = pool.map(requests.get,urls)
for response in iterator:
print(response.headers)
print("\n")
Make async or threaded calls.
So, you would do something like this:
import aiohttp
import asyncio
import time
start_time = time.time()
async def main():
async with aiohttp.ClientSession() as session:
for number in range(1, 151):
pokemon_url = f'https://pokeapi.co/api/v2/pokemon/{number}'
async with session.get(pokemon_url) as resp:
pokemon = await resp.json()
print(pokemon['name'])
asyncio.run(main())
Could also do multiprocessing as per the comment, but async is better for i/o type tasks.
I have some HTML pages that I am trying to extract the text from using asynchronous web requests through aiohttp and asyncio, after extracting them I save the files locally. I am using BeautifulSoup(under extract_text()), to process the text from the response and extract the relevant text within the HTML page(exclude the code, etc.) but facing an issue where my synchronous version of the script is faster than my asynchronous + multiprocessing.
As I understand, using the BeautifulSoup function causes the main event loop to block within parse(), so based on these two StackOverflow questions[0, 1], I figured the best thing to do was to run the extract_text() within its own process(as its a CPU task) which should prevent the event loop from blocking.
This results in the script taking 1.5x times longer than the synchronous version(with no multiprocessing).
To confirm that this was not an issue with my implementation of the asynchronous code, I removed the use of the extract_text() and instead saved the raw text from the response object. Doing this resulted in my asynchronous code being much faster, showcasing that the issue is purely from the extract_text() being run on a separate process.
Am I missing some important detail here?
import asyncio
from asyncio import Semaphore
import json
import logging
from pathlib import Path
from typing import List, Optional
import aiofiles
from aiohttp import ClientSession
import aiohttp
from bs4 import BeautifulSoup
import concurrent.futures
import functools
def extract_text(raw_text: str) -> str:
return " ".join(BeautifulSoup(raw_text, "html.parser").stripped_strings)
async def fetch_text(
url: str,
session: ClientSession,
semaphore: Semaphore,
**kwargs: dict,
) -> str:
async with semaphore:
response = await session.request(method="GET", url=url, **kwargs)
response.raise_for_status()
logging.info("Got response [%s] for URL: %s", response.status, url)
text = await response.text(encoding="utf-8")
return text
async def parse(
url: str,
session: ClientSession,
semaphore: Semaphore,
**kwargs,
) -> Optional[str]:
try:
text = await fetch_text(
url=url,
session=session,
semaphore=semaphore,
**kwargs,
)
except (
aiohttp.ClientError,
aiohttp.http_exceptions.HttpProcessingError,
) as e:
logging.error(
"aiohttp exception for %s [%s]: %s",
url,
getattr(e, "status", None),
getattr(e, "message", None),
)
except Exception as e:
logging.exception(
"Non-aiohttp exception occured: %s",
getattr(e, "__dict__", None),
)
else:
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
extract_text_ = functools.partial(extract_text, text)
text = await loop.run_in_executor(pool, extract_text_)
logging.info("Found text for %s", url)
return text
async def process_file(
url: dict,
session: ClientSession,
semaphore: Semaphore,
**kwargs: dict,
) -> None:
category = url.get("category")
link = url.get("link")
if category and link:
text = await parse(
url=f"{URL}/{link}",
session=session,
semaphore=semaphore,
**kwargs,
)
if text:
save_path = await get_save_path(
link=link,
category=category,
)
await write_file(html_text=text, path=save_path)
else:
logging.warning("Text for %s not found, skipping it...", link)
async def process_files(
html_files: List[dict],
semaphore: Semaphore,
) -> None:
async with ClientSession() as session:
tasks = [
process_file(
url=file,
session=session,
semaphore=semaphore,
)
for file in html_files
]
await asyncio.gather(*tasks)
async def write_file(
html_text: str,
path: Path,
) -> None:
# Write to file using aiofiles
...
async def get_save_path(link: str, category: str) -> Path:
# return path to save
...
async def main_async(
num_files: Optional[int],
semaphore_count: int,
) -> None:
html_files = # get all the files to process
semaphore = Semaphore(semaphore_count)
await process_files(
html_files=html_files,
semaphore=semaphore,
)
if __name__ == "__main__":
NUM_FILES = # passed through CLI args
SEMAPHORE_COUNT = # passed through CLI args
asyncio.run(
main_async(
num_files=NUM_FILES,
semaphore_count=SEMAPHORE_COUNT,
)
)
SnakeViz charts across 1000 samples
Async version with extract_text and multiprocessing
Async version without extract_text
Sync version with extract_text(notice how the html_parser from BeautifulSoup takes up the majority of the time here)
Sync version without extract_text
Here is roughly what your asynchronous program does:
Launch num_files parse() tasks concurrently
Each parse() task creates its own ProcessPoolExecutor and asynchronously awaits for extract_text (which is executed in the previously created process pool).
This is suboptimal for several reasons:
It creates num_files process pools, which are expensive to create and takes memory
Each pool is only used for one single operation, which is counterproductive: as many concurrent operations as possible should be submitted to a given pool
You are creating a new ProcessPoolExecutor each time the parse() function is called. You could try to instantiate it once (as a global for instance, of passed through a function argument):
from concurrent.futures import ProcessPoolExecutor
async def parse(loop, executor, ...):
...
text = await loop.run_in_executor(executor, extract_text)
# and then in `process_file` (or `process_files`):
async def process_file(...):
...
loop = asyncio.get_running_loop()
with ProcessPoolExecutor() as executor:
...
await process(loop, executor, ...)
I benchmarked the overhead of creating a ProcessPoolExecutor on my old MacBook Air 2015 and it shows that it is quite slow (almost 100 ms for pool creation, opening, submit and shutdown):
from time import perf_counter
from concurrent.futures import ProcessPoolExecutor
def main_1():
"""Pool crated once"""
reps = 100
t1 = perf_counter()
with ProcessPoolExecutor() as executor:
for _ in range(reps):
executor.submit(lambda: None)
t2 = perf_counter()
print(f"{(t2 - t1) / reps * 1_000} ms") # 2 ms/it
def main_2():
"""Pool created at each iteration"""
reps = 100
t1 = perf_counter()
for _ in range(reps):
with ProcessPoolExecutor() as executor:
executor.submit(lambda: None)
t2 = perf_counter()
print(f"{(t2 - t1) / reps * 1_000} ms") # 100 ms/it
if __name__ == "__main__":
main_1()
main_2()
You may again hoist it up in the process_files function, which avoid recreating the pool for each file.
Also, try to inspect more closely your first SnakeViz chart in order to know what exactly in process.py:submit is taking that much time.
One last thing, be careful of the semantics of using a context manager on an executor:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
for i in range(100):
executor.submit(some_work, i)
Not only this creates and executor and submit work to it but it also waits for all work to finish before exiting the with statement.
I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break
There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).
I am trying to understand how to handle a grpc api with bidirectional streaming (using the Python API).
Say I have the following simple server definition:
syntax = "proto3";
package simple;
service TestService {
rpc Translate(stream Msg) returns (stream Msg){}
}
message Msg
{
string msg = 1;
}
Say that the messages that will be sent from the client come asynchronously ( as a consequence of user selecting some ui elements).
The generated python stub for the client will contain a method Translate that will accept a generator function and will return an iterator.
What is not clear to me is how would I write the generator function that will return messages as they are created by the user. Sleeping on the thread while waiting for messages doesn't sound like the best solution.
This is a bit clunky right now, but you can accomplish your use case as follows:
#!/usr/bin/env python
from __future__ import print_function
import time
import random
import collections
import threading
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
import grpc
from translate_pb2 import Msg
from translate_pb2_grpc import TestServiceStub
from translate_pb2_grpc import TestServiceServicer
from translate_pb2_grpc import add_TestServiceServicer_to_server
def translate_next(msg):
return ''.join(reversed(msg))
class Translator(TestServiceServicer):
def Translate(self, request_iterator, context):
for req in request_iterator:
print("Translating message: {}".format(req.msg))
yield Msg(msg=translate_next(req.msg))
class TranslatorClient(object):
def __init__(self):
self._stop_event = threading.Event()
self._request_condition = threading.Condition()
self._response_condition = threading.Condition()
self._requests = collections.deque()
self._last_request = None
self._expected_responses = collections.deque()
self._responses = {}
def _next(self):
with self._request_condition:
while not self._requests and not self._stop_event.is_set():
self._request_condition.wait()
if len(self._requests) > 0:
return self._requests.popleft()
else:
raise StopIteration()
def next(self):
return self._next()
def __next__(self):
return self._next()
def add_response(self, response):
with self._response_condition:
request = self._expected_responses.popleft()
self._responses[request] = response
self._response_condition.notify_all()
def add_request(self, request):
with self._request_condition:
self._requests.append(request)
with self._response_condition:
self._expected_responses.append(request.msg)
self._request_condition.notify()
def close(self):
self._stop_event.set()
with self._request_condition:
self._request_condition.notify()
def translate(self, to_translate):
self.add_request(to_translate)
with self._response_condition:
while True:
self._response_condition.wait()
if to_translate.msg in self._responses:
return self._responses[to_translate.msg]
def _run_client(address, translator_client):
with grpc.insecure_channel('localhost:50054') as channel:
stub = TestServiceStub(channel)
responses = stub.Translate(translator_client)
for resp in responses:
translator_client.add_response(resp)
def main():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
add_TestServiceServicer_to_server(Translator(), server)
server.add_insecure_port('[::]:50054')
server.start()
translator_client = TranslatorClient()
client_thread = threading.Thread(
target=_run_client, args=('localhost:50054', translator_client))
client_thread.start()
def _translate(to_translate):
return translator_client.translate(Msg(msg=to_translate)).msg
translator_pool = futures.ThreadPoolExecutor(max_workers=4)
to_translate = ("hello", "goodbye", "I", "don't", "know", "why",)
translations = translator_pool.map(_translate, to_translate)
print("Translations: {}".format(zip(to_translate, translations)))
translator_client.close()
client_thread.join()
server.stop(None)
if __name__ == "__main__":
main()
The basic idea is to have an object called TranslatorClient running on a separate thread, correlating requests and responses. It expects that responses will return in the order that requests were sent out. It also implements the iterator interface so that you can pass it directly to an invocation of the Translate method on your stub.
We spin up a thread running _run_client which pulls responses out of TranslatorClient and feeds them back in the other end with add_response.
The main function I included here is really just a strawman since I don't have the particulars of your UI code. I'm running _translate in a ThreadPoolExecutor to demonstrate that, even though translator_client.translate is synchronous, it yields, allowing you to have multiple in-flight requests at once.
We recognize that this is a lot of code to write for such a simple use case. Ultimately, the answer will be asyncio support. We have plans for this in the not-too-distant future. But for the moment, this sort of solution should keep you going whether you're running python 2 or python 3.
I have 5,00,000 urls. and want to get response of each asynchronously.
import aiohttp
import asyncio
#asyncio.coroutine
def worker(url):
response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))
main()
I want 200 connections at a time(concurrent 200), not more than this because
when I run this program for 50 urls it works fine, i.e url_list[:50]
but if I pass whole list, i get this error
aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
may be frequency is too much and server is denying to respond after a limit?
Yes, one can expect a server to stop responding after causing too much traffic (whatever the definition of "too much traffic") to it.
One way to limit number of concurrent requests (throttle them) in such cases is to use asyncio.Semaphore, similar in use to these used in multithreading: just like there, you create a semaphore and make sure the operation you want to throttle is aquiring that semaphore prior to doing actual work and releasing it afterwards.
For your convenience, asyncio.Semaphore implements context manager to make it even easier.
Most basic approach:
CONCURRENT_REQUESTS = 200
#asyncio.coroutine
def worker(url, semaphore):
# Aquiring/releasing semaphore using context manager.
with (yield from semaphore):
response = yield from aiohttp.request(
'GET',
url,
connector=aiohttp.TCPConnector(share_cookies=True,
verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))