Python Multi Requests URLs

Python Multi Requests URLs - python

Hi Guys,
im new with python (im just begining less than 2 weeks)
so i need some advices and tricks :p
what is the fastest and most efficient way to fetch around 1500 api request ?
use async function that execute them all and return to get the
results ?
dividing them into lists of 300 urls and put every
list inside a Thread which will execute them inside an async loop ?
do the same thing as the second suggestion but with Processes
instead of Threads ?
for the moment it's working for me but it takes something like 8s to execute 1400 api requests but when i try a single request without threads it takes 9s
im i doing something wrong ??!
Fetch one URL ( i tried to pass the Session as param but i get errors when reachs the 700 requests )
async def fetch_one(url):
async with curio_http.ClientSession() as session:
response = await session.get(url)
content = await response.json()
return content
fetch a list of URLs inside an async loop
async def fetchMultiURLs(url_list):
tasks = []
responses = []
for url in url_list:
task = await curio.spawn(fetch_one(url))
tasks.append(task)
for task in tasks:
content = await task.join()
responses.append(content)
print(content)
Create Threads and put inside them Async Loops depending on URLs / X URL by Loop
for example MultiFetch(URLS[600],200) will create 3 threads that will call 200 requests by thread and in async way
def MultiFetch(URLS,X):
MyThreadsList = []
MyThreadsResults = []
N_Threads = (lambda x: int (x/X) if (x % X == 0) else int(x/X)+1) (len(URLS))
for i in range( N_Threads ): # will iterate X = ListSize / X
MyThreadsList.append( Thread( target = curio.run , args = (fetchMultiURLs( (URLS[ i*X:(X*i+X)]) ) ,) ) )
MyThreadsList[i].start()
for i in range( N_Threads ):
MyThreadsResults.append(MyThreadsList[i].join())
return MyThreadsResults

Finaly i found a solution :) it takes 2.2s to fetch 1400 urls
i used the 3ed suggestion ( async loop inside Processes )
# Fetch 1 URL
async def fetch_one(url):
async with curio_http.ClientSession() as session:
response = await session.get(url)
content = await response.json()
return content
# Fetch X URLs
async def fetchMultiURLs(url_list):
tasks = []
responses = []
for url in url_list:
task = await curio.spawn(fetch_one(url))
tasks.append(task)
for task in tasks:
content = await task.join()
responses.append(content)
return responses
# i tried to put lambda instead of this function but it not working
def RuningCurio(X):
return curio.run(fetchMultiURLs(X))
# Create Processes and Async Loops depending on URLs / X URL by Loop
# in my case (im using a VPS) a single Process can easly fetch 700 links in less than 1s , so dont make multiProcesses under this number of urls (just use the fetchMultiURLs function)
def MultiFetch(URLS,X):
MyListofLists = []
LengthURLs = len(URLS)
N_Process = int (LengthURLs / X) if ( LengthURLs % X == 0) else int( LengthURLs / X) + 1
for i in range( N_Process ): # Create a list of lists ( [ [1,2,3],[4,5,6],[7,8,9] ] )
MyListofLists.append(URLS[ i*X:(X*i+X)])
P = Pool( N_Process)
return P.map( RuningCurio ,MyListofLists)
# im fetching 2100 urls in 1.1s i hope this Solution will help you Guys

Related

How to optimize my performances, using asynchronous python code

I'm looking to optimize my code in order to process the info faster. First time playing with asynchronous requests. And also still new to Python. I hope my code makes sense.
I'm using FastAPI as a framework. And aiohttp to send my requests.
Right now, I'm only interested in getting the total of results per word searched. I will be dumping the json into a DB afterwards.
My code is sending requests to the public crossref API (crossref)
As an example, I'm searching for the terms from 2022-06-02 to 2022-06-03 (inclusive). The terms being searched are: 'paper' (3146 results), 'ammonium' (1430 results) and 'bleach' (23 results). Example:
https://api.crossref.org/works?rows=1000&sort=created&mailto=youremail#domain.com&query=paper&filter=from-index-date:2022-06-02,until-index-date:2022-06-03&cursor=*
This returns 3146 rows. I need to search for only one term at a time. I did not try to split it per day as well to see if it's faster.
There is also a recursive context in this. This is where I feel like I'm mishandling the asynchronous concept. Here is why I need a recursive call.
Deep paging requests
Deep paging using cursors can be used to iterate over large result sets, without any limits on their size.
To use deep paging make a query as normal, but include the cursor parameter with a value of *, for example:
https://api.crossref.org/works?rows=1000&sort=created&mailto=youremail#domain.com&query=ammonium&filter=from-index-date:2022-06-02,until-index-date:2022-06-03&cursor=*
A next-cursor field will be provided in the JSON response. To get the next page of results, pass the value of next-cursor as the cursor parameter. For example:
https://api.crossref.org/works?rows=1000&sort=created&mailto=youremail#domain.com&query=ammonium&filter=from-index-date:2022-06-02,until-index-date:2022-06-03&cursor=<value of next-cursor parameter>
Advice from the CrossRef doc
Clients should check the number of returned items. If the number of returned items is equal to the number of expected rows then the end of the result set has been reached. Using next-cursor beyond this point will result in responses with an empty items list.
My processing time is still through the roof with just 3 words (and 7 requests), it's over 15sec. I'm trying to turn that down to under 5 seconds if possible? Using postman, the longest request took about 4 seconds to come back
This is what I have so far if you want to try it out.
schema.py
class CrossRefSearchRequest(BaseModel):
keywords: List[str]
date_from: Optional[datetime] = None
date_to: Optional[datetime] = None
controler.py
import time
from fastapi import FastAPI, APIRouter, Request
app = FastAPI(title="CrossRef API", openapi_url=f"{settings.API_V1_STR}/openapi.json")
api_router = APIRouter()
service = CrossRefService()
#api_router.post("/search", status_code=201)
async def search_keywords(*, search_args: CrossRefSearchRequest) -> dict:
fixed_search_args = {
"sort": "created",
"rows": "1000",
"cursor": "*"
}
results = await service.cross_ref_request(search_args, **fixed_search_args)
return {k: len(v) for k, v in results.items()}
# sets the header X-Process-Time, in order to have the time for each request
#app.middleware("http")
async def add_process_time_header(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
response.headers["X-Process-Time"] = str(process_time)
return response
app.include_router(api_router)
if __name__ == "__main__":
# Use this for debugging purposes only
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001, log_level="debug")
service.py
from datetime import datetime, timedelta
def _setup_date_default(date_from_req: datetime, date_to_req: datetime):
yesterday = datetime.utcnow()- timedelta(days=1)
date_from = yesterday if date_from_req is None else date_from_req
date_to = yesterday if date_to_req is None else date_to_req
return date_from.strftime(DATE_FORMAT_CROSS_REF), date_to.strftime(DATE_FORMAT_CROSS_REF)
class CrossRefService:
def __init__(self):
self.client = CrossRefClient()
# my recursive call for the next cursor
async def _send_client_request(self ,final_result: dict[str, list[str]], keywords: [str], date_from: str, date_to: str, **kwargs):
json_responses = await self.client.cross_ref_request_date_range(keywords, date_from, date_to, **kwargs)
for json_response in json_responses:
message = json_response.get('message', {})
keyword = message.get('query').get('search-terms')
next_cursor = message.get('next-cursor')
total_results = message.get('total-results')
search_results = message.get('items', [{}]) if total_results > 0 else []
if final_result[keyword] is None:
final_result[keyword] = search_results
else:
final_result[keyword].extend(search_results)
if total_results > int(kwargs['rows']) and len(search_results) == int(kwargs['rows']):
kwargs['cursor'] = next_cursor
await self._send_client_request(final_result, [keyword], date_from, date_to, **kwargs)
async def cross_ref_request(self, request: CrossRefSearchRequest, **kwargs) -> dict[str, list[str]]:
date_from, date_to = _setup_date(request.date_from, request.date_to)
results: dict[str, list[str]] = dict.fromkeys(request.keywords)
await self._send_client_request(results, request.keywords, date_from, date_to, **kwargs)
return results
client.py
import asyncio
from aiohttp import ClientSession
async def _send_request_task(session: ClientSession, url: str):
try:
async with session.get(url) as response:
await response.read()
return response
# exception handler to come
except Exception as e:
print(f"exception for {url}")
print(str(e))
class CrossRefClient:
base_url = "https://api.crossref.org/works?" \
"query={}&" \
"filter=from-index-date:{},until-index-date:{}&" \
"sort={}&" \
"rows={}&" \
"cursor={}"
def __init__(self) -> None:
self.headers = {
"User-Agent": f"my_app/v0.1 (example.com/; mailto:youremail#domain.com) using FastAPI"
}
async def cross_ref_request_date_range(
self, keywords: [str], date_from: str, date_to: str, **kwargs
) -> list:
async with ClientSession(headers=self.headers) as session:
tasks = [
asyncio.create_task(
_send_request_task(session, self.base_url.format(
keyword, date_from, date_to, kwargs['sort'], kwargs['rows'], kwargs['cursor']
)),
name=TASK_NAME_BASE.format(keyword, date_from, date_to)
)
for keyword in keywords
]
responses = await asyncio.gather(*tasks)
return [await response.json() for response in responses]
How to optimize this better and use asynchronous calls better? Also this recursive loop might not be the best way to do it neither. Any ideas on that too?
I implemented a solution for synchronous calls and it's even slower. So I guess I'm not too far away.
Thanks!

Your code looks fine and you are not misusing the asynchronous concept.
Perhaps you are limited by the number of client session, which is limited to 100 connections at a time. Take a look at https://docs.aiohttp.org/en/stable/client_reference.html#aiohttp.BaseConnector
Maybe the server upstream is just answering slowly to a massive amount of requests.

python asyncio & httpx

I am very new to asynchronous programming and I was playing around with httpx. I have the following code and I am sure I am doing something wrong - just don't know what it is. There are two methods, one synchronous and other asynchronous. They are both pull from google finance. On my system I am seeing the time spent as following:
Asynchronous: 5.015218734741211
Synchronous: 5.173618316650391
Here is the code:
import httpx
import asyncio
import time
#
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
#
def sync_pull(url):
r = httpx.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
#
async def async_pull(url):
async with httpx.AsyncClient() as client:
r = await client.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
#
if __name__ == "__main__":
goog_fin_nyse_url = 'https://www.google.com/finance/quote/'
tickers = ['F', 'TWTR', 'CVX', 'VZ', 'GME', 'GM', 'PG', 'AAL',
'MARK', 'AAP', 'THO', 'NGD', 'ZSAN', 'SEAC',
]
print("Running asynchronously...")
async_start = time.time()
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
asyncio.run(async_pull(url))
async_end = time.time()
print(f"Time lapsed is: {async_end - async_start}")
print("Running synchronously...")
sync_start = time.time()
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
sync_pull(url)
sync_end = time.time()
print(f"Time lapsed is: {sync_end - sync_start}")
I had hoped the asynchronous method approach would require a fraction of the time the synchronous approach is requiring. What am I doing wrong?

When you say asyncio.run(async_pull) you're saying run 'async_pull' and wait for the result to come back. Since you do this once per each ticker in your loop, you're essentially using asyncio to run things synchronously and won't see performance benefits.
What you need to do is create several async calls and run them concurrently. There are several ways to do this, the easiest is to use asyncio.gather (see https://docs.python.org/3/library/asyncio-task.html#asyncio.gather) which takes in a sequence of coroutines and runs them concurrently. Adapting your code is fairly straightforward, you create an async function to take a list of urls and then call async_pull on each of them and then pass that in to asyncio.gather and await the results. Adapting your code to this looks like the following:
import httpx
import asyncio
import time
def sync_pull(url):
r = httpx.get(url)
print(r.status_code)
async def async_pull(url):
async with httpx.AsyncClient() as client:
r = await client.get(url)
print(r.status_code)
async def async_pull_all(urls):
return await asyncio.gather(*[async_pull(url) for url in urls])
if __name__ == "__main__":
goog_fin_nyse_url = 'https://www.google.com/finance/quote/'
tickers = ['F', 'TWTR', 'CVX', 'VZ', 'GME', 'GM', 'PG', 'AAL',
'MARK', 'AAP', 'THO', 'NGD', 'ZSAN', 'SEAC',
]
print("Running asynchronously...")
async_start = time.time()
results = asyncio.run(async_pull_all([goog_fin_nyse_url + ticker + ':NYSE' for ticker in tickers]))
async_end = time.time()
print(f"Time lapsed is: {async_end - async_start}")
print("Running synchronously...")
sync_start = time.time()
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
sync_pull(url)
sync_end = time.time()
print(f"Time lapsed is: {sync_end - sync_start}")
Running this way, the asynchronous version runs in about a second for me as opposed to seven synchronously.

Here's a nice pattern I use (I tend to change it a little each time). In general, I make a module async_utils.py and just import the top-level fetching function (e.g. here fetch_things), and then my code is free to forget about the internals (other than error handling). You can do it in other ways, but I like the 'functional' style of aiostream, and often find the repeated calls to the process function take certain defaults I set using functools.partial.
Note: async currying with partials is Python 3.8+ only
You can pass in a tqdm.tqdm progress bar to pbar (initialised with known size total=len(things)) to have it update when each async response is processed.
import asyncio
import httpx
from aiostream import stream
from functools import partial
__all__ = ["fetch", "process", "async_fetch_urlset", "fetch_things"]
async def fetch(session, url, raise_for_status=False):
response = await session.get(str(url))
if raise_for_status:
response.raise_for_status()
return response
async def process_thing(data, things, pbar=None, verbose=False):
# Map the response back to the thing it came from in the things list
source_url = data.history[0].url if data.history else data.url
thing = next(t for t in things if source_url == t.get("thing_url"))
# Handle `data.content` here, where `data` is the `httpx.Response`
if verbose:
print(f"Processing {source_url=}")
build.update({"computed_value": "result goes here"})
if pbar:
pbar.update()
async def async_fetch_urlset(urls, things, pbar=None, verbose=False, timeout_s=10.0):
timeout = httpx.Timeout(timeout=timeout_s)
async with httpx.AsyncClient(timeout=timeout) as session:
ws = stream.repeat(session)
xs = stream.zip(ws, stream.iterate(urls))
ys = stream.starmap(xs, fetch, ordered=False, task_limit=20)
process = partial(process_thing, things=things, pbar=pbar, verbose=verbose)
zs = stream.map(ys, process)
return await zs
def fetch_things(urls, things, pbar=None, verbose=False):
return asyncio.run(async_fetch_urlset(urls, things, pbar, verbose))
In this example, the input is a list of dicts (with string keys and values), things: list[dict[str,str]], and the key "thing_url" is accessed to retrieve the URL. Having a dict or object is desirable instead of just the URL string for when you want to 'map' the result back to the object it came from. The process_thing function is able to modify the input list things in-place (i.e. any changes are not scoped within the function, they change it back in the scope that called it).
You'll often find errors arise during async runs that you don't get when running synchronously, so you'll need to catch them, and re-try. A common gotcha is to retry at the wrong level (e.g. around the entire loop)
In particular, you'll want to import and catch httpcore.ConnectTimeout, httpx.ConnectTimeout, httpx.RemoteProtocolError, and httpx.ReadTimeout.
Increasing the timeout_s parameter will reduce the frequency of the timeout errors by letting the AsyncClient 'wait' for longer, but doing so may in fact slow down your program (it won't "fail fast" quite as fast).
Here's an example of how to use the async_utils module given above:
from async_utils import fetch_things
import httpx
import httpcore
# UNCOMMENT THIS TO SEE ALL THE HTTPX INTERNAL LOGGING
#import logging
#log = logging.getLogger()
#log.setLevel(logging.DEBUG)
#log_format = logging.Formatter('[%(asctime)s] [%(levelname)s] - %(message)s')
#console = logging.StreamHandler()
#console.setLevel(logging.DEBUG)
#console.setFormatter(log_format)
#log.addHandler(console)
things = [
{"url": "https://python.org", "name": "Python"},
{"url": "https://www.python-httpx.org/", "name": "HTTPX"},
]
#log.debug("URLSET:" + str(list(t.get("url") for t in things)))
def make_urlset(things):
"""Make a URL generator (empty if all have been fetched)"""
urlset = (t.get("url") for t in things if "computed_value" not in t)
return urlset
retryable_errors = (
httpcore.ConnectTimeout,
httpx.ConnectTimeout, httpx.RemoteProtocolError, httpx.ReadTimeout,
)
# ASYNCHRONOUS:
max_retries = 100
for i in range(max_retries):
print(f"Retry {i}")
try:
urlset = make_urlset(things)
foo = fetch_things(urls=urlset, things=things, verbose=True)
except retryable_errors as exc:
print(f"Caught {exc!r}")
if i == max_retries - 1:
raise
except Exception:
raise
# SYNCHRONOUS:
#for t in things:
# resp = httpx.get(t["url"])
In this example I set a key "computed_value" on a dictionary once the async response has successfully been processed which then prevents that URL from being entered into the generator on the next round (when make_urlset is called again). In this way, the generator gets progressively smaller. You can also do it with lists but I find a generator of the URLs to be pulled works reliably. For an object you'd change the dictionary key assignment/access (update/in) to attribute assignment/access (settatr/hasattr).

I wanted to post working version of the coding using futures - virtually the same run-time:
import httpx
import asyncio
import time
#
#--------------------------------------------------------------------
# Synchronous pull
#--------------------------------------------------------------------
#
def sync_pull(url):
r = httpx.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
# Asynchronous Pull
#--------------------------------------------------------------------
#
async def async_pull(url):
async with httpx.AsyncClient() as client:
r = await client.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
# Build tasks queue & execute coroutines
#--------------------------------------------------------------------
#
async def build_task() -> None:
goog_fin_nyse_url = 'https://www.google.com/finance/quote/'
tickers = ['F', 'TWTR', 'CVX', 'VZ', 'GME', 'GM', 'PG', 'AAL',
'MARK', 'AAP', 'THO', 'NGD', 'ZSAN', 'SEAC',
]
tasks= []
#
## Following block of code will create a queue full of function
## call
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
tasks.append(asyncio.ensure_future(async_pull(url)))
start_time = time.time()
#
## This block of code will derefernce the function calls
## from the queue, which will cause them all to run
## rapidly
await asyncio.gather(*tasks)
#
## Calculate time lapsed
finish_time = time.time()
elapsed_time = finish_time - start_time
print(f"\n Time spent processing: {elapsed_time} ")
# Start from here
if __name__ == "__main__":
asyncio.run(build_task())

asynchroneous error handling and response processing of an unbounded list of tasks using zeep

So here is my use case:
I read from a database rows containing information to make a complex SOAP call (I'm using zeep to do these calls).
One row from the database corresponds to a request to the service.
There can be up to 20 thousand lines, so I don't want to read everything in memory before making the calls.
I need to process the responses - when the
response is OK, I need to store some returned information back into
my database, and when there is an exception I need to process the
exception for that particular request/response pair.
I need also to capture some external information at the time of the request creation, so that I know where to store the response from the request. In my current code I'm using the delightful property of gather() that makes the results come in the same order.
I read the relevant PEPs and Python documentation but I'm still very confused, as there seems to be multiple ways to solve the same problem.
I also went through countless exercises on the web, but the examples are all trivial - it's either asyncio.sleep() or some webscraping with a finite list of urls.
The solution that I have come up so far kinda works - the asyncio.gather() method is very, very, useful, but I have not been able to 'feed' it from a generator. I'm currently just counting to an arbitrary size and then starting a .gather() operation. I've transcribed the code, with boring parts left out and I've tried to anonymise the code
I've tried solutions involving semaphores, queues, different event loops, but I'm failing every time. Ideally I'd like to be able to create Futures 'continuously' - I think I'm missing the logic of 'convert this awaitable call to a future'
I'd be grateful for any help!
import asyncio
from asyncio import Future
import zeep
from zeep.plugins import HistoryPlugin
history = HistoryPlugin()
max_concurrent_calls = 5
provoke_errors = True
def export_data_async(db_variant: str, order_nrs: set):
st = time.time()
results = []
loop = asyncio.get_event_loop()
def get_client1(service_name: str, system: Systems = Systems.ACME) -> Tuple[zeep.Client, zeep.client.Factory]:
client1 = zeep.Client(wsdl=system.wsdl_url(service_name=service_name),
transport=transport,
plugins=[history],
)
factory_ns2 = client1.type_factory(namespace='ns2')
return client1, factory_ns2
table = 'ZZZZ'
moveback_table = 'EEEEEE'
moveback_dict = create_default_empty_ordered_dict('attribute1 attribute2 attribute3 attribute3')
client, factory = get_client1(service_name='ACMEServiceName')
if log.isEnabledFor(logging.DEBUG):
client.wsdl.dump()
zeep_log = logging.getLogger('zeep.transports')
zeep_log.setLevel(logging.DEBUG)
with Db(db_variant) as db:
db.open_db(CON_STRING[db_variant])
db.init_table_for_read(table, order_list=order_nrs)
counter_failures = 0
tasks = []
sids = []
results = []
def handle_future(future: Future) -> None:
results.extend(future.result())
def process_tasks_concurrently() -> None:
nonlocal tasks, sids, counter_failures, results
futures = asyncio.gather(*tasks, return_exceptions=True)
futures.add_done_callback(handle_future)
loop.run_until_complete(futures)
for i, response_or_fault in enumerate(results):
if type(response_or_fault) in [zeep.exceptions.Fault, zeep.exceptions.TransportError]:
counter_failures += 1
log_webservice_fault(sid=sids[i], db=db, err=response_or_fault, object=table)
else:
db.write_dict_to_table(
moveback_table,
{'sid': sids[i],
'attribute1': response_or_fault['XXX']['XXX']['xxx'],
'attribute2': response_or_fault['XXX']['XXX']['XXXX']['XXX'],
'attribute3': response_or_fault['XXXX']['XXXX']['XXX'],
}
)
db.commit_db_con()
tasks = []
sids = []
results = []
return
for row in db.rows(table):
if int(row.id) % 2 == 0 and provoke_errors:
payload = faulty_message_payload(row=row,
factory=factory,
)
else:
payload = message_payload(row=row,
factory=factory,
)
tasks.append(client.service.myRequest(
MessageHeader=factory.MessageHeader(**message_header_arguments(row=row)),
myRequestPayload=payload,
_soapheaders=[security_soap_header],
))
sids.append(row.sid)
if len(tasks) == max_concurrent_calls:
process_tasks_concurrently()
if tasks: # this is the remainder of len(db.rows) % max_concurrent_calls
process_tasks_concurrently()
loop.run_until_complete(transport.session.close())
db.execute_this_statement(statement=update_sql)
db.commit_db_con()
log.info(db.activity_log)
if counter_failures:
log.info(f"{table :<25} Count failed: {counter_failures}")
print("time async: %.2f" % (time.time() - st))
return results
Failed attempt with Queue: (blocks at await client.service)
loop = asyncio.get_event_loop()
counter = 0
results = []
async def payload_generator(db_variant: str, order_nrs: set):
# code that generates the data for the request
yield counter, row, payload
async def service_call_worker(queue, results):
while True:
counter, row, payload = await queue.get()
results.append(await client.service.myServicename(
MessageHeader=calculate_message_header(row=row)),
myPayload=payload,
_soapheaders=[security_soap_header],
)
)
print(colorama.Fore.BLUE + f'after result returned {counter}')
# Here do the relevant processing of response or error
queue.task_done()
async def main_with_q():
n_workers = 3
queue = asyncio.Queue(n_workers)
e = pprint.pformat(queue)
p = payload_generator(DB_VARIANT, order_list_from_args())
results = []
workers = [asyncio.create_task(service_call_worker(queue, results))
for _ in range(n_workers)]
async for c in p:
await queue.put(c)
await queue.join() # wait for all tasks to be processed
for worker in workers:
worker.cancel()
if __name__ == '__main__':
try:
loop.run_until_complete(main_with_q())
loop.run_until_complete(transport.session.close())
finally:
loop.close()

Parsing large number of HTML files with asyncio aiofiles and parsing them in pandas DataFrame

I have around 40 000 HTML files on disk and function that parses HTML with Beautiful Soup and returns dictionary for each HTML.
During reading/parsing I'm appending all dictionaries to list and creating pandas DataFrame in the end.
It all works ok in synchronous mode but it takes a long time to run so I want to run in with aiofiles
Currently my code looks like this:
# Function for fetching all ad info from single page
async def getFullAdSoup(soup):
...
adFullFInfo = {} # dictionary parsed from Beautifoul soup object
return await adFullFInfo
async def main():
adExtendedDF = pd.DataFrame()
adExtendednfo = {}
htmls = glob.glob("HTML_directory" + "/*.html") # Get all HTML files from directory
htmlTasks = [] # Holds list of returned dictionaries
for html in natsorted(htmls):
async with aiofiles.open(html, mode='r', encoding='UTF-8', errors='strict', buffering=1) as f:
contents = await f.read()
htmlTasks.append(getFullAdSoup(BeautifulSoup(contents, features="lxml")))
htmlDicts = await asyncio.gather(*htmlTasks)
adExtendedDF = pd.DataFrame(data=htmlDicts, ignore_index=True)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Error I'm getting is:
File "C:/Users/.../test.py", line 208, in getFullAdSoup
return await adFullFInfo TypeError: object dict can't be used in 'await' expression
I'm finding similar question here but I'm unable to make it work.
I don't know how to transform my parsing function to asynchronous mode and how to iterate over files calling that function.

Your error happens because you await a dict, I'm guessing you misunderstood, you don't need to await in the return statement for it to be async. I would refactor it like this
# Function for fetching all ad info from single page
async def getFullAdSoup(soup):
...
adFullFInfo = {} # dictionary parsed from Beautifoul soup object
return adFullFInfo #*****1****
async def main():
adExtendedDF = pd.DataFrame()
adExtendednfo = {}
htmls = glob.glob("HTML_directory" + "/*.html") # Get all HTML files from directory
htmlTasks = [] # Holds list of returned dictionaries
for html in natsorted(htmls):
async with aiofiles.open(html, mode='r', encoding='UTF-8', errors='strict', buffering=1) as f:
contents = await f.read()
htmlTasks.append(asyncio.create_task( #****2****
getFullAdSoup(BeautifulSoup(contents, features="lxml"))))
await asyncio.sleep(0) #****3****
htmlDicts = await asyncio.gather(*htmlTasks) #****4****
adExtendedDF = pd.DataFrame(data=htmlDicts, ignore_index=True)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
4 changes:
No need to await the dict
Use asyncio.create_task to schedule the task to run ASAP
sleep(0) to release the event loop and let the task start running
Move the gather method outside of the loop, so you can gather all tasks at once instead of one at a time.
2 and 3 are optional, but I find that it makes a lot of speed difference depending on what you are doing

Asyncio Loop Within Asyncio Loop

I'm just starting to use Asyncio and I'm trying to use it to parse a website.
I'm trying to parse 6 sections (self.signals) of the site, each section has N number of pages with tables on them, so essentially I'm trying to async the loop that calls what section, and async the pages in each section. This is what I have so far.
class FinViz():
def __init__(self):
self.url = 'https://finviz.com/screener.ashx?v=160&s='
self.signals = {
'Earnings_Before' : 'n_earningsbefore',
'Earnings_After' : 'n_earningsafter',
'Most_Active' : 'ta_mostactive',
'Top_Gainers' : 'ta_topgainers',
'Most_Volatile' : 'ta_mostvolatile',
'News' : 'n_majornews',
'Upgrade' : 'n_upgrades',
'Unusual_Volume' : 'ta_unusualvolume'
}
self.ticks = []
def _parseTable(self, data):
i, signal = data
url = self.signals[signal] if i == 0 else self.signals[signal] + '&r={}'.format(str(i * 20 + 1))
soup = BeautifulSoup(urlopen(self.url + url, timeout = 3).read(), 'html5lib')
table = soup.find('div', {'id' : 'screener-content'}).find('table',
{'width' : '100%', 'cellspacing': '1', 'cellpadding' : '3', 'border' : '0', 'bgcolor' : '#d3d3d3'})
for row in table.findAll('tr'):
col = row.findAll('td')[1]
if col.find('a'):
self.ticks.append(col.find('a').text)
async def parseSignal(self, signal):
try:
soup = BeautifulSoup(urlopen(self.url + self.signals[signal], timeout = 3).read(), 'html5lib')
tot = int(soup.find('td', {'class' : 'count-text'}).text.split()[1])
with concurrent.futures.ThreadPoolExecutor(max_workers = 20) as executor:
loop = asyncio.get_event_loop()
futures = []
for i in range(tot // 20 + (tot % 20 > 0)):
futures.append(loop.run_in_executor(executor, self._parseTable, (i, signal)))
for response in await asyncio.gather(*futures):
pass
except URLError:
pass
async def getAll(self):
with concurrent.futures.ThreadPoolExecutor(max_workers = 20) as executor:
loop = asyncio.get_event_loop()
futures = []
for signal in self.signals:
futures.append(await loop.run_in_executor(executor, self.parseSignal, signal))
for response in await asyncio.gather(*futures):
pass
print(self.ticks)
if __name__ == '__main__':
x = FinViz()
loop = asyncio.get_event_loop()
loop.run_until_complete(x.getAll())
This does do the job successfully, but it somehow does it slower than if I were to do the parsing without asyncio.
Any tips for an asynchronous noob?
Edit: Added full code

Remember python has a GIL, so threaded code will not help performance. To potentially speed things up use a ProcessPoolExecutor however note you'll incur the following overhead:
pickle/unpickling data to sub-process worker
pickle/unpickling result sent back to main process
You can avoid 1. if you run on a fork safe environment and store the data in a global variable.
You can also do stuff like share a memory mapped file...also sharing raw strings/bytes is the fastest.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Multi Requests URLs - python

Related

How to optimize my performances, using asynchronous python code

python asyncio & httpx

asynchroneous error handling and response processing of an unbounded list of tasks using zeep

Parsing large number of HTML files with asyncio aiofiles and parsing them in pandas DataFrame

Asyncio Loop Within Asyncio Loop

Categories

Resources