Faster way to iterate over dataframe?

Faster way to iterate over dataframe? - python

I have a dataframe where each row is a record and I need to send each record in the body of a post request. Right now I am looping through the dataframe to accomplish this. I am constrained by the fact that each record must be posted individually. Is there a faster way to accomplish this?

Iterating over the data frame is not the issue here. The issue is you have to wait for the server to response to each of your request. Network request takes eons compared to CPU time need to iterate over the data frame. In other words, your program is I/O bound, not CPU bound.
One way to speed it up is to use coroutines. Let's say you have to make 1000 requests. Instead of firing one request, wait for the response, then fire the next request and so on, you fire 1000 requests at once and tell Python to wait until you have received all 1000 responses.
Since you didn't provide any code, here's a small program to illustrate the point:
import aiohttp
import asyncio
import numpy as np
import time
from typing import List
async def send_single_request(session: aiohttp.ClientSession, url: str):
async with session.get(url) as response:
return await response.json()
async def send_all_requests(urls: List[str]):
async with aiohttp.ClientSession() as session:
# Make 1 coroutine for each request
coroutines = [send_single_request(session, url) for url in urls]
# Wait until all coroutines have finished
return await asyncio.gather(*coroutines)
# We will make 10 requests to httpbin.org. Each request will take at least d
# seconds. If you were to fire them sequentially, they would have taken at least
# delays.sum() seconds to complete.
np.random.seed(42)
delays = np.random.randint(0, 5, 10)
urls = [f"https://httpbin.org/delay/{d}" for d in delays]
# Instead, we will fire all 10 requests at once, then wait until all 10 have
# finished.
t1 = time.time()
result = asyncio.run(send_all_requests(urls))
t2 = time.time()
print(f"Expected time: {delays.sum()} seconds")
print(f"Actual time: {t2 - t1:.2f} seconds")
Output:
Expected time: 28 seconds
Actual time: 4.57 seconds
You have to read up a bit on coroutines and how they work but for the most part, they are not too complicated for your use case. This comes with a couple caveats:
All your requests must be independent from each other.
The rate limit on the server must be sufficient to handle your workload. For example, if it restricts you to 2 requests per minute, there is no way around that other than upgrading to different service tier.

Related

Asyncio is blocking using FastAPI

I have a function that make a post request with a lot of treatment. All of that takes 30 seconds.
I need to execute this function every 6 mins. So I used asyncio for that ... But it's not asynchrone my api is blocked since the end of function ... Later I will have treatment that takes 5 minutes to execute.
def update_all():
# do request and treatment (30 secs)
async run_update_all():
while True:
await asyncio.sleep(6 * 60)
update_all()
loop = asyncio.get_event_loop()
loop.create_task(run_update_all())
So, I don't understand why during the execute time of update_all() all requests comming are in pending, waiting for the end of update_all() instead of being asynchronous

I found an answer with the indication of larsks
I did that :
def update_all():
# Do synchrone post request and treatment that take long time
async def launch_async():
loop = asyncio.get_event_loop()
while True:
await asyncio.sleep(120)
loop.run_in_executore(None, update_all)
asyncio.create_task(launch_async())
With that code I'm able to launch a synchrone function every X seconds without blocking the main thread of FastApi :D
I hope that will help other people in the same case than me.

How to use API response in multithreading Python

So, I'm trying to write a queue on Python for load testing and I'm stuck. What I have:
Rest API with authentication;
POST route for sending a request;
requests package for sending requests
Array with users email that also contains their user_id.
When I passed the authentication I got a UserID and access_token. I can save it in a dictionary or a list, but then I need to use it for sending requests: user_id need for route, access_token for checksum calculation. But I don't see any variants for using these cases. I thought about two for loops, but I need a way to save access_token and user_id for all of my users in the dictionary and use it. And I'm not sure about this case, too.
I tried to do it in multithreading. For example:
q = queue.Queue()
def run_with_queue_api_authorize():
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as pool:
auth = [pool.submit(ApiAuthorization.get_token_generated, email) for email in ApiVariables.users]
for r in concurrent.futures.as_completed(auth):
q.put(r)
print(q.qsize())
return q
def run_with_queue_send_activity():
start_time = int(datetime.datetime.now().timestamp())
run_with_queue_api_authorize()
end_time = int(datetime.datetime.now().timestamp())
print(f"Execution time for auth:{end_time - start_time}, start time is {start_time}, end_time is {end_time}")
time.sleep(10)
start_time = int(datetime.datetime.now().timestamp())
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as pool:
while not q.empty():
task = q.get()
new_try = [pool.submit(MultiQueue.multi_post_user_activity, task if task.done())]
if q.empty():
break
end_time = int(datetime.datetime.now().timestamp())
print(f"Execution time for post user activity:{end_time - start_time}, start time is {start_time}, end_time is {end_time}")
When I use this code I can send my test data but for one user in the 3000 of threads. And it'll be failed for 2999 threads. I don't understand why it doesn't work.
I tried to create a queue with threading package. It finishes immediately without any information. And I think that the solution with ThreadPoolExecutor is more reliable and most possible.
I can take a user_id from a special array but I need to get an access_token for this user. It doesn't look as a work case.
Why do I need it? Cause I need to get a count of requests in the second. But when I do authorization in the threads, I also send data at the same time asynchronously. And the total time for my tests will be counting with the time of authorization.
How can I resolved this problem? I've seen some videos about multithreading but it doesn't work for me. And I read a lot of information about the subject. But I can't apply it for my case.
I'd be grateful for any advice.

does using asyncio loops inside threads decrease performance

i am creating a system where i have to query a distant server periodically multiple times, about 10000 times a second. it is a bit a lot but it is still experimental and i won the server so no issues with exceeding load or anything.
how i did that is spin up 50 processes and each process spins up about 200 threads with each running a loop over 2 asyncio tasks forever.
the loop looks like this
async def getDataPeriodically(item):
while True:
self.getNewData(item)
await asyncio.sleep(replayInterval)
entriesLoop = asyncio.get_event_loop()
entriesLoop.create_task(getDataPeriodically("X"))
entriesLoop.create_task(getDataPeriodically("Y"))
entriesLoop.run_forever()
the issue i had is that although the replayInterval is set to 0.5 second or 1 second even, self.getNewData wouldn't finish the HTTP request on time . sometimes it finishes 10 seconds after and sometimes even 2 minutes after.
i would like to know if running an asyncio loop inside a thread decreases the efficiency or opposes the concurrency logic of the thread ?

If you can change getNewData(), you do not need the await calls.
Threads can update object attributes directly, so you can pass in a dictionary (or other object) and monitor a specific attribute.
This doesnt answer your question about asyncio, but may help with your overall problem?
....
def getNewData(self, obj):
#Request data
#Once data is received
obj['dataReceived'] = True
....
def getDataPeriodically(item):
obj = {'dataReceived': False}
while True:
self.getNewData(item, obj)
while not obj['dataReceived']: #Wait for getNewData to receive data
pass
#Do whatever with data
obj['dataReceived'] = False #Prep for next HTTP request
thread = threading.Thread(target=getDataPeriodically, args=(item,))
thread.daemon = True
thread.start()

grpc python measure response time

How can I measure the full time grpc-python takes to handle a request?
So far the best I can do is:
def Run(self, request, context):
start = time.time()
# service code...
end = time.time()
return myservice_stub.Response()
But this doesn't measure how much time grpc takes to serialize the request, response, to transfer it over the network.. and so on. I'm looking for a way to "hook" into these steps.

You can measure on the client side:
start = time.time()
response = stub.Run(request)
total_end_to_end = time.time() - start
Then you can get the total overhead (serialization, transfer) by reducing the computation of the Run method.
To automate the process, you can add (at least for the sake of the test) the computation time as a field to the myservice_stub.Response.

Control dynamic asyncio.Semaphore

Not sure if I'm taking the right approach. I need to change dynamically the amount of requests I'm sending, for this I thought I could change dynamically the amount of semaphores I'm running. Let's say I want 10 parallel calls during the first minute, then increase it to 20 for 2 minutes, then decrease it to 15, and so on, like a load-test. Is it possible to control dynamic_value during run-time?
async def send_cus(cus_id):
async with ClientSession(connector=TCPConnector(limit=0)) as session:
num_parallel = asyncio.Semaphore(dynamic_value)
async def send_cust_2(cust_id):
async with num_parallel:
# do something ...
tasks = list(send_cust_2(cust_id) for cus_id in my_ordered_lst)
await asyncio.gather(*tasks)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster way to iterate over dataframe? - python

I have a dataframe where each row is a record and I need to send each record in the body of a post request. Right now I am looping through the dataframe to accomplish this. I am constrained by the fact that each record must be posted individually. Is there a faster way to accomplish this?

Related

Asyncio is blocking using FastAPI

How to use API response in multithreading Python

does using asyncio loops inside threads decrease performance

grpc python measure response time

Control dynamic asyncio.Semaphore

Categories

Resources