Optimizing speed for bulk SSL API requests - python

I have written a script to run around 530 API calls which i intend to run every 5 minutes, from these calls i store data to process in bulk later (Prediction ETC).
The API has a limit of 500 requests per second. However when running my code I am seeing a 2 second time per call (due to SSL i believe).
How can i speed this up to enable me to run 500 requests within 5 minutes, as the current time required renders the data i am collecting useless :(
Code:
def getsurge(lat, long):
response = client.get_price_estimates(
start_latitude=lat,
start_longitude=long,
end_latitude=-34.063676,
end_longitude=150.815075
)
result = response.json.get('prices')
return result
def writetocsv(database):
database_writer = csv.writer(database)
database_writer.writerow(HEADER)
pool = Pool()
# Open Estimate Database
while True:
for data in coordinates:
line = data.split(',')
long = line[3]
lat = line[4][:-2]
estimate = getsurge(lat, long)
timechecked = datetime.datetime.now()
for d in estimate:
if d['display_name'] == 'TAXI':
database_writer.writerow([timechecked, [line[0], line[1]], d['surge_multiplier']])
database.flush()
print(timechecked, [line[0], line[1]], d['surge_multiplier'])

Is the APi under your control? If so, create an endpoint which can give you all the data you need in one go.

Related

Retrieve all the API response while you have maximum offset in Python

I am attempting to retrieve data from this API that has a max offset of 200000. The records I am attempting to pull are more than the max offset. Below is a sample of the code I am using but when I reach the offset limit of 200000 it breaks (the API doesn't return any helpful response in terms of how many pages/requests I need to do that's why I am going until there are no more results ). I need to find a way to loop through and pull all the data. Thanks
def pull_api_data():
offset_limit = 0
teltel_data = []
# Loop through the results and add if present
while True:
print("Skip", offset_limit, "rows before beginning to return results")
querystring = {"offset": "{}".format(offset_limit), "filter": "starttime>="'{}'.format(date_filter), "limit" : "5000"}
response = session.get(url=url, headers=the_headers, params=querystring)
data = response.json()['data']
# Do we have more data from teltel ?
if len(data) == 0:
break
# If yes ,then add the data to the main list ,teltel_data
teltel_data.extend(data)
# Increase offset_limit to skip the already added data
offset_limit = offset_limit + 5000
# transform the raw data by converting it to a dataframe and do necessary cleaning
pull_api_data()

Parallel web requests with GPU on Google Collab

I need to obtain properties from a web service for a large list of products (~25,000) and this is a very time sensitive operation (ideally I need this to execute in just a few seconds). I coded this first using a for loop as a proof of concept but it's taking 1.25 hours. I'd like to vectorize this code and execute the http requests in parallel using a GPU on Google Collab. I've removed many of the unnecessary details, but it's important to note that the products and their web service urls are stored in a DataFrame.
Will this be faster to execute on a GPU? Or should I just use multiple threads on a CPU?
What is the best way to implement this? And How can I save the results from parallel processes to the results DataFrame (all_product_properties) without running into concurrency problems?
Each product has multiple properties (key-value pairs) that I'm obtaining from the JSON response, but the product_id is not included in the JSON response so I need to add the product_id to the DataFrame.
#DataFrame containing string column of urls
urls = pd.DataFrame(["www.url1.com", "www.url2.com", ..., "www.url3.com"], columns=["url"])
#initialize empty dataframe to store properties for all products
all_product_properties = pd.DataFrame(columns=["product_id", "property_name", "property_value"])
for i in range(1, len(urls)):
curr_url = urls.loc[i, "url"]
try:
http_response = requests.request("GET", curr_url)
if http_response is not None:
http_response_json = json.loads(http_response.text)
#extract product properties from JSON response
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
all_product_properties = pd.concat([all_product_properties, curr_product_properties_df ])
except Exception as e:
print(e)
GPUs probably will not help here since they are meant for accelerating numerical operations. However, since you are trying to parallelize HTTP requests which are I/O bound, you can use Python multithreading (part of the standard library) to reduce the time required.
In addition, concatenating pandas dataframes in a loop is a very slow operation (see: Why does concatenation of DataFrames get exponentially slower?). You can instead append your output to a list, and run just a single concat after the loop has concluded.
Here's how I would implement your code w/ multithreading:
# Use an empty list for storing loop output
all_product_properties = []
thread_local = threading.local()
def get_session():
if not hasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
def download_site(url):
session = get_session()
try:
with session.get(url) as response:
if response is not None:
http_response_json = json.loads(response.text)
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
return curr_product_properties_df
print(f"Read {len(response.content)} from {url}")
except Exception as e:
print(e)
def download_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
all_product_properties = executor.map(download_site, sites)
return all_product_properties
if __name__ == "__main__":
# Store URLs as list, example below
urls = ["https://www.jython.org", "http://olympus.realpython.org/dice"] * 10
start_time = time.time()
all_product_properties = download_all_sites(urls)
all_product_properties = pd.concat(all_product_properties)
duration = time.time() - start_time
print(f"Downloaded {len(urls)} in {duration} seconds")
Reference: this RealPython article on multithreading and multiprocessing in Python: https://realpython.com/python-concurrency/

Python 3.7.4 -> How to keep memory usage low?

The following code retrieves and creates and indexes of uniqueCards on a given database.
for x in range(2010,2015):
for y in range(1,13):
index = str(x)+"-"+str("0"+str(y) if y<10 else y)
url = urlBase.replace("INDEX",index)
response = requests.post(url,data=query,auth=(user,pwd))
if response.status_code != 200:
continue
#this is a big json, around 4MB each
parsedJson = json.loads(response.content)["aggregations"]["uniqCards"]["buckets"]
for z in parsedJson:
valKey = 0
ind = 0
header = str(z["key"])[:8]
if header in headers:
ind = headers.index(header)
else:
headers.append(header)
valKey = int(str(ind)+str(z["key"])[8:])
creditCards.append(CreditCard(valKey,x*100+y))
The CreditCard object, the only one that survives the scope is around 64bytes long, each.
After running, this code was supposed to map around 10 million cards. That would translate to 640 million bytes, or around 640 Mega bytes.
The problem is that midway this operation, the memory consumption hits about 3GB...
My first guess is that, for some reason, the GC is not collecting the parsedJson. What should I do keep memory consumption under control? Can I dispose of that object manually?
Edit1:
the CreditCard is define as
class CreditCard:
number = 0
knownSince = 0
def __init__(self, num, date):
self.number=num
self.knownSince=date
Edit2:
When I get to 3.5 million cards on creditCards.__len__(), sys.getsizeof(creditCards) reports 31MB, but the process is consuming 2GB!
the problem is the json.load. Loading a 4MB results in a 5-8x memory jump.
Edit:
I manage to work around this using a custom mapper for the JSON:
def object_decoder(obj):
if obj.__contains__('key'):
return CreditCard(obj['key'],xy)
return obj
Now the memory grows slowly and I've been able to parse the whole set using around 2GB of memory

Replacing foreach with threading

My program basically has to get around 6000 items from the DB and calls an external API for each item. This almost takes 30 min to complete. I just thought of using threads here where i could create multi threads and split the process and reduce the time. So i came up with some thing like this. But I have two questions here. How do i store the response from the API that is processed by the function.
api = externalAPI()
for x in instruments:
response = api.getProcessedItems(x.symbol, days, return_perc);
if(response > float(return_perc)):
return_response.append([x.trading_symbol, x.name, response])
So in the above example the for loop runs for 6000 times(len(instruments) == 6000)
Now lets take i have splited the 6000 items to 2 * 3000 items and do something like this
class externalApi:
def handleThread(self, symbol, days, perc):
//I call the external API and process the items
// how do i store the processed data
def getProcessedItems(self,symbol, days, perc):
_thread.start_new_thread(self.handleThread, (symbol, days, perc))
_thread.start_new_thread(self.handleThread, (symbol, days, perc))
return self.thread_response
I am just starting out with thread. would be helpful if i know this is the right thing to do to reduce the time here.
P.S : Time is important here. I want to reduce it to 1 min from 30 min.
I suggest using worker-queue pattern like so...
you will have a queue of jobs, each worker will take a job and work on it, the result it will put at another queue, when all workers are done, the result queue will be read and process the results
def worker(pool, result_q):
while True:
job = pool.get()
result = handle(job) #handle job
result_q.put(result)
pool.task_done()
q = Queue.Queue()
res_q = Queue.Queue()
for i in range(num_worker_threads):
t = threading.Thread(target=worker, args=(q, res_q))
t.setDaemon(True)
t.start()
for job in jobs:
q.put(job)
q.join()
while not res_q.empty():
result = res_q.get()
# do smth with result
The worker-queue pattern suggested in shahaf's answer works fine, but Python provides even higher level abstractions, in concurret.futures. Namely a ThreadPoolExecutor, which will take care of the queueing and starting of threads for you:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=30)
responses = executor.map(process_item, (x.symbol for x in instruments))
The main complication with using the excutor.map() is that it can only map over one argument, meaning that there can be only one input to proces_item namely symbol).
However, if more arguments are needed, it is possible to define a new function, which will fixate all arguments but one. This can either be done manually or using the special Python partial call, found in functools:
from functools import partial
process_item = partial(api.handleThread, days=days, perc=return_perc)
Applying the ThreadPoolExecutor strategy to your current probelm would then have a solution similar to:
from concurrent.futures import ThreadPoolExecutor
from functools import partial
class Instrument:
def __init__(self, symbol, name):
self.symbol = symbol
self.name = name
instruments = [Instrument('SMB', 'Name'), Instrument('FNK', 'Funky')]
class externalApi:
def handleThread(self, symbol, days, perc):
# Call the external API and process the items
# Example, to give something back:
if symbol == 'FNK':
return days*3
else:
return days
def process_item_generator(api, days, perc):
return partial(api.handleThread, days=days, perc=perc)
days = 5
return_perc = 10
api = externalApi()
process_item = process_item_generator(api, days, return_perc)
executor = ThreadPoolExecutor(max_workers=30)
responses = executor.map(process_item, (x.symbol for x in instruments))
return_response = ([x.symbol, x.name, response]
for x, response in zip(instruments, responses)
if response > float(return_perc))
Here I have assumed that x.symbol is the same as x.trading_symbol and I have made a dummy implementation of your API call, to get some type of return value, but it should give a good idea of how to do this. Due to this, the code is a bit longer, but then again, it becomes a runnable example.

Global variables shared across all requests in Pyramid

I have this code in Pylons that calculates the network usage of the Linux system on which the webapp runs. Basically, to calculate the network utilization, we need to read the file /proc/net/dev twice, which gives us the the amount of transmitted data, and divide the subtracted values by the time elapsed between two reads.
I don’t want to do this calculation at regular intervals. There’s a JS code which periodically fetches this data. The transfer rate is the average amount of transmitted bytes between two requests per time unit. In Pylons, I used pylons.app_globals to store the reading which is going to be subtracted from the next reading at subsequent request. But apparently there’s no app_globals in Pyramid and I’m not sure if using thread locals is the correct course of action. Also, although request.registry.settings is apparently shared across all requests, I’m reluctant to store my data there, as the name implies it should only store the settings.
def netUsage():
netusage = {'rx':0, 'tx':0, 'time': time.time()}
rtn = {}
net_file = open('/proc/net/dev')
for line in net_file.readlines()[2:]:
tmp = map(string.atof, re.compile('\d+').findall(line[line.find(':'):]))
if line[:line.find(':')].strip() == "lo":
continue
netusage['rx'] += tmp[0]
netusage['tx'] += tmp[8]
net_file.close()
rx = netusage['rx'] - app_globals.prevNetusage['rx'] if app_globals.prevNetusage['rx'] else 0
tx = netusage['tx'] - app_globals.prevNetusage['tx'] if app_globals.prevNetusage'tx'] else 0
elapsed = netusage['time'] - app_globals.prevNetusage['time']
rtn['rx'] = humanReadable(rx / elapsed)
rtn['tx'] = humanReadable(tx / elapsed)
app_globals.prevNetusage = netusage
return rtn
#memorize(duration = 3)
def getSysStat():
memTotal, memUsed = getMemUsage()
net = netUsage()
loadavg = getLoadAverage()
return {'cpu': getCPUUsage(),
'mem': int((memUsed / memTotal) * 100),
'load1': loadavg[0],
'load5': loadavg[1],
'load15': loadavg[2],
'procNum': loadavg[3],
'lastProc': loadavg[4],
'rx': net['rx'],
'tx': net['tx']
}
Using request thread locals is considered bad design and should not be abused according to official pyramid docs.
My advice is to use some simple key-value storage like memcached or redis if possible.

Categories

Resources