Retrieve all the API response while you have maximum offset in Python - python

I am attempting to retrieve data from this API that has a max offset of 200000. The records I am attempting to pull are more than the max offset. Below is a sample of the code I am using but when I reach the offset limit of 200000 it breaks (the API doesn't return any helpful response in terms of how many pages/requests I need to do that's why I am going until there are no more results ). I need to find a way to loop through and pull all the data. Thanks
def pull_api_data():
offset_limit = 0
teltel_data = []
# Loop through the results and add if present
while True:
print("Skip", offset_limit, "rows before beginning to return results")
querystring = {"offset": "{}".format(offset_limit), "filter": "starttime>="'{}'.format(date_filter), "limit" : "5000"}
response = session.get(url=url, headers=the_headers, params=querystring)
data = response.json()['data']
# Do we have more data from teltel ?
if len(data) == 0:
break
# If yes ,then add the data to the main list ,teltel_data
teltel_data.extend(data)
# Increase offset_limit to skip the already added data
offset_limit = offset_limit + 5000
# transform the raw data by converting it to a dataframe and do necessary cleaning
pull_api_data()

Related

Get all transactions from OKX with python

I try to make full overview over my transaktions (bye/sell/deposit/withdrawl/earnings and boot trades) with python for Okx, but I get only 2 Trades (but I have made more than 2).
I have tried to send request with orders-history-archive and fetchMyTrades from CCXT Library (have tried some other functions, but I steed don't get my transactions.)
Is there some way to get full overview for Okx with python (and other Brocker/Wallets)?
here How I try to get the data with CCXT (it give only 2 outputs):
def getMyTrades(self):
tData = []
tSymboles = [
'BTC/USDT',
'ETH/USDT',
'SHIB/USDT',
'CELO/USDT',
'XRP/USDT',
'SAMO/USDT',
'NEAR/USDT',
'ETHW/USDT',
'DOGE/USDT',
'SOL/USDT',
'LUNA/USDT'
]
for item in tSymboles:
if exchange.has['fetchMyTrades']:
since = exchange.milliseconds() - 60*60*24*180*1000 # -180 days from now
while since < exchange.milliseconds():
symbol = item # change for your symbol
limit = 20 # change for your limit
orders = exchange.fetchMyTrades(symbol, since, limit)
if len(orders):
since = orders[len(orders) - 1]['timestamp'] + 1
tData += orders
else:
break

Parallel web requests with GPU on Google Collab

I need to obtain properties from a web service for a large list of products (~25,000) and this is a very time sensitive operation (ideally I need this to execute in just a few seconds). I coded this first using a for loop as a proof of concept but it's taking 1.25 hours. I'd like to vectorize this code and execute the http requests in parallel using a GPU on Google Collab. I've removed many of the unnecessary details, but it's important to note that the products and their web service urls are stored in a DataFrame.
Will this be faster to execute on a GPU? Or should I just use multiple threads on a CPU?
What is the best way to implement this? And How can I save the results from parallel processes to the results DataFrame (all_product_properties) without running into concurrency problems?
Each product has multiple properties (key-value pairs) that I'm obtaining from the JSON response, but the product_id is not included in the JSON response so I need to add the product_id to the DataFrame.
#DataFrame containing string column of urls
urls = pd.DataFrame(["www.url1.com", "www.url2.com", ..., "www.url3.com"], columns=["url"])
#initialize empty dataframe to store properties for all products
all_product_properties = pd.DataFrame(columns=["product_id", "property_name", "property_value"])
for i in range(1, len(urls)):
curr_url = urls.loc[i, "url"]
try:
http_response = requests.request("GET", curr_url)
if http_response is not None:
http_response_json = json.loads(http_response.text)
#extract product properties from JSON response
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
all_product_properties = pd.concat([all_product_properties, curr_product_properties_df ])
except Exception as e:
print(e)
GPUs probably will not help here since they are meant for accelerating numerical operations. However, since you are trying to parallelize HTTP requests which are I/O bound, you can use Python multithreading (part of the standard library) to reduce the time required.
In addition, concatenating pandas dataframes in a loop is a very slow operation (see: Why does concatenation of DataFrames get exponentially slower?). You can instead append your output to a list, and run just a single concat after the loop has concluded.
Here's how I would implement your code w/ multithreading:
# Use an empty list for storing loop output
all_product_properties = []
thread_local = threading.local()
def get_session():
if not hasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
def download_site(url):
session = get_session()
try:
with session.get(url) as response:
if response is not None:
http_response_json = json.loads(response.text)
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
return curr_product_properties_df
print(f"Read {len(response.content)} from {url}")
except Exception as e:
print(e)
def download_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
all_product_properties = executor.map(download_site, sites)
return all_product_properties
if __name__ == "__main__":
# Store URLs as list, example below
urls = ["https://www.jython.org", "http://olympus.realpython.org/dice"] * 10
start_time = time.time()
all_product_properties = download_all_sites(urls)
all_product_properties = pd.concat(all_product_properties)
duration = time.time() - start_time
print(f"Downloaded {len(urls)} in {duration} seconds")
Reference: this RealPython article on multithreading and multiprocessing in Python: https://realpython.com/python-concurrency/

How to loop through millions of Django model objects without getting an out of range or other error

I have millions of objects in a Postgres database and I need to send data from 200 of them at a time to an API, which will give me additional information (the API can only deal with up to 200 elements at a time). I've tried several strategies. The first strategy ended up with my script getting killed because it used too much memory. This attempt below worked better, but I got the following error: django.db.utils.DataError: bigint out of range. This error happened around when the "start" variable reached 42,000. What is a more efficient way to accomplish this task? Thank you.
articles_to_process = Article.objects.all() # This will be in the millions
dois = articles_to_process.values_list('doi', flat=True) # These are IDs of articles
start = 0
end = 200 # The API to which I will send IDs can only return up to 200 records at a time.
number_of_dois = dois.count()
times_to_loop = (number_of_dois / 200) + 1
while times_to_loop > 0:
times_to_loop = times_to_loop - 1
chunk = dois[start:end]
doi_string = ', '.join(chunk)
start = start + 200
end = end + 200
[DO API CALL, GET DATA FOR EACH ARTICLE, SAVE THAT DATA TO ARTICLE]
Consider using iterator:
chunk_size = 200
counter = 0
idx = []
for article_id in dois.iterator(chunk_size):
counter += 1
idx.append(str(article_id))
if counter >= chunk_size:
doi_string = ', '.join(idx)
idx = []
counter = 0
# DO API CALL, GET DATA FOR EACH ARTICLE, SAVE THAT DATA TO ARTICLE

How to collect all results from a web API in Python?

I am collecting data from a web API by using a Python script. The web API provides maximum 50 results ("size":50). However, I need to collect all the results. Please let me know how can I do it. My initial code is available below. Thank you in advance.
def getData():
headers = {
'Content-type': 'application/json',
}
data = '{"size":50,"sites.recruitment_status":"ACTIVE", "sites.org_state_or_province":"VA"}'
response = requests.post('https://clinicaltrialsapi.cancer.gov/v1/clinical-trials', headers=headers, data=data)
print(response.json())
To add to the answer already given you can get then total results from the initial json. You can then use a loop to increment for batches
import requests
import json
url = "https://clinicaltrialsapi.cancer.gov/v1/clinical-trials"
r = requests.get(url).json()
num_results = int(r['total'])
results_per_request = 50
total = 0
while total < num_results:
total+=results_per_request
print(total)
Everything is in the doc :
https://clinicaltrialsapi.cancer.gov/#!/Clinical45trials/searchTrialsByGet
GET clinical-trials
Filters all clinical trials based upon supplied filter params. Filter
params may be any of the fields in the schema as well as any of the
following params...
size: limit the amount of results a supplied amount (default is 10,
max is 50)
from: start the results from a supplied starting point (default is 0)
...
So you just have to specify a "from" value, and increment it 50 by 50.

Optimizing speed for bulk SSL API requests

I have written a script to run around 530 API calls which i intend to run every 5 minutes, from these calls i store data to process in bulk later (Prediction ETC).
The API has a limit of 500 requests per second. However when running my code I am seeing a 2 second time per call (due to SSL i believe).
How can i speed this up to enable me to run 500 requests within 5 minutes, as the current time required renders the data i am collecting useless :(
Code:
def getsurge(lat, long):
response = client.get_price_estimates(
start_latitude=lat,
start_longitude=long,
end_latitude=-34.063676,
end_longitude=150.815075
)
result = response.json.get('prices')
return result
def writetocsv(database):
database_writer = csv.writer(database)
database_writer.writerow(HEADER)
pool = Pool()
# Open Estimate Database
while True:
for data in coordinates:
line = data.split(',')
long = line[3]
lat = line[4][:-2]
estimate = getsurge(lat, long)
timechecked = datetime.datetime.now()
for d in estimate:
if d['display_name'] == 'TAXI':
database_writer.writerow([timechecked, [line[0], line[1]], d['surge_multiplier']])
database.flush()
print(timechecked, [line[0], line[1]], d['surge_multiplier'])
Is the APi under your control? If so, create an endpoint which can give you all the data you need in one go.

Categories

Resources