Retrieve all the API response while you have maximum offset in Python

Retrieve all the API response while you have maximum offset in Python - python

I am attempting to retrieve data from this API that has a max offset of 200000. The records I am attempting to pull are more than the max offset. Below is a sample of the code I am using but when I reach the offset limit of 200000 it breaks (the API doesn't return any helpful response in terms of how many pages/requests I need to do that's why I am going until there are no more results ). I need to find a way to loop through and pull all the data. Thanks
def pull_api_data():
offset_limit = 0
teltel_data = []
# Loop through the results and add if present
while True:
print("Skip", offset_limit, "rows before beginning to return results")
querystring = {"offset": "{}".format(offset_limit), "filter": "starttime>="'{}'.format(date_filter), "limit" : "5000"}
response = session.get(url=url, headers=the_headers, params=querystring)
data = response.json()['data']
# Do we have more data from teltel ?
if len(data) == 0:
break
# If yes ,then add the data to the main list ,teltel_data
teltel_data.extend(data)
# Increase offset_limit to skip the already added data
offset_limit = offset_limit + 5000
# transform the raw data by converting it to a dataframe and do necessary cleaning
pull_api_data()

Related

Get all transactions from OKX with python

I try to make full overview over my transaktions (bye/sell/deposit/withdrawl/earnings and boot trades) with python for Okx, but I get only 2 Trades (but I have made more than 2).
I have tried to send request with orders-history-archive and fetchMyTrades from CCXT Library (have tried some other functions, but I steed don't get my transactions.)
Is there some way to get full overview for Okx with python (and other Brocker/Wallets)?
here How I try to get the data with CCXT (it give only 2 outputs):
def getMyTrades(self):
tData = []
tSymboles = [
'BTC/USDT',
'ETH/USDT',
'SHIB/USDT',
'CELO/USDT',
'XRP/USDT',
'SAMO/USDT',
'NEAR/USDT',
'ETHW/USDT',
'DOGE/USDT',
'SOL/USDT',
'LUNA/USDT'
]
for item in tSymboles:
if exchange.has['fetchMyTrades']:
since = exchange.milliseconds() - 60*60*24*180*1000 # -180 days from now
while since < exchange.milliseconds():
symbol = item # change for your symbol
limit = 20 # change for your limit
orders = exchange.fetchMyTrades(symbol, since, limit)
if len(orders):
since = orders[len(orders) - 1]['timestamp'] + 1
tData += orders
else:
break

Parallel web requests with GPU on Google Collab

I need to obtain properties from a web service for a large list of products (~25,000) and this is a very time sensitive operation (ideally I need this to execute in just a few seconds). I coded this first using a for loop as a proof of concept but it's taking 1.25 hours. I'd like to vectorize this code and execute the http requests in parallel using a GPU on Google Collab. I've removed many of the unnecessary details, but it's important to note that the products and their web service urls are stored in a DataFrame.
Will this be faster to execute on a GPU? Or should I just use multiple threads on a CPU?
What is the best way to implement this? And How can I save the results from parallel processes to the results DataFrame (all_product_properties) without running into concurrency problems?
Each product has multiple properties (key-value pairs) that I'm obtaining from the JSON response, but the product_id is not included in the JSON response so I need to add the product_id to the DataFrame.
#DataFrame containing string column of urls
urls = pd.DataFrame(["www.url1.com", "www.url2.com", ..., "www.url3.com"], columns=["url"])
#initialize empty dataframe to store properties for all products
all_product_properties = pd.DataFrame(columns=["product_id", "property_name", "property_value"])
for i in range(1, len(urls)):
curr_url = urls.loc[i, "url"]
try:
http_response = requests.request("GET", curr_url)
if http_response is not None:
http_response_json = json.loads(http_response.text)
#extract product properties from JSON response
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
all_product_properties = pd.concat([all_product_properties, curr_product_properties_df ])
except Exception as e:
print(e)

GPUs probably will not help here since they are meant for accelerating numerical operations. However, since you are trying to parallelize HTTP requests which are I/O bound, you can use Python multithreading (part of the standard library) to reduce the time required.
In addition, concatenating pandas dataframes in a loop is a very slow operation (see: Why does concatenation of DataFrames get exponentially slower?). You can instead append your output to a list, and run just a single concat after the loop has concluded.
Here's how I would implement your code w/ multithreading:
# Use an empty list for storing loop output
all_product_properties = []
thread_local = threading.local()
def get_session():
if not hasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
def download_site(url):
session = get_session()
try:
with session.get(url) as response:
if response is not None:
http_response_json = json.loads(response.text)
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
return curr_product_properties_df
print(f"Read {len(response.content)} from {url}")
except Exception as e:
print(e)
def download_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
all_product_properties = executor.map(download_site, sites)
return all_product_properties
if __name__ == "__main__":
# Store URLs as list, example below
urls = ["https://www.jython.org", "http://olympus.realpython.org/dice"] * 10
start_time = time.time()
all_product_properties = download_all_sites(urls)
all_product_properties = pd.concat(all_product_properties)
duration = time.time() - start_time
print(f"Downloaded {len(urls)} in {duration} seconds")
Reference: this RealPython article on multithreading and multiprocessing in Python: https://realpython.com/python-concurrency/

How to loop through millions of Django model objects without getting an out of range or other error

I have millions of objects in a Postgres database and I need to send data from 200 of them at a time to an API, which will give me additional information (the API can only deal with up to 200 elements at a time). I've tried several strategies. The first strategy ended up with my script getting killed because it used too much memory. This attempt below worked better, but I got the following error: django.db.utils.DataError: bigint out of range. This error happened around when the "start" variable reached 42,000. What is a more efficient way to accomplish this task? Thank you.
articles_to_process = Article.objects.all() # This will be in the millions
dois = articles_to_process.values_list('doi', flat=True) # These are IDs of articles
start = 0
end = 200 # The API to which I will send IDs can only return up to 200 records at a time.
number_of_dois = dois.count()
times_to_loop = (number_of_dois / 200) + 1
while times_to_loop > 0:
times_to_loop = times_to_loop - 1
chunk = dois[start:end]
doi_string = ', '.join(chunk)
start = start + 200
end = end + 200
[DO API CALL, GET DATA FOR EACH ARTICLE, SAVE THAT DATA TO ARTICLE]

Consider using iterator:
chunk_size = 200
counter = 0
idx = []
for article_id in dois.iterator(chunk_size):
counter += 1
idx.append(str(article_id))
if counter >= chunk_size:
doi_string = ', '.join(idx)
idx = []
counter = 0
# DO API CALL, GET DATA FOR EACH ARTICLE, SAVE THAT DATA TO ARTICLE

How to collect all results from a web API in Python?

I am collecting data from a web API by using a Python script. The web API provides maximum 50 results ("size":50). However, I need to collect all the results. Please let me know how can I do it. My initial code is available below. Thank you in advance.
def getData():
headers = {
'Content-type': 'application/json',
}
data = '{"size":50,"sites.recruitment_status":"ACTIVE", "sites.org_state_or_province":"VA"}'
response = requests.post('https://clinicaltrialsapi.cancer.gov/v1/clinical-trials', headers=headers, data=data)
print(response.json())

To add to the answer already given you can get then total results from the initial json. You can then use a loop to increment for batches
import requests
import json
url = "https://clinicaltrialsapi.cancer.gov/v1/clinical-trials"
r = requests.get(url).json()
num_results = int(r['total'])
results_per_request = 50
total = 0
while total < num_results:
total+=results_per_request
print(total)

Everything is in the doc :
https://clinicaltrialsapi.cancer.gov/#!/Clinical45trials/searchTrialsByGet
GET clinical-trials
Filters all clinical trials based upon supplied filter params. Filter
params may be any of the fields in the schema as well as any of the
following params...
size: limit the amount of results a supplied amount (default is 10,
max is 50)
from: start the results from a supplied starting point (default is 0)
...
So you just have to specify a "from" value, and increment it 50 by 50.

Optimizing speed for bulk SSL API requests

I have written a script to run around 530 API calls which i intend to run every 5 minutes, from these calls i store data to process in bulk later (Prediction ETC).
The API has a limit of 500 requests per second. However when running my code I am seeing a 2 second time per call (due to SSL i believe).
How can i speed this up to enable me to run 500 requests within 5 minutes, as the current time required renders the data i am collecting useless :(
Code:
def getsurge(lat, long):
response = client.get_price_estimates(
start_latitude=lat,
start_longitude=long,
end_latitude=-34.063676,
end_longitude=150.815075
)
result = response.json.get('prices')
return result
def writetocsv(database):
database_writer = csv.writer(database)
database_writer.writerow(HEADER)
pool = Pool()
# Open Estimate Database
while True:
for data in coordinates:
line = data.split(',')
long = line[3]
lat = line[4][:-2]
estimate = getsurge(lat, long)
timechecked = datetime.datetime.now()
for d in estimate:
if d['display_name'] == 'TAXI':
database_writer.writerow([timechecked, [line[0], line[1]], d['surge_multiplier']])
database.flush()
print(timechecked, [line[0], line[1]], d['surge_multiplier'])

Is the APi under your control? If so, create an endpoint which can give you all the data you need in one go.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retrieve all the API response while you have maximum offset in Python - python

Related

Get all transactions from OKX with python

Parallel web requests with GPU on Google Collab

How to loop through millions of Django model objects without getting an out of range or other error

How to collect all results from a web API in Python?

Optimizing speed for bulk SSL API requests

Categories

Resources