Been trying to extract websocket information from Bitfinex websocket client service. Below is the code. The script works fine when I search for under 30 crypto pairs (ie. "p" or "PAIRS" has 30 elements) but if I try to go higher the script never gets to the "save_data" co-routine. Any ideas why this could be happening.
I modified the script from: "https://mmquant.net/replicating-orderbooks-from-websocket-stream-with-python-and-asyncio/", kudos to Mmquant for making the code available and giving an awesome script description.
import aiohttp
import asyncio
import ujson
from tabulate import tabulate
from copy import deepcopy
import pandas as pd
from openpyxl import load_workbook
import datetime
from datetime import datetime
import numpy as np
from collections import OrderedDict
from time import sleep
"""
Load the workbook to dump the API data as well as instruct it to not generate a new sheet.
The excel work book must:
1. Be of the type ".xlsx", only this because the load_workbook function was set to call a specific sheet with .xlsx format. This can be changed.
2. Must have the worksheets, "apidata" and "Test". This can also be adjusted below.
3. The excel workbooks name is "bitfinexws.xlsx". This can be changed below.
4. The excel spreadsheet is in the same folder as this script.
"""
book = load_workbook('bitfinexwsasync.xlsx') #.xlsx Excel spreadsheet that will be used for the placement and extracting of data.
apdat = book['Sheet1'] #Assign a variable to the sheet where the trade ratios will be put. This is case sensitive.
#The next 3 lines are critical to allow overwriting of data and not creating a new worksheet when using panda dataframes.
writer = pd.ExcelWriter('bitfinexwsasync.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
#Get a list of all the ratios and add the standard trade url: "https://api.bitfinex.com/v1/book/" before the ratios.
burl = 'https://api.bitfinex.com/v1/book/' #This is the standard url for retrieving trade ratios, the pair symbol must be added after this.
sym = pd.read_json('https://api.bitfinex.com/v1/symbols',orient='values') #This is a list of all the symbols on the Bitfinex website.
p=[]
p=[0]*len(sym)
for i in range(0,len(sym)):
p[i]=sym.loc[i,0]
p=tuple(p)
m=len(p) #Max number of trade ratios to extract for this script. Script cannot run the full set of 105 trade ratios, it will time-out.
p=p[0:m]
d=[]
e=[]
j=[]
"""
NOTE:
The script cannot run for the full 105 pairs, it timesout and becomes unresponsive.
By testig the stability it was found that calling 21 pairs per script at a refresh rate of 5seconds did not allow for any time-out problems.
"""
print('________________________________________________________________________________________________________')
print('')
print('Bitfinex Websocket Trading Orderbook Extraction - Asynchronous.')
print('There are a total of ', len(sym), ' trade ratios in this exchange.')
print('Only ',m,' trading pairs will be extracted by this script, namely:',p)
print('Process initiated at',datetime.now().strftime('%Y-%m-%d %H:%M:%S'),'.') #Tells me the date and time that the data extraction was intiated.
print('________________________________________________________________________________________________________')
print('')
# Pairs which generate orderbook for.
PAIRS = p
# If there is n pairs we need to subscribe to n websocket channels.
# This the subscription message template.
# For details about settings refer to https://bitfinex.readme.io/v2/reference#ws-public-order-books.
SUB_MESG = {
'event': 'subscribe',
'channel': 'book',
'freq': 'F0', #Adjust for real time
'len': '25',
'prec': 'P0'
# 'pair': <pair>
}
def build_book(res, pair):
""" Updates orderbook.
:param res: Orderbook update message.
:param pair: Updated pair.
"""
global orderbooks
# Filter out subscription status messages.
if res.data[0] == '[':
# String to json
data = ujson.loads(res.data)[1]
# Build orderbook
# Observe the structure of orderbook. The prices are keys for corresponding count and amount.
# Structuring data in this way significantly simplifies orderbook updates.
if len(data) > 10:
bids = {
str(level[0]): [str(level[1]), str(level[2])]
for level in data if level[2] > 0
}
asks = {
str(level[0]): [str(level[1]), str(level[2])[1:]]
for level in data if level[2] < 0
}
orderbooks[pair]['bids'] = bids
orderbooks[pair]['asks'] = asks
# Update orderbook and filter out heartbeat messages.
elif data[0] != 'h':
# Example update message structure [1765.2, 0, 1] where we have [price, count, amount].
# Update algorithm pseudocode from Bitfinex documentation:
# 1. - When count > 0 then you have to add or update the price level.
# 1.1- If amount > 0 then add/update bids.
# 1.2- If amount < 0 then add/update asks.
# 2. - When count = 0 then you have to delete the price level.
# 2.1- If amount = 1 then remove from bids
# 2.2- If amount = -1 then remove from asks
data = [str(data[0]), str(data[1]), str(data[2])]
if int(data[1]) > 0: # 1.
if float(data[2]) > 0: # 1.1
orderbooks[pair]['bids'].update({data[0]: [data[1], data[2]]})
elif float(data[2]) < 0: # 1.2
orderbooks[pair]['asks'].update({data[0]: [data[1], str(data[2])[1:]]})
elif data[1] == '0': # 2.
if data[2] == '1': # 2.1
if orderbooks[pair]['bids'].get(data[0]):
del orderbooks[pair]['bids'][data[0]]
elif data[2] == '-1': # 2.2
if orderbooks[pair]['asks'].get(data[0]):
del orderbooks[pair]['asks'][data[0]]
async def save_data():
""" Save the data to the excel spreadsheet specified """
#NOTE, Adjusted this for every 5 seconds, ie "await asyncio.sleep(10)" to "await asyncio.sleep(5)"
global orderbooks
while 1:
d=[]
e=[]
j=[]
await asyncio.sleep(5)
for pair in PAIRS:
bids2 = [[v[1], v[0], k] for k, v in orderbooks[pair]['bids'].items()]
asks2 = [[k, v[0], v[1]] for k, v in orderbooks[pair]['asks'].items()]
bids2.sort(key=lambda x: float(x[2]), reverse=True)
asks2.sort(key=lambda x: float(x[0]))
table2 = [[*bid, *ask] for (bid, ask) in zip(bids2, asks2)]
d.extend(table2)
e.extend([0]*len(table2))
e[len(e)-len(table2)]=pair
j.extend([0]*len(d))
j[0]=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
s = pd.DataFrame(d, columns=['bid:amount', 'bid:count', 'bid:price', 'ask:price', 'ask:count', 'ask:amount'])
r = pd.DataFrame(e, columns=['Trade pair'])
u = pd.DataFrame(j, columns=['Last updated'])
z = pd.concat([s, r, u], axis=1, join_axes=[s.index])
z.to_excel(writer, 'Sheet1', startrow=0, startcol=0, index=False)
writer.save()
print('Update completed at',datetime.now().strftime('%Y-%m-%d %H:%M:%S'),'.')
async def get_book(pair, session):
""" Subscribes for orderbook updates and fetches updates. """
#print('enter get_book, pair: {}'.format(pair))
pair_dict = deepcopy(SUB_MESG) #Allows for changes to a made within a variable.
pair_dict.update({'pair': pair}) #Updates the dictionary SUB_MESG with the new pair to be evaluated. Will be added to the end of the dictionary.
async with session.ws_connect('wss://api.bitfinex.com/ws/2') as ws:
asyncio.ensure_future(ws.send_json(pair_dict)) #This was added and replaced "ws.send_json(pair_dict)" as Ubuntu python required a link to asyncio for this function.
while 1: #Loops infinitely.
res = await ws.receive()
print(pair_dict['pair'], res.data) # debug
build_book(res, pair)
async def main():
""" Driver coroutine. """
async with aiohttp.ClientSession() as session:
coros = [get_book(pair, session) for pair in PAIRS]
# Append coroutine for printing orderbook snapshots every 10s.
coros.append(save_data())
await asyncio.wait(coros)
orderbooks = {
pair: {}
for pair in PAIRS
}
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Related
I am exploring azure management APIs. The ADF monitor pipeline, returns only 100 records at a time. So I created a while loop, but for some reason, not sure what, not able to get the next token.
ct = d.get('continuationToken','')
c = 1
while ct!='':
req_body = self.getDataBody(ct)
data = self.getResponse(data_url,data_headers,req_body)
nct = self.getContinuationToken(data,c)
c = c+1
print(c)
if ct == nct:
print(ct)
print(nct)
print('duplicate token')
break
ct = nct
if ct == '':
break
Here in the next iteration next token is not getting updated.
Update:
following the functions that the above code is using
def getDataBody(self,ct):
start_date = datetime.now().strftime("%Y-%m-%d")
end_date = (datetime.now() + timedelta(days=1)).strftime("%Y-%m-%d")
data_body = {'lastUpdatedAfter': start_date, 'lastUpdatedBefore': end_date}
if ct!='':
data_body['continuationToken'] = ct
return data_body
def getResponse(self,url,headers,body):
data = requests.post(url,headers=headers,data=body)
return data.text
def getContinuationToken(self,data,c):
d = json.loads(data)
with open(f'data/{c}.json','w') as f:
json.dump(d,f)
return d.get('continuationToken','')
you can try with increasing the timeout in the ADF activity may be due to the timeout setting in your current ADF activity is less than the actual time
taking to execute that API .
I've built a small download manager to get data for the SHARADAR tables in Quandl. GIT
This is functioning well but the downloads are very slow for the larger files (up to 2 gb over 10 years).
I attempted to use asyncio but this didn't speed up the downloads. This may be because Quandl doesn't allow concurrent downloads. Am I making an error in my code, or is this restriction I will have to live with from Quandl?
import asyncio
import math
import time
import pandas as pd
import quandl
import update
def segment_dates(table, date_start, date_end):
# Determine the number of days per asyncio loop. Determined by the max size of the
# range of data divided by the size of the files in 100 mb chunks.
# reduce this number for smaller more frequent downloads.
total_days = 40
# Number of days per download should be:
sizer = math.ceil(total_days / update.sharadar_tables[table][2])
# Number of days between start and end.
date_diff = date_end - date_start
loop_count = int(math.ceil(date_diff.days / sizer))
sd = date_start
sync_li = []
for _ in range(loop_count):
ed = sd + pd.Timedelta(days=sizer)
if ed > date_end:
ed = date_end
sync_li.append((sd, ed,))
sd = ed + pd.Timedelta(days=1)
return sync_li
async def get_data(table, kwarg):
"""
Using the table name and kwargs retrieves the most current data.
:param table: Name of table to update.
:param kwarg: Dictionary containing the parameters to send to Quandl.
:return dataframe: Pandas dataframe containing latest data for the table.
"""
return quandl.get_table("SHARADAR/" + table.upper(), paginate=True, **kwarg)
async def main():
table = "SF1"
# Name of the column that has the date field for this particular table.
date_col = update.sharadar_tables[table][0]
date_start = pd.to_datetime("2020-03-15")
date_end = pd.to_datetime("2020-04-01")
apikey = "API Key"
quandl.ApiConfig.api_key = apikey
# Get a list containing the times start and end for loops.
times = segment_dates(table, date_start, date_end)
wait_li = []
for t in times:
kwarg = {date_col: {"gte": t[0].strftime("%Y-%m-%d"), "lte": t[1].strftime("%Y-%m-%d")}}
wait_li.append(loop.create_task(get_data(table, kwarg)))
await asyncio.wait(wait_li)
return wait_li
if __name__ == "__main__":
starter = time.time()
try:
loop = asyncio.get_event_loop()
res = loop.run_until_complete(main())
for r in res:
df = r.result()
print(df.shape)
print(df.head())
except:
raise ValueError("error")
finally:
# loop.close()
print("Finished in {}".format(time.time() - starter))
I am new to python. By following code from others I have put together the following code. The code provide the output from "GetLatestTick" by looping though code for each "pair" in list under the __name/main__ section. I can even perform calculation on the line being processed. How can I organized this output into a DataFrame for sorting and other manipulation?
from multiprocessing import Process, Lock
import requests, json
from tabulate import tabulate
def bittrex(lock, pair):
lock.acquire()
#print(pair)
get_request_link = ('https://bittrex.com/Api/v2.0/pub/market/GetLatestTick?marketName='
+ pair + '&tickInterval=thirtyMin')
api = requests.get(get_request_link)
data = json.loads(api.text)
for i in data:
if i == 'result':
for h in data[i]:
print(pair,h['O'],h['H'],h['L'],h['C'],h['V'],h['BV'],
(((h['O'])/(h['H']))-1),h['T'])
lock.release()
return data
def main():
bitt_all = bittrex(lock, pair) # This still required
if __name__ == '__main__':
lock = Lock()
pair = ['BTC-MUE','USD-ETH','USDT-TRX','USDT-BTC']
for pair in pair:
Process(target=bittrex, args=(lock, pair)).start()
So I am creating a script to communicate to our API server for asset management and retrieve some information. I've found that the longest total time portion of the script is:
{method 'read' of '_ssl._SSLSocket' objects}
Currently we're pulling information about 25 assets or so and that specific portion is taking 18.89 seconds.
Is there any way to optimize this so it doesn't take 45 minutes to do all 2,700 computers we have?
I can provide a copy of the actual code if that would be helpful.
import urllib2
import base64
import json
import csv
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
request = urllib2.Request('jss'
'JSSResource/computers')
creds = get_creds()
request.add_header('Authorization', 'Basic ' + base64.b64encode(creds))
response = urllib2.urlopen(request).read()
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
parsed_ids_json = json.loads(response)
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
request = urllib2.Request(request_url)
request.add_header('Authorization', 'Basic ' + base64.b64encode(creds))
response = urllib2.urlopen(request).read()
parsed_ext_json = json.loads(response)
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
I'd _highly recommend using the 'requests' module over 'urllib2'. It handles a lot of stuff for you and will save you many a headache.
I believe it will also give you better performance, but I'd love to hear your feedback.
Here's your code using requests. (I've added newlines to highlight my changes. Note the built-in .json() decoder.):
# Requires requests module be installed.:
# `pip install requests` or `pip3 install requests`
# https://pypi.python.org/pypi/requests/
import requests
import base64
import json
import csv
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
base_url = 'jss'
what_we_want = 'JSSResource/computers'
request_url = base_url + what_we_want
# NOTE the request_url is constructed based on your request assignment just below.
# As such, it is malformed as a URL, and I assume anonymized for your posting on SO.
# request = urllib2.Request('jss'
# 'JSSResource/computers')
#
creds = get_creds()
headers={
'Authorization': 'Basic ' + base64.b64encode(creds),
}
response = requests.get( request_url, headers )
parsed_ids_json = response.json()
#[NO NEED FOR THE FOLLOWING. 'requests' HANDLES DECODES JSON. SEE ABOVE ASSIGNMENT.]
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
# parsed_ids_json = json.loads(response)
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
headers={
'Authorization': 'Basic ' + base64.b64encode(creds),
}
response = requests.get( request_url, headers )
parsed_ext_json = response.json()
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
Please do let me know the relative performance time with your 25 assets in 18.89 seconds! I'm very curious.
I'd still recommend my other answer below(?) regarding the use of the requests module from a pure cleanliness perspective (requests is very clean to work with), but I recognize it may or may not address your original question.
If you want to try PyCurl, which likely will impact your original question, here's the same code implemented with that approach:
# Requires pycurl module be installed.:
# `pip install pycurl` or `pip3 install pycurl`
# https://pypi.python.org/pypi/pycurl/7.43.0
# NOTE: The syntax used herein for pycurl is python 3 compliant.
# Not python 2 compliant.
import pycurl
import base64
import json
import csv
def pycurl_data( url, headers ):
buffer = BytesIO()
connection = pycurl.Curl()
connection.setopt( connection.URL, url )
connection.setopt(pycurl.HTTPHEADER, headers )
connection.setopt( connection.WRITEDATA, buffer )
connection.perform()
connection.close()
body = buffer.getvalue()
# NOTE: The following assumes a byte string and a utf8 format. Change as desired.
return json.loads( body.decode('utf8') )
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
base_url = 'jss'
what_we_want = 'JSSResource/computers'
request_url = base_url + what_we_want
# NOTE the request_url is constructed based on your request assignment just below.
# As such, it is malformed as a URL, and I assume anonymized for your posting on SO.
# request = urllib2.Request('jss'
# 'JSSResource/computers')
#
creds = get_creds()
headers= [ 'Authorization: Basic ' + base64.b64encode(creds) ]
response = pycurl_data( url, headers )
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
parsed_ids_json = json.dumps( response )
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
headers= [ 'Authorization: Basic ' + base64.b64encode(creds) ]
response = pycurl_data( url, headers )
parsed_ext_json = json.dumps( response )
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
I've got a script here which (ideally) iterates through multiple pages X of JSON data for each entity Y (in this case, multiple loans X for each team Y). The way that the api is constructed, I believe I must physically change a subdirectory within the URL in order to iterate through multiple entities. Here is the explicit documentation and URL:
GET /teams/:id/loans
Returns loans belonging to a particular team.
Example http://api.kivaws.org/v1/teams/2/loans.json
Parameters id(number) Required. The team ID for which to return loans.
page(number) The page position of results to return. Default: 1
sort_by(string) The order by which to sort results. One of: oldest,
newest Default: newest app_id(string) The application id in reverse
DNS notation. ids_only(string) Return IDs only to make the return
object smaller. One of: true, false Default: false Response
loan_listing – HTML , JSON , XML , RSS
Status Production
And here is my script, which does run and appear to extract the correct data, but doesn't seem to write any data to the outfile:
# -*- coding: utf-8 -*-
import urllib.request as urllib
import json
import time
# storing team loans dict. The key is the team id, en value is the list of lenders
team_loans = {}
url = "http://api.kivaws.org/v1/teams/"
#teams_id range 1 - 11885
for i in range(1, 100):
params = dict(
id = i
)
#i =1
try:
handle = urllib.urlopen(str(url+str(i)+"/loans.json"))
print(handle)
except:
print("Could not handle url")
continue
# reading response
item_html = handle.read().decode('utf-8')
# converting bytes to str
data = str(item_html)
# converting to json
data = json.loads(data)
# getting number of pages to crawl
numPages = data['paging']['pages']
# deleting paging data
data.pop('paging')
# calling additional pages
if numPages >1:
for pa in range(2,numPages+1,1):
#pa = 2
handle = urllib.urlopen(str(url+str(i)+"/loans.json?page="+str(pa)))
print("Pulling loan data from team " + str(i) + "...")
# reading response
item_html = handle.read().decode('utf-8')
# converting bytes to str
datatemp = str(item_html)
# converting to json
datatemp = json.loads(datatemp)
#Pagings are redundant headers
datatemp.pop('paging')
# adding data to initial list
for loan in datatemp['loans']:
data['loans'].append(loan)
time.sleep(2)
# recording loans by team in dict
team_loans[i] = data['loans']
if (data['loans']):
print("===Data added to the team_loan dictionary===")
else:
print("!!!FAILURE to add data to team_loan dictionary!!!")
# recorging data to file when 10 teams are read
print("===Finished pulling from page " + str(i) + "===")
if (int(i) % 10 == 0):
outfile = open("team_loan.json", "w")
print("===Now writing data to outfile===")
json.dump(team_loans, outfile, sort_keys = True, indent = 2, ensure_ascii=True)
outfile.close()
else:
print("!!!FAILURE to write data to outfile!!!")
# compliance with API # of requests
time.sleep(2)
print ('Done! Check your outfile (team_loan.json)')
I know that may be a heady amount of code to throw in your faces, but it's a pretty sequential process.
Again, this program is pulling the correct data, but it is not writing this data to the outfile. Can anyone understand why?
For others who may read this post, the script does in face write data to an outfile. It was simply test code logic that was wrong. Ignore the print statements I have put into place.