Question: Is a time delay a good way of dealing with request rate limits?
I am very new to requests, APIs and web services. I am trying to create a web service that, given an ID, makes a request to MusicBrainz API and retrieves some information. However, apparently I am making too many requests, or making them too fast. In the last line of the code, if the delay parameter is set to 0, this error will appear:
{'error': 'Your requests are exceeding the allowable rate limit. Please see http://wiki.musicbrainz.org/XMLWebService for more information.'}
And looking into that link, I found out that:
The rate at which your IP address is making requests is measured. If that rate is too high, all your requests will be declined (http 503) until the rate drops again. Currently that rate is (on average) 1 request per second.
Therefore I thought, okey, I will insert a time delay of 1 second, and it will work. And it worked, but I guess there are nicer, neater and smarter ways of dealing with such a problem. Do you know one?
CODE:
####################################################
################### INSTRUCTIONS ###################
####################################################
'''
This script runs locally and returns a JSON formatted file, containing
information about the release-groups of an artist whose MBID must be provided.
'''
#########################################
############ CODE STARTS ################
#########################################
#IMPORT PACKAGES
#All of them come with Anaconda3 installation, otherwise they can be installed with pip
import requests
import json
import math
import time
#Base URL for looking-up release-groups on musicbrainz.org
root_URL = 'http://musicbrainz.org/ws/2/'
#Parameters to run an example
offset = 10
limit = 1
MBID = '65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab'
def collect_data(MBID, root_URL):
'''
Description: Auxiliar function to collect data from the MusicBrainz API
Arguments:
MBID - MusicBrainz Identity of some artist.
root_URL - MusicBrainz root_URL for requests
Returns:
decoded_output - dictionary file containing all the information about the release-groups
of type album of the requested artist
'''
#Joins paths. Note: Release-groups can be filtered by type.
URL_complete = root_URL + 'release-group?artist=' + MBID + '&type=album' + '&fmt=json'
#Creates a requests object and sends a GET request
request = requests.get(URL_complete)
assert request.status_code == 200
output = request.content #bits file
decoded_output = json.loads(output) #dict file
return decoded_output
def collect_releases(release_group_id, root_URL, delay = 1):
'''
Description: Auxiliar function to collect data from the MusicBrainz API
Arguments:
release_group_id - ID of the release-group whose number of releases is to be extracted
root_URL - MusicBrainz root_URL for requests
Returns:
releases_count - integer containing the number of releases of the release-group
'''
URL_complete = root_URL + 'release-group/' + release_group_id + '?inc=releases' + '&fmt=json'
#Creates a requests object and sends a GET request
request = requests.get(URL_complete)
#Parses the content of the request to a dictionary
output = request.content
decoded_output = json.loads(output)
#Time delay to not exceed MusicBrainz request rate limit
time.sleep(delay)
releases_count = 0
if 'releases' in decoded_output:
releases_count = len(decoded_output['releases'])
else:
print(decoded_output)
#raise ValueError(decoded_output)
return releases_count
def paginate(store_albums, offset, limit = 50):
'''
Description: Auxiliar function to paginate results
Arguments:
store_albums - Dictionary containing information about each release-group
offset - Integer. Corresponds to starting album to show.
limit - Integer. Default to 50. Maximum number of albums to show per page
Returns:
albums_paginated - Paginated albums according to specified limit and offset
'''
#Restricts limit to 150
if limit > 150:
limit = 150
if offset > len(store_albums['albums']):
raise ValueError('Offset is greater than number of albums')
#Apply offset
albums_offset = store_albums['albums'][offset:]
#Count pages
pages = math.ceil(len(albums_offset) / limit)
albums_limited = []
if len(albums_offset) > limit:
for i in range(pages):
albums_limited.append(albums_offset[i * limit : (i+1) * limit])
else:
albums_limited = albums_offset
albums_paginated = {'albums' : None}
albums_paginated['albums'] = albums_limited
return albums_paginated
def post(MBID, offset, limit, delay = 1):
#Calls the auxiliar function 'collect_data' that retrieves the JSON file from MusicBrainz API
json_file = collect_data(MBID, root_URL)
#Creates list and dictionary for storing the information about each release-group
album_details_list = []
album_details = {"id": None, "title": None, "year": None, "release_count": None}
#Loops through all release-groups in the JSON file
for item in json_file['release-groups']:
album_details["id"] = item["id"]
album_details["title"] = item["title"]
album_details["year"] = item["first-release-date"].split("-")[0]
album_details["release_count"] = collect_releases(item["id"], root_URL, delay)
album_details_list.append(album_details.copy())
#Creates dictionary with all the albums of the artist
store_albums = {"albums": None}
store_albums["albums"] = album_details_list
#Paginates the dictionary
stored_paginated_albums = paginate(store_albums, offset , limit)
#Returns JSON typed file containing the different albums arranged according to offset&limit
return json.dumps(stored_paginated_albums)
#Runs the program and prints the JSON output as specified in the wording of the exercise
print(post(MBID, offset, limit, delay = 1))
There aren't any nicer ways of dealing with this problem, other than asking the API owner to increase your rate limit. The only way to avoid a rate limit problem is by not making too many requests at a time, and besides hacking the API in such a way that you bypass its requests counter, you're stuck with waiting one second between each request.
Related
My Problem:
The web app I'm building relies on real-time transcription of a user's voice along with timestamps for when each word begins and ends.
Google's Speech-to-Text API has a limit of 4 minutes for streaming requests but I want users to be able to run their mic's for as long as 30 minutes if they so choose.
Thankfully, Google provides its own code examples for how to make successive requests to their Speech-to-Text API in a way that mimics endless streaming speech recognition.
I've adapted their Python infinite streaming example for my purposes (see below for my code). The timestamps provided by Google are pretty accurate but the issue is that when I exceed the streaming limit (4 minutes) and a new request is made, the timestamped transcript returned by Google's API from the new request is off by as much as 5 seconds or more.
Below is an example of the output when I adjust the streaming limit to 10 seconds (so a new request to Google's Speech-to-Text API begins every 10 seconds).
The timestamp you see printed next to each transcribed response (the 'corrected_time' in the code) is the timestamp for the end of the transcribed line, not the beginning. These timestamps are accurate for the first request but are off by ~4 seconds in the second request and ~9 seconds in the third request.
In a Nutshell, I want to make sure that when the streaming limit is exceeded and a new request is made, the timestamps returned by Google for that new request are adjusted accurately.
My Code:
To help you understand what's going on, I would recommend running it on your machine (only takes a couple of minutes to get working if you have a Google Cloud service account).
I've included more detail on my current diagnosis below the code.
#!/usr/bin/env python
"""Google Cloud Speech API sample application using the streaming API.
NOTE: This module requires the dependencies `pyaudio`.
To install using pip:
pip install pyaudio
Example usage:
python THIS_FILENAME.py
"""
# [START speech_transcribe_infinite_streaming]
import os
import re
import sys
import time
from google.cloud import speech
import pyaudio
from six.moves import queue
# Audio recording parameters
STREAMING_LIMIT = 20000 # 20 seconds (originally 4 mins but shortened for testing purposes)
SAMPLE_RATE = 16000
CHUNK_SIZE = int(SAMPLE_RATE / 10) # 100ms
# Environment Variable set for Google Credentials. Put the json service account
# key in the root directory
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'YOUR_SERVICE_ACCOUNT_KEY.json'
def get_current_time():
"""Return Current Time in MS."""
return int(round(time.time() * 1000))
class ResumableMicrophoneStream:
"""Opens a recording stream as a generator yielding the audio chunks."""
def __init__(self, rate, chunk_size):
self._rate = rate
self.chunk_size = chunk_size
self._num_channels = 1
self._buff = queue.Queue()
self.closed = True
self.start_time = get_current_time()
self.restart_counter = 0
self.audio_input = []
self.last_audio_input = []
self.result_end_time = 0
self.is_final_end_time = 0
self.final_request_end_time = 0
self.bridging_offset = 0
self.last_transcript_was_final = False
self.new_stream = True
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
channels=self._num_channels,
rate=self._rate,
input=True,
frames_per_buffer=self.chunk_size,
# Run the audio stream asynchronously to fill the buffer object.
# This is necessary so that the input device's buffer doesn't
# overflow while the calling thread makes network requests, etc.
stream_callback=self._fill_buffer,
)
def __enter__(self):
self.closed = False
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
# Signal the generator to terminate so that the client's
# streaming_recognize method will not block the process termination.
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, *args, **kwargs):
"""Continuously collect data from the audio stream, into the buffer."""
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
"""Stream Audio from microphone to API and to local buffer"""
while not self.closed:
data = []
"""
THE BELOW 'IF' STATEMENT IS WHERE THE ERROR IS LIKELY OCCURRING
This statement runs when the streaming limit is hit and a new request is made.
"""
if self.new_stream and self.last_audio_input:
chunk_time = STREAMING_LIMIT / len(self.last_audio_input)
if chunk_time != 0:
if self.bridging_offset < 0:
self.bridging_offset = 0
if self.bridging_offset > self.final_request_end_time:
self.bridging_offset = self.final_request_end_time
chunks_from_ms = round(
(self.final_request_end_time - self.bridging_offset)
/ chunk_time
)
self.bridging_offset = round(
(len(self.last_audio_input) - chunks_from_ms) * chunk_time
)
for i in range(chunks_from_ms, len(self.last_audio_input)):
data.append(self.last_audio_input[i])
self.new_stream = False
# Use a blocking get() to ensure there's at least one chunk of
# data, and stop iteration if the chunk is None, indicating the
# end of the audio stream.
chunk = self._buff.get()
self.audio_input.append(chunk)
if chunk is None:
return
data.append(chunk)
# Now consume whatever other data's still buffered.
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
self.audio_input.append(chunk)
except queue.Empty:
break
yield b"".join(data)
def listen_print_loop(responses, stream):
"""Iterates through server responses and prints them.
The responses passed is a generator that will block until a response
is provided by the server.
Each response may contain multiple results, and each result may contain
multiple alternatives; Here we print only the transcription for the top
alternative of the top result.
In this case, responses are provided for interim results as well. If the
response is an interim one, print a line feed at the end of it, to allow
the next result to overwrite it, until the response is a final one. For the
final one, print a newline to preserve the finalized transcription.
"""
for response in responses:
if get_current_time() - stream.start_time > STREAMING_LIMIT:
stream.start_time = get_current_time()
break
if not response.results:
continue
result = response.results[0]
if not result.alternatives:
continue
transcript = result.alternatives[0].transcript
result_seconds = 0
result_micros = 0
if result.result_end_time.seconds:
result_seconds = result.result_end_time.seconds
if result.result_end_time.microseconds:
result_micros = result.result_end_time.microseconds
stream.result_end_time = int((result_seconds * 1000) + (result_micros / 1000))
corrected_time = (
stream.result_end_time
- stream.bridging_offset
+ (STREAMING_LIMIT * stream.restart_counter)
)
# Display interim results, but with a carriage return at the end of the
# line, so subsequent lines will overwrite them.
if result.is_final:
sys.stdout.write("FINAL RESULT # ")
sys.stdout.write(str(corrected_time/1000) + ": " + transcript + "\n")
stream.is_final_end_time = stream.result_end_time
stream.last_transcript_was_final = True
# Exit recognition if any of the transcribed phrases could be
# one of our keywords.
if re.search(r"\b(exit|quit)\b", transcript, re.I):
sys.stdout.write("Exiting...\n")
stream.closed = True
break
else:
sys.stdout.write("INTERIM RESULT # ")
sys.stdout.write(str(corrected_time/1000) + ": " + transcript + "\r")
stream.last_transcript_was_final = False
def main():
"""start bidirectional streaming from microphone input to speech API"""
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=SAMPLE_RATE,
language_code="en-US",
max_alternatives=1,
)
streaming_config = speech.StreamingRecognitionConfig(
config=config, interim_results=True
)
mic_manager = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
print(mic_manager.chunk_size)
sys.stdout.write('\nListening, say "Quit" or "Exit" to stop.\n\n')
sys.stdout.write("End (ms) Transcript Results/Status\n")
sys.stdout.write("=====================================================\n")
with mic_manager as stream:
while not stream.closed:
sys.stdout.write(
"\n" + str(STREAMING_LIMIT * stream.restart_counter) + ": NEW REQUEST\n"
)
stream.audio_input = []
audio_generator = stream.generator()
requests = (
speech.StreamingRecognizeRequest(audio_content=content)
for content in audio_generator
)
responses = client.streaming_recognize(streaming_config, requests)
# Now, put the transcription responses to use.
listen_print_loop(responses, stream)
if stream.result_end_time > 0:
stream.final_request_end_time = stream.is_final_end_time
stream.result_end_time = 0
stream.last_audio_input = []
stream.last_audio_input = stream.audio_input
stream.audio_input = []
stream.restart_counter = stream.restart_counter + 1
if not stream.last_transcript_was_final:
sys.stdout.write("\n")
stream.new_stream = True
if __name__ == "__main__":
main()
# [END speech_transcribe_infinite_streaming]
My Current Diagnosis
The 'corrected_time' is not being set correctly when new requests are made. This is due to the 'bridging_offset' not being set correctly. So what we need to look at is the 'generator()' method in the 'ResumableMicrophoneStream' class.
In the 'generator()' method, there is an 'if' statement which is run when the streaming limit is hit and a new request is made
if self.new_stream and self.last_audio_input:
Its purpose appears to be to take any lingering audio data that wasn't finished being transcribed before the streaming limit was hit and add it to the buffer before any new audio chunks so that it's transcribed in the new request.
It is also the responsibility of this 'if' statement to set the 'bridging offset' but I'm not entirely sure what this offset represents. All I know is that however it is being set, it is not being set accurately.
Time offset values show the beginning and the end of each spoken word
that is recognized in the supplied audio. A time offset value
represents the amount of time that has elapsed from the beginning of
the audio, in increments of 100ms.
This tells us that the offset you are receiving for each of the timestamps that you are running within your project will always make the timestamps from start to finish. That would be my guess as to why it’s causing your application problems.
So, I started learning about python's requests package lately and I got into a challenge.
At first I was given a link “http://pms.zelros.com/” that only gave me a tip : query param id must be an uuid v4
I started working on that and so far I've came up with this:
'''
def get_optimal_frequency(nb_of_requests=50):
"""
This sends a number of request in a row to raise Error 429 and get the "optimal frequency" by setting it
to the maximal 'X-Rate-Limit-Remaining' that we got + 10% as a margin of error
:param nb_of_requests: The number of requests sent to raise the Error
:return: The safe time to wait between requets in ms
:rtype: int
"""
session = requests.Session()
query = uuid.uuid4()
optimal_frequency = 0
headers = {
'User-Agent': 'Chrome/79.0.3945.88',
}
for i in range(nb_of_requests):
response = session.get("http://pms.zelros.com", params={'id':query}, headers=headers)
if response.headers.get('X-Rate-Limit-Remaining') is not None and int(response.headers.get('X-Rate-Limit-Remaining')) > optimal_frequency:
optimal_frequency = int(response.headers.get('X-Rate-Limit-Remaining'))
return 1.1*optimal_frequency
def spam_until_score(score):
"""
This sends requests with a uuidv4 until the desired score is reached
:param score: The score wanted
:return: The response of the last request
:rtype: requests.models.Response
"""
start = time.time()
current_score = 0
query = uuid.uuid4()
session = requests.Session()
optimal_frequency = get_optimal_frequency()
headers = {
'User-Agent': 'Chrome/79.0.3945.88',
}
while(current_score < score):
response = session.get("http://pms.zelros.com", params={'id':query}, headers=headers)
dict_response = response.json()
if (int(dict_response.get('score')) < current_score):
break
else:
current_score = int(dict_response.get('score'))
time.sleep(optimal_frequency/1000)
end = time.time()
duration = end - start
return response, duration
But I'm stuck, the goal is to reach 1 000 000 score and getting to 10 000 took 5536s.
The help I've got so far are these:
Level 10000
From /people
Let's add a people payload
"people": [x, x, x]
Level 2000
And you can add a score payload to optimize your preparation
Level 700
You can /prepare your request.
Level 300
Nice start. It was easy :)
Let's use some fancy http verbs.
Level 100
You already know that you cannot spam me.
But do you know that there is an optimal frequency to contact me ?
Level 0
Hello !
Welcome to the Zelros challenge
The goal is to reach a one millon score.
Sorry for the long message but here are my questions :
-Is there a way to send more requests without raising error 429, maybe using parallel requests ? If yes, how should I do it ?
-I don't really get how preparing requests could help me.
-What other html methods besides the GET one could I be using ?
Thanks for your time and help
So I am trying my best to navigate my way through the Facebook API. I need to crate a script that will download my business' campaign information daily as a csv file so I can use another script to upload the information to our database easily.
I finally have code that works to print the information to the log, but I am reaching the user request limit because I have to call get_insights() for every single campaign individually. I am wondering if anyone knows how to help me make it so I don't have to call the facebook API as often.
What I would like to do if find a field where I can get the daily spend so I don't have to call the API in every iteration of my for campaign loop, but I cannot for the life of me find a way to do so.
#Import all the facebook mumbo jumbo
from facebookads.api import FacebookAdsApi
from facebookads.adobjects.adset import AdSet
from facebookads.adobjects.campaign import Campaign
from facebookads.adobjects.adsinsights import AdsInsights
from facebookads.adobjects.adreportrun import AdReportRun
from facebookads.adobjects.adaccount import AdAccount
from facebookads.adobjects.business import Business
import time
#Set the login info
my_app_id = '****'
my_app_secret = '****'
my_access_token = '****'
#Start the connection to the facebook API
FacebookAdsApi.init(my_app_id, my_app_secret, my_access_token)
business = Business('****')
#Get all ad accounts on the business account
accounts = business.get_owned_ad_accounts(fields=[AdAccount.Field.id])
#iterate through all accounts in the business account
for account in accounts:
tempaccount = AdAccount(account[AdAccount.Field.id])
#get all campaigns in the adaccount
campaigns = tempaccount.get_campaigns(fields=[Campaign.Field.name,Campaign.Field])
#iterate trough all the campaigns in the adaccount
for campaign in campaigns:
print(campaign[Campaign.Field.name])
#get the insight info (spend) from each campaign
campaignsights = campaign.get_insights(params={'date_preset':'yesterday'},fields=[AdsInsights.Field.spend])
print (campaignsights)
It took a while of digging through the API and guessing but I got it! Here is my final script:
# This program downloads all relevent Facebook traffic info as a csv file
# This program requires info from the Facebook Ads API: https://github.com/facebook/facebook-python-ads-sdk
# Import all the facebook mumbo jumbo
from facebookads.api import FacebookAdsApi
from facebookads.adobjects.adsinsights import AdsInsights
from facebookads.adobjects.adaccount import AdAccount
from facebookads.adobjects.business import Business
# Import th csv writer and the date/time function
import datetime
import csv
# Set the info to get connected to the API. Do NOT share this info
my_app_id = '****'
my_app_secret = '****'
my_access_token = '****'
# Start the connection to the facebook API
FacebookAdsApi.init(my_app_id, my_app_secret, my_access_token)
# Create a business object for the business account
business = Business('****')
# Get yesterday's date for the filename, and the csv data
yesterdaybad = datetime.datetime.now() - datetime.timedelta(days=1)
yesterdayslash = yesterdaybad.strftime('%m/%d/%Y')
yesterdayhyphen = yesterdaybad.strftime('%m-%d-%Y')
# Define the destination filename
filename = yesterdayhyphen + '_fb.csv'
filelocation = "/cron/downloads/"+ filename
# Get all ad accounts on the business account
accounts = business.get_owned_ad_accounts(fields=[AdAccount.Field.id])
# Open or create new file
try:
csvfile = open(filelocation , 'w+', 0777)
except:
print ("Cannot open file.")
# To keep track of rows added to file
rows = 0
try:
# Create file writer
filewriter = csv.writer(csvfile, delimiter=',')
except Exception as err:
print(err)
# Iterate through the adaccounts
for account in accounts:
# Create an addaccount object from the adaccount id to make it possible to get insights
tempaccount = AdAccount(account[AdAccount.Field.id])
# Grab insight info for all ads in the adaccount
ads = tempaccount.get_insights(params={'date_preset':'yesterday',
'level':'ad'
},
fields=[AdsInsights.Field.account_id,
AdsInsights.Field.account_name,
AdsInsights.Field.ad_id,
AdsInsights.Field.ad_name,
AdsInsights.Field.adset_id,
AdsInsights.Field.adset_name,
AdsInsights.Field.campaign_id,
AdsInsights.Field.campaign_name,
AdsInsights.Field.cost_per_outbound_click,
AdsInsights.Field.outbound_clicks,
AdsInsights.Field.spend
]
);
# Iterate through all accounts in the business account
for ad in ads:
# Set default values in case the insight info is empty
date = yesterdayslash
accountid = ad[AdsInsights.Field.account_id]
accountname = ""
adid = ""
adname = ""
adsetid = ""
adsetname = ""
campaignid = ""
campaignname = ""
costperoutboundclick = ""
outboundclicks = ""
spend = ""
# Set values from insight data
if ('account_id' in ad) :
accountid = ad[AdsInsights.Field.account_id]
if ('account_name' in ad) :
accountname = ad[AdsInsights.Field.account_name]
if ('ad_id' in ad) :
adid = ad[AdsInsights.Field.ad_id]
if ('ad_name' in ad) :
adname = ad[AdsInsights.Field.ad_name]
if ('adset_id' in ad) :
adsetid = ad[AdsInsights.Field.adset_id]
if ('adset_name' in ad) :
adsetname = ad[AdsInsights.Field.adset_name]
if ('campaign_id' in ad) :
campaignid = ad[AdsInsights.Field.campaign_id]
if ('campaign_name' in ad) :
campaignname = ad[AdsInsights.Field.campaign_name]
if ('cost_per_outbound_click' in ad) : # This is stored strangely, takes a few steps to break through the layers
costperoutboundclicklist = ad[AdsInsights.Field.cost_per_outbound_click]
costperoutboundclickdict = costperoutboundclicklist[0]
costperoutboundclick = costperoutboundclickdict.get('value')
if ('outbound_clicks' in ad) : # This is stored strangely, takes a few steps to break through the layers
outboundclickslist = ad[AdsInsights.Field.outbound_clicks]
outboundclicksdict = outboundclickslist[0]
outboundclicks = outboundclicksdict.get('value')
if ('spend' in ad) :
spend = ad[AdsInsights.Field.spend]
# Write all ad info to the file, and increment the number of rows that will display
filewriter.writerow([date, accountid, accountname, adid, adname, adsetid, adsetname, campaignid, campaignname, costperoutboundclick, outboundclicks, spend])
rows += 1
csvfile.close()
# Print report
print (str(rows) + " rows added to the file " + filename)
I then have a php script that takes the csv file and uploads it to my database. The key is pulling all the insight data in one big yank. You can then break it up however you want because each ad has information about its adset, adaccount, and campaign.
Adding a couple of small functions to improve on LucyTurtle's answer as it is still susceptible to Facebook's Rate Limiting
import logging
import requests as rq
#Function to find the string between two strings or characters
def find_between( s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
#Function to check how close you are to the FB Rate Limit
def check_limit():
def check_limit():
check=rq.get('https://graph.facebook.com/v3.3/act_'+account_number+'/insights?access_token='+my_access_token)
call=float(find_between(check.headers['x-business-use-case-usage'],'call_count":','}'))
cpu=float(find_between(check.headers['x-business-use-case-usage'],'total_cputime":','}'))
total=float(find_between(check.headers['x-business-use-case-usage'],'total_time":',','))
usage=max(call,cpu,total)
return usage
#Check if you reached 75% of the limit, if yes then back-off for 5 minutes (put this chunk in your 'for ad is ads' loop, every 100-200 iterations)
if (check_limit()>75):
print('75% Rate Limit Reached. Cooling Time 5 Minutes.')
logging.debug('75% Rate Limit Reached. Cooling Time 5 Minutes.')
time.sleep(300)
I'd just like to say
Thank you.
As Marks Andre said - you made my day!
The FB SDK documentation is exhaustive, but it completely lacks the practical implementation examples for day-to-day-tasks like this one. Bookmark is set - page will be revisited soon.
So the only thing I can actually contribute for fellow sufferers: it seems that with the newer facebook_business SDK you can simply completely replace "facebookads" in the import statements with "facebook_business".
So I am creating a script to communicate to our API server for asset management and retrieve some information. I've found that the longest total time portion of the script is:
{method 'read' of '_ssl._SSLSocket' objects}
Currently we're pulling information about 25 assets or so and that specific portion is taking 18.89 seconds.
Is there any way to optimize this so it doesn't take 45 minutes to do all 2,700 computers we have?
I can provide a copy of the actual code if that would be helpful.
import urllib2
import base64
import json
import csv
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
request = urllib2.Request('jss'
'JSSResource/computers')
creds = get_creds()
request.add_header('Authorization', 'Basic ' + base64.b64encode(creds))
response = urllib2.urlopen(request).read()
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
parsed_ids_json = json.loads(response)
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
request = urllib2.Request(request_url)
request.add_header('Authorization', 'Basic ' + base64.b64encode(creds))
response = urllib2.urlopen(request).read()
parsed_ext_json = json.loads(response)
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
I'd _highly recommend using the 'requests' module over 'urllib2'. It handles a lot of stuff for you and will save you many a headache.
I believe it will also give you better performance, but I'd love to hear your feedback.
Here's your code using requests. (I've added newlines to highlight my changes. Note the built-in .json() decoder.):
# Requires requests module be installed.:
# `pip install requests` or `pip3 install requests`
# https://pypi.python.org/pypi/requests/
import requests
import base64
import json
import csv
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
base_url = 'jss'
what_we_want = 'JSSResource/computers'
request_url = base_url + what_we_want
# NOTE the request_url is constructed based on your request assignment just below.
# As such, it is malformed as a URL, and I assume anonymized for your posting on SO.
# request = urllib2.Request('jss'
# 'JSSResource/computers')
#
creds = get_creds()
headers={
'Authorization': 'Basic ' + base64.b64encode(creds),
}
response = requests.get( request_url, headers )
parsed_ids_json = response.json()
#[NO NEED FOR THE FOLLOWING. 'requests' HANDLES DECODES JSON. SEE ABOVE ASSIGNMENT.]
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
# parsed_ids_json = json.loads(response)
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
headers={
'Authorization': 'Basic ' + base64.b64encode(creds),
}
response = requests.get( request_url, headers )
parsed_ext_json = response.json()
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
Please do let me know the relative performance time with your 25 assets in 18.89 seconds! I'm very curious.
I'd still recommend my other answer below(?) regarding the use of the requests module from a pure cleanliness perspective (requests is very clean to work with), but I recognize it may or may not address your original question.
If you want to try PyCurl, which likely will impact your original question, here's the same code implemented with that approach:
# Requires pycurl module be installed.:
# `pip install pycurl` or `pip3 install pycurl`
# https://pypi.python.org/pypi/pycurl/7.43.0
# NOTE: The syntax used herein for pycurl is python 3 compliant.
# Not python 2 compliant.
import pycurl
import base64
import json
import csv
def pycurl_data( url, headers ):
buffer = BytesIO()
connection = pycurl.Curl()
connection.setopt( connection.URL, url )
connection.setopt(pycurl.HTTPHEADER, headers )
connection.setopt( connection.WRITEDATA, buffer )
connection.perform()
connection.close()
body = buffer.getvalue()
# NOTE: The following assumes a byte string and a utf8 format. Change as desired.
return json.loads( body.decode('utf8') )
# Count Number so that process only runs for 25 assets at a time will be
# replaced with a variable that is determined by the number of computers added
# to the list
Count_Stop = 25
final_output_list = []
def get_creds():
# Credentials Function that retrieves username:pw from .file
with open('.cred') as cred_file:
cred_string = cred_file.read().rstrip()
return cred_string
print(cred_string)
def get_all_assets():
# Function to retrieve computer ID + computer names and store the ID in a
# new list called computers_parsed
base_url = 'jss'
what_we_want = 'JSSResource/computers'
request_url = base_url + what_we_want
# NOTE the request_url is constructed based on your request assignment just below.
# As such, it is malformed as a URL, and I assume anonymized for your posting on SO.
# request = urllib2.Request('jss'
# 'JSSResource/computers')
#
creds = get_creds()
headers= [ 'Authorization: Basic ' + base64.b64encode(creds) ]
response = pycurl_data( url, headers )
# At this point the request for ID + name has been retrieved and now to be
# formatted in json
parsed_ids_json = json.dumps( response )
# Then assign the parsed list (which has nested lists) at key 'computers'
# to a new list variable called computer_set
computer_set = parsed_ids_json['computers']
# New list to store just the computer ID's obtained in Loop below
computer_ids = []
# Count variable, when equal to max # of computers in Count_stop it stops.
count = 0
# This for loop iterates over ID + name in computer_set and returns the ID
# to the list computer_ids
for computers in computer_set:
count += 1
computer_ids.append(computers['id'])
# This IF condition allows for the script to be tested at 25 assets
# instead of all 2,000+ (comment out other announce_all_assets call)
if count == Count_Stop:
announce_all_assets(computer_ids, count)
# announce_all_assets(computer_ids, count)
def announce_all_assets(computer_ids, count):
print('Final list of ID\'s for review: ' + str(computer_ids))
print('Total number of computers to check against JSS: ' +
str(count))
extension_attribute_request(computer_ids, count)
def extension_attribute_request(computer_ids, count):
# Creating new variable, first half of new URL used in loop to get
# extension attributes using the computer ID's in computers_ids
base_url = 'jss'
what_we_want = '/subset/extensionattributes'
creds = get_creds()
print('Extension attribute function starts now:')
for ids in computer_ids:
request_url = base_url + str(ids) + what_we_want
headers= [ 'Authorization: Basic ' + base64.b64encode(creds) ]
response = pycurl_data( url, headers )
parsed_ext_json = json.dumps( response )
ext_att_json = parsed_ext_json['computer']['extension_attributes']
retrieve_all_ext(ext_att_json)
def retrieve_all_ext(ext_att_json):
new_computer = {}
# new_computer['original_id'] = ids['id']
# new_computer['original_name'] = ids['name']
for computer in ext_att_json:
new_computer[str(computer['name'])] = computer['value']
add_to_master_list(new_computer)
def add_to_master_list(new_computer):
final_output_list.append(new_computer)
print(final_output_list)
def main():
# Function to run the get all assets function
get_all_assets()
if __name__ == '__main__':
# Function to run the functions in order: main > get all assets >
main()
I've got a script here which (ideally) iterates through multiple pages X of JSON data for each entity Y (in this case, multiple loans X for each team Y). The way that the api is constructed, I believe I must physically change a subdirectory within the URL in order to iterate through multiple entities. Here is the explicit documentation and URL:
GET /teams/:id/loans
Returns loans belonging to a particular team.
Example http://api.kivaws.org/v1/teams/2/loans.json
Parameters id(number) Required. The team ID for which to return loans.
page(number) The page position of results to return. Default: 1
sort_by(string) The order by which to sort results. One of: oldest,
newest Default: newest app_id(string) The application id in reverse
DNS notation. ids_only(string) Return IDs only to make the return
object smaller. One of: true, false Default: false Response
loan_listing – HTML , JSON , XML , RSS
Status Production
And here is my script, which does run and appear to extract the correct data, but doesn't seem to write any data to the outfile:
# -*- coding: utf-8 -*-
import urllib.request as urllib
import json
import time
# storing team loans dict. The key is the team id, en value is the list of lenders
team_loans = {}
url = "http://api.kivaws.org/v1/teams/"
#teams_id range 1 - 11885
for i in range(1, 100):
params = dict(
id = i
)
#i =1
try:
handle = urllib.urlopen(str(url+str(i)+"/loans.json"))
print(handle)
except:
print("Could not handle url")
continue
# reading response
item_html = handle.read().decode('utf-8')
# converting bytes to str
data = str(item_html)
# converting to json
data = json.loads(data)
# getting number of pages to crawl
numPages = data['paging']['pages']
# deleting paging data
data.pop('paging')
# calling additional pages
if numPages >1:
for pa in range(2,numPages+1,1):
#pa = 2
handle = urllib.urlopen(str(url+str(i)+"/loans.json?page="+str(pa)))
print("Pulling loan data from team " + str(i) + "...")
# reading response
item_html = handle.read().decode('utf-8')
# converting bytes to str
datatemp = str(item_html)
# converting to json
datatemp = json.loads(datatemp)
#Pagings are redundant headers
datatemp.pop('paging')
# adding data to initial list
for loan in datatemp['loans']:
data['loans'].append(loan)
time.sleep(2)
# recording loans by team in dict
team_loans[i] = data['loans']
if (data['loans']):
print("===Data added to the team_loan dictionary===")
else:
print("!!!FAILURE to add data to team_loan dictionary!!!")
# recorging data to file when 10 teams are read
print("===Finished pulling from page " + str(i) + "===")
if (int(i) % 10 == 0):
outfile = open("team_loan.json", "w")
print("===Now writing data to outfile===")
json.dump(team_loans, outfile, sort_keys = True, indent = 2, ensure_ascii=True)
outfile.close()
else:
print("!!!FAILURE to write data to outfile!!!")
# compliance with API # of requests
time.sleep(2)
print ('Done! Check your outfile (team_loan.json)')
I know that may be a heady amount of code to throw in your faces, but it's a pretty sequential process.
Again, this program is pulling the correct data, but it is not writing this data to the outfile. Can anyone understand why?
For others who may read this post, the script does in face write data to an outfile. It was simply test code logic that was wrong. Ignore the print statements I have put into place.