Google BigQuery python - error paginating table

Google BigQuery python - error paginating table - python

I have a large table in BigQuery, which i have to go through, get all data and process it in my GAE app. Since my table is going to be about 4m rows, i decided i have to get data via pagination mechanism implemeted in code examples here > https://cloud.google.com/bigquery/querying-data
def async_query(query):
client = bigquery.Client()
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.use_legacy_sql = False
query_job.use_query_cache = False
query_job.begin()
wait_for_job(query_job)
query_results = query_job.results()
page_token = None
output_rows = []
while True:
rows, total_rows, page_token = query_results.fetch_data(max_results = 200, page_token = page_token)
output_rows = output_rows + rows
if not page_token:
break
def wait_for_job(job):
while True:
job.reload() # Refreshes the state via a GET request.
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
But when i execute it i receive an error:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
When max_results parameter > table size it works fine. When max_results < table size and pagination required - i get this error.
Am i missing something?

The error indicates your overall request handler processing takes too long. Very likely because of the multiple query_results.fetch_data iterations, due to pagination.
You may want to check:
Dealing with DeadlineExceededErrors
Deadline errors: 60 seconds or less in Google App Engine
You'll probably have to re-think your app a bit, maybe try to not get the whole result immediately and instead either:
get just a portion of the result
get the result on a separate request, later on, after obtaining it in the background, for example:
via a single task queue request (10 min instead or 60s deadline)
by assembling it from multiple chunks collected in separate task queue requests to really make it scalable (not sure if this actually works with bigquery, I only tried it with the datastore)

Related

Simple function to respect Twitter's V2 API rate limits?

Problem:
Often we'd like to pull much more data than Twitter would like us to at one time. In between each query it would be wonderful if there was a simple function to call that checks if you need to wait.
Question:
What is a simple function for respecting Twitter's API limits and ensuring that any long-running-query will complete successfully without harassing Twitter and ensure the querying user does not get banned?
Ideal Answer:
The most ideal answer would be a portable function that should work in all situations. That is, finish (properly) no matter what, and respect Twitter's API rate limit rules.
Caveat
I have submitted a working answer of my own but I am unsure if there is a way to improve it.

I am developing a Python package to utilize Twitter's new V2 API. I want to make sure that I am respecting Twitter's rate limits as best as I possibly can.
Below are the two functions used to wait when needed. They check the API call response headers for remaining queries and then also rely on Twitter's HTTP codes provided here as an ultimate backup. As far as I can tell, these three HTTP codes are the only time-related errors, and the others should raise issues for an API user to inform them of whatever they are doing incorrectly.
from datetime import datetime
from osometweet.utils import pause_until
def manage_rate_limits(response):
"""Manage Twitter V2 Rate Limits
This method takes in a `requests` response object after querying
Twitter and uses the headers["x-rate-limit-remaining"] and
headers["x-rate-limit-reset"] headers objects to manage Twitter's
most common, time-dependent HTTP errors.
"""
while True:
# Get number of requests left with our tokens
remaining_requests = int(response.headers["x-rate-limit-remaining"])
# If that number is one, we get the reset-time
# and wait until then, plus 15 seconds.
# The regular 429 exception is caught below as well,
# however, we want to program defensively, where possible.
if remaining_requests == 1:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"One request from being rate limited. Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Explicitly checking for time dependent errors.
# Most of these errors can be solved simply by waiting
# a little while and pinging Twitter again - so that's what we do.
if response.status_code != 200:
# Too many requests error
if response.status_code == 429:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"Too many requests. Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Twitter internal server error
elif response.status_code == 500:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Internal server error # Twitter. Giving Twitter a break...\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Twitter service unavailable error
elif response.status_code == 503:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Twitter service unavailable. Giving Twitter a break...\n\tResume Time: {resume_time}")
pause_until(resume_time)
# If we get this far, we've done something wrong and should exit
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
# Each time we get a 200 response, exit the function and return the response object
if response.ok:
return response
Here is the pause_until function.
def pause_until(time):
""" Pause your program until a specific end time. 'time' is either
a valid datetime object or unix timestamp in seconds (i.e. seconds
since Unix epoch) """
end = time
# Convert datetime to unix timestamp and adjust for locality
if isinstance(time, datetime):
# If we're on Python 3 and the user specified a timezone,
# convert to UTC and get tje timestamp.
if sys.version_info[0] >= 3 and time.tzinfo:
end = time.astimezone(timezone.utc).timestamp()
else:
zoneDiff = pytime.time() - (datetime.now() - datetime(1970, 1, 1)).total_seconds()
end = (time - datetime(1970, 1, 1)).total_seconds() + zoneDiff
# Type check
if not isinstance(end, (int, float)):
raise Exception('The time parameter is not a number or datetime object')
# Now we wait
while True:
now = pytime.time()
diff = end - now
#
# Time is up!
#
if diff <= 0:
break
else:
# 'logarithmic' sleeping to minimize loop iterations
sleep(diff / 2)
This seems to work quite nicely but I'm not sure if there are edge-cases that will break this or if there is simply a more elegant/simple way to do this.

Google Contacts API: Temporary internal error, when uploading contact photos in parallel

I need to change the contact photo for a large number of contacts, using the python client for the Google Contacts API 3.0
gdata==2.0.18
The code I'm running is:
client = gdata.contacts.client.ContactsClient(source=MY_APP_NAME)
GDClientAuth(client, MY_AUTH)
def _get_valid_contact(contact_id):
contact = client.GetContact(contact_id)
if contact.GetPhotoLink() is None:
# Generate a proper photo link for this contact
link = gdata.contacts.data.ContactLink()
link.etag = '*'
link.href = generate_photo_url(contact)
link.rel = 'http://schemas.google.com/contacts/2008/rel#photo'
link.type = 'image/*'
contact.link.append(link)
return contact
def upload_photo(contact_id, image_path, image_type, image_size):
contact = _get_valid_contact(contact_id)
try:
client.ChangePhoto(media=image_path,
contact_entry_or_url=contact,
content_type=image_type,
content_length=image_size)
except gdata.client.RequestError as req:
if req.status == 412:
#handle etag mismatches, etc...
pass
Given a list of valid Google contact ids, if I run the upload_photo method sequentially for each of them, everything goes smoothly, and all the contacts get their photo changed:
for contact_id in CONTACT_ID_LIST:
upload_photo(contact_id, '/path/to/image', 'image/png', 1234)
However, if I try to upload the photos in parallel (using at least 4 threads), some of them shall randomly fail with 500, A temporary internal problem has occurred. Try again later as a response to the client.ChangePhoto call. I can retry these photos later, though, and they finally get updated:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(4)
for contact_id in CONTACT_ID_LIST:
pool.apply_async(func=upload_photo,
args=(contact_id,'/path/to/image', 'image/png', 1234))
The more threads I use, the more frequently the error happens.
The only similar issue I could find is http://code.google.com/a/google.com/p/apps-api-issues/issues/detail?id=2507, and it was solved some time ago.
The issue I'm facing now might be different, as it happens randomly, and only when running the updates in parallel. So there are chances there might be a race condition at some point at the Google Contacts API end.

Will RPC on App Engine localize to a single instance?

I'm using RPC to fetch multiple URLs asynchronously. I'm using a global variable to track completion and notice that the contents of that global have radically different contents before and after the RPC calls complete.
Feels like I'm missing something obvious... Is it possible for the rpc.wait() to result in the app context being loaded on a new instance when the callbacks are made?
Here's the basic pattern...
aggregated_results = {}
def aggregateData(sid):
# local variable tracking results
aggregated_results[sid] = []
# create a bunch of asynchronous url fetches to get all of the route data
rpcs = []
for r in routes:
rpc = urlfetch.create_rpc()
rpc.callback = create_callback(rpc,sid)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
# all of the schedule URLs have been fetched. now wait for them to finish
for rpc in rpcs:
rpc.wait()
# look at results
try:
if len(aggregated_results[sid]) == 0:
logging.debug("We couldn't find results for transaction")
except KeyError as e:
logging.error('aggregation error: %s' % e.message)
logging.debug(aggregated_results)
return aggregated_results[sid]
def magic_callback(rpc, sid):
# do some work to parse the result
# of the urlfetch call...
# <hidden>
#
try:
if len(aggregated_results[sid]) == 0:
aggregated_results[sid] = [stop]
else:
done = False
for i, s in enumerate(aggregated_results[sid]):
if stop.time <= s.time:
aggregated_results[sid].insert(i,stop)
done = True
break
if not done:
aggregated_results[sid].append(stop)
except KeyError as e:
logging.error('aggregation error: %s' % e.message)
The KeyError is thrown both inside the callback as well as the end of processing all of the results. Neither of those should happen.
When I print out the contents of the dictionary, the sid is in fact gone, but there are other entries for other requests that are being processed. In some cases, more entries than I see when the respective request starts.
This pattern is called on a web request handler. Not in the background.
It's as if, the callbacks occur on a difference instance.
The sid key in this case is a combination of strings that includes a time string and I'm confident it is unique.

GAE - How can I combine the results of several asynchronous url fetches?

I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!

In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))

I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)

Python-ldap search: Size Limit Exceeded

I'm using the python-ldap library to connect to our LDAP server and run queries. The issue I'm running into is that despite setting a size limit on the search, I keep getting SIZELIMIT_EXCEEDED errors on any query that would return too many results. I know that the query itself is working because I will get a result if the query returns a small subset of users. Even if I set the size limit to something absurd, like 1, I'll still get a SIZELIMIT_EXCEEDED on those bigger queries. I've pasted a generic version of my query below. Any ideas as to what I'm doing wrong here?
result = self.ldap.search_ext_s(self.base, self.scope, '(personFirstMiddle=<value>*)', sizelimit=5)

When the LDAP client requests a size-limit, that is called a 'client-requested' size limit. A client-requested size limit cannot override the size-limit set by the server. The server may set a size-limit for the server as a whole, for a particular authorization identity, or for other reasons - whichever the case, the client may not override the server size limit. The search request may have to be issued in multiple parts using the simple paged results control or the virtual list view control.

Here's a Python3 implementation that I came up with after heavily editing what I found here and in the official documentation. At the time of writing this it works with the pip3 package python-ldap version 3.2.0.
def get_list_of_ldap_users():
hostname = "google.com"
username = "username_here"
password = "password_here"
base = "dc=google,dc=com"
print(f"Connecting to the LDAP server at '{hostname}'...")
connect = ldap.initialize(f"ldap://{hostname}")
connect.set_option(ldap.OPT_REFERRALS, 0)
connect.simple_bind_s(username, password)
connect=ldap_server
search_flt = "(personFirstMiddle=<value>*)" # get all users with a specific middle name
page_size = 500 # how many users to search for in each page, this depends on the server maximum setting (default is 1000)
searchreq_attrlist=["cn", "sn", "name", "userPrincipalName"] # change these to the attributes you care about
req_ctrl = SimplePagedResultsControl(criticality=True, size=page_size, cookie='')
msgid = connect.search_ext(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
total_results = []
pages = 0
while True: # loop over all of the pages using the same cookie, otherwise the search will fail
pages += 1
rtype, rdata, rmsgid, serverctrls = connect.result3(msgid)
for user in rdata:
total_results.append(user)
pctrls = [c for c in serverctrls if c.controlType == SimplePagedResultsControl.controlType]
if pctrls:
if pctrls[0].cookie: # Copy cookie from response control to request control
req_ctrl.cookie = pctrls[0].cookie
msgid = connect.search_ext(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
else:
break
else:
break
return total_results
This will return a list of all users but you can edit it as required to return what you want without hitting the SIZELIMIT_EXCEEDED issue :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Google BigQuery python - error paginating table - python

Related

Simple function to respect Twitter's V2 API rate limits?

Google Contacts API: Temporary internal error, when uploading contact photos in parallel

Will RPC on App Engine localize to a single instance?

GAE - How can I combine the results of several asynchronous url fetches?

Python-ldap search: Size Limit Exceeded

Categories

Resources