Is there anyway to run queries repeatedly on google BigQuery using a Python Script?
I want to query a dataset using Google BigQuery Platform for a weeks data and I want to this over a year. It is a bit too tedious to query the dataset 52 times. Instead I would prefer to write a Python script(As I know Python).
I hope someone could point me in the right direction regarding this.
BigQuery supplies client libraries for several languages -- see https://cloud.google.com/bigquery/client-libraries -- and in particular for Python, with docs at https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/python/latest/?_ga=1.176926572.834714677.1415848949 (you'll need to follow the hyperlinks to understand the docs).
https://cloud.google.com/bigquery/bigquery-api-quickstart gives an example of a command-line program, in either Java or Python, that uses the Google BigQuery API to run a query on one of the available Sample Datasets and display the result. After imports, and setting a few constants, the Python script boils down to
storage = Storage('bigquery_credentials.dat')
credentials = storage.get()
if credentials is None or credentials.invalid:
# Run oauth2 flow with default arguments.
credentials = tools.run_flow(FLOW, storage, tools.argparser.parse_args([]))
http = httplib2.Http()
http = credentials.authorize(http)
bigquery_service = build('bigquery', 'v2', http=http)
try:
query_request = bigquery_service.jobs()
query_data = {'query':'SELECT TOP( title, 10) as title, COUNT(*) as revision_count FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0;'}
query_response = query_request.query(projectId=PROJECT_NUMBER,
body=query_data).execute()
print 'Query Results:'
for row in query_response['rows']:
result_row = []
for field in row['f']:
result_row.append(field['v'])
print ('\t').join(result_row)
except HttpError as err:
print 'Error:', pprint.pprint(err.content)
except AccessTokenRefreshError:
print ("Credentials have been revoked or expired, please re-run"
"the application to re-authorize")
As you see, just 30 lines, mostly concerned with getting and checking authorization and handling errors. The "core" part, net of such considerations, is really just half those lines:
bigquery_service = build('bigquery', 'v2', http=http)
query_request = bigquery_service.jobs()
query_data = {'query':'SELECT TOP( title, 10) as title, COUNT(*) as revision_count FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0;'}
query_response = query_request.query(projectId=PROJECT_NUMBER,
body=query_data).execute()
print 'Query Results:'
for row in query_response['rows']:
result_row = []
for field in row['f']:
result_row.append(field['v'])
print ('\t').join(result_row)
You can use google data flow for python and if its a one time thing run it from your terminal or equivalent. alternately you can have a shell script in appenginecron that loops through code 52 times to get your data. google dataflow scheduling.
Related
I have a portal users can access built on cherrypy which has some forms which can be submitted that will be sent to JIRA via the REST api for tracking purposes. Once it has been submitted I then take the information from the user supplied information on the form and that JIRA Issue ID and send them to an oracle DB.
As well, I then extended the functionality of the portal to be able to view the user submissions via a list page and then select a record to view what is stored in the DB for that submission. I had the idea to then use the REST API for JIRA to get what the status and assignee is for the Issue within JIRA. Converting my code to submit to the API to instead query it with the necessary JQL statement was fairly simple and can be seen below.
def jira_status_check(jira_id):
if jira_id != "No JIRA Issue":
try:
search_url = "https://myjirainstance.atlassian.net/rest/api/2/search/?jql=issue=" + jira_id + "&fields=status,assignee,resolution"
print search_url
username = 'some_user'
password = 'some_password'
request = urllib2.Request(search_url)
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % base64string)
request.add_header("Content-Type", "application/json")
result = urllib2.urlopen(request).read()
json_results = json.loads(result)
print json_results
jira_status = json_results["issues"][0]["fields"]["status"]["name"]
if json_results["issues"][0]["fields"]["resolution"] is None:
tmp = "tmp"
if json_results["issues"][0]["fields"]["resolution"] is not None:
jira_status = jira_status + " - " + json_results["issues"][0]["fields"]["resolution"]["name"]
# assignee_name = "TEST"
# assignee_NT = "TEST"
if json_results["issues"][0]["fields"]["assignee"] is None:
assignee_name = "Unassigned"
assignee_NT = "Unassigned"
if json_results["issues"][0]["fields"]["assignee"] is not None:
assignee_name = json_results["issues"][0]["fields"]["assignee"]["displayName"]
assignee_NT = json_results["issues"][0]["fields"]["assignee"]["name"]
# if json_results["issues"][0]["fields"]["assignee"]["displayName"] is not None:
# assignee_name = json_results["issues"][0]["fields"]["assignee"]["displayName"]
# if json_results["issues"][0]["fields"]["assignee"] is None:
# assignee_NT = "Unassigned"
# if json_results["issues"][0]["fields"]["assignee"]["name"] is not None:
# assignee_NT = json_results["issues"][0]["fields"]["assignee"]["name"]
print jira_status
print assignee_name
print assignee_NT
output = [jira_status, assignee_name, assignee_NT]
except:
jira_status = "No JIRA Issue by that number or JIRA inaccessible"
assignee_name = "No JIRA Issue by that number or JIRA inaccessible"
assignee_NT = "No JIRA Issue by that number or JIRA inaccessible"
output = [jira_status, assignee_name, assignee_NT]
else:
jira_status = "No JIRA Issue"
assignee_name = "No JIRA Issue"
assignee_NT = "No JIRA Issue"
output = [jira_status, assignee_name, assignee_NT]
return output
However it was limited to searching a single record at a time, which works when you are only viewing the single record, but I was hoping to extend this possibly to my list page and searching many at once with one api query rather than tons of single issue queries. I am capable of using jql and the rest API to search with multiple Issue numbers at a link like this https://myjirainstance.atlassian.net/rest/api/2/search/?jql=Issue%3DSPL-3284%20OR%20Issue%3DSPL-3285&fields=status,assignee,resolution
But then I was thinking about what if somehow a bad Issue ID is saved and queried as a part of the massive query. Previously it was handled with the except statement in my jira_status_check function when it was a single record query. When I try to query the rest api with a link like the last one shared I instead get
{"errorMessages":["An issue with key 'SPL-6666' does not exist for field 'Issue'."],"warningMessages":[]}
I tried to build a query from an advanced search of issues but when I do something like Issue=SPL-3284 OR Issue=SPL-3285 OR Issue=SPL-6666 I get a response of An issue with key 'SPL-6666' does not exist for field 'Issue'.
Is there a correct way to search via JQL with multiple Issue numbers and give back no values for the fields for ones without matching issue numbers?
Or am I stuck with doing a ton of single issue queries to the api to cover my bases? This would be less than ideal, and might cause me to just limit the api queries to when a single record is viewed rather than the list page for usability.
Would I be better off moving my function to query JIRA to javascript/jquery that can populate the list of submissions after the page is rendered?
I ended up reaching out to Atlassian with my question about JQL and then was given the following rest api documentation and told about the validateQuery parameter to add to my JQL to achieve my search. https://docs.atlassian.com/jira/REST/6.1.7/
When I now use a query similar to this on my rest api link with my additional parameter
jql=Issue%3DSPL-3284 OR Issue%3DSPL-3285&fields=status,assignee,resolution&validateQuery=true
I get back a JSON with actual content for the issues which are valid and then a separate warningMessages object with any that are bad. An example JSON is below, but obviously $CONTENT would be actual results from the query
{
"expand": "schema,names",
"startAt": 0,
"maxResults": 50,
"total": 2,
"issues": [
{
$CONTENT
},
{
$CONTENT
}
],
"warningMessages": [
"An issue with key 'SPL-6666' does not exist for field 'Issue'."
]
}
Hopefully someone else will find this helpful in the future
This question already has answers here:
Google Analytics API - retrieve Custom Segment Id by its name
(2 answers)
Closed 4 years ago.
I am using a Python script to get Segment info from Google Analytics. All of the built in Segments that come with Google Analytics are printing fine, but the Custom Segments I have created are not showing up.
Here are the relevant portions of the Script:
def get_service(api_name, api_version, scope, key_file_location,
service_account_email):
"""Get a service that communicates to a Google API.
Args:
api_name: The name of the api to connect to.
api_version: The api version to connect to.
scope: A list auth scopes to authorize for the application.
key_file_location: The path to a valid service account p12 key file.
service_account_email: The service account email address.
Returns:
A service that is connected to the specified API.
"""
credentials = ServiceAccountCredentials.from_p12_keyfile(
service_account_email, key_file_location, scopes=scope)
http = credentials.authorize(httplib2.Http())
# Build the service object.
service = build(api_name, api_version, http=http)
return service
def get_segments(service):
try:
segments = service.management().segments().list().execute()
except TypeError, error:
# Handle errors in constructing a query.
print 'There was an error in constructing your query : %s' % error
except HttpError, error:
# Handle API errors.
print ('There was an API error : %s : %s' %(error.resp.status, error.resp.reason))
# Example #2:
# The results of the list method are stored in the segments object.
# The following code shows how to iterate through them.
for segment in segments.get('items', []):
print 'Segment Id = %s' % segment.get('id')
print 'Segment kind = %s' % segment.get('kind')
print 'Segment segmentId = %s' % segment.get('segmentId')
print 'Segment Name = %s' % segment.get('name')
print 'Segment Definition = %s' % segment.get('definition')
if segment.get('created'):
print 'Created = %s' % segment.get('created')
print 'Updated = %s' % segment.get('updated')
print
def main():
# Define the auth scopes to request.
scope = ['https://www.googleapis.com/auth/analytics.readonly']
# Use the developer console and replace the values with your
# service account email and relative location of your key file.
service_account_email = '************'
key_file_location = '*********'
# Authenticate and construct service.
service = get_service('analytics', 'v3', scope, key_file_location,
service_account_email)
get_segments(service)
if __name__ == '__main__':
main()
You need to have Collaborate permission enabled for the custom segments you have created
Manage Segments#Set Segment availability
Visit this link and go to 'Set Segment availability' section in there.
Apply 'Collaborate permission' option as shown in the link.
After this, your segments will be pulled from the API code you posted ;)
Please note: Analytics Core Reporting API doesn't have access to the custom segments. They can only be accessed by the Analytics Management API
Now, I use this script to request Big Query using python API:
import argparse
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
def request(query):
query_request = bigquery_service.jobs()
query_data = {'query':query, 'timeoutMs':100000}
query_response = query_request.query(projectId=project, body=query_data).execute()
return query_response
query = """
select domain
from
[logs.compressed_v40_20170313]
limit 150000"""
respond = request(query)
I have got results:
print respond['totalRows'] # total number of lines in respond
u'150000'
print len(respond['raws]) # actual number of lines
100000
Question: how to receive remaining 50,000 lines?
To get more results after the first page of results, you need to call getQueryResults.
In your case, you'll need to get the Job ID and Page Token from the query response.
query_response = query_request.query(projectId=project, body=query_data).execute()
page_token = query_response['pageToken']
job_id = query_response['jobReference']['jobId']
next_page = bigquery_service.jobs().getQueryResults(
projectId=project, jobId=job_id, pageToken=page_token)
Continue this in a loop until you have all query results.
Note: the call to query can time out, but the query will still be running in the background. We recommend you create an explicit Job ID and insert a job manually rather than using the query method.
See the "async" query sample. Note: that it is not quite the proper name, since this sample does wait for the query to finish.
I'm trying to get a very simple Python script to talk to Freebase.
All the examples I've found use the simple / api key authorization model. So I made a Google Developer account, made a project, and tried to get a key as Google says to. It demands I provide a list of numeric IP addresses that I'll call from. Not feasible, since I don't have a fixed IP (I do have dyndns set up, but that doesn't help since Google won't take a domain name, only numerics).
So I tried OAuth2, which is overkill for what I need (I'm not accessing any non-public user data). But I couldn't find even one online example of using OAuth2 for Freebase. I tried adjusting other examples, but after bouncing around between appengine, Decorator, several obsolete Python libraries, and several other approaches, I got nowhere.
Can anyone either explain or point to a good example of how to do this (without spending 10x more time on authorization, than on the app I'm trying to authorize)? A working example with OAuth2, preferably without many layers of "simplifying" APIs; or a tip on how to get around the fixed-IP requirement for API key authorization, would be fantastic. Thanks!
Steve
I had to do this for Google Drive, but as far as I know this should work for any Google API.
When you create a new Client ID in the developer console, you should have the option to create a Service Account. This will create a public/private key pair, and you can use that to authenticate without any OAuth nonsense.
I stole this code out of our GDrive library, so it may be broke and it is GDrive specific, so you will need to replace anything that says "drive" with whatever Freebase wants.
But I hope it's enough to get you started.
# Sample code that connects to Google Drive
from apiclient.discovery import build
import httplib2
from oauth2client.client import SignedJwtAssertionCredentials, VerifyJwtTokenError
SERVICE_EMAIL = "you#gmail.com"
PRIVATE_KEY_PATH ="./private_key.p12"
# Load private key
key = open(PRIVATE_KEY_PATH, 'rb').read()
# Build the credentials object
credentials = SignedJwtAssertionCredentials(SERVICE_EMAIL, key, scope='https://www.googleapis.com/auth/drive')
try:
http = httplib2.Http()
http = credentials.authorize(http)
except VerifyJwtTokenError as e:
print(u"Unable to authorize using our private key: VerifyJwtTokenError, {0}".format(e))
raise
connection = build('drive', 'v2', http=http)
# You can now use connection to call anything you need for freebase - see their API docs for more info.
Working from #Rachel's sample code, with a bit of fiddling I got to this, which works, and illustrates the topic, search, and query features.
Must install libraries urllib and json, plus code from https://code.google.com/p/google-api-python-client/downloads/list
Must enable billing from 'settings' for the specific project
The mglread() interface for Python is broken as of April 2014.
The documented 'freebase.readonly' scope doesn't work.
from apiclient.discovery import build
import httplib2
from oauth2client.client import SignedJwtAssertionCredentials, VerifyJwtTokenError
# Set up needed constants
#
SERVICE_EMAIL = args.serviceEmail
PRIVATE_KEY_PATH = args.privateKeyFile
topicID = args.topicID
query = args.query
search_url = 'https://www.googleapis.com/freebase/v1/search'
topic_url = 'https://www.googleapis.com/freebase/v1/topic'
mql_url = "https://www.googleapis.com/freebase/v1/mqlread"
key = open(PRIVATE_KEY_PATH, 'rb').read()
credentials = SignedJwtAssertionCredentials(SERVICE_EMAIL, key,
scope='https://www.googleapis.com/auth/freebase')
try:
http = httplib2.Http()
http = credentials.authorize(http)
except VerifyJwtTokenError as e:
print(u"Unable to authorize via private key: VerifyJwtTokenError, {0}".format(e))
raise
connection = build('freebase', 'v1', http=http)
# Search for a topic by Freebase topic ID
# https://developers.google.com/freebase/v1/topic-overview
#
params = { 'filter': 'suggest' }
url = topic_url + topicID + '?' + urllib.urlencode(params)
if (args.verbose): print("URL: " + url)
resp = urllib.urlopen(url).read()
if (args.verbose): print("Response: " + resp)
respJ = json.loads(resp)
print("Topic property(s) for '%s': " % topicID)
for property in respJ['property']:
print(' ' + property + ':')
for value in respJ['property'][property]['values']:
print(' - ' + value['text'])
print("\n")
# Do a regular search
# https://developers.google.com/freebase/v1/search-overview
#
params = { 'query': query }
url = search_url + '?' + urllib.urlencode(params)
if (args.verbose): print("URL: " + url)
resp = urllib.urlopen(url).read()
if (args.verbose): print("Response: " + resp)
respJ = json.loads(resp)
print("Search result for '%s': " % query)
theKeys = {}
for res in respJ['result']:
print ("%-40s %-15s %10.5f" %
(res['name'], res['mid'], res['score']))
params = '{ "id": "%s", "type": []}' % (res['mid'])
# Run a query on the retrieved ID, to get its types:
url = mql_url + '?query=' + params
resp = urllib.urlopen(url).read()
respJ = json.loads(resp)
print(" Type(s): " + `respJ['result']['type']`)
otherKeys = []
for k in res:
if (k not in ['name', 'mid', 'score']): otherKeys.append(k)
if (len(otherKeys)): print(" Other keys: " + ", ".join(otherKeys))
sys.exit(0)
I am just starting with the Google Analytics Reporting API and used the Hello API tutorial to get started. (https://developers.google.com/analytics/solutions/articles/hello-analytics-api)
Unfortunately, I am stuck before I even start. I read it (twice). Created the project, updates the client_secrets.jason file... but when I run the main, it crashes.
File "C:\Python27\New Libraries Downloaded\analytics-v3-python-cmd-line\hello_analytics_api_v3.py", line 173, in <module>
main(sys.argv)
File "C:\Python27\New Libraries Downloaded\analytics-v3-python-cmd-line\hello_analytics_api_v3.py", line 56, in main
service, flags = sample_tools.init(argv, 'analytics', 'v3', __doc__, __file__, scope='https://www.googleapis.com/auth/analytics.readonly')
NameError: global name '__file__' is not defined
I'm new (really really new) to this, so any help (and a more detailed tutorial) would be much appreciated.
Thanks !
EDIT: I have't changed anything from the original code in the tutorial. I'll worry about modifications after I get this running. Thanks !
CODE: hello_analytics_api_v3.py
import argparse
import sys
from apiclient.errors import HttpError
from apiclient import sample_tools
from oauth2client.client import AccessTokenRefreshError
def main(argv):
# Authenticate and construct service.
service, flags = sample_tools.init(argv, 'analytics', 'v3', __doc__, __file__, scope='https://www.googleapis.com/auth/analytics.readonly')
# Try to make a request to the API. Print the results or handle errors.
try:
first_profile_id = get_first_profile_id(service)
if not first_profile_id:
print 'Could not find a valid profile for this user.'
else:
results = get_top_keywords(service, first_profile_id)
print_results(results)
except TypeError, error:
# Handle errors in constructing a query.
print ('There was an error in constructing your query : %s' % error)
except HttpError, error:
# Handle API errors.
print ('Arg, there was an API error : %s : %s' % (error.resp.status, error._get_reason()))
except AccessTokenRefreshError:
# Handle Auth errors.
print ('The credentials have been revoked or expired, please re-run ','the application to re-authorize')
def get_first_profile_id(service):
"""Traverses Management API to return the first profile id.
This first queries the Accounts collection to get the first account ID.
This ID is used to query the Webproperties collection to retrieve the first
webproperty ID. And both account and webproperty IDs are used to query the
Profile collection to get the first profile id.
Args:
service: The service object built by the Google API Python client library.
Returns:
A string with the first profile ID. None if a user does not have any
accounts, webproperties, or profiles.
"""
accounts = service.management().accounts().list().execute()
if accounts.get('items'):
firstAccountId = accounts.get('items')[0].get('id')
webproperties = service.management().webproperties().list(
accountId=firstAccountId).execute()
if webproperties.get('items'):
firstWebpropertyId = webproperties.get('items')[0].get('id')
profiles = service.management().profiles().list(
accountId=firstAccountId,
webPropertyId=firstWebpropertyId).execute()
if profiles.get('items'):
return profiles.get('items')[0].get('id')
return None
def get_top_keywords(service, profile_id):
"""Executes and returns data from the Core Reporting API.
This queries the API for the top 25 organic search terms by visits.
Args:
service: The service object built by the Google API Python client library.
profile_id: String The profile ID from which to retrieve analytics data.
Returns:
The response returned from the Core Reporting API.
"""
return service.data().ga().get(
ids='ga:' + profile_id,
start_date='2012-01-01',
end_date='2012-01-15',
metrics='ga:visits',
dimensions='ga:source,ga:keyword',
sort='-ga:visits',
filters='ga:medium==organic',
start_index='1',
max_results='25').execute()
def print_results(results):
"""Prints out the results.
This prints out the profile name, the column headers, and all the rows of
data.
Args:
results: The response returned from the Core Reporting API.
"""
print
print 'Profile Name: %s' % results.get('profileInfo').get('profileName')
print
# Print header.
output = []
for header in results.get('columnHeaders'):
output.append('%30s' % header.get('name'))
print ''.join(output)
# Print data table.
if results.get('rows', []):
for row in results.get('rows'):
output = []
for cell in row:
output.append('%30s' % cell)
print ''.join(output)
else:
print 'No Rows Found'
if __name__ == '__main__':
main(sys.argv)
according to the error the program doesn't recognize 'file'. In IPython this error comes up (not 100% sure why) but this error shouldn't come up when running a file. In a file the 'file' argument will return the full path and the file name.
Try creating a file and running from there or simply paste in a the full path and file name instead.
Also be sure that the client secrets are located in the same folder as your script!