How to query an advanced search with google customsearch API? - python

How can I programmatically using the Google Python client library do an advanced search with Google custom search API search engine in order to return a list of first n links based in some terms and parameters of an advanced search I queried?.
I tried to check the documentation(I did not found any example), and this answer. However, the latter did not worked, since currently there is no support for the AJAX API. So far I tried this:
from googleapiclient.discovery import build
import pprint
my_cse_id = "test"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1",developerKey="<My developer key>")
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
results = google_search('dogs', my_api_key, my_cse_id, num=10)
for result in results:
pprint.pprint(result)
And this:
import pprint
from googleapiclient.discovery import build
def main():
service = build("customsearch", "v1",developerKey="<My developer key>")
res = service.cse().list(q='dogs').execute()
pprint.pprint(res)
if __name__ == '__main__':
main()
Thus, any idea of how to do and advanced search with google's search engine API?. This is how my credentials look at google console:
credentials

First you need to define a custom search as described here, then make sure your my_cse_id matches the google API custom search (cs) id, e.g.
cx='017576662512468239146:omuauf_lfve'
is a search engine which only searches for domains ending with .com.
Next we need our developerKey.
from googleapiclient.discovery import build
service = build("customsearch", "v1", developerKey=dev_key)
Now we can execute our search.
res = service.cse().list(q=search_term, cx=my_cse_id).execute()
We can add additional search parameters, like language or country by using the arguments described here, e.g.
res = service.cse().list(q="the best dog food", cx=my_cse_id, cr="countryUK", lr="lang_en").execute()
would serch for "the best dog food" in English and the site needs to be from the UK.
The following modified code worked for me. api_key was removed since it was never used.
from googleapiclient.discovery import build
my_cse_id = "012156694711735292392:rl7x1k3j0vy"
dev_key = "<Your developer key>"
def google_search(search_term, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=dev_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
results = google_search('boxer dogs', my_cse_id, num=10, cr="countryCA", lr="lang_en")
for result in results:
print(result.get('link'))
Output
http://www.aboxerworld.com/whiteboxerfaqs.htm
http://boxerrescueontario.com/?section=available_dogs
http://www.aboxerworld.com/abouttheboxerbreed.htm
http://m.huffpost.com/ca/entry/10992754
http://rawboxers.com/aboutraw.shtml
http://www.tanoakboxers.com/
http://www.mondlichtboxers.com/
http://www.tanoakboxers.com/puppies/
http://www.landosboxers.com/dogs/puppies/puppies.htm
http://www.boxerrescuequebec.com/

An alternative using the python requests library if you do not want to use the google discovery api:
import requests, pprint
q='italy'
api_key='AIzaSyCs.....................'
q = requests.get('https://content.googleapis.com/customsearch/v1',
params={ 'cx': '013027958806940070381:dazyknr8pvm', 'q': q, 'key': api_key} )
pprint.pprint(q.json())

This is late but hopefully it helps someone...
For advanced search use
response=service.cse().list(q="mysearchterm",
cx="017576662512468239146:omuauf_lfve", ).execute()
The list() method takes in more args to help advance your search... check args here:
https://developers.google.com/custom-search/json-api/v1/reference/cse/list

Related

Python ml engine predict: How can I make a googleapiclient.discovery.build persistent?

I need to make online predictions from a model that is deployed in cloud ml engine. My code in python is similar to the one found in the docs (https://cloud.google.com/ml-engine/docs/tensorflow/online-predict):
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(project, model)
if version is not None:
name += '/versions/{}'.format(version)
response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()
However, I receive the "instances" data from outside the script, I wonder if there is a way I could run this script without making the "service = googleapiclient.discovery.build('ml', 'v1')" each time before a request, since it takes time.
pd: this is my very first project on gcp. Thank you.
Something like this will work. You'll want to initialize your service globally then use that service instance to make your call.
import googleapiclient.discovery
AI_SERVICE = None
def ai_platform_init():
global AI_SERVICE
# Set GCP Authentication
credentials = os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')
# Path to your credentials
credentials_path = os.path.join(os.path.dirname(__file__), 'ai-platform-credentials.json')
if credentials is None and os.path.exists(credentials_path):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credentials_path
# Create AI Platform Service
if os.path.exists(credentials_path):
AI_SERVICE = googleapiclient.discovery.build('ml', 'v1', cache=MemoryCache())
# Initialize AI Platform on load.
ai_platform_init()
then later on, you can do something like this:
def call_ai_platform():
response = AI_SERVICE.projects().predict(name=name,
body={'instances': instances}).execute()
Bonus! in case you were curious about the MemoryCache class in the googleapiclient.discovery call, that was borrowed from another SO:
class MemoryCache():
"""A workaround for cache warnings from Google.
Check out: https://github.com/googleapis/google-api-python-client/issues/325#issuecomment-274349841
"""
_CACHE = {}
def get(self, url):
return MemoryCache._CACHE.get(url)
def set(self, url, content):
MemoryCache._CACHE[url] = content

oauth2client is now deprecated

In the Python code for requesting data from Google Analytics ( https://developers.google.com/analytics/devguides/reporting/core/v4/quickstart/service-py ) via an API, oauth2client is being used. The code was last time updated in July 2018 and until now the oauth2client is deprecated. My question is can I get the same code, where instead of oauth2client, google-auth or oauthlib is being used ?
I was googling to find a solution how to replace the parts of code where oauth2client is being used. Yet since I am not a developer I didn't succeed. This is how I tried to adapt the code in this link ( https://developers.google.com/analytics/devguides/reporting/core/v4/quickstart/service-py ) to google-auth. Any idea how to fix this ?
import argparse
from apiclient.discovery import build
from google.oauth2 import service_account
from google.auth.transport.urllib3 import AuthorizedHttp
SCOPES = ['...']
DISCOVERY_URI = ('...')
CLIENT_SECRETS_PATH = 'client_secrets.json' # Path to client_secrets.json file.
VIEW_ID = '...'
def initialize_analyticsreporting():
"""Initializes the analyticsreporting service object.
Returns:l
analytics an authorized analyticsreporting service object.
"""
# Parse command-line arguments.
credentials = service_account.Credentials.from_service_account_file(CLIENT_SECRETS_PATH)
# Prepare credentials, and authorize HTTP object with them.
# If the credentials don't exist or are invalid run through the native client
# flow. The Storage object will ensure that if successful the good
# credentials will get written back to a file.
authed_http = AuthorizedHttp(credentials)
response = authed_http.request(
'GET', SCOPES)
# Build the service object.
analytics = build('analytics', 'v4', http=http, discoveryServiceUrl=DISCOVERY_URI)
return analytics
def get_report(analytics):
# Use the Analytics Service Object to query the Analytics Reporting API V4.
return analytics.reports().batchGet(
body=
{
"reportRequests":[
{
"viewId":VIEW_ID,
"dateRanges":[
{
"startDate":"2019-01-01",
"endDate":"yesterday"
}],
"dimensions":[
{
"name":"ga:transactionId"
},
{
"name":"ga:sourceMedium"
},
{
"name":"ga:date"
}],
"metrics":[
{
"expression":"ga:transactionRevenue"
}]
}]
}
).execute()
def printResults(response):
for report in response.get("reports", []):
columnHeader = report.get("columnHeader", {})
dimensionHeaders = columnHeader.get("dimensions", [])
metricHeaders = columnHeader.get("metricHeader", {}).get("metricHeaderEntries", [])
rows = report.get("data", {}).get("rows", [])
for row in rows:
dimensions = row.get("dimensions", [])
dateRangeValues = row.get("metrics", [])
for header, dimension in zip(dimensionHeaders, dimensions):
print (header + ": " + dimension)
for i, values in enumerate(dateRangeValues):
for metric, value in zip(metricHeaders, values.get("values")):
print (metric.get("name") + ": " + value)
def main():
analytics = initialize_analyticsreporting()
response = get_report(analytics)
printResults(response)
if __name__ == '__main__':
main()
I need to obtain response in form of a json with given dimensions and metrics from Google Analytics.
For those running into this problem and wish to port to the newer auth libraries, do a diff b/w the 2 different versions of the short/simple Google Drive API sample at the code repo for the G Suite APIs intro codelab to see what needs to be updated (and what can stay as-is). The bottom-line is that the API client library code can remain the same while all you do is swap out the auth libraries underneath.
Note that sample is only for user acct auth... for svc acct auth, the update is similar, but I don't have an example of that yet (working on one though... will update this once it's published).

Enforce parameters into a google research (Python)

I created a python script that allows me to search on google on a terminal.
from googleapiclient.discovery import build
import pprint
searchSubject = input("Research: ")
#What I'm looking for like on a normal google research
api_key = "my_api_key"
cse_id = "my_cse_id"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
results = google_search(str(searchSubject), api_key, cse_id, num=10)
for result in results:
pprint.pprint(result)
The thing is, I would like to enforce certain parameters into my research. I am looking for .onion websites. When I type that, I get salad recipes. I would like the research to include explicitly .onion (the . is the key word here). I tried to find proper documentations to enforce parameters but didn't find any.

Google Custom Search API - Inaccurate results

I'm trying to use the aforementioned API to get number of google search results for a query, however it's not giving me the correct results. I have my custom search engine configured to search the entire web. For example, if I search "hello" I get 113,000,000 results. However if I do that myself on Google I get 1,340,000,000. I'm quite new to all this stuff so feeling quite lost, can anybody please help? Thanks
from googleapiclient.discovery import build
import pprint
my_api_key = "API KEY"
my_cse_id = "CSE ID"
def main():
service = build("customsearch", "v1",
developerKey = my_api_key)
res = service.cse().list(q = 'hello', cx= my_cse_id,).execute()
pprint.pprint(res)
if __name__ == '__main__':
main()

How to get more then 100,000 results in respond using Google BigQuery python API?

Now, I use this script to request Big Query using python API:
import argparse
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
def request(query):
query_request = bigquery_service.jobs()
query_data = {'query':query, 'timeoutMs':100000}
query_response = query_request.query(projectId=project, body=query_data).execute()
return query_response
query = """
select domain
from
[logs.compressed_v40_20170313]
limit 150000"""
respond = request(query)
I have got results:
print respond['totalRows'] # total number of lines in respond
u'150000'
print len(respond['raws]) # actual number of lines
100000
Question: how to receive remaining 50,000 lines?
To get more results after the first page of results, you need to call getQueryResults.
In your case, you'll need to get the Job ID and Page Token from the query response.
query_response = query_request.query(projectId=project, body=query_data).execute()
page_token = query_response['pageToken']
job_id = query_response['jobReference']['jobId']
next_page = bigquery_service.jobs().getQueryResults(
projectId=project, jobId=job_id, pageToken=page_token)
Continue this in a loop until you have all query results.
Note: the call to query can time out, but the query will still be running in the background. We recommend you create an explicit Job ID and insert a job manually rather than using the query method.
See the "async" query sample. Note: that it is not quite the proper name, since this sample does wait for the query to finish.

Categories

Resources