BigQuery vs Custom Search for High-throughput Google (Scholar) Searches? - python

For a list of ~30 thousand keywords, I'm trying to find out how many Google search hits there exist for each keyword, similar to this website but on a larger scale: http://azich.org/google/.
I am using python to query and was originally planning to use pygoogle. Unfortunately Google has a limit of ~100 searches a day for a free account. I am willing to use a paid service, but I am not sure which Google service makes more sense - BigQuery or Custom Search. Bigquery seems to be for searches on a provided set of data, whereas Custom Search seems to be website search solutions for a small "slice" of the internet.
Would someone refer me to the appropriate service that will allow me to perform the above task? It doesn't need to be a free service - I am willing to pay.
Two more things, if possible: I'd like the searches to be from Google Scholar, but this is not necessary. Second, I'd like to save the text from the front page, such as the blurbs from each search result, to text-mine the front page results.

BigQuery is not a tool to interact with Google Search in any way. BigQuery is a tool for you to feed your data, and then run analytical queries over those data. But you need first to ingest the data.

Related

Scraping vs Google Trends API using Python

I'm trying to collect the top five search queries for each trend for the past year by category on Google Trends.
I don't know if I should do this using a python library such as pytrends, which from their docs require a keyword to be able to query GT, or I don't have any specific keyword, I want to fetch any search query for a term on every category that can be found.
Use a scraping library Selenium or Beautifulsoup4 to collect this information directly from the GT website.
The goal of this is to be able to retrieve the top 5 websites for each query later ...
Which direction should I take?
It is better to use one of the unofficial APIs.
These connect to the Google internal APIs that power the Trends UI with structured information. But scraping would only return mostly unstructured HTML, and you would need to extract the structured data yourself. This information will not be as reliable or as complete.
It is the difference between talking through an API that is intended for "machine to machine" communication, vs a web UI that is intended for "machine to human" interactions.

what is a search query in google custom search engine?

This question has nothing to do with technical help, however i need to understand what is a search query under google custom search api. If I am not mistaken, a search query is what i query in google search box isn't it ?.
If so, under google custom search api, it was said, that i can make 100 queries a day. Keeping that in mind, i was being cautious in making queries and the total queries were 54.
After 54 queries, i received the below error. The error says This API requires billing to be enabled on the project. Visit https://console.developers.google.com/billing?project=236852110619 to enable billing. Why is it so ?
Does that mean after billing, can i utilize 46 queries that belong to free quota ?
Quota for this API seems to be pro-rated if you start in the middle of a day. Are you able to get the full 100 queries the next day?

How to connect Superset to external APIs like Google Analytics?

I am willing to show Google Analytics and Google Search Console data directly into Superset through their API.
Make direct queries to Google Analytics API in JSON (instead of storing the results into my database then showing them into Superset) and show the result in Superset
Make direct queries to Google Search Console API in JSON and show the result in Superset
Make direct queries to other amazing JSON APIs and show the result in Superset
How can I do so?
I couldn't find a Google Analytics datasource. I couldn't find a Google Search Console datasource either.
I can't find a way to display in Superset data retrieved from an API, only data stored in a database. I must be missing something, but I can't find anything in the docs related to authenticating & querying external APIs.
Superset can’t query external data API’s directly. Superset has to work with a supported database or data engine (https://superset.incubator.apache.org/installation.html#database-dependencies). This means that you need to find a way to fetch data out of the API and store it in a supported database / data engine. Some options:
Build a little Python pipeline that will query the data API, flatten the data to something tabular / relational, and upload that data to a supported data source - https://superset.incubator.apache.org/installation.html#database-dependencies - and set up Superset so it can talk to that database / data engine.
For more robust solutions, you may want to work with an devops / infrastructure to stand up a workflow scheduler like Apache Airflow (https://airflow.apache.org/) to regularly ping this API and store it in a database of some kind that Superset can talk to.
If you want to regularly query data from a popular 3rd party API, I also recommend checking out Meltano and learning more about Singer taps. These will handle some of the heavy lifting of fetching data from an API regularly and storing it in a database like Postgres. The good news is that there's a Singer tap for Google Analytics - https://github.com/singer-io/tap-google-analytics
Either way, Superset is just a thin layer above your database / data engine. So there’s no way around the reality that you need to find a way to extract data out of an API and store it in a compatible data source.
There is this project named shillelagh by one of Superset's contributors. This gives a SQL interface to REST APIs. This same package is used in Apache Superset to connect with gsheets.
New adapters are relatively easy to implement. There's a step-by-step tutorial that explains how to create a new adapter to an API or filetype in shillelagh.
The package shillelagh underlying uses SQLite Virtual Tables by using the SQLite wrapper APSW
Redash is an alternative to Superset for that task, but it doesn't have the same features. Here is a compared list of integrations for both tools: https://discuss.redash.io/t/a-comparison-of-redash-and-superset/1503
A quick alternative is paying for a third party service like: https://www.stitchdata.com/integrations/google-analytics/superset/
There is no such connector available by default.
A recommended solution would be storing your Google Analytics and Search Console data in a database, you could write a script that pulls data every 4 hours or whichever interval works for you.
Also, you shouldn't store all data but only the dimension/metrics you wish to see in your reports.

How to extract information (citation, h-index, currently working institution etc) about all professors in a specific field from Google scholar?

I want to compare different information (citation, h-index, etc) of professors in a specific field in different institutions all over the world by data mining and analysis techniques. But I have no idea how to extract these data of hundreds of (or even thousands of) professors since Google does not provide an official API for it. So I am wondering are there any other ways to do that?
Use this google code tool will calculate an individual h-index but if you do this on demand for a limited number in a particular field you will not break the terms of use - it doesn't specifically refer to limits on access but does refer to disruption of service (eg bulk requests may potentially do this) the export questions state:
I wrote a program to download lots of search results, but you blocked my computer from accessing Google Scholar. Can you raise the limit?
Err, no, please respect our robots.txt when you access Google Scholar using automated software. As the wearers of crawler's shoes and webmaster's hat, we cannot recommend adherence to web standards highly enough.
Web of Science does have an API available and a collaboration agreement with google scholar but Web of Science only for certain individuals
A solution could be to request user's web of science credential (or your own) to return the information on demand - perhaps for the top ones in the field, then store it as you planned. Google scholar only updates a few times per week and this would not be excessive use.
The other option is to request permission from google, which is an mentioned in the terms of use, although seems unlikely to be granted.
I've done a project exactly on this.
You provide an input text file to the script with the names of the professor you'd like to retrieve the information from, and the script is able to crawl google scholar and manage the info you are interested on.
The project provides also a functionality for downloading automatically the profile picture of the researchers/professors.
In order to respect the constraint imposed by the portal you can set a delay between each requests. if you have >1k of profile to crawl it might take a while but it works.
A concurrency-enabled script has also been implemented and it runs way faster than the basic sequence approach.
note: in order to specify the information you need you have to know either the id of the class of the html generated by google scholar or the name of the class.
good luck!

Query Events from Multiple Google Calendars in Single Batch

Is there a way to query for events in multiple calendars (in the same Google account) in a single batch request?
I've been through the Google documentation here, but it hasn't really helped.
What I'm trying to do really is scan through a given user's calendars and get a list of events for each one.
An example in python/gdata would be amazing.
EDIT: Looks like this answers my question. TL;DR not possible.
Short answer: No
Long answer: The API does not allow you to fetch more than one calendar with a single request and it doesn't allow you to manipulate more than one calendar in a single request.
If this is really important to you (you're trying to reduce requests so I guess it's about performance), you could possible use the Google App Engine to create a function that would perform this work for you. I'm not sure you would see a big jump in performance though.

Categories

Resources