Fetching tables from BigQuery using Service Account in JSON format - python

I need to fetch tables from a dataset using a service account in JSON format.
I have got the list of datasets from one project using the following code snippet:
client = bigquery.Client.from_service_account_json(path)
datasets = list(client.list_datsets())
Now I need the list of all tables from any particular dataset.
I do not have IAM rights.
Therefore I'm using a service account.

I do not really understand what you mean when you say you use Service Account credentials because you do not have IAM rights. But I presume that you have a JSON-encoded key for a Service Account with the right permissions to access the data that you want to retrieve.
As of writing, the latest version of the BigQuery Python Client Library google.cloud.bigquery is 0.32.0. In this version of the library, you can use the list_tables() function to list the tables in a given dataset. To do so, you will have to pass as an argument to that function the reference to a dataset. Below I share a small code example that lists the datasets (and their nested tables) for the project to which the Service Account in use has access:
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json("/path/to/key.json")
for dataset in client.list_datasets():
print(" - {}".format(dataset.dataset_id))
dataset_ref = client.dataset(dataset.dataset_id)
for table in client.list_tables(dataset_ref):
print(" - {}".format(table.table_id))
The output is something like:
- dataset1
- table1
- table2
- dataset2
- table1
You can find more information about the Dataset class and the Table class in the documentation too. Also here you have the general page of the documentation for the BigQuery Python Client Library.

Related

Google BigQuery query in Python works when using result(), but Permission issue when using to_dataframe()

I've run into a problem after upgrades of my pip packages and my bigquery connector that returns query results suddenly stopped working with following error message
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/to/file', scopes=['https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/bigquery'
])
client = bigquery.Client(credentials=credentials)
data = client.query('select * from dataset.table').to_dataframe()
PermissionDenied: 403 request failed: the user does not have
bigquery.readsessions.create' permission
But! If you switched the code to
data = client.query('select * from dataset.table').result()
(dataframe -> result) you received the data in RowIterator format and were able to properly read them.
The same script using to_dataframe with the same credentials was working on the server. Therefore I set my bigquery package to the same version 2.28.0, which still did not help.
I could not find any advices on this error / topic anywhere, so I just want to share if any of you faced the same thing.
Just set
create_bqstorage_client=False
from google.cloud import bigquery
import os
client = bigquery.Client()
query_job = client.query(query)
df = query_job.result().to_dataframe(create_bqstorage_client=False)
There are different ways of receiving data from of bigquery. Using the BQ Storage API is considered more efficient for larger result sets compared to the other options:
The BigQuery Storage Read API provides a third option that represents an improvement over prior options. When you use the Storage Read API, structured data is sent over the wire in a binary serialization format. This allows for additional parallelism among multiple consumers for a set of results
The Python BQ library internally determines whether it can use the BQ Storage API or not. For the result method, it uses the tradtional tabledata.list method internally, whereas the to_dataframe method uses the BQ Storage API if the according package is installed.
However, using the BQ Storage API requires you to have the bigquery.readSessionUser Role respectively the readsessions.create right which in your case seems to be lacking.
By uninstalling the google-cloud-bigquery-storage, the google-cloud-bigquery package was falling back to the list method. Hence, by de-installing this package, you were working around the lack of rights.
See the BQ Python Libary Documentation for details.
Resolution
Along with google-cloud-bigquery package, I also had installed package google-cloud-bigquery-storage. Once I uninstalled that one using
pip uninstall google-cloud-bigquery-storage
everything started working again! Unfortunately, the error message was not so straightforward so it took some time to figure out :)

BigQuery cross project access via cloud functions

Let's say I have two GCP Projects, A and B. And I am the owner of both projects. When I use the UI, I can query BigQuery tables in project B from both projects. But I run into problems when I try to run a Cloud Function in project A, from which I try to access a BigQuery table in project B. Specifically I run into a 403 Access Denied: Table <>: User does not have permission to query table <>.. I am a bit confused as to why I can't access the data in B and what I need to do. In my Cloud Function all I do is:
from google.cloud import bigquery
client = bigquery.Client()
query = cient.query(<my-query>)
res = query.result()
The service account used to run the function exists in project A - how do I give it editor access to BigQuery in project B? (Or what else should I do?).
Basically you have an issue with IAM Permissions and roles on the service account used to run the function.
You should define the role bigquery.admin on your service account and it would do the trick.
However it may not be the adequate solution in regards to best practices. The link below provides a few scenarios with examples of roles most suited to your case.
https://cloud.google.com/bigquery/docs/access-control-examples

Is it possible to limit a Google service account to specific BigQuery datasets within a project?

I've set up a service account using the GCP UI for a specific project Project X. Within Project X there are 3 datasets:
Dataset 1
Dataset 2
Dataset 3
If I assign the role BigQuery Admin to Project X this is currently being inherited by all 3 datasets.
Currently all of these datasets inherit the permissions assigned to the service account at the project level. Is there any way to modify the permissions for the service account such that it only has access to specified datasets? e.g. allow access to Dataset 1 but not Dataset 2 or Dataset 3.
Is this type of configuration possible?
I've tried to add a condition in the UI but when I use the Name resource type and set the value equal to Dataset 1 I'm not able to access any of the datasets - presumably the value is not correct. Or a dataset is not a valid name resource.
UPDATE
Adding some more detail regarding what I'd already tried before posting, as well as some more detail on what I'm doing.
For my particular use case, I'm trying to perform SQL queries as well as modifying tables in BigQuery through the API (using Python).
Case A:
I create a service account with the role 'BigQuery Admin'.
This role is propagated to all datasets within the project - the property is inherited and I can not delete this service account role from any of the datasets.
In this case I'm able to query all datasets and tables using the Python API - as you'd expect.
Case B:
I create a service account with no default role.
No role is propagated and I can assign roles to specific datasets by clicking on the 'Share dataset' option in the UI to assign the 'BigQuery Admin' role to them.
In this case I'm not able to query any of the datasets or tables and get the following error if I try:
*Forbidden: 403 POST https://bigquery.googleapis.com/bq/projects/project-x/jobs: Access Denied: Project X: User does not have bigquery.jobs.create permission in project Project X.*
Even though the permissions required (bigquery.jobs.create in this case) exist for the dataset I want, I can't query the data as it appears that the bigquery.jobs.create permission is also required at a project level to use the API.
I'm posting the solution that I found to the problem in case it is useful to anyone else trying to accomplish the same.
Assign the role "BigQuery Job User" at a project level in order to have the permission bigquery.jobs.create assigned to the service account for that project.
You can then manually assign specific datasets the role of "BigQuery Data Editor" in order to query them through the API in Python. Do this by clciking on "Share dataset" in the BigQuery UI. So for this example, I've "Shared" Dataset 1 and Dataset 2 with the service account.
You should now be able to query the datasets for which you've assigned the BigQuery Data Editor role in Python.
However, for Dataset 3, for which the "BigQuery Data Editor" role has not been assigned, if you attempt to query a table this should return the error:
Forbidden: 403 Access Denied: Table Project-x:dataset_1.table_1: User does not have permission to query table Project-x:dataset_1.table_1.
As described above, we now have sufficient permissions to access the project but not the table within Dataset 3 - by design.
As you can see here, you can grant access in your dataset to some entities, including service accounts:
Google account e-mail: Grants an individual Google account access to
the dataset
Google Group: Grants all members of a Google group access
to the dataset Google Apps
Domain: Grants all users and groups in a
Google domain access to the dataset
Service account: Grants a service
account access to the dataset
Anybody: Enter "allUsers" to grant
access to the general public
All Google accounts: Enter
"allAuthenticatedUsers" to grant access to any user signed in to a
Google Account
I suggest that you create a service account without permissions in BigQuery and then grant the access for a specific dataset.
I hope it helps you.
Please keep in mind that access to BigQuery can be granted at project level or dataset level.
The dataset is the lowest level you can assign permissions, so that accounts can access all the resources in the dataset, e.g. tables, views, columns and rows. Permissions at project level permissions, as you have already noticed, are propagated (heritage) for all the datasets in the project.
Regarding your service account, by default Google Cloud assigns it a structure like service_accunt_name#example.gserviceaccount.com, and during the process of sharing the dataset, as commented by #rmesteves, you will need this email address to grant it the desired permissions.
It seems that the steps you described "Name resource type" are not the correct ones. In the BigQuery UI please try:
Click on the dataset name (e.g. Dataset1 in your example) you want to share.
Then, at the right on the screen you will see the option "Share Dataset", click on it.
Follow instructions to set up to your service account a BigQuery role like BigQuery Admin, BigQuery Data Owner, BigQuery User, among others. Check the previous link to be aware of what kind of things the roles can perform.

Connectiong to Azure table storage from Azure databricks

I am trying to connecto to azure table storage from Databricks. I can't seem to find any resources that doesn't go to blob containers, but I have tried modifying it for tables.
spark.conf.set(
"fs.azure.account.key.accountname.table.core.windows.net",
"accountkey")
blobDirectPath = "wasbs://accountname.table.core.windows.net/TableName"
df = spark.read.parquet(blobDirectPath)
I am making an assumption for now that tables are parquet files. I am getting authentication errors on this code now.
According to my research, Azure Databricks does not support the data source of Azure table storage. For more details, please refer to https://docs.azuredatabricks.net/spark/latest/data-sources/index.html.
Besides if you still want to use table storage, you can use Azure Cosmos DB Table API. But they have some differences. For more details, please refer to https://learn.microsoft.com/en-us/azure/cosmos-db/faq#where-is-table-api-not-identical-with-azure-table-storage-behavior.

Is it possible to connect and query a BigQuery table from Google App-Engine (python) without OAuth2 authentication dialog?

I’m working on a Google App-Engine project which stores around 100K entities in the Datastore. Since I have to search in the string properties of those entities I have to find an effective way to do it.
After some research I found Google’s BigQuery service which looks perfect for me. I already imported the entities into BigQuery via the web interface but I can not connect and run a query on the BigQuery from the App-Engine code.
My App-Engine project has no web interface. It generates only JSON outputs which are consumed by mobile applications.
So, my question is this: is it possible to connect and run a query from the App-Engine python code without the OAuth2 authentication dialog?
Yes. Simply use what's known as a "service account" as described here. Then, some simple Python code once you've exported GOOGLE_APPLICATION_CREDENTIALS to point to the credential file:
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
for dataset in client.list_datasets():
do_something_with(dataset)
More info here too.
Just a quick caution this is case sensitive
- to check the captalisation for 'Client';
my_bigquery_client = bigquery.client(project = 'my_project')
- this fails with an error "TypeError: 'module' object is not callable"
my_bigquery_client = bigquery.Client(project = 'my_project')
- this works.

Categories

Resources