Query Public Data Sets in Big Query using Python Pycharm - python

I want to retrieve data from Google big query. But user authentication is not happening for me. This is the message I am getting.
"Access Denied: Project bigquery-public-data: The user sajith-sudhi#adroit-marking-183823.iam.gserviceaccount.com does not have bigquery.jobs.create permission in project bigquery-public-data"
Here is the code:
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
'NYCTaxi-c81bd00c9864.json')
project_id = 'bigquery-public-data'
client = bigquery.Client(credentials= credentials,project=project_id)
query_job = client.query("""
SELECT *
FROM new_york.tlc_yellow_trips_2016
LIMIT 1000""")
results = query_job.result() # Waits for job to complete.

In below line you should set your own project
project_id = 'bigquery-public-data'
so it will be as
project_id = 'your_project'
And in query itself you shoud add project as below
query_job = client.query("""
SELECT *
FROM `bigquery-public-data.new_york.tlc_yellow_trips_2016`
LIMIT 1000""")

Related

Python Bigquery create temp table

When I create temp table via python, an error throws
400 Use of CREATE TEMPORARY TABLE requires a script or session
How can I create a session?
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
client = bigquery.Client(project=project, location = location)
client.query('''
create temp table t_acquisted_users as
select *
from table_a
limit 10
''').result()
You can create a session using the BigQuery API using the create_session parameter in a job config, for example:
job_config=bigquery.QueryJobConfig(create_session=True)
More details on this excellent article:
https://dev.to/stack-labs/bigquery-transactions-over-multiple-queries-with-sessions-2ll5
That's how I fix it in quick. Awaiting others provide a better answer
# create session
client0 = bigquery.Client(project=project, location=location)
job = client0.query(
"SELECT 1;", # a query can't fail
job_config=bigquery.QueryJobConfig(create_session=True)
)
session_id = job.session_info.session_id
job.result()
# set default session
client = bigquery.Client(project=project, location=location,
default_query_job_config=bigquery.QueryJobConfig(
connection_properties=[
bigquery.query.ConnectionProperty(
key="session_id", value=session_id
)
]
))

How to move a blob data to Snowflake thru Python

I am trying to move the data from ADLS blob to Snowflake table.
I am able to do the same with UI.
Steps followed for UI :
Generated the following SAS token :
sp=rl&st=2021-06-01T05:45:37Z&se=2021-06-01T13:45:37Z&spr=https&sv=2020-02-10&sr=c&sig=rYYY4o%2YY3jj%2XXXXXAB%2Bo8ygrtyAVCnPOxomlOc%3D
Able to load the table with the above token in Snowflake Web UI :
copy into FIRST_LEVEL.MOVIES
from 'azure://adlsedmadifpoc.blob.core.windows.net/airflow-dif/raw-area/'
credentials=(azure_sas_token='sp=rl&st=2021-06-01T05:45:37Z&se=2021-06-01T13:45:37Z&spr=https&sv=2020-02-10&sr=c&sig=rYYY4o%2YY3jj%2XXXXXAB%2Bo8ygrtyAVCnPOxomlOc%3D')
FORCE = TRUE file_format = (TYPE = CSV);
I am trying to do the same with Python :
from azure.storage.blob import BlobServiceClient,generate_blob_sas,BlobSasPermissions
from datetime import datetime,timedelta
import snowflake.connector
def generate_sas_token(file_name):
sas = generate_blob_sas(account_name="xxxx",
account_key="p5V2GELxxxxQ4tVgLdj9inKwwYWlAnYpKtGHAg==", container_name="airflow-dif",blob_name=file_name,permission=BlobSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=2))
print (sas)
return sas
sas = generate_sas_token("raw-area/moviesDB.csv")
# Connectio string
conn = snowflake.connector.connect(user='xx',password='xx#123',account='xx.southeast-asia.azure',database='xx')
# Create cursor
cur = conn.cursor()
cur.execute(
f"copy into FIRST_LEVEL.MOVIES FROM 'azure://xxx.blob.core.windows.net/airflow-dif/raw-area/moviesDB.csv' credentials=(azure_sas_token='{sas}') file_format = (TYPE = CSV) ;")
cur.execute(f" Commit ;")
# Execute SQL statement
cur.close()
conn.close()
SAS token generated in the code :
se=2021-06-01T07%3A42%3A11Z&sp=rt&sv=2020-06-12&sr=b&sig=ZhZMPSI%yyyyAPTqqE0%3D
I am unable to use List permission while generating sas token thru python.
I am facing the below error :
cursor=cursor,
snowflake.connector.errors.ProgrammingError: 091003 (22000): Failure using stage area. Cause: [Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. (Status Code: 403; Error Code: AuthenticationFailed)]
I might have list of csv files in future in that folder.
Any help appreciated. Thanks.
The following code worked :
from azure.storage.blob import generate_container_sas, ContainerSasPermissions
from datetime import datetime,timedelta
import snowflake.connector
def get_sas_token():
container_sas_token = generate_container_sas(
account_name = 'XX',
account_key = 'p5V2GEL3AqGuPMMYXXXQ4tVgLdj9inKwwYWlAnYpKtGHAg==',
container_name = 'airflow-dif',
permission=ContainerSasPermissions(read=True,list=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
print (container_sas_token)
return container_sas_token
sas = get_sas_token()
# Connectio string
conn = snowflake.connector.connect(user='XX',password='XX#123',account='XX.southeast-asia.azure',database='XX')
# Create cursor
cur = conn.cursor()
cur.execute(
f"copy into FIRST_LEVEL.MOVIES FROM 'azure://XX.blob.core.windows.net/airflow-dif/raw-area/' credentials=(azure_sas_token='{sas}') FORCE = TRUE file_format = (TYPE = CSV) ;")
print (cur.fetchone())
cur.execute(f" Commit ;")
# Execute SQL statement
cur.close()
conn.close()
Thank you Gaurav for your inputs.

Unable to read data from AWS Glue Database/Tables using Python

My requirement is to use python script to read data from AWS Glue Database into a dataframe. When I researched I fought the library - "awswrangler". I'm using the below code to connect and read data:
import awswrangler as wr
profile_name = 'aws_profile_dev'
REGION = 'us-east-1'
#Retreiving credentials to connect to AWS
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(profile_name)
session = boto3.session.Session(
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY,
aws_session_token=SESSION_TOKEN
)
my_df= wr.athena.read_sql_table(table= 'mytable_1', database= 'shared_db', boto3_session=session)
However, when I'm running the above code, I'm getting the following error - "ValueError: year 0 is out of range"
Alternatively, I tried using another library - "pyathena". The code I'm trying to use is:
from pyathena import connect
import pandas as pd
conn = connect(aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY,
aws_session_token=SESSION_TOKEN,
s3_staging_dir='s3://my-sample-bucket/',
region_name='us-east-1')
df = pd.read_sql("select * from AwsDataCatalog.shared_db.mytable_1 limit 1000", conn)
Using this, I'm able to retrieve data, but it works only if I'm using limit. i.e.., If I'm just running query without limit i.e.., "select * from AwsDataCatalog.shared_db.mytable_1", it's giving the error - ValueError: year 0 is out of range
Weird behavior - For example, If I run:
df = pd.read_sql("select * from AwsDataCatalog.shared_db.mytable_1 limit 1200", conn)
sometimes it's giving the same error, and if I simply reduce the limit value and run (for example as limit 1199), and later again when I run it back with limit 1200 it works. But this doesn't work if I'm trying to read more than ~1300 rows. I have a total 2002 rows in the table. I need to read the entire table.
Please help! Thank you!
Use following code in python to get data what you are looking for.
import boto3
query = "SELECT * from table_name"
s3_resource = boto3.resource("s3")
s3_client = boto3.client('s3')
DATABASE = 'database_name'
output='s3://output-bucket/output-folder'
athena_client = boto3.client('athena')
# Execution
response = athena_client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
queryId = response['QueryExecutionId']
I have found a way using awswrangler to query data directly from Athena into pandas dataframe on your local machine. This doesn't require us to provide output location on S3.
profile_name = 'Dev-AWS'
REGION = 'us-east-1'
#this automatically retrieves credentials from your aws credentials file after you run aws configure on command-line
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(profile_name)
session = boto3.session.Session(
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY,
aws_session_token=SESSION_TOKEN
)
wr.athena.read_sql_query("select * from table_name", database="db_name", boto3_session=session)
Alternatively, if you don't want to query Athena, but want to read entire glue table, you can use:
my_df = wr.athena.read_sql_table(table= 'my_table', database= 'my_db', boto3_session=session)

bigquery, extract_table AttributeError: 'Client' object has no attribute 'dataset'

my question is about a code to extract a table extract a table from Bigquery and save it as a json file
.
I made my code mostly by following the gcloud tutorials on their documentation.
I couldn't implicit set my credentials, so I did it in a explicit way, to my json file. But it seems that it doesn't quite get the "Client" object by the path I took.
If anyone could clarify me how this whole implicit and explicit credential works, would help me a lot too!
I am using python 2.7 and pycharm. The code is as follows:
from gcloud import bigquery
from google.cloud import storage
def bigquery_get_rows ():
json_key = "path/to/my/json_file.json"
storage_client = storage.Client.from_service_account_json(json_key)
print("\nPeguei o Cliente\n")
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)
print(storage_client)
#Setando ambiente
bucket_name = 'my_bucket/name'
print(bucket_name)
destination_uri = 'gs://{}/{}'.format(bucket_name, 'my_table_json_name.json')
print(destination_uri)
#dataset_ref = client.dataset('samples', project='my_project_name')
dataset_ref = storage_client.dataset('my_dataset_name', project='my_project_id')
print(dataset_ref)
table_ref = dataset_ref.table('my_table_to_be_extracted_name')
print(table_ref)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = (
bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)
extract_job = client.extract_table(
table_ref, destination_uri, job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
bigquery_get_rows()
You are using wrong client object. You try to use gcs client to work with bigquery.
Instead of
dataset_ref = storage_client.dataset('my_dataset_name', project='my_project_id')
it should be:
bq_client = bigquery.Client.from_service_account_json(
'path/to/service_account.json')
dataset_ref = bq_client.dataset('my_dataset_name', project='my_project_id')

How to run a BigQuery query in Python

This is the query that I have been running in BigQuery that I want to run in my python script. How would I change this/ what do I have to add for it to run in Python.
#standardSQL
SELECT
Serial,
MAX(createdAt) AS Latest_Use,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
From what I have been researching it is saying that I cant save this query as a permanent table using Python. Is that true? and if it is true is it possible to still export a temporary table?
You need to use the BigQuery Python client lib, then something like this should get you up and running:
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
query = "SELECT...."
dataset = client.dataset('dataset')
table = dataset.table(name='table')
job = client.run_async_query('my-job', query)
job.destination = table
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html
See the current BigQuery Python client tutorial.
Here is another way using a JSON file for the service account:
>>> from google.cloud import bigquery
>>>
>>> CREDS = 'test_service_account.json'
>>> client = bigquery.Client.from_service_account_json(json_credentials_path=CREDS)
>>> job = client.query('select * from dataset1.mytable')
>>> for row in job.result():
... print(row)
This is a good usage guide:
https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html
To simply run and write a query:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'your_dataset_id'
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table("your_table_id")
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location="US",
job_config=job_config,
) # API request - starts the query
query_job.result() # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))
I personally prefer querying using pandas:
# BQ authentication
import pydata_google_auth
SCOPES = [
'https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive',
]
credentials = pydata_google_auth.get_user_credentials(
SCOPES,
# Set auth_local_webserver to True to have a slightly more convienient
# authorization flow. Note, this doesn't work if you're running from a
# notebook on a remote sever, such as over SSH or with Google Colab.
auth_local_webserver=True,
)
query = "SELECT * FROM my_table"
data = pd.read_gbq(query, project_id = MY_PROJECT_ID, credentials=credentials, dialect = 'standard')
The pythonbq package is very simple to use and a great place to start. It uses python-gbq.
To get started you would need to generate a BQ json key for external app access. You can generate your key here.
Your code would look something like:
from pythonbq import pythonbq
myProject=pythonbq(
bq_key_path='path/to/bq/key.json',
project_id='myGoogleProjectID'
)
SQL_CODE="""
SELECT
Serial,
MAX(createdAt) AS Latest_Use,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
"""
output=myProject.query(sql=SQL_CODE)

Categories

Resources