Create multiple views in shell script or big query - python

When I am exporting data from MySQL to BigQuery, some data are been duplicated. As a way to fix this, I thought of creating views of this tables using row number. The query to do this is shown below. The problem is that a lot of tables in my dataset are duplicated and possibly when I add new tables and export them to big query, they will have duplicated data and I don't want to create this type of query every time that a I add a new table in my dataset (I want that, in the moment I export a new table, a view to this table is created). Is this possible to do in a loop in the query (like 'for each table in my data set, do this')? Is this possible to do in shell script (when export a table to big query, create a view for this table)? In last case, is this possible to do in python?
SELECT
* EXCEPT (ROW_NUMBER)
FROM
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY id order by updated_at desc) ROW_NUMBER
FROM dataset1.table1
)
WHERE ROW_NUMBER = 1

It is definitely can be done in python.
I would recommend to use gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python
So I think you script should be something like this
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
client = bigquery.Client()
dataset_ref = client.dataset('dataset_name')
tables = list(client.list_tables(dataset_ref))
for tab in tables:
table = dataset.table("v_{}".format(tab.name))
table.view_query = "select * from `my-project.my.dataset.{}`".format(tab.name)
#if creating legacy view comment out next line
table.view_query_legacy_sql = False
table.create()

Related

Reading Data from Temp Table in Snowflake into Jupyter Notebook

I am trying to query data from Snowflake into a Jupyter Notebook. Since some columns were not present in the original table, I did create a temporary table which had the required new columns. Unfortunately, due to work restrictions, I couldn't show the whole output here. But when I did run the CREATE TEMPORARY TABLE command, got the following output.
Table CUSTOMER_ACCOUNT_NEW successfully created.
Here is the query I used to make the TEMP table.
CREATE OR REPLACE TEMPORARY TABLE DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT_NEW AS
SELECT ID,
VERIFICATION_PROFILE,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults')::VARCHAR AS identitymind,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults."mm:1"')::VARCHAR AS mm1,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults."mm:2"')::VARCHAR AS mm2,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults.res')::VARCHAR AS res,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults."ss:1"')::VARCHAR AS sanctions,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.account.riskScore')::VARCHAR AS riskscore,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.giact.verificationResponse')::VARCHAR AS GIACT,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.account.type')::VARCHAR AS acct_type,
get_path(VERIFICATION_PROFILE,'autoVerified.verified')::VARCHAR AS verified,
get_path(VERIFICATION_PROFILE,'bankInformationProvided')::VARCHAR AS Bank_info_given,
get_path(VERIFICATION_PROFILE,'businessInformationProvided')::VARCHAR AS Business_info_given,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.account.industry.riskLevel')::VARCHAR AS industry_risk
FROM DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT
WHERE DATEDIFF('day',TO_DATE(TIME_UPDATED),CURRENT_DATE())<=90
I would like to mention that VERIFICATION_PROFILE is a JSON blob, hence I had to use get_path to retrieve the values. Moreover, keys like mm:1 are originally in double quotes, so I did use it as it is, and it is working fine in snowflake.
Then using snowflake connector python, I did try to run following query;
import pandas as pd
import warnings
import snowflake.connector as sf
ctx = sf.connect(
user='*****',
password='*****',
account='*******',
warehouse='********',
database='DATA_LAKE',
schema='CUSTOMER'
)
#create cursor
curs = ctx.cursor()
sqlnew2 = "SELECT * \
FROM DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT_NEW;"
curs.execute(sqlnew2)
df = curs.fetch_pandas_all()
Here curs is the cursor object created earlier. Then I got the following message;
ProgrammingError: 002003 (42S02): SQL compilation error:
Object 'DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT_NEW' does not exist or not authorized.
May I know does snowflake connector allow us to query data from temporary tables pr not? Help/advice is greatly appreciated.
Temp tables only live as long as the session they were created lives:
Temporary tables can have a Time Travel retention period of 1 day; however, a temporary table is purged once the session (in which the table was created) ends so the actual retention period is for 24 hours or the remainder of the session, whichever is shorter.
You might want to use a transient table instead:
https://docs.snowflake.com/en/user-guide/tables-temp-transient.html#comparison-of-table-types

How to query bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?

So I have several tables with each product for each year and tables go like:
2020product5, 2019product5, 2018product6 and so on. I have added two custom parameters in google data studio as well named year and product_id but could not use them in table names themselves. I have used parameterized queries before but in conditions like where product_id = #product_id but this setup only works if all of the data is in same table which is not the current case with me. In python I use string formatters like f"{year}product{product_id}" but that obviously does not work in this case...
Using Bigquery Default CONCAT & FORMAT functions does not help as both throw following validation error: Table-valued function not found: CONCAT at [1:15]
So how do I get around with querying bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?
After much research I (kinda) sorted it out. Turns out it is a database level feature to query schema-level entities e.g. table names dynamically. BigQuery does not support formatting within table name like tables as per in question (e.g. 2020product5, 2019product5, 2018product6) cannot be queried directly. However, it does have a TABLE_SUFFIX function which allow you to access tables dynamically given that changes in table names are located at the end of the table. (This feature also allowed for dateweise partitioning and many tools which use BQ as data sink, utilize this. So If you are using BQ as data sink, there is good chance that your original data source is already doing so). Thus, table names like (product52020, product52019, product62018) as well can be accessed dynamically and of course from data studio too using following:
SELECT * FROM `project_salsa_101.dashboards.product*` WHERE _table_Suffix = CONCAT(#product_id,#year)
P.S.: Used python to create a dirty script which looped through products and tables and copied and created new ones which goes as follows: (Adding script with formatted string so it might be useful for anyone with such case wtih nominal effort)
import itertools
credentials = service_account.Credentials.from_service_account_file(
'project_salsa_101-bq-admin.json')
project_id = 'project_salsa_101'
schema = 'dashboards'
client = bigquery.Client(credentials= credentials,project=project_id)
for product_id, year in in itertools.product(product_ids, years):
df = client.query(f"""
SELECT * FROM `{project_id}.{schema}.{year}product{product_id}`
""").result().to_dataframe()
df.to_gbq(project_id = project_id,
destination_table = f'{schema}.product{product_id}{year}',
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
if_exists = 'replace')
client.query(f"""
DROP TABLE `{project_id}.{schema}.{year}product{product_id}`""").result()

How to retrieve all the catalog names , schema names and the table names in a database like snowflake or any such database?

I need to drop some columns and uppercase the data in snowflake tables.
For which I need to loop through all the catalogs/ dbs, its respective schemas and then the tables.
I need this to be in python to list of the catalogs schemas and then the tables after which I will be exicuting the SQL query to do the manipulations.
How to proceed with this?
1.List all the catalog names
2.List all the schema names
3.List alll the table names
I have established a connection using python snowflake connector
Your best source for this information is in your SNOWFLAKE.ACCOUNT_USAGE share that Snowflake provides. You'l need to grant privileges to whatever role you are using to connect with Python. From there, though, there is are the following views: DATABASES, SCHEMATA, TABLES, and more.
The easiest way would be to follow the below process
show databases;
select "name" from table(result_scan(last_query_id()));
This will give you the list of Databases. Put them in a list. Traverse through this list and on each item do the following:
use <DBNAME>;
show schemas;
select "name" from table(result_scan(last_query_id()));
Get the list of schemas
use schema <SchemaName>;
show tables;
select "name" from table(result_scan(last_query_id()));
Get the list of tables and then run your queries.
You probably will not need the result_scan. Recently, I created a python program to list all columns for all tables within Snowflake. My requirement was to validate each column and calculate some numerical statistics of the columns. I was able to do it using 'Show Columns' only. I have open sourced some of the common snowflake operations which is available here
https://github.com/Infosys/Snowflake-Python-Development-Framework
You can clone this code and then use this framework to create your python program to list the columns as below and then you can do whatever you would like with the column details
##
from utilities.sf_operations import Snowflakeconnection
connection = Snowflakeconnection(profilename ='snowflake_host')
sfconnectionresults = connection.get_snowflake_connection()
sfconnection = sfconnectionresults.get('connection')
statuscode = sfconnectionresults.get('statuscode')
statusmessage = sfconnectionresults.get('statusmessage')
print(sfconnection,statuscode,statusmessage)
snow_sql = 'SHOW COLUMNS;'
queryresult = connection.execute_snowquery(sfconnection,snow_sql);
print(queryresult['result'])
print('column_name|table_name|column_attribute')
print('---------------------------------------------')
for rows in queryresult['result']:
table_name = rows[0]
schema_name = rows[1]
column_name = rows[2]
column_attribute = rows[3]
is_Null = rows[4]
default_Value = rows[5]
kind = rows[6]
expression = rows[7]
comment = rows[8]
database_name = rows[9]
autoincrement = rows[10]
print(column_name+'|'+table_name+'|'+column_attribute)

Adding a column from an existing BQ table to another BQ table using Python

I am trying to experiment with creating new tables from existing BQ tables, all within python. So far I've successfully created the table using some similar code, but now I want to add another column to it from another table - which I have not been successful with. I think the problem comes somewhere within my SQL code.
Basically what I want here is to add another column named "ip_address" and put all the info from another table into that column.
I've tried splitting up the two SQL statements and running them separately, I've tried many different combinations of the commands (taking our CHAR, adding (32) after, combining all into one statement, etc.), and still I run into problems.
from google.cloud import bigquery
def alter(client, sql_alter, job_config, table_id):
query_job = client.query(sql_alter, job_config=job_config)
query_job.result()
print(f'Query results appended to table {table_id}')
def main():
client = bigquery.Client.from_service_account_json('my_json')
table_id = 'ref.datasetid.tableid'
job_config = bigquery.QueryJobConfig()
sql_alter = """
ALTER TABLE `ref.datasetid.tableid`
ADD COLUMN ip_address CHAR;
INSERT INTO `ref.datasetid.tableid` ip_address
SELECT ip
FROM `ref.datasetid.table2id`;
"""
alter(client, sql_alter, job_config, table_id)
if __name__ == '__main__':
main()
With this code, the current error is "400 Syntax error: Unexpected extra token INSERT at [4:9]" Also, do I have to continuously reference my table with ref.datasetid.tableid or can I write just tableid? I've run into errors before it gets there so I'm still not sure. Still a beginner so help is greatly appreciated!
BigQuery does not support ALTER TABLE or other DDL statements, take a look into how Modifying table schemas there you can find an example of how to add a new column when you append data to a table during a load job.

BigQuery insert job instead of streaming

I am currently using BigQuery's stream option to load data into tables. However, tables that have date partition on do not show any partitions... I am aware of this being an effect of the streaming.
The Python code I use:
def stream_data(dataset_name, table_name, data):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
# Reload the table to get the schema.
table.reload()
rows = data
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
print(errors)
Will date partitioned tables eventually show and if no, how can I create an insert job to realize this?
Not sure what you mean by "partitions not being shown" but when you create a partitioned table you will only see one single table.
The only difference here is that you can query in this table for date partitions, like so:
SELECT
*
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-12-25')
AND TIMESTAMP('2016-12-31');
As you can see in this example, partitioned tables have the meta column _PARTITIONTIME and that's what you use to select the partitions you are interested in.
For more info, here are the docs explaining a bit more about querying data in partitioned tables.

Categories

Resources