I have a use case where I have to write the events data in multiple tables using bigtable in python. Is it possible in bigtable or we can't do that in bigtable .
When I try to write the data in multiple table in the same code then this issue occurs.
google.cloud.bigtable.table.TableMismatchError:
Please confirm whether we can do it using bigtable or not.
Bigtable doesn't have support for writing to multiple tables, so you'd need to get a connection for each table and then group the writes by table. You could do something like this:
table1 = instance.table(table_id1)
table2 = instance.table(table_id2)
timestamp = datetime.datetime.utcnow()
column_family_id = "stats_summary"
rows1 = [table.direct_row("tablet#a0b81f74#20190501"),
table.direct_row("tablet#a0b81f74#20190502")]
rows1[0].set_cell(...)
rows1[0].set_cell(...)
rows1[1].set_cell(...)
rows1[1].set_cell(...)
rows2 = [table.direct_row("tablet#20190501#a0b81f74"),
table.direct_row("tablet#20190502#a0b81f74")]
rows2[0].set_cell(...)
rows2[0].set_cell(...)
rows2[1].set_cell(...)
rows2[1].set_cell(...)
response1 = table1.mutate_rows(rows)
response2 = table2.mutate_rows(rows)
There are more examples of how to perform various writes in the Bigtable documentation, so once you create your other Bigtbale table connections, then you can just follow those examples.
Related
So I have several tables with each product for each year and tables go like:
2020product5, 2019product5, 2018product6 and so on. I have added two custom parameters in google data studio as well named year and product_id but could not use them in table names themselves. I have used parameterized queries before but in conditions like where product_id = #product_id but this setup only works if all of the data is in same table which is not the current case with me. In python I use string formatters like f"{year}product{product_id}" but that obviously does not work in this case...
Using Bigquery Default CONCAT & FORMAT functions does not help as both throw following validation error: Table-valued function not found: CONCAT at [1:15]
So how do I get around with querying bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?
After much research I (kinda) sorted it out. Turns out it is a database level feature to query schema-level entities e.g. table names dynamically. BigQuery does not support formatting within table name like tables as per in question (e.g. 2020product5, 2019product5, 2018product6) cannot be queried directly. However, it does have a TABLE_SUFFIX function which allow you to access tables dynamically given that changes in table names are located at the end of the table. (This feature also allowed for dateweise partitioning and many tools which use BQ as data sink, utilize this. So If you are using BQ as data sink, there is good chance that your original data source is already doing so). Thus, table names like (product52020, product52019, product62018) as well can be accessed dynamically and of course from data studio too using following:
SELECT * FROM `project_salsa_101.dashboards.product*` WHERE _table_Suffix = CONCAT(#product_id,#year)
P.S.: Used python to create a dirty script which looped through products and tables and copied and created new ones which goes as follows: (Adding script with formatted string so it might be useful for anyone with such case wtih nominal effort)
import itertools
credentials = service_account.Credentials.from_service_account_file(
'project_salsa_101-bq-admin.json')
project_id = 'project_salsa_101'
schema = 'dashboards'
client = bigquery.Client(credentials= credentials,project=project_id)
for product_id, year in in itertools.product(product_ids, years):
df = client.query(f"""
SELECT * FROM `{project_id}.{schema}.{year}product{product_id}`
""").result().to_dataframe()
df.to_gbq(project_id = project_id,
destination_table = f'{schema}.product{product_id}{year}',
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
if_exists = 'replace')
client.query(f"""
DROP TABLE `{project_id}.{schema}.{year}product{product_id}`""").result()
I need to drop some columns and uppercase the data in snowflake tables.
For which I need to loop through all the catalogs/ dbs, its respective schemas and then the tables.
I need this to be in python to list of the catalogs schemas and then the tables after which I will be exicuting the SQL query to do the manipulations.
How to proceed with this?
1.List all the catalog names
2.List all the schema names
3.List alll the table names
I have established a connection using python snowflake connector
Your best source for this information is in your SNOWFLAKE.ACCOUNT_USAGE share that Snowflake provides. You'l need to grant privileges to whatever role you are using to connect with Python. From there, though, there is are the following views: DATABASES, SCHEMATA, TABLES, and more.
The easiest way would be to follow the below process
show databases;
select "name" from table(result_scan(last_query_id()));
This will give you the list of Databases. Put them in a list. Traverse through this list and on each item do the following:
use <DBNAME>;
show schemas;
select "name" from table(result_scan(last_query_id()));
Get the list of schemas
use schema <SchemaName>;
show tables;
select "name" from table(result_scan(last_query_id()));
Get the list of tables and then run your queries.
You probably will not need the result_scan. Recently, I created a python program to list all columns for all tables within Snowflake. My requirement was to validate each column and calculate some numerical statistics of the columns. I was able to do it using 'Show Columns' only. I have open sourced some of the common snowflake operations which is available here
https://github.com/Infosys/Snowflake-Python-Development-Framework
You can clone this code and then use this framework to create your python program to list the columns as below and then you can do whatever you would like with the column details
##
from utilities.sf_operations import Snowflakeconnection
connection = Snowflakeconnection(profilename ='snowflake_host')
sfconnectionresults = connection.get_snowflake_connection()
sfconnection = sfconnectionresults.get('connection')
statuscode = sfconnectionresults.get('statuscode')
statusmessage = sfconnectionresults.get('statusmessage')
print(sfconnection,statuscode,statusmessage)
snow_sql = 'SHOW COLUMNS;'
queryresult = connection.execute_snowquery(sfconnection,snow_sql);
print(queryresult['result'])
print('column_name|table_name|column_attribute')
print('---------------------------------------------')
for rows in queryresult['result']:
table_name = rows[0]
schema_name = rows[1]
column_name = rows[2]
column_attribute = rows[3]
is_Null = rows[4]
default_Value = rows[5]
kind = rows[6]
expression = rows[7]
comment = rows[8]
database_name = rows[9]
autoincrement = rows[10]
print(column_name+'|'+table_name+'|'+column_attribute)
I'm deleting big query rows from a table, using the "pandas-gbq" library, which works fine.
However, since this is a "read" action, by default the whole table content is being fetched, and, I do not want it to occur, since that is unnecessary.
This is my current code below, any ideas about a way to perform a delete action without fetching the table as a df?
Thanks in advance.
Delete gbq table rows - Today and yesterday
sql = """
DELETE FROM `bla.bla.bla`
WHERE Day = '{today}' OR Day = '{yesterday}'
"""
sql = sql.format(today = curr_date, yesterday= prev_date)
pandas_gbq.read_gbq(sql, project_id=project_id, credentials=credentials)
Why not use google-cloud-bigquery to invoke the query, which provides better access to the BQ API surface?
pandas_gbq by its nature provides only a subset to enable integration with the pandas ecosystem. See this document for more information about the differences and migrating between the two.
Here's a quick equivalent using the google-cloud-bigquery:
def do_the_thing():
from google.cloud import bigquery
bqclient = bigquery.Client()
sql = """
DELETE FROM `bla.bla.bla`
WHERE Day = '{today}' OR Day = '{yesterday}'
"""
query = bqclient.query(sql)
print("started query as {}".format(query.job_id))
# invoke result() to wait until completion
# DDL/DML queries don't return rows so we don't need a row iterator
query.result()
When I am exporting data from MySQL to BigQuery, some data are been duplicated. As a way to fix this, I thought of creating views of this tables using row number. The query to do this is shown below. The problem is that a lot of tables in my dataset are duplicated and possibly when I add new tables and export them to big query, they will have duplicated data and I don't want to create this type of query every time that a I add a new table in my dataset (I want that, in the moment I export a new table, a view to this table is created). Is this possible to do in a loop in the query (like 'for each table in my data set, do this')? Is this possible to do in shell script (when export a table to big query, create a view for this table)? In last case, is this possible to do in python?
SELECT
* EXCEPT (ROW_NUMBER)
FROM
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY id order by updated_at desc) ROW_NUMBER
FROM dataset1.table1
)
WHERE ROW_NUMBER = 1
It is definitely can be done in python.
I would recommend to use gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python
So I think you script should be something like this
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
client = bigquery.Client()
dataset_ref = client.dataset('dataset_name')
tables = list(client.list_tables(dataset_ref))
for tab in tables:
table = dataset.table("v_{}".format(tab.name))
table.view_query = "select * from `my-project.my.dataset.{}`".format(tab.name)
#if creating legacy view comment out next line
table.view_query_legacy_sql = False
table.create()
I am currently using BigQuery's stream option to load data into tables. However, tables that have date partition on do not show any partitions... I am aware of this being an effect of the streaming.
The Python code I use:
def stream_data(dataset_name, table_name, data):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
# Reload the table to get the schema.
table.reload()
rows = data
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
print(errors)
Will date partitioned tables eventually show and if no, how can I create an insert job to realize this?
Not sure what you mean by "partitions not being shown" but when you create a partitioned table you will only see one single table.
The only difference here is that you can query in this table for date partitions, like so:
SELECT
*
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-12-25')
AND TIMESTAMP('2016-12-31');
As you can see in this example, partitioned tables have the meta column _PARTITIONTIME and that's what you use to select the partitions you are interested in.
For more info, here are the docs explaining a bit more about querying data in partitioned tables.