Delete bigquery table content with pandas-gbq - python

I'm deleting big query rows from a table, using the "pandas-gbq" library, which works fine.
However, since this is a "read" action, by default the whole table content is being fetched, and, I do not want it to occur, since that is unnecessary.
This is my current code below, any ideas about a way to perform a delete action without fetching the table as a df?
Thanks in advance.
Delete gbq table rows - Today and yesterday
sql = """
DELETE FROM `bla.bla.bla`
WHERE Day = '{today}' OR Day = '{yesterday}'
"""
sql = sql.format(today = curr_date, yesterday= prev_date)
pandas_gbq.read_gbq(sql, project_id=project_id, credentials=credentials)

Why not use google-cloud-bigquery to invoke the query, which provides better access to the BQ API surface?
pandas_gbq by its nature provides only a subset to enable integration with the pandas ecosystem. See this document for more information about the differences and migrating between the two.
Here's a quick equivalent using the google-cloud-bigquery:
def do_the_thing():
from google.cloud import bigquery
bqclient = bigquery.Client()
sql = """
DELETE FROM `bla.bla.bla`
WHERE Day = '{today}' OR Day = '{yesterday}'
"""
query = bqclient.query(sql)
print("started query as {}".format(query.job_id))
# invoke result() to wait until completion
# DDL/DML queries don't return rows so we don't need a row iterator
query.result()

Related

Does fetching data from Azure Table Storage from python takes too long? Data has around 1000 rows per hour and I am fetching it per hour

import os, uuid
from azure.data.tables import TableClient
import json
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity, EntityProperty
import pandas as pd
def queryAzureTable(azureTableName,filterQuery):
table_service = TableService(account_name='accountname', account_key='accountkey')
tasks=Entity()
tasks = table_service.query_entities(azureTableName, filter=filterQuery)
return tasks
filterQuery = f"PartitionKey eq '{key}' and Timestamp ge datetime'2022-06-15T09:00:00' and Timestamp lt datetime'2022-06-15T10:00:00')"
entities = queryAzureTable("TableName",filterQuery)
for i in entities:
print(i)
OR
df = pd.DataFrame(entities)
Above is the code that I am using, in the azure table there are only around 1000 entries which should not take too long but extracting it takes more than an hour using this.
Both, using either a 'for' loop or changing entities directly to DataFrame takes too long.
Could anyone let me know the reason why it is taking too long or generally it takes that much of time.
If that's the case, is there any alternate way of it that does not take more than 10-15 mins for processing it without increasing the number of clusters already in use.
I read multithreading might resolve it, I tried that too but doesn't seems to be of any help, maybe I am writing it wrong, could anyone help me with the code using multithreading or any alternate way.
I tried to list all the rows with my table storage, By default Azure table storage can only have 1000 rows or entities per table.
Also, there are few limitations on the partition key and rows, which should not exceed the size of 1KIB. Unfortunately the type of table storage account also matters to decrease the latency of your output. As, you’re trying to query 1000 rows at once:-
Make sure you have your table storage near to your region.
Check the scalability targets and limitations for your rows of Azure table storage here :- https://learn.microsoft.com/en-us/azure/storage/tables/scalability-targets#scale-targets-for-table-storage
Also, AFAIK, In your code, you can directly make use of
list_entities
method to list all the entities in the table instead of writing such complex query :-
I tried the below code and was able to retrieve all the table entities successfully within few seconds with standard general purpose V2 Storage account :-
Code :-
from azure.data.tables import TableClient
table_client = TableClient.from_connection_string(conn_str="DefaultEndpointsProtocol=<connection-strin g>windows.net", table_name="myTable")
**# Query the entities in the table**
entities = list(table_client.list_entities())
for i, entity in enumerate(entities):
print("Entity #{}: {}".format(entity, i))
Result :-
With Pandas :-
import pandas as pd
from azure.cosmosdb.table.tableservice import TableService
CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=siliconstrg45;AccountKey=<connection-string>==;EndpointSuffix=core.windows.net"
SOURCE_TABLE = "myTable"
def set_table_service():
""" Set the Azure Table Storage service """
return TableService(connection_string=CONNECTION_STRING)
def get_dataframe_from_table_storage_table(table_service):
""" Create a dataframe from table storage data """
return pd.DataFrame(get_data_from_table_storage_table(table_service,
))
def get_data_from_table_storage_table(table_service):
""" Retrieve data from Table Storage """
get_dataframe_from_table_storage_table
for record in table_service.query_entities(
SOURCE_TABLE
):
yield record
ts = set_table_service()
df = get_dataframe_from_table_storage_table(table_service=ts,
)
print(df)
Result :-
If you have your table storage scalability targets in place, You can consider few points from this document to increase the I/Ops of your table storage :-
Refer here :-
https://learn.microsoft.com/en-us/azure/storage/tables/storage-performance-checklist
Also, storage quota and limits vary for Azure subscriptions type too!

How to import a large bigquery table into jupyter lab?

In Big Query, I have a table with 608 GB of data, 50 million rows, and 2651 columns. I'm trying to load it into Jupyter Lab as a pandas dataframe before doing any modeling. I'm saving the query's results into a pandas dataframe as a destination using %%bigquery. However, because of the big size, I'm getting an error. I followed the documentation here and a couple of stackoverflow discussions (this) that suggested using LIMIT and setting query.allow large results = True. However, I am unable to determine how I can apply them to my specific problem.
Kindly please advise.
Thanks.
If you want to use configuration.query.allowLargeResults and set it to true, you should add a destination table object.
Set allowLargeResults to true in your job configuration.
If you are using python, you can see this example using allow_large_results and set it to true.
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"
# Set the destination table and use_legacy_sql to True to use
# legacy SQL syntax.
job_config = bigquery.QueryJobConfig(
allow_large_results=True, destination=table_id, use_legacy_sql=True
)
sql = """
SELECT corpus
FROM [bigquery-public-data:samples.shakespeare]
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
If you are querying via API
"configuration": { "query": { "allowLargeResults": true, "query": "select uid from [project:dataset.table]" "destinationTable": [project:dataset.table] } }
Using allow_large_results has limitations. These are the limitations:
You must specify a destination table.
You cannot specify a top-level ORDER BY, TOP, or LIMIT clause.
Window functions can return large query results only if used in
conjunction with a PARTITION BY clause.
You can see this official documentation.

How to query bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?

So I have several tables with each product for each year and tables go like:
2020product5, 2019product5, 2018product6 and so on. I have added two custom parameters in google data studio as well named year and product_id but could not use them in table names themselves. I have used parameterized queries before but in conditions like where product_id = #product_id but this setup only works if all of the data is in same table which is not the current case with me. In python I use string formatters like f"{year}product{product_id}" but that obviously does not work in this case...
Using Bigquery Default CONCAT & FORMAT functions does not help as both throw following validation error: Table-valued function not found: CONCAT at [1:15]
So how do I get around with querying bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?
After much research I (kinda) sorted it out. Turns out it is a database level feature to query schema-level entities e.g. table names dynamically. BigQuery does not support formatting within table name like tables as per in question (e.g. 2020product5, 2019product5, 2018product6) cannot be queried directly. However, it does have a TABLE_SUFFIX function which allow you to access tables dynamically given that changes in table names are located at the end of the table. (This feature also allowed for dateweise partitioning and many tools which use BQ as data sink, utilize this. So If you are using BQ as data sink, there is good chance that your original data source is already doing so). Thus, table names like (product52020, product52019, product62018) as well can be accessed dynamically and of course from data studio too using following:
SELECT * FROM `project_salsa_101.dashboards.product*` WHERE _table_Suffix = CONCAT(#product_id,#year)
P.S.: Used python to create a dirty script which looped through products and tables and copied and created new ones which goes as follows: (Adding script with formatted string so it might be useful for anyone with such case wtih nominal effort)
import itertools
credentials = service_account.Credentials.from_service_account_file(
'project_salsa_101-bq-admin.json')
project_id = 'project_salsa_101'
schema = 'dashboards'
client = bigquery.Client(credentials= credentials,project=project_id)
for product_id, year in in itertools.product(product_ids, years):
df = client.query(f"""
SELECT * FROM `{project_id}.{schema}.{year}product{product_id}`
""").result().to_dataframe()
df.to_gbq(project_id = project_id,
destination_table = f'{schema}.product{product_id}{year}',
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
if_exists = 'replace')
client.query(f"""
DROP TABLE `{project_id}.{schema}.{year}product{product_id}`""").result()

Adding a column from an existing BQ table to another BQ table using Python

I am trying to experiment with creating new tables from existing BQ tables, all within python. So far I've successfully created the table using some similar code, but now I want to add another column to it from another table - which I have not been successful with. I think the problem comes somewhere within my SQL code.
Basically what I want here is to add another column named "ip_address" and put all the info from another table into that column.
I've tried splitting up the two SQL statements and running them separately, I've tried many different combinations of the commands (taking our CHAR, adding (32) after, combining all into one statement, etc.), and still I run into problems.
from google.cloud import bigquery
def alter(client, sql_alter, job_config, table_id):
query_job = client.query(sql_alter, job_config=job_config)
query_job.result()
print(f'Query results appended to table {table_id}')
def main():
client = bigquery.Client.from_service_account_json('my_json')
table_id = 'ref.datasetid.tableid'
job_config = bigquery.QueryJobConfig()
sql_alter = """
ALTER TABLE `ref.datasetid.tableid`
ADD COLUMN ip_address CHAR;
INSERT INTO `ref.datasetid.tableid` ip_address
SELECT ip
FROM `ref.datasetid.table2id`;
"""
alter(client, sql_alter, job_config, table_id)
if __name__ == '__main__':
main()
With this code, the current error is "400 Syntax error: Unexpected extra token INSERT at [4:9]" Also, do I have to continuously reference my table with ref.datasetid.tableid or can I write just tableid? I've run into errors before it gets there so I'm still not sure. Still a beginner so help is greatly appreciated!
BigQuery does not support ALTER TABLE or other DDL statements, take a look into how Modifying table schemas there you can find an example of how to add a new column when you append data to a table during a load job.

Create multiple views in shell script or big query

When I am exporting data from MySQL to BigQuery, some data are been duplicated. As a way to fix this, I thought of creating views of this tables using row number. The query to do this is shown below. The problem is that a lot of tables in my dataset are duplicated and possibly when I add new tables and export them to big query, they will have duplicated data and I don't want to create this type of query every time that a I add a new table in my dataset (I want that, in the moment I export a new table, a view to this table is created). Is this possible to do in a loop in the query (like 'for each table in my data set, do this')? Is this possible to do in shell script (when export a table to big query, create a view for this table)? In last case, is this possible to do in python?
SELECT
* EXCEPT (ROW_NUMBER)
FROM
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY id order by updated_at desc) ROW_NUMBER
FROM dataset1.table1
)
WHERE ROW_NUMBER = 1
It is definitely can be done in python.
I would recommend to use gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python
So I think you script should be something like this
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
client = bigquery.Client()
dataset_ref = client.dataset('dataset_name')
tables = list(client.list_tables(dataset_ref))
for tab in tables:
table = dataset.table("v_{}".format(tab.name))
table.view_query = "select * from `my-project.my.dataset.{}`".format(tab.name)
#if creating legacy view comment out next line
table.view_query_legacy_sql = False
table.create()

Categories

Resources