BigQuery insert job instead of streaming - python

I am currently using BigQuery's stream option to load data into tables. However, tables that have date partition on do not show any partitions... I am aware of this being an effect of the streaming.
The Python code I use:
def stream_data(dataset_name, table_name, data):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
# Reload the table to get the schema.
table.reload()
rows = data
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
print(errors)
Will date partitioned tables eventually show and if no, how can I create an insert job to realize this?

Not sure what you mean by "partitions not being shown" but when you create a partitioned table you will only see one single table.
The only difference here is that you can query in this table for date partitions, like so:
SELECT
*
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-12-25')
AND TIMESTAMP('2016-12-31');
As you can see in this example, partitioned tables have the meta column _PARTITIONTIME and that's what you use to select the partitions you are interested in.
For more info, here are the docs explaining a bit more about querying data in partitioned tables.

Related

Airflow: BigQuery SQL Insert empty data to the table

Using Airflow, I am trying to get the data from one table to insert it into another in BigQuery. I have 5 origin tables and 5 destination tables. My SQL query and python logic work for the 4 tables where it successfully gets the data and inserts it into their respective destination tables, but it doesn't work for 1 table.
query = '''SELECT * EXCEPT(eventdate) FROM `gcp_project.gcp_dataset.gcp_table_1`
WHERE id = "1234"
AND eventdate = "2023-01-18"
'''
# Delete the previous destination tables if existed
bigquery_client.delete_table("gcp_project.gcp_dataset.dest_gcp_table_1", not_found_ok=True)
job_config = bigquery.QueryJobConfig()
table_ref = bigquery_client.dataset(gcp_dataset).table(dest_gcp_table_1)
job_config.destination = table_ref
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TURNCATE
# Start the query, passing in the extra configuration.
query_job = bigquery_client.query(query=query,
location='US',
job_config=job_config
)
#check if the table is successfully written
while not query_job.done():
time.sleep(1)
logging.info("Data is written into a destination table with {} number of rows for id {}."
.format(query_job.result().total_rows, id))
I have even tried using the SQL query with CREATE OR REPLACE but the result was still the same table_1 is coming as empty. I have also tried BigQueryInsertJobOperator, but table_1 still comes empty.
Note: Size of the Table_1 data is around 270 MB with 1463306 rows, it is also the biggest out of all the tables data when it comes to inserting it into another table
I tried to execute the above logic from my local machine and it works fine for table_1 as well, I see the data in GCP BigQuery.
I am not sure why and what's happening behind this. Does anyone have any idea why this happening or what can it cause?
Found the root cause for this, the previous query which is responsible for populating the origin table was still running in the GCP BigQuery backend. Because of that the above query did get any data.
Solution: introduced query_job.result() This will wait for the job to be complete and then execute the next query.

Can we insert in multiple tables using bigtable

I have a use case where I have to write the events data in multiple tables using bigtable in python. Is it possible in bigtable or we can't do that in bigtable .
When I try to write the data in multiple table in the same code then this issue occurs.
google.cloud.bigtable.table.TableMismatchError:
Please confirm whether we can do it using bigtable or not.
Bigtable doesn't have support for writing to multiple tables, so you'd need to get a connection for each table and then group the writes by table. You could do something like this:
table1 = instance.table(table_id1)
table2 = instance.table(table_id2)
timestamp = datetime.datetime.utcnow()
column_family_id = "stats_summary"
rows1 = [table.direct_row("tablet#a0b81f74#20190501"),
table.direct_row("tablet#a0b81f74#20190502")]
rows1[0].set_cell(...)
rows1[0].set_cell(...)
rows1[1].set_cell(...)
rows1[1].set_cell(...)
rows2 = [table.direct_row("tablet#20190501#a0b81f74"),
table.direct_row("tablet#20190502#a0b81f74")]
rows2[0].set_cell(...)
rows2[0].set_cell(...)
rows2[1].set_cell(...)
rows2[1].set_cell(...)
response1 = table1.mutate_rows(rows)
response2 = table2.mutate_rows(rows)
There are more examples of how to perform various writes in the Bigtable documentation, so once you create your other Bigtbale table connections, then you can just follow those examples.

How to delete a particular month from a parquet file partitioned by month

I am having monthly Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet formats in append mode, but partitioned by month column. Here is the pseudo-code below -
def Revenue(filename):
df = spark.read.load(filename)
.
.
df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')
Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')
The df gets stored in parquet format on monthly basis, as can be seen below -
Question: How can I delete the parquet folder corresponding to a particular month?
One way would be to load all these parquet files in a big df and then use .where() clause to filter out that particular month and then save it back into parquet format partitionBy month in overwrite mode, like this -
# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')
But, this approach is quite cumbersome.
Other way is to directly delete the folder of that particular month, but I am not sure if that's a right way to approach things, lest we alter the metadata in an unforseeable way.
What would be the right way to delete the parquet data for a particular month?
Spark supports deleting partition, both data and metadata.
Quoting the scala code comment
/**
* Drop Partition in ALTER TABLE: to drop a particular partition for a table.
*
* This removes the data and metadata for this partition.
* The data is actually moved to the .Trash/Current directory if Trash is configured,
* unless 'purge' is true, but the metadata is completely lost.
* An error message will be issued if the partition does not exist, unless 'ifExists' is true.
* Note: purge is always false when the target is a view.
*
* The syntax of this command is:
* {{{
* ALTER TABLE table DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] [PURGE];
* }}}
*/
In your case, there is no backing table.
We could register the dataframe as a temp table and use the above syntax(temp table documentation)
From pyspark, we could run the SQL using the syntax in this link
Sample:
df = spark.read.format('parquet').load('Revenue.parquet'). registerTempTable("tmp")
spark.sql("ALTER TABLE tmp DROP IF EXISTS PARTITION (month='2015-02-01') PURGE")
Below statement will only delete the metadata related to partition information.
ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");
you need to set the tblproperties for your hive external table as False, if you want to delete the data as well. It will set your hive table as managed table.
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='FALSE');
you can set it back to external table.
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='TRUE');
I tried setting given properties using spark session but was facing some issues.
spark.sql("""alter table db.test_external set tblproperties ("EXTERNAL"="TRUE")""")
pyspark.sql.utils.AnalysisException: u"Cannot set or change the preserved property key: 'EXTERNAL';"
I am sure there must be someway to do this. I ended up using python. I defined below function in pyspark and it did the job.
query=""" hive -e 'alter table db.yourtable set tblproperties ("EXTERNAL"="FALSE");ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");' """
def delete_partition():
print("I am here")
import subprocess
import sys
p=subprocess.Popen(query,shell=True,stderr=subprocess.PIPE)
stdout,stderr = p.communicate()
if p.returncode != 0:
print stderr
sys.exit(1)
>>> delete_partition()
This will delete the metadata and data both.
Note. I have tested this with Hive ORC external partition table, which is partitioned on loaded_date
# Partition Information
# col_name data_type comment
loaded_date string
Update:
Basically your data is lying at hdfs location in subdirectory named as
/Revenue/month=2015-02-01
/Revenue/month=2015-03-01
/Revenue/month=2015-03-01
and so on
def delete_partition(month_delete):
print("I am here")
hdfs_path="/some_hdfs_location/Revenue/month="
final_path=hdfs_path+month_delete
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "-r", final_path])
print("got deleted")
delete_partition("2015-02-01")

Python - link database query to python dataframe

I have a spreadsheet with seven thousand rows of user ids. I need to query a database table and return results matching the ids in the spreadsheet.
My current approach is to read the entire database table into a pandas data frame and then merge with another data frame created from the spreadsheet. I'd prefer not to read the entire table into memory due to it's size. Is there anyway to do this without reading in the entire table? In Access or SAS, I could write a query that links the locally created table (i.e. created from spreadsheet) with the database table.
Current code that reads entire table into memory
# read spreadsheet
external_file = pd.read_excel("userlist.xlsx")
# query
qry = "select id,term_code,group_code from employee_table"
# read table from Oracle database
oracle_data = pd.read_sql(qry,connection)
# merge spreadsheet with oracle data
df = pd.merge(external_file,oracle_data,on=['id','term_code'])
I realize the following isn't possible but I would like to be able to query the database like this where "external_file" is a data frame created from my spreadsheet (or at least find an equivalent solution):
query = """
select a.id,
a.term_code,
a.group_code
from employee_table a
inner join external_file b on a.id = b.id and a.term_code=b.term_code
"""
i think you could use a xlwing (https://www.xlwings.org) to create a function that reads the id columns and create the query you want

Create multiple views in shell script or big query

When I am exporting data from MySQL to BigQuery, some data are been duplicated. As a way to fix this, I thought of creating views of this tables using row number. The query to do this is shown below. The problem is that a lot of tables in my dataset are duplicated and possibly when I add new tables and export them to big query, they will have duplicated data and I don't want to create this type of query every time that a I add a new table in my dataset (I want that, in the moment I export a new table, a view to this table is created). Is this possible to do in a loop in the query (like 'for each table in my data set, do this')? Is this possible to do in shell script (when export a table to big query, create a view for this table)? In last case, is this possible to do in python?
SELECT
* EXCEPT (ROW_NUMBER)
FROM
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY id order by updated_at desc) ROW_NUMBER
FROM dataset1.table1
)
WHERE ROW_NUMBER = 1
It is definitely can be done in python.
I would recommend to use gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python
So I think you script should be something like this
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
client = bigquery.Client()
dataset_ref = client.dataset('dataset_name')
tables = list(client.list_tables(dataset_ref))
for tab in tables:
table = dataset.table("v_{}".format(tab.name))
table.view_query = "select * from `my-project.my.dataset.{}`".format(tab.name)
#if creating legacy view comment out next line
table.view_query_legacy_sql = False
table.create()

Categories

Resources