How to update / delete in snowflake from the AWS Glue script - python

I want to delete a record in a dataframe object in snowflake table .
Similarly I want to perform update on basis of "key" in dataframe in snowflake table.
My research indicates that the utils method can perform the DDL operation but i am unable to find the some example to refer to.

As you mentioned, you can use the runQuery() method of the Utils object to execute DDL/DML SQL statements:
https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
If you want to do it based on some keys, then you can iterate items on DataFrame, and run an SQL for each item:
how to loop through each row of dataFrame in pyspark
But this will be a performance killer. Snowflake is a data warehouse, so you should always prefer "batch updates" instead of single row updates.
I would suggest you to write your dataframe to a staging table in Snowflake, and then call a SQL to update the rows in target table based on the staging table.

Related

How to exclude a column when creating BigQuery external table?

I'm trying to create an external table in BQ using data stored in GCS bucket. Below is the DDL command I'm using:
CREATE OR REPLACE EXTERNAL TABLE `external table`
OPTIONS (
format = 'parquet',
uris = ['gs://...', 'gs://...']
);
How can I exclude a particular column from being imported to external table? Since I cannot ALTER an external table to DROP COLUMN after being created.
There is no provision for selecting columns while loading data from GCS.
In the following documentation provided by google all possible configurations and properties for loading data to bigQuery from GCS is provided. But the option that you are looking for is not there.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
But there are other ways to do this.
First option,
You can load this to data to a temp table first, and then select the required columns from that and populate the second table.
Second option,
If you don't want to do this manually. You can make use of CloudRun on BigQuery events. Whenever you load data into first table, it triggers Cloudrun in which you can write your code to remove the column you do not want and insert those into the second table.
https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events
Third Option,
If this is just a one time activity, you can load the whole data into one table and then create a view with required columns on top of it.

Upload data to Exasol from python Dataframe

I wonder if there's anyways to upload a dataframe and create a new table in Exasol? import_from_pandas assumes the table already exists. Do we need to run a SQL separately to create the table? for other databases, to_sql can just create the table if it doesn't exist.
Yes, As you mentioned import_from_pandas requires a table. So, you need to create a table before writing to it. You can run a SQL create table ... script by connection.execute before using import_from_pandas. Also to_sql needs a table since based on the documentation it will be translated to a SQL insert command.
Pandas to_sql allows to create a new table if it does not exist, but it needs an SQLAlchemy connection, which is not supported for Exasol out of the box. However, there seems to be a SQLAlchemy dialect for Exasol you could use (haven't tried it yet): sqlalchemy-exasol.
Alternatively, I think you have to use a create table statement and then populate the table via pyexasol's import_from_pandas.

What is the best way of updating BigQuery table from a pandas Dataframe with many rows

I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.
Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.
bqclient = bigquery.Client(
credentials=credentials,
project=project_id,
)
query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)
But it doesn't make sense to create update statement for each row.
Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.
Question: What is the best way of updating Bigquery table from pandas dataframe?
Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:
The one you mentioned, with a query and update one by one row.
Recreate the table using only the new values.
Appending the new data with different timestamp.
Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.
If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].
[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables
[2] https://cloud.google.com/bigquery/docs/clustered-tables

What happens if cursor.executemany fails in Python

I am using cursor.executemany to insert thousands of rows into snowflake database from some other source. So if in case the insert fails due to some reason, does it rollback all the inserts?
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
So if in case the insert fails due to some reason, does it rollback all the inserts?
The cursor.executemany(…) implementation in Snowflake's Python Connector fills a multi-row INSERT INTO command, whose values are pre-evaluated by the query compiler before the inserts run, so they all run together or fail early if a value is unacceptable to its defined column type.
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
If there are no ID-like columns, you'll need to define a condition that qualifies two rows to be the same (such as a multi-column match).
Assuming your new batch of inserts are in a temporary table TEMP, the following SQL can insert into the DESTINATION table by performing a check of all rows against a set from the DESTINATION table.
Using HASH(…) as a basis for comparison, comparing all columns in each row together (in order):
INSERT INTO DESTINATION
SELECT *
FROM TEMP
WHERE
HASH(*) IS NOT IN ( SELECT HASH(*) FROM DESTINATION )
As suggested by Christian in the comments, the MERGE SQL command can also be used, once you have an identifier strategy (join keys). This too requires the new rows to be placed in a temporary table first, and offers an ability to perform an UPDATE if a row is already found.
Note: HASH(…) may have collisions and isn't the best fit. Better is to form an identifier using one or more of your table's columns, and compare them together. Your question lacks information about table and data characteristics, so I've picked a very simple approach involving HASH(…) here.

Pandas :Record count inserted by Python TO_SQL funtion

I am using Python to_sql function to insert data in a database table from Pandas dataframe.
I am able to insert data in database table but I want to know in my code how many records are inserted .
How to know record count of inserts ( i do not want to write one more query to access database table to get record count)?
Also, is there a way to see logs for this function execution. like what were the queries executed etc.
There is no way to do this, since python cannot know how many of the records being inserted were already in the table.

Categories

Resources