Pandas :Record count inserted by Python TO_SQL funtion - python

I am using Python to_sql function to insert data in a database table from Pandas dataframe.
I am able to insert data in database table but I want to know in my code how many records are inserted .
How to know record count of inserts ( i do not want to write one more query to access database table to get record count)?
Also, is there a way to see logs for this function execution. like what were the queries executed etc.

There is no way to do this, since python cannot know how many of the records being inserted were already in the table.

Related

Creating and load to Postgres Partitioned table

I am trying to read XML file(s) in Python, parse them, extract required fields and insert the extracted data into a Postgres table. I am new to Python and Postgres. So was hoping if someone could please clarify some questions I have in my mind.
The requirement is that there will be 2 target tables in Postgres for every XML file (of a certain business entity e.g. customers, product etc) that should be received and read on a particular day - CURRENT and HISTORY.
The CURRENT table (e.g. CUST_CURR) is supposed to hold the latest data received for a particular run (current day file) on that particular day and HISTORY table (CUST_HIST) will contain the history of all the data received till the previous run - i.e just keep on appending the records for every run into the HIST table.
However, the requirement is to make the HIST table a PARTITIONED table (to improve query response time by partition-pruning) based on the current process run date. In other words, during a particular run, the CURR table needs to be truncated and loaded with the day's extracted records, and the records already existing in the CURR table should be copied/inserted/appended into the HIST table in a NEW Partition (of the HIST table) based on the run date.
Now, when I searched the internet to know more about Postgres PARTITION-ing tables, it appears that to create NEW partitions, new tables need to be created manually (with a different name) every time representing that partition documentation?? The example shows a CREATE TABLE statement for creating a partition!!
CREATE TABLE CUSTOMER_HIST_20220630 PARTITION OF CUST_CURRENT
FOR VALUES FROM ('2006-02-01') TO ('2006-03-01')
I am sure I have misinterpreted this but can anyone please correct me and help me clear the confusion.
So if anyone has to query the HIST table with the run-date filter (assuming that the partitions are created on the run_date column), the user has to query that particular (sub)table (something like SELECT * FROM CUST_HIST_20220630 WHERE run_dt >= '2022-05-31') instead of the main partitioned table (SELECT * FROM CUSTOMER_HIST) ?
In other RDBMS's (Oracle, Teradata etc) the partitions are created automatically when the data is loaded and they remain part of the same table. When a user queries the table based on the partitioned column, the DB optimizer engine understands this and effectively prunes the unnecessary partitions and reads only the required partition(s), thereby highly increasing response time.
Request someone to please clear my confusion. Is there a way to automate partition creation while loading data to Postgres table by using Python (psycopg2) ? I am new to Postgres and Python so please forgive my naivety.

How do you handle large query results for a simple select in bigquery with the python client library?

I have a table where I wrote 1.6 million records, and each has two columns: an ID, and a JSON string column.
I want to select all of those records and write the json in each row as a file. However, the query result is too large, and I get the 403 associated with that:
"403 Response too large to return. Consider specifying a destination table in your job configuration."
I've been looking at the below documentation around this and understand that they recommend specifying a table for the results and viewing them there, BUT all I want to do is select * from the table, so that would effectively just be copying it over, and I feel like I would run into the same issue querying that result table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction
https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationQuery.FIELDS.allow_large_results
What is the best practice here? Pagination? Table sampling? list_rows?
I'm using the python client library as stated in the question title. My current code is just this:
query = f'SELECT * FROM `{project}.{dataset}.{table}`'
return client.query(query)
I should also mention that the IDs are not sequential, they're just alphanumerics.
The best practice and efficient way is to export your data and then download it instead of querying the whole table (SELECT *).
From there, you may extract your needed data from the exported files (eg. CSV, JSON, etc) using python code without having to wait for your code to finish the SELECT * query.

Upload data to Exasol from python Dataframe

I wonder if there's anyways to upload a dataframe and create a new table in Exasol? import_from_pandas assumes the table already exists. Do we need to run a SQL separately to create the table? for other databases, to_sql can just create the table if it doesn't exist.
Yes, As you mentioned import_from_pandas requires a table. So, you need to create a table before writing to it. You can run a SQL create table ... script by connection.execute before using import_from_pandas. Also to_sql needs a table since based on the documentation it will be translated to a SQL insert command.
Pandas to_sql allows to create a new table if it does not exist, but it needs an SQLAlchemy connection, which is not supported for Exasol out of the box. However, there seems to be a SQLAlchemy dialect for Exasol you could use (haven't tried it yet): sqlalchemy-exasol.
Alternatively, I think you have to use a create table statement and then populate the table via pyexasol's import_from_pandas.

How to update / delete in snowflake from the AWS Glue script

I want to delete a record in a dataframe object in snowflake table .
Similarly I want to perform update on basis of "key" in dataframe in snowflake table.
My research indicates that the utils method can perform the DDL operation but i am unable to find the some example to refer to.
As you mentioned, you can use the runQuery() method of the Utils object to execute DDL/DML SQL statements:
https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
If you want to do it based on some keys, then you can iterate items on DataFrame, and run an SQL for each item:
how to loop through each row of dataFrame in pyspark
But this will be a performance killer. Snowflake is a data warehouse, so you should always prefer "batch updates" instead of single row updates.
I would suggest you to write your dataframe to a staging table in Snowflake, and then call a SQL to update the rows in target table based on the staging table.

Write from a query to table in BigQuery only if query is not empty

In BigQuery it's possible to write to a new table the results of a query. I'd like the table to be created only whenever the query returns at least one row. Basically I don't want to end up creating empty table. I can't find an option to do that. (I am using the Python library, but I suppose the same applies to the raw API)
Since you have to specify the destination on the query definition and you don't know what it will return when you run it can you tack a LIMIT 1 to the end?
You can check the row number in the job result object and then re run the query without the limiter if there are results into your new table.
There's no option to do this in one step. I'd recommend running the query, inspecting the results, and then performing a table copy with WRITE_TRUNCATE to commit the results to the final location if the intermediate output contains at least one row.

Categories

Resources