I am trying to read XML file(s) in Python, parse them, extract required fields and insert the extracted data into a Postgres table. I am new to Python and Postgres. So was hoping if someone could please clarify some questions I have in my mind.
The requirement is that there will be 2 target tables in Postgres for every XML file (of a certain business entity e.g. customers, product etc) that should be received and read on a particular day - CURRENT and HISTORY.
The CURRENT table (e.g. CUST_CURR) is supposed to hold the latest data received for a particular run (current day file) on that particular day and HISTORY table (CUST_HIST) will contain the history of all the data received till the previous run - i.e just keep on appending the records for every run into the HIST table.
However, the requirement is to make the HIST table a PARTITIONED table (to improve query response time by partition-pruning) based on the current process run date. In other words, during a particular run, the CURR table needs to be truncated and loaded with the day's extracted records, and the records already existing in the CURR table should be copied/inserted/appended into the HIST table in a NEW Partition (of the HIST table) based on the run date.
Now, when I searched the internet to know more about Postgres PARTITION-ing tables, it appears that to create NEW partitions, new tables need to be created manually (with a different name) every time representing that partition documentation?? The example shows a CREATE TABLE statement for creating a partition!!
CREATE TABLE CUSTOMER_HIST_20220630 PARTITION OF CUST_CURRENT
FOR VALUES FROM ('2006-02-01') TO ('2006-03-01')
I am sure I have misinterpreted this but can anyone please correct me and help me clear the confusion.
So if anyone has to query the HIST table with the run-date filter (assuming that the partitions are created on the run_date column), the user has to query that particular (sub)table (something like SELECT * FROM CUST_HIST_20220630 WHERE run_dt >= '2022-05-31') instead of the main partitioned table (SELECT * FROM CUSTOMER_HIST) ?
In other RDBMS's (Oracle, Teradata etc) the partitions are created automatically when the data is loaded and they remain part of the same table. When a user queries the table based on the partitioned column, the DB optimizer engine understands this and effectively prunes the unnecessary partitions and reads only the required partition(s), thereby highly increasing response time.
Request someone to please clear my confusion. Is there a way to automate partition creation while loading data to Postgres table by using Python (psycopg2) ? I am new to Postgres and Python so please forgive my naivety.
Related
I have the following partitions in my table partitioned by 'DATE'
Row partition_id
1 20210222
2 20210223
I am trying to overwrite one of these partitions '20210222' by using Bigquery python API
My tablename is table_name$20210222 and I am using WRITE_TRUNCATE as the write-disposition but I am getting the following error:
google.api_core.exceptions.BadRequest: 400 Some rows belong to different partitions
rather than destination partition 20210222
I want to be able to overwrite just one of these partitions through my python code. It works fine for WRITE_APPEND and WRITE_TRUNCATE but WRITE_APPEND adds duplication and WRITE_TRUNCATE deletes the previous table and only adds new data. I want to replace data for an existing partition.
BigQuery converts and stores all TIMESTAMP values in the UTC timezone. So, if you have a partitioned by DAY table and you try to insert a timestamp in a different timezone than UTC, it will be first converted to UTC and then stored in the relevant partition.
In your case, it seems that you had a value like 2020-12-25 00:30:00 CET which when converted to UTC upon storing is 2020-12-24 23:30:00 UTC so it belongs to a different partition than other values from the same day (in your timezone).
That leads to the error that you encountered because you are trying to overwrite a specific partition and some of the entries that you are trying to insert belong to a different one.
You might want to try to use the MERGE statement directly to overwrite data on that particular partition as below.
MERGE INTO `table` AS INTERNAL_DEST
USING (
SELECT * FROM data_to_overwrite
) AS INTERNAL_SOURCE
ON FALSE
when not matched by source
and INTERNAL_DEST.partition_id BETWEEN 20210222 AND 20210222
-- the range of the partitions to be overwritten
then delete
when not matched then insert ROW
UPDATE for additional notes and caveat.
I would say this is the only feasible way that I know about to update data like this. wrapping this query within python code and calling the python code for overwriting historical data.
Please note that overwrite historical data is against BQ best practice for loading data. It could be used sparely for batch updating data. and it shall never be used as a transactional update operation like what we do in a transactional database.
This is considered a DML operation, which has a much lower quota comparing to other data loading approaches. See this post for details about the difference between BQ loading strategies. BigQuery: too many table dml insert operations for this table
Presently, we send entire files to the Cloud (Google Cloud Storage) to be imported into BigQuery and do a simple drop/replace. However, as the file sizes have grown, our network team doesn't particularly like the bandwidth we are taking while other ETLs are also trying to run. As a result, we are looking into sending up changed/deleted rows only.
Trying to find the path/help docs on how to do this. Scope - I will start with a simple example. We have a large table with 300 million records. Rather than sending 300 million records every night, send over X million that have changed/deleted. I then need to incorporate the change/deleted records into the BigQuery tables.
We presently use Node JS to move from Storage to BigQuery and Python via Composer to schedule native table updates in BigQuery.
Hope to get pointed in the right direction for how to start down this path.
Stream the full row on every update to BigQuery.
Let the table accommodate multiple rows for the same primary entity.
Write a view eg table_last that picks the most recent row.
This way you have all your queries near-realtime on real data.
You can deduplicate occasionally the table by running a query that rewrites self table with latest row only.
Another approach is if you have 1 final table, and 1 table which you stream into, and have a MERGE statement that runs scheduled every X minutes to write the updates from streamed table to final table.
I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.
SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.
I hope I can make it as clear as possible to understand. I am maintaining data regarding stage movements i.e. when someone moves into stage and when someone moves out. I want bigquery table to have single entry for each stage movement(due to the kind of query I'll be doing on the data) but there are two updates for in and out and so this is what I am doing;
Normal Streaming insert when someone moves into stage
While Moving out:
a. Copy the truncated table to the same destination using a query like
SELECT * FROM my_dataset.my_table WHERE id !="id"
b. Do a streaming insert for the new row.
The problem is, there are random data drops when doing streaming inserts after the copy operation.
I found this link: After recreating BigQuery table streaming inserts are not working?
where it has been mentioned that there should be a delay of >2mins before doing streaming inserts in this case to avoid data drops but, I want it to be instantaneous since multiple stage movements can be happening in a matter of a few seconds. Is there a workaround or a fix for this? Or do I have to rethink my complete process in an append-only basis which isn't looking likely right now?
do I have to rethink my complete process in an append-only basis?
My suggestion for your particular case would be not to truncate table on each and every “move out”.
Assuming you have field that identifies recent row (timestamp or order, etc.) you can easily filter out old rows with something like
SELECT <your_fileds>
FROM (
SELECT
<your_fileds>,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY timestanp DESC) AS most_recent_row
FROM my_dataset.my_table
)
WHERE most_recent_row = 1
If needed you can do daily purging of "old/not latest" rows into truncated table using the very same approach as above
where it has been mentioned?
Maybe not that explicitly your case, but check Data availability section
And in How to change the template table schema read third paragraph (I feel it is related)
I have an html file on network which updates almost every minute with new rows in a table. At any point, the file contains close to 15000 rows I want to create a MySQL table with all data in the table, and then some more that I compute from the available data.
The said HTML table contains, say rows from the last 3 days. I want to store all of them in my mysql table, and update the table every hour or so (can this be done via a cron?)
For connecting to the DB, I'm using MySQLdb which works fine. However, I'm not sure what are the best practices to do so. I can scrape the data using bs4, connect to table using MySQLdb. But how should I update the table? What logic should I use to scrape the page that uses the least resources?
I am not fetching any results, just scraping and writing.
Any pointers, please?
My Suggestion is instead of updating values row by row try to use Bulk Insert in temporary table and then move the data into an actual table based on some timing key. If you have key column that will be good for reading the recent rows as you added.
You can adopt the following approach:
For the purpose of the discussion, let master be the final destination the scraped data.
Then we can adopt the following steps:
Scrape data from the web page.
Store this scraped data within the temporary table within MySQL say temp.
Perform an EXCEPT operation to pull out only those rows which exist within the master but not in temp.
persist the rows obtained in step 3 within the master table.
Please refer to this link for understanding how to perform SET operations in MySQL. Also, it would be advisable to place all this logic within a store procedure and pass it the set of the data to be processed ( not sure if this part is possible in MySQL)
Adding one more step to the approach - Based on the discussion below, we can use a timestamp based column to determine the newest rows that need to be placed into the table. The above approach for SET based operations works well, in case there are no timestamp based columns.