I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.
SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.
Related
I am trying to read XML file(s) in Python, parse them, extract required fields and insert the extracted data into a Postgres table. I am new to Python and Postgres. So was hoping if someone could please clarify some questions I have in my mind.
The requirement is that there will be 2 target tables in Postgres for every XML file (of a certain business entity e.g. customers, product etc) that should be received and read on a particular day - CURRENT and HISTORY.
The CURRENT table (e.g. CUST_CURR) is supposed to hold the latest data received for a particular run (current day file) on that particular day and HISTORY table (CUST_HIST) will contain the history of all the data received till the previous run - i.e just keep on appending the records for every run into the HIST table.
However, the requirement is to make the HIST table a PARTITIONED table (to improve query response time by partition-pruning) based on the current process run date. In other words, during a particular run, the CURR table needs to be truncated and loaded with the day's extracted records, and the records already existing in the CURR table should be copied/inserted/appended into the HIST table in a NEW Partition (of the HIST table) based on the run date.
Now, when I searched the internet to know more about Postgres PARTITION-ing tables, it appears that to create NEW partitions, new tables need to be created manually (with a different name) every time representing that partition documentation?? The example shows a CREATE TABLE statement for creating a partition!!
CREATE TABLE CUSTOMER_HIST_20220630 PARTITION OF CUST_CURRENT
FOR VALUES FROM ('2006-02-01') TO ('2006-03-01')
I am sure I have misinterpreted this but can anyone please correct me and help me clear the confusion.
So if anyone has to query the HIST table with the run-date filter (assuming that the partitions are created on the run_date column), the user has to query that particular (sub)table (something like SELECT * FROM CUST_HIST_20220630 WHERE run_dt >= '2022-05-31') instead of the main partitioned table (SELECT * FROM CUSTOMER_HIST) ?
In other RDBMS's (Oracle, Teradata etc) the partitions are created automatically when the data is loaded and they remain part of the same table. When a user queries the table based on the partitioned column, the DB optimizer engine understands this and effectively prunes the unnecessary partitions and reads only the required partition(s), thereby highly increasing response time.
Request someone to please clear my confusion. Is there a way to automate partition creation while loading data to Postgres table by using Python (psycopg2) ? I am new to Postgres and Python so please forgive my naivety.
I am new to python and of course MySql. I recently created a Python function that generates a list of values that i want to insert to a table (2 columns) in MySql based on their specification.
Is it possible to create a procedure that can take a list of values that i'm sending through python, check if these values are already in one of my 2 two columns,
if they are already in the second one don't return,
if they are in the first one return all that are contained there
if they are in none of them return them with some kind of a flag so i can handle them through python and insert them to correct table
EXTRA EXPLANATION
Let me try to explain what i want to achieve so maybe you can give me a push and help me out. So, first i get a list of CPE items like this ("cpe:/a:apache:iotdb:0.9.0") in python and my goal is to save them into a database where the CPE's related to the IOT will be differentiated from the generic ones and saved in different tables or columns. My goal is that this distinction will be done by user input for each and every item but only once per item, so after parsing all items in python i want to first check in database if they exist in one of the tables or columns.
So for each and every list item that i pass i want to query mysql and:
if it exists in non iot column already don't return anything
if it exist in iot column already return item
if not exists anywhere return also item so i can get user input in python to verify if this is iot item or not and insert it to database after that
I think you could use library called pandas.
Idk if it is the best solution but it could work.
Export the thing you have in SQL into pandas or just query the SQL file using pandas.
Check out this library, it's really helpful for exploring data sets.
https://pandas.pydata.org/
I have 10M+ records per day to insert into a Postgres database.
90% are duplicates and only the unique records should be inserted (this can be checked on a specific column value).
Because of the large volume, batch inserts seems like the only sensible option.
I'm trying to figure out how to make this work.
I've tried:
SQLAlchemy, but it throws an error. So I assume it's not possible.
s = Session(bind=engine)
s.bulk_insert_mappings(Model, rows)
s.commit()
Throws:
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "..._key"
Panda's to_sql doesn't have this unique record capability.
So I'm thinking of putting new records in an "intermediate table", then running background jobs in parallel to add those records to the main table if they don't already exist. I don't know if this is the most efficient procedure.
Is there a better approach?
Is there some way to make SQLAlchemy or Pandas do this?
There are two common ways to go about solving this problem. To pick between these, you need to examine where you're willing to spend the compute power, and whether or not the extra network transfer is going to be an issue. We don't have enough information to make that judgement call for you.
Option 1: Load to a temporary table
This option is basically what you described. Have a temporary table or a table that's dedicated to the load, which matches the schema of your destination table. Obviously this should exclude the unique constraints.
Load the entirety of your batch into this table, and once it's all there insert from this table into your destination table. You can very easily use standard SQL statements to do any kind of manipulation you need, such as distinct or whether it's the first record, or whatever else.
Option 2: Only load unique values, filtering with pandas
Pandas has a drop_duplicates() function which limit your dataframe to unique entries, and you can specify things such as which columns to check and which row to keep.
df = df.drop_duplicates(subset = ["Age"])
I am using cursor.executemany to insert thousands of rows into snowflake database from some other source. So if in case the insert fails due to some reason, does it rollback all the inserts?
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
So if in case the insert fails due to some reason, does it rollback all the inserts?
The cursor.executemany(…) implementation in Snowflake's Python Connector fills a multi-row INSERT INTO command, whose values are pre-evaluated by the query compiler before the inserts run, so they all run together or fail early if a value is unacceptable to its defined column type.
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
If there are no ID-like columns, you'll need to define a condition that qualifies two rows to be the same (such as a multi-column match).
Assuming your new batch of inserts are in a temporary table TEMP, the following SQL can insert into the DESTINATION table by performing a check of all rows against a set from the DESTINATION table.
Using HASH(…) as a basis for comparison, comparing all columns in each row together (in order):
INSERT INTO DESTINATION
SELECT *
FROM TEMP
WHERE
HASH(*) IS NOT IN ( SELECT HASH(*) FROM DESTINATION )
As suggested by Christian in the comments, the MERGE SQL command can also be used, once you have an identifier strategy (join keys). This too requires the new rows to be placed in a temporary table first, and offers an ability to perform an UPDATE if a row is already found.
Note: HASH(…) may have collisions and isn't the best fit. Better is to form an identifier using one or more of your table's columns, and compare them together. Your question lacks information about table and data characteristics, so I've picked a very simple approach involving HASH(…) here.
I have an html file on network which updates almost every minute with new rows in a table. At any point, the file contains close to 15000 rows I want to create a MySQL table with all data in the table, and then some more that I compute from the available data.
The said HTML table contains, say rows from the last 3 days. I want to store all of them in my mysql table, and update the table every hour or so (can this be done via a cron?)
For connecting to the DB, I'm using MySQLdb which works fine. However, I'm not sure what are the best practices to do so. I can scrape the data using bs4, connect to table using MySQLdb. But how should I update the table? What logic should I use to scrape the page that uses the least resources?
I am not fetching any results, just scraping and writing.
Any pointers, please?
My Suggestion is instead of updating values row by row try to use Bulk Insert in temporary table and then move the data into an actual table based on some timing key. If you have key column that will be good for reading the recent rows as you added.
You can adopt the following approach:
For the purpose of the discussion, let master be the final destination the scraped data.
Then we can adopt the following steps:
Scrape data from the web page.
Store this scraped data within the temporary table within MySQL say temp.
Perform an EXCEPT operation to pull out only those rows which exist within the master but not in temp.
persist the rows obtained in step 3 within the master table.
Please refer to this link for understanding how to perform SET operations in MySQL. Also, it would be advisable to place all this logic within a store procedure and pass it the set of the data to be processed ( not sure if this part is possible in MySQL)
Adding one more step to the approach - Based on the discussion below, we can use a timestamp based column to determine the newest rows that need to be placed into the table. The above approach for SET based operations works well, in case there are no timestamp based columns.