I have an html file on network which updates almost every minute with new rows in a table. At any point, the file contains close to 15000 rows I want to create a MySQL table with all data in the table, and then some more that I compute from the available data.
The said HTML table contains, say rows from the last 3 days. I want to store all of them in my mysql table, and update the table every hour or so (can this be done via a cron?)
For connecting to the DB, I'm using MySQLdb which works fine. However, I'm not sure what are the best practices to do so. I can scrape the data using bs4, connect to table using MySQLdb. But how should I update the table? What logic should I use to scrape the page that uses the least resources?
I am not fetching any results, just scraping and writing.
Any pointers, please?
My Suggestion is instead of updating values row by row try to use Bulk Insert in temporary table and then move the data into an actual table based on some timing key. If you have key column that will be good for reading the recent rows as you added.
You can adopt the following approach:
For the purpose of the discussion, let master be the final destination the scraped data.
Then we can adopt the following steps:
Scrape data from the web page.
Store this scraped data within the temporary table within MySQL say temp.
Perform an EXCEPT operation to pull out only those rows which exist within the master but not in temp.
persist the rows obtained in step 3 within the master table.
Please refer to this link for understanding how to perform SET operations in MySQL. Also, it would be advisable to place all this logic within a store procedure and pass it the set of the data to be processed ( not sure if this part is possible in MySQL)
Adding one more step to the approach - Based on the discussion below, we can use a timestamp based column to determine the newest rows that need to be placed into the table. The above approach for SET based operations works well, in case there are no timestamp based columns.
Related
I am trying to read XML file(s) in Python, parse them, extract required fields and insert the extracted data into a Postgres table. I am new to Python and Postgres. So was hoping if someone could please clarify some questions I have in my mind.
The requirement is that there will be 2 target tables in Postgres for every XML file (of a certain business entity e.g. customers, product etc) that should be received and read on a particular day - CURRENT and HISTORY.
The CURRENT table (e.g. CUST_CURR) is supposed to hold the latest data received for a particular run (current day file) on that particular day and HISTORY table (CUST_HIST) will contain the history of all the data received till the previous run - i.e just keep on appending the records for every run into the HIST table.
However, the requirement is to make the HIST table a PARTITIONED table (to improve query response time by partition-pruning) based on the current process run date. In other words, during a particular run, the CURR table needs to be truncated and loaded with the day's extracted records, and the records already existing in the CURR table should be copied/inserted/appended into the HIST table in a NEW Partition (of the HIST table) based on the run date.
Now, when I searched the internet to know more about Postgres PARTITION-ing tables, it appears that to create NEW partitions, new tables need to be created manually (with a different name) every time representing that partition documentation?? The example shows a CREATE TABLE statement for creating a partition!!
CREATE TABLE CUSTOMER_HIST_20220630 PARTITION OF CUST_CURRENT
FOR VALUES FROM ('2006-02-01') TO ('2006-03-01')
I am sure I have misinterpreted this but can anyone please correct me and help me clear the confusion.
So if anyone has to query the HIST table with the run-date filter (assuming that the partitions are created on the run_date column), the user has to query that particular (sub)table (something like SELECT * FROM CUST_HIST_20220630 WHERE run_dt >= '2022-05-31') instead of the main partitioned table (SELECT * FROM CUSTOMER_HIST) ?
In other RDBMS's (Oracle, Teradata etc) the partitions are created automatically when the data is loaded and they remain part of the same table. When a user queries the table based on the partitioned column, the DB optimizer engine understands this and effectively prunes the unnecessary partitions and reads only the required partition(s), thereby highly increasing response time.
Request someone to please clear my confusion. Is there a way to automate partition creation while loading data to Postgres table by using Python (psycopg2) ? I am new to Postgres and Python so please forgive my naivety.
I have 10M+ records per day to insert into a Postgres database.
90% are duplicates and only the unique records should be inserted (this can be checked on a specific column value).
Because of the large volume, batch inserts seems like the only sensible option.
I'm trying to figure out how to make this work.
I've tried:
SQLAlchemy, but it throws an error. So I assume it's not possible.
s = Session(bind=engine)
s.bulk_insert_mappings(Model, rows)
s.commit()
Throws:
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "..._key"
Panda's to_sql doesn't have this unique record capability.
So I'm thinking of putting new records in an "intermediate table", then running background jobs in parallel to add those records to the main table if they don't already exist. I don't know if this is the most efficient procedure.
Is there a better approach?
Is there some way to make SQLAlchemy or Pandas do this?
There are two common ways to go about solving this problem. To pick between these, you need to examine where you're willing to spend the compute power, and whether or not the extra network transfer is going to be an issue. We don't have enough information to make that judgement call for you.
Option 1: Load to a temporary table
This option is basically what you described. Have a temporary table or a table that's dedicated to the load, which matches the schema of your destination table. Obviously this should exclude the unique constraints.
Load the entirety of your batch into this table, and once it's all there insert from this table into your destination table. You can very easily use standard SQL statements to do any kind of manipulation you need, such as distinct or whether it's the first record, or whatever else.
Option 2: Only load unique values, filtering with pandas
Pandas has a drop_duplicates() function which limit your dataframe to unique entries, and you can specify things such as which columns to check and which row to keep.
df = df.drop_duplicates(subset = ["Age"])
I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.
Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.
bqclient = bigquery.Client(
credentials=credentials,
project=project_id,
)
query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)
But it doesn't make sense to create update statement for each row.
Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.
Question: What is the best way of updating Bigquery table from pandas dataframe?
Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:
The one you mentioned, with a query and update one by one row.
Recreate the table using only the new values.
Appending the new data with different timestamp.
Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.
If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].
[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables
[2] https://cloud.google.com/bigquery/docs/clustered-tables
I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.
SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.
I want to scrape some specific webpages on a regular basis (e.g. each hour). This I want to do with python. The scraped results should get inserted into an SQLite table. New info will be scraped but also 'old' information will get scraped again, since the python-script will run each hour.
To be more precise, I want to scrape a sports-result page, where more and more match-results get published on the same page as the tournament proceeds. So with each new scraping I just need the new results to be entered in the SQLite-table, since the older ones already got scraped (and inserted into the table) one hour before (or even earlier).
I also don't want to insert the same result twice, when it gets scraped the second time. So there should be some mechanism to check if one result already got scraped. Can this be done on SQL-level? So, that I scrape the whole page, make an INSERT statement for each result, but only those INSERT statements get executed successfully which were not present in the database before. I'm thinking of something like a UNIQUE keyword or so.
Or am I thinking too much about performance and should solve this by doing a DROP TABLE each time before I start scraping and then just scrape everything from scratch again? I don't talk about really much data. It's just about 100 records (= matches) for 1 tournament and about 50 tournaments a year.
Basically I would just be interested in some kind of best-practice approach.
What you want to do is an upsert (update or insert if it doesn't exist).
Check here to see how to do it in sqlite: SQLite UPSERT - ON DUPLICATE KEY UPDATE
It looks like you want to insert data if it doesn't exist? Perhaps something like:
Check if the entry exists
Insert Data if it doesn't
Update the entry if it does? (do you want to update)
You could issue 2 seperate sql statements SELECT then INSERT/UPDATE
Or You could set unique, and i beileve sqllite will raise IntegrityError
try:
# your insert here
pass
except sqlite.IntegrityError:
# data is duplicate insert
pass