Update field with no-value - python

I have a table in a PostgreSQL database.
I'm writing data to this table (using some computation with Python and psycopg2 to write results down in a specific column in that table).
I need to update some existing cell of that column.
Till now, I was able either to delete the complete row before writing this single cell because all other cells on the row were also written back as the same time, or delete the entire column for the same reason.
Now I can't do that anymore because that would mean long computation time to rebuild either the row or the column for only a few new values to be written in some cell.
I know the update command. It works well for that.
But, if I had existing values in some cells, and that a new computation gives me no more result for these cells, I would like to "clear" the existing values to keep the table up-to-date with the last computation I've done.
Is there a simple way to do that ? update doesn't seems to work (it seems to keep the old values).
I precise again I'm using psycopg2 to write things to my table.

you simple update the cell with the value NULL in SQL - psycopg2 will insert NULL into the database when you update your column with None-type from python.

Related

Is there a way to return multiple values to python after a MySql query?

I am new to python and of course MySql. I recently created a Python function that generates a list of values that i want to insert to a table (2 columns) in MySql based on their specification.
Is it possible to create a procedure that can take a list of values that i'm sending through python, check if these values are already in one of my 2 two columns,
if they are already in the second one don't return,
if they are in the first one return all that are contained there
if they are in none of them return them with some kind of a flag so i can handle them through python and insert them to correct table
EXTRA EXPLANATION
Let me try to explain what i want to achieve so maybe you can give me a push and help me out. So, first i get a list of CPE items like this ("cpe:/a:apache:iotdb:0.9.0") in python and my goal is to save them into a database where the CPE's related to the IOT will be differentiated from the generic ones and saved in different tables or columns. My goal is that this distinction will be done by user input for each and every item but only once per item, so after parsing all items in python i want to first check in database if they exist in one of the tables or columns.
So for each and every list item that i pass i want to query mysql and:
if it exists in non iot column already don't return anything
if it exist in iot column already return item
if not exists anywhere return also item so i can get user input in python to verify if this is iot item or not and insert it to database after that
I think you could use library called pandas.
Idk if it is the best solution but it could work.
Export the thing you have in SQL into pandas or just query the SQL file using pandas.
Check out this library, it's really helpful for exploring data sets.
https://pandas.pydata.org/

Batch insert only unique records into PostgreSQL with Python (millions of records per day)

I have 10M+ records per day to insert into a Postgres database.
90% are duplicates and only the unique records should be inserted (this can be checked on a specific column value).
Because of the large volume, batch inserts seems like the only sensible option.
I'm trying to figure out how to make this work.
I've tried:
SQLAlchemy, but it throws an error. So I assume it's not possible.
s = Session(bind=engine)
s.bulk_insert_mappings(Model, rows)
s.commit()
Throws:
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "..._key"
Panda's to_sql doesn't have this unique record capability.
So I'm thinking of putting new records in an "intermediate table", then running background jobs in parallel to add those records to the main table if they don't already exist. I don't know if this is the most efficient procedure.
Is there a better approach?
Is there some way to make SQLAlchemy or Pandas do this?
There are two common ways to go about solving this problem. To pick between these, you need to examine where you're willing to spend the compute power, and whether or not the extra network transfer is going to be an issue. We don't have enough information to make that judgement call for you.
Option 1: Load to a temporary table
This option is basically what you described. Have a temporary table or a table that's dedicated to the load, which matches the schema of your destination table. Obviously this should exclude the unique constraints.
Load the entirety of your batch into this table, and once it's all there insert from this table into your destination table. You can very easily use standard SQL statements to do any kind of manipulation you need, such as distinct or whether it's the first record, or whatever else.
Option 2: Only load unique values, filtering with pandas
Pandas has a drop_duplicates() function which limit your dataframe to unique entries, and you can specify things such as which columns to check and which row to keep.
df = df.drop_duplicates(subset = ["Age"])

Migration of Django field with default value to PostgreSQL database

https://docs.djangoproject.com/en/1.10/topics/migrations/
Here it says:
"PostgreSQL is the most capable of all the databases here in terms of schema support; the only caveat is that adding columns with default values will cause a full rewrite of the table, for a time proportional to its size.
"For this reason, it’s recommended you always create new columns with null=True, as this way they will be added immediately."
I am asking if I get it correct.
From what I understand, I should first create the field with null=True and no default value then migrate it and then give it a default value and migrate it again, the values will be added immediately, but otherwise the whole database would be rewritten and Django migration doesn't want to do the trick by itself?
It's also mentioned in that same page that:
In addition, MySQL will fully rewrite tables for almost every schema
operation and generally takes a time proportional to the number of
rows in the table to add or remove columns. On slower hardware this
can be worse than a minute per million rows - adding a few columns to
a table with just a few million rows could lock your site up for over
ten minutes.
and
SQLite has very little built-in schema alteration support, and so
Django attempts to emulate it by:
Creating a new table with the new schema Copying the data across
Dropping the old table Renaming the new table to match the original
name
So in short, what that statement you are referring to above really says is
postgresql exhibits mysql like behaviour when adding a new column with
a default value
The approach you are trying would work. Adding a column with a null would mean no table re write. You can then alter the column to have a default value. However existing nulls will continue to be null
The way I understand it, on the second migration the default value will not be written to the existing rows. Only when a new row is created with no value for the default field it will be written.
I think the warning to use null=True for new column is only related to performance. If you really want all the existing rows to have the default value just use default= and accept the performance consequence of a table rewrite.

Write from a query to table in BigQuery only if query is not empty

In BigQuery it's possible to write to a new table the results of a query. I'd like the table to be created only whenever the query returns at least one row. Basically I don't want to end up creating empty table. I can't find an option to do that. (I am using the Python library, but I suppose the same applies to the raw API)
Since you have to specify the destination on the query definition and you don't know what it will return when you run it can you tack a LIMIT 1 to the end?
You can check the row number in the job result object and then re run the query without the limiter if there are results into your new table.
There's no option to do this in one step. I'd recommend running the query, inspecting the results, and then performing a table copy with WRITE_TRUNCATE to commit the results to the final location if the intermediate output contains at least one row.

Update a MySQL table from an HTML table with thousands of rows

I have an html file on network which updates almost every minute with new rows in a table. At any point, the file contains close to 15000 rows I want to create a MySQL table with all data in the table, and then some more that I compute from the available data.
The said HTML table contains, say rows from the last 3 days. I want to store all of them in my mysql table, and update the table every hour or so (can this be done via a cron?)
For connecting to the DB, I'm using MySQLdb which works fine. However, I'm not sure what are the best practices to do so. I can scrape the data using bs4, connect to table using MySQLdb. But how should I update the table? What logic should I use to scrape the page that uses the least resources?
I am not fetching any results, just scraping and writing.
Any pointers, please?
My Suggestion is instead of updating values row by row try to use Bulk Insert in temporary table and then move the data into an actual table based on some timing key. If you have key column that will be good for reading the recent rows as you added.
You can adopt the following approach:
For the purpose of the discussion, let master be the final destination the scraped data.
Then we can adopt the following steps:
Scrape data from the web page.
Store this scraped data within the temporary table within MySQL say temp.
Perform an EXCEPT operation to pull out only those rows which exist within the master but not in temp.
persist the rows obtained in step 3 within the master table.
Please refer to this link for understanding how to perform SET operations in MySQL. Also, it would be advisable to place all this logic within a store procedure and pass it the set of the data to be processed ( not sure if this part is possible in MySQL)
Adding one more step to the approach - Based on the discussion below, we can use a timestamp based column to determine the newest rows that need to be placed into the table. The above approach for SET based operations works well, in case there are no timestamp based columns.

Categories

Resources