Adding data to an already existing partition in Postgres using Python - python

I need to parition a table on 2 columns, and insert records to an already existing partition of a Postgres table using Python (Psycopg2).
I am very new to Python and Postgres, hence struggling a bit with a challenging requirement. I searched the internet and found that Postgres does not support Partitioning by List on multiple columns.
I have 2 tables - "cust_details_curr" & "cust_details_hist". Both the tables will have the same structure. However the "_hist" table needs to be partitioned on 2 columns - 'area_code' and 'eff_date'
CREATE TABLE cust_details_curr
(
cust_id int,
area_code varchar(5),
cust_name varchar(20),
cust_age int
eff_date date
);
CREATE TABLE cust_details_hist
(
cust_id int,
area_code varchar(5),
cust_name varchar(20),
cust_age int
eff_date date
); -- Needs to be partitioned on area_code and eff_date
The "area_code" is passed as an argument to the process.
The column "eff_date" is supposed to contain the current process run
date.
There are multiple "area_codes" to be passed as an argument to the program - (there are 5 values - A501, A502, A503, A504, X101) all of which will run sequentially on the same day (i.e eff_date will be the same for all the runs).
The requirement is that whenever the "curr" table is being loaded for a specific "area_code", the program must first copy the data already existing in the "curr" table (for that specific area_code) into a partition of "eff_date" and that specific "area_code" of the "_hist" table. Next, the data pertaining to the same area_code in "curr" table must be deleted, and new data for that area_code will be loaded with the current process date in the eff_date column.
However, the process should run for 1 area_code at a time and hence the process will run for multiple area_codes on the same day. (which means they will all have eff_date = same current date)
So my question is -
how to partition the _hist table by 2 columns - area_code and
eff_date ?
Also, once a partition of the eff_date is created (assume 2022-08-01)
and loaded in the _hist table for one of the area_codes (assume
A501), the next job in the sequence will need to load the data for
another area_code (say A502) to load into the same eff_date partition
(since eff_date is same for both the process instances as they are
executed on the same day ) How can I insert data into the existing
partition ?
I devised the following (crude) way to handle the requirement when it was only for a single partition column - "eff_date". For that I would execute the sql queries below in order to somewhat implement the initial requirements for a single eff_date and area_code value.
However, I am struggling to figure out how to implement the same with multiple area_codes as a second partition column in the _hist table, And how to insert data into an already existing date partition (eff_dt), loaded by a previous area_code instance.
CREATE TABLE cust_details_curr
(
cust_id int,
area_code varchar(5),
cust_name varchar(20),
cust_age int
eff_date date
);
CREATE TABLE cust_details_hist
(
cust_id int,
area_code varchar(5),
cust_name varchar(20),
cust_age int
eff_date date
) PARTITIONED BY LIST (eff_dt); -- Partitioned by List
table_name = "cust_details_curr"
table_name_hist = table_name + '_hist'
e = datetime.now()
eff_date = e.strftime("%Y-%m-%d")
dttime = e.strftime("%Y%m%d_%H%M%S")
table_name_curr_part = table_name_part + '_' + str(dttime)
query_count = f"SELECT count(*) as cnt from {table_name} where area_code = '{area_code}'; "
query_date = f"SELECT distinct eff_date as eff_dt from {table_name} where area_code = '{area_code}';"
cur.execute(quey_date)
eff_date = cur.fetchone()[0]
query_crt = f"CREATE TABLE {table_name_curr_part} LIKE {table_name_part} INCLUDING DEFAULTS);"
query_ins_part = f"INSERT INTO {table_name_curr_part} SELECT * FROM {table_name} where area_code = '{area_code}' AND eff_dt = '{eff_date}';"
query_add_part = f"ALTER TABLE {table_name_part} ATTACH PARTITION {table_name_curr_part} FOR VALUES IN (DATE '{eff_date}') ;"
query_del = f"DELETE FROM {table_name} WHERE area_code = '{area_code}';"
query_ins_curr = f"INSERT INTO {table_name} (cust_id, area_code, cust_name, cust_age, eff_dt) VALUES %s"
cur.execute(....)
# Program trimmed in the interest of space
Can anyone please help me how to implement a workaround for the above requirements with multiple partition columns. How can I load data to an already existing partition ?
Happy to provide additional information. Any help is appreciated.

Related

Insert unique auto-increment ID on python to sql?

I'd like to insert an Order_ID to make each row unique using python and pyodbc to SQL Server.
Currently, my code is:
name = input("Your name")
def connectiontoSQL(order_id,name):
query = f'''\
insert into order (Order_ID, Name)
values('{order_id}','{name}')'''
return (execute_query_commit(conn,query))
If my table in SQL database is empty and I'd like it to add a order_ID by 1 every time I execute,
How should I code order_id in Python such that it will automatically create the first order_ID as OD001, and if I execute another time, it would create OD002?
You can create a INT Identity column as your primary key and add a computed column that has the order number that you display in your application.
create table Orders
(
[OrderId] [int] IDENTITY(0,1) NOT NULL,
[OrderNumber] as 'OD'+ right( '00000' + cast(OrderId as varchar(6)) , 6) ,
[OrderDate] date,
PRIMARY KEY CLUSTERED
(
[OrderId] ASC
)
)

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.
Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).
It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

How to ignore start on input in hive insert query

I have data format in tab separated
State:ca city:california population:1M
I want to create DB, when I do insert I should ignore "state:" , "city:" and "poulation" and I want to insert state into state database with population and city into city table with population.
There will be 2 tables then one with state and population the other with city and population
CREATE EXTERNAL TABLE IF NOT EXISTS CSP.original
(
st STRING COMMENT 'State',
ct STRING COMMENT 'City',
po STRING COMMENT 'Population'
)
COMMENT 'Original Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
this didn't work. It added comment but it didn't ignore.
And I also I want to create 2 tables for state and city. Can anyone please help me?
You would have to create external table first.
Step1:
CREATE EXTERNAL TABLE all_info (state STRING, population INT) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t;
Step2:
CREATE TABLE IF NOT EXISTS state (state string, population INT) PARTITIONED BY (date string);
CREATE TABLE IF NOT EXISTS city (city string, population INT) PARTITIONED BY (date string);
Step3:
INSERT OVERWRITE TABLE state
PARTITION (date = ‘201707076’)
SELECT *
FROM all_info
WHERE date = ‘20170706’ AND
instr(state, ‘state:’) = 1;
INSERT OVERWRITE TABLE city
PARTITION (date = ‘201707076’)
SELECT *
FROM all_info
WHERE date = ‘20170706’ AND
instr(state, ‘city:’) = 1;

How to create variable columns in MYSQL table using python

How to create variable columns in a table according to the user input?
In other words, I have a table that contains the ID of students, but I need to create variable columns for weeks according to user's choice. For example if the number of weeks chosen by the user is 2 then we create a table like this
cur.execute("""CREATE TABLE Attendance
(
Week1 int,
Week2 int,
ID int primary key ,
)""")
You can just build the column defs as a string in Python:
num_weeks = 4
week_column_defs = ', '.join("Week{} int".format(week_num) for week_num in range(1, num_weeks+1))
command = """CREATE TABLE Attendance
(
{weeks} ,
ID int primary key ,
)""".format(weeks=week_column_defs)
cur.execute(command)

python sqlalchemy distinct column values

I have 6 tables in my SQLite database, each table with 6 columns(Date, user, NormalA, specialA, contact, remarks) and 1000+ rows.
How can I use sqlalchemy to sort through the Date column to look for duplicate dates, and delete that row?
Assuming this is your model:
class MyTable(Base):
__tablename__ = 'my_table'
id = Column(Integer, primary_key=True)
date = Column(DateTime)
user = Column(String)
# do not really care of columns other than `id` and `date`
# important here is the fact that `id` is a PK
following are two ways to delete you data:
Find duplicates, mark them for deletion and commit the transaction
Create a single SQL query which will perform deletion on the database directly.
For both of them a helper sub-query will be used:
# helper subquery: find first row (by primary key) for each unique date
subq = (
session.query(MyTable.date, func.min(MyTable.id).label("min_id"))
.group_by(MyTable.date)
) .subquery('date_min_id')
Option-1: Find duplicates, mark them for deletion and commit the transaction
# query to find all duplicates
q_duplicates = (
session
.query(MyTable)
.join(subq, and_(
MyTable.date == subq.c.date,
MyTable.id != subq.c.min_id)
)
)
for x in q_duplicates:
print("Will delete %s" % x)
session.delete(x)
session.commit()
Option-2: Create a single SQL query which will perform deletion on the database directly
sq = (
session
.query(MyTable.id)
.join(subq, and_(
MyTable.date == subq.c.date,
MyTable.id != subq.c.min_id)
)
).subquery("subq")
dq = (
session
.query(MyTable)
.filter(MyTable.id.in_(sq))
).delete(synchronize_session=False)
Inspired by the Find duplicate values in SQL table this might help you to select duplicate dates:
query = session.query(
MyTable
).\
having(func.count(MyTable.date) > 1).\
group_by(MyTable.date).all()
If you only want to show unique dates; distinct on is what you might need
While I like the whole object oriented approache with SQLAlchemy, sometimes I find it easier to directly use some SQL.
And since the records don't have a key, we need the row number (_ROWID_) to delete the targeted records and I don't think the API provides it.
So first we connect to the database:
from sqlalchemy import create_engine
db = create_engine(r'sqlite:///C:\temp\example.db')
eng = db.engine
Then to list all the records:
for row in eng.execute("SELECT * FROM TableA;") :
print row
And to display all the duplicated records where the dates are identical:
for row in eng.execute("""
SELECT * FROM {table}
WHERE {field} IN (SELECT {field} FROM {table} GROUP BY {field} HAVING COUNT(*) > 1)
ORDER BY {field};
""".format(table="TableA", field="Date")) :
print row
Now that we identified all the duplicates, they probably need to be fixed if the other fields are different:
eng.execute("UPDATE TableA SET NormalA=18, specialA=20 WHERE Date = '2016-18-12' ;");
eng.execute("UPDATE TableA SET NormalA=4, specialA=8 WHERE Date = '2015-18-12' ;");
And finnally to keep the first inserted record and delete the most recent duplicated records :
print eng.execute("""
DELETE FROM {table}
WHERE _ROWID_ NOT IN (SELECT MIN(_ROWID_) FROM {table} GROUP BY {field});
""".format(table="TableA", field="Date")).rowcount
Or to keep the last inserted record and delete the other duplicated records :
print eng.execute("""
DELETE FROM {table}
WHERE _ROWID_ NOT IN (SELECT MAX(_ROWID_) FROM {table} GROUP BY {field});
""".format(table="TableA", field="Date")).rowcount

Categories

Resources