Combine multiple table schemas in one separate table in databricks - python

I am learning Databricks and going through an exploring and research phase. I found various tools while triaging python syntax. I.e. Dataframes with PySpark, Bamboo library, Apache Spark library to read SQL objects, Panda etc.
But somehow I am mixing up the usage of these all libraries.
I am exploring these alternatives to achieve one task. How to combine or merge multiple table schemas in one table.
For an instance, if I have 20 tables. Table1, Table2, Table3, ... , Table20.
Table1 has 3 columns.
Col1 | Col2 | Col3 |
Table2 has 4 columns.
Col4 | Col5 | Col6 | Col7
and that way all 20 table has such columns.
Can the community provide some insight to approach this implementation?
This is greatly appreciated.
Troubleshooting
schema1 = "database string, table string"
schema2 = "table string, column string, datatype string"
tbl_df = spark.createDataFrame([],schema1)
tbl_df3 = spark.createDataFrame([],schema2)
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for db in db_list:
# Get list of tables from each the database
db_tables = spark.sql(f"SHOW TABLES in {db}").rdd.map(lambda row: row.tableName).collect()
# For each table, get list of columns
for table in db_tables:
#initialize the database
spark.sql(f"use {db}")
df = spark.createDataFrame([(db, table.strip())], schema=['database', 'table'])
tbl_df = tbl_df.union(df)
above code works fine and gives me list of all databases and tables associated. Now next thing I am trying to achieve is schema2.
Based on list of tables, I managed to retrieve list of columns from all tables. But I believe it returns in the form of tuple.
For example, when I iterate for loop on db_tables as below,
columns = spark.sql(f"DESCRIBE TABLE {table}").rdd.collect()
this gives me below result.
[Row(col_name='Col1', data_type='timestamp', comment=None), Row(col_name='Col2', data_type='string', comment=None), Row(col_name='Col3', data_type='string', comment=None)]
[Row(col_name='Col4', data_type='timestamp', comment=None), Row(col_name='Col5', data_type='timestamp', comment=None), Row(col_name='Col6', data_type='timestamp', comment=None), Row(col_name='Col7', data_type='timestamp', comment=None)]
This is my real challenge now. I try to figure out to access above Row format and transform in below tabular outcome.
Table | Column | Datatype
-------------------------
Table1| Col1 | Timestamp
Table1| Col2 | string
Table1| Col3 | string
Table2| Col4 | Timestamp
Table2| Col5 | string
Table2| Col6 | string
Table2| Col7 | string
Finally I will merge or join 2 dataframes based on table name (taking it as key) and generate final outcome like below.
Database| Table | Column | Datatype
------------------------------------
Db1 | Table1| Col1 | Timestamp
Db1 | Table1| Col2 | string
Db1 | Table1| Col3 | string
Db1 | Table2| Col4 | Timestamp
Db1 | Table2| Col5 | string
Db1 | Table2| Col6 | string
Db1 | Table2| Col7 | string

If each table has unique columns, you can use unionByName.To create a single table with merged schema, you can use the following code:
#list of table names
tables = ['default.t1','default.t2','default.t3']
final_df = spark.sql(f'select * from {tables[0]}') #load 1st table to a dataframe
#display(final_df)
final = 'final_df'
for table in tables[1:]:
final = final + f'.unionByName(spark.sql("select * from {table}"),allowMissingColumns=True)' #creating string expression to get final result
#print(final)
req_df = eval(final)
#display(req_df)
req_df.printSchema()
UPDATE:
To get the database name, table name, column name and their type for each of the table in each of the database, you can use the following code:
My table creation code:
%sql
create database d1;
create table d1.t1(id int, gname varchar(40));
create table d1.t2(fname varchar(40),lname varchar(40));
create database d2;
create table d2.tb1(id varchar(40),age int, name varchar(40));
To get dataframe as per requirement:
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
#db_list
db_tables = spark.sql(f"SHOW TABLES in {db_list[0]}")
for i in db_list[1:]:
db_tables = db_tables.union(spark.sql(f"SHOW TABLES in {i}"))
#display(db_tables)
final_df = None
for row in db_tables.collect():
if(final_df is None):
final_df = spark.sql(f"DESCRIBE TABLE {row.database}.{row.tableName}")\
.withColumn('database',lit(f'{row.database}'))\
.withColumn('tablename',lit(f'{row.tableName}'))\
.select('database','tablename','col_name','data_type')
else:
final_df = final_df.union(spark.sql(f"DESCRIBE TABLE {row.database}.{row.tableName}")\
.withColumn('database',lit(f'{row.database}'))\
.withColumn('tablename',lit(f'{row.tableName}'))\
.select('database','tablename','col_name','data_type'))
display(final_df)

Related

Python using LOAD DATA LOCAL INFILE IGNORE 1 LINES to update existing table

I have loaded with succes large CSV files (with 1 header row) to a mysql table from Python with the command:
LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE 'table" FIELDS TERMINATED BY ';' IGNORE 1 LINES (#vone, #vtwo, #vthree) SET DatumTijd = #vone, Debiet = NULLIF(#vtwo,''),Boven = NULLIF(#vthree,'')
The file contains historic data back to 1970. Every month I get an update with roughly 4320 rows that need to be added to the existing table. Sometimes there is an overlap with the existing table, so I would like to use REPLACE. But this does not seem to work in combination with IGNORE 1 LINES. The primary key is DatumTijd, which follows the mysql datetime format.
I tried several combinations of REPLACE and IGNORE in different order, before the INTO TABLE "table" and behind FIELDS TERMINATED part.
Any suggestions how to solve this?
Apart from the possible typo of enclosing the table name in single quotes rather than backticks the load statement works fine on my windows device given the following data
one;two;three
2023-01-01;1;1
2023-01-02;2;2
2023-01-01;3;3
2022-01-04;;;
note I prefer coalesce to nullif and have included an auto_increment id to demonstrate what replace actually does , ie delete and insert.
drop table if exists t;
create table t(
id int auto_increment primary key,
DatumTijd date,
Debiet varchar(1),
boven varchar(1),
unique index key1(DatumTijd)
);
LOAD DATA INFILE 'C:\\Program Files\\MariaDB 10.1\\data\\sandbox\\data.txt'
replace INTO TABLE t
FIELDS TERMINATED BY ';'
IGNORE 1 LINES (#vone, #vtwo, #vthree)
SET DatumTijd = #vone,
Debiet = coalesce(#vtwo,''),
Boven = coalesce(#vthree,'')
;
select * from t;
+----+------------+--------+-------+
| id | DatumTijd | Debiet | boven |
+----+------------+--------+-------+
| 2 | 2023-01-02 | 2 | 2 |
| 3 | 2023-01-01 | 3 | 3 |
| 4 | 2022-01-04 | | |
+----+------------+--------+-------+
3 rows in set (0.001 sec)
It should not matter that replace in effect creates a new record but if it does to you consider loading to a staging table then insert..on duplicate key to target

django mysql connector - is allowing >1 entry for per specifc field in django

ive written some code to parse a website, and input it into a mysql db.
The problem is I am getting a lot of duplicates per FKToTech_id like:
id | ref | FKToTech_id |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+
| 1 | website.com/path | 1 |
| 2 | website.com/path | 1 |
| 3 | website.com/path | 1
What Im looking for is instead to have (1) row in this database, based on if ref has been entered already for FKToTech_id and not have multiple of the same row like:
id | ref | FKToTech_id |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+
| 1 | website.com/path | 1 |
How can I modify my code below to just python pass if the above is True (==1 ref with same FKToTech_id?
for i in elms:
allcves = {cursor.execute("INSERT INTO TechBooks (ref, FKToTech_id) VALUES (%s, %s) ", (i.attrs["href"], row[1])) for row in cves}
mydb.commit()
Thanks
Make ref a unique column, then use INSERT IGNORE to skip the insert if it would cause a duplicate key error.
ALTER TABLE TechBooks ADD UNIQUE INDEX (ref);
for i in elms:
cursor.executemany("INSERT IGNORE INTO TechBooks (ref, FKToTech_id) VALUES (%s, %s) ", [(i.attrs["href"], row[1]) for row in cves])
mydb.commit()
I'm not sure what your intent was by assigning the results of cursor.execute() to allcves. cursor.execute() doesn't return a value unless you use multi=True. I've replaced the useless set comprehension with use of cursor.executemany() to insert many rows at once.

Pyspark : Multiple join condition with cast type as string

In Spark SQL , i would need to cast as_of_date to string and do a multiple inner join with 3 tables and select all rows & columns in table1 , 2 and 3 after join . Example table schema as shown below
Tablename : Table_01 alias t1
Column | Datatype
as_of_date | String
Tablename | String
Credit_Card | String
Tablename : Table_02 alias t2
Column | Datatype
as_of_date | INT
Customer_name | String
tablename | string
Tablename : Table_03 alias t3
Column | Datatype
as_of_date | String
tablename | String
address | String
Join use-case :
t1.as_of_date = t2.as_of_date AND t1.tablename = t2.tablename
t2.as_of_date = t3.as_of_date AND t2.tablename = t3.tablename
Tables are already created in hive, i am doing a spark transformation on top of this tables and i am converting as_of_date in table_02 as string.
There are 2 approach i have thought of , but i am unsure which is the best approach
Approach 1:
df = spark.sql("select t1.*,t2.*,t3.* from table_1 t1 where cast(t1.as_of_date as string) inner join table_t2 t2 on t1.as_of_date = t2.as_of_date AND t1.tablename = t2.tablename inner join table_03 t3 on t2.as_of_date = t3.as_of_date and t2.tablename = t3.tablename")
Approach 2:
df_t1 = spark.sql("select * from table_01");
df_t2 = spark.sql("select * from table_02");
df_t3 = spark.sql("select * from table_03");
## Cast as_of_date as String if dtype as of date is int
if dict(df_t2.dtypes)["as_of_date"] == 'int':
df_t1["as_of_date"].cast(cast(StringType())
## Join Condition
df = df_t1.alias('t1').join(df_t2.alias('t2'),on="t1.tablename=t2.tablename AND t1.as_of_date = t2.as_of_date", how='inner').join(df_t3.alias('t3'),on="t2.as_of_date = t3.as_of_date AND t2.tablename = t3.tablename",how='inner').select('t1.*,t2.*,t3.*')
I feel that using approach 2 is long winded, i would need some advice on which approach should i go with as for easy maintenance and the scripts used
I would suggest using Spark SQL directly as below. You can cast every as_of_date column from all tables as a string regardless of its data type. You want to cast integer into string, but if you also cast string into a string, it does no harm.
df = spark.sql("""
select t1.*, t2.*, t3.*
from t1
join t2 on string(t1.as_of_date) = string(t2.as_of_date) AND t1.tablename = t2.tablename
join t3 on string(t2.as_of_date) = string(t3.as_of_date) AND t2.tablename = t3.tablename
""")

SQL multiple table conditional retrieving statement

I'm trying to retrieve data from an sqlite table conditional on another sqlite table within the same database. The formats are the following:
master table
--------------------------------------------------
| id | TypeA | TypeB | ...
--------------------------------------------------
| 2020/01/01 | ID_0-0 |ID_0-1 | ...
--------------------------------------------------
child table
--------------------------------------------------
| id | Attr1 | Attr2 | ...
--------------------------------------------------
| ID_0-0 | 112.04 |-3.45 | ...
--------------------------------------------------
I want to write an sqlite3 query that takes:
A date D present in master.id
A list of types present in master's columns
A list of attributes present in child's columns
and returns a dataframe with rows the types and columns the attributes. Of course, I could just read the tables into pandas and do the work there, but I think it would be more computationally intensive and I want to learn more SQL syntax!
UPDATE
So far I've tried:
"""SELECT *
FROM child
WHERE EXISTS
(SELECT *
FROM master
WHERE master.TypeX = Type_input
WHERE master.dateX = date_input)"""
and then concatenate the strings over all required TypeX and dateX and execute with:
cur.executescripts(script)

Transforming a Pandas DataFrame into a VALUES sql statement

Using pandas in python, I need to be able to generate efficient queries from a dataframe into postgresql. Unfortunately DataFrame.to_sql(...) only performs direct inserts and the query i wish to make is fairly complicated.
Ideally, I'd like to do this:
WITH my_data AS (
SELECT * FROM (
VALUES
<dataframe data>
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
my_table.col1 = my_data.col1,
my_table.col2 = complex_function(my_table.col2, my_data.col2),
FROM my_data
WHERE my_table.col3 < my_data.col3;
However, to do that, i would need to turn my dataframe into a plain values statement. I could, of course, rewrite my own functions, but past experiences have taught me that writing functions to escape and sanitize sql should never be done manually.
We are using SQLAlchemy, but bound parameters seem to only work with a limited number of arguments, and ideally i would like the serialization of the dataframe into text to be done at C-speed.
So, is there a way, either through pandas, or through SQLAlchemy, to turn efficiently my dataframe into the values substatement, and insert it into my query?
You could use psycopg2.extras.execute_values.
For example, given this setup
CREATE TABLE my_table (
col1 int
, col2 text
, col3 int
);
INSERT INTO my_table VALUES
(99, 'X', 1)
, (99, 'Y', 2)
, (99, 'Z', 99);
# | col1 | col2 | col3 |
# |------+------+------|
# | 99 | X | 1 |
# | 99 | Y | 2 |
# | 99 | Z | 99 |
The python code
import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config
df = pd.DataFrame([
(1, 'A', 10),
(2, 'B', 20),
(3, 'C', 30)])
with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
with conn.cursor() as cursor:
sql = '''WITH my_data AS (
SELECT * FROM (
VALUES %s
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3'''
pge.execute_values(cursor, sql, df.values)
updates my_table to be
# SELECT * FROM my_table
| col1 | col2 | col3 |
|------+------+------|
| 99 | Z | 99 |
| 1 | XA | 1 |
| 1 | YA | 2 |
Alternatively, you could use psycopg2 to generate the SQL.
The code in format_values is almost entirely copied from the source code for pge.execute_values.
import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config
df = pd.DataFrame([
(1, "A'foo'", 10),
(2, 'B', 20),
(3, 'C', 30)])
def format_values(cur, sql, argslist, template=None, page_size=100):
enc = pge._ext.encodings[cur.connection.encoding]
if not isinstance(sql, bytes):
sql = sql.encode(enc)
pre, post = pge._split_sql(sql)
result = []
for page in pge._paginate(argslist, page_size=page_size):
if template is None:
template = b'(' + b','.join([b'%s'] * len(page[0])) + b')'
parts = pre[:]
for args in page:
parts.append(cur.mogrify(template, args))
parts.append(b',')
parts[-1:] = post
result.append(b''.join(parts))
return b''.join(result).decode(enc)
with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
with conn.cursor() as cursor:
sql = '''WITH my_data AS (
SELECT * FROM (
VALUES %s
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3'''
print(format_values(cursor, sql, df.values))
yields
WITH my_data AS (
SELECT * FROM (
VALUES (1,'A''foo''',10),(2,'B',20),(3,'C',30)
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3

Categories

Resources