I have few tables , tables will be in different databases and below is sample comparison i am trying
EmplTbl = cur.execute("select A , B , C from EmployeeTable where EmplName in ('A','B')")
emp_entries = set(cur)
DeptTbl = cur.execute("select A , B , C from DeptTable")
dept_entries = set(cur)
print(emp_entries.difference(dept_entries))
In this example i have provided only 3 columns for comparison. But in my case i have 30- 40 columns.
when i am trying to do a difference between sets or using 'for' loop or dataframe join comparison-- script is running very slow and i am getting final message as "Killed"
In the below code i am trying to do an inner join to get exact match
EmplTbl = cur.execute("select A , B , C from EmployeeTable where EmplName in ('A','B')")
emp_entries = set(cur)
DeptTbl = cur.execute("select A , B , C from DeptTable")
for DeptTbl in cur:
if emp_entries in DeptTbl:
print(emp_entries)
Volume of records: i may have up to 10million
Is there any way i can increase my performance make it to run fast. I have linux server with 4 node configuration.
Please suggest
You can directly use query for difference:
Select col1, col2, col3 from table 1
Minus
Select col1, col2, col3 from table2;
-- or --
Select col1, col2, col3 from table1 t1
Where exists
(Select 1 from table2 t2
Where t1.col1 = t2.col1
And t1.col2 = t2.col2
And t1.col3 = t2.col3)
Cheers!!
Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.
Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).
It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');
I have three coloumns in the following table:
col1, col2, col3
now, I have the following python code
table = "#temp_table"
col = 'col1'
val = 'a'
cursor.execute("""select "{}" from "{}" as a where "{}" ="{}"
""".format(col1,table,col1,val))
For reasons that are probably obvious to others, this does not work. How can I rewrite this to satisfy what I am trying to acomplish?
Your query will work if you change " -> ` :
cursor.execute("""select `{}` from `{}` as a where `{}` =`{}`
""".format(col1,table,col1,val))
I'm using the python bindings for sqlite3 and I'm attempting to do a query something like this
table1
col1 | col2
------------
aaaaa|1
aaabb|2
bbbbb|3
test.py
def get_rows(db, ugc):
# I want a startswith query. but want to protect against potential sql injection
# with the user-generated-content
return db.execute(
# Does not work :)
"SELECT * FROM table1 WHERE col1 LIKE ? + '%'",
[ugc],
).fetchall()
Is there a way to do this safely?
Expected behaviour:
>>> get_rows('aa')
[('aaaaa', 1), ('aaabb', 2)]
In SQL, + is used to add numbers.
Your SQL ends up as ... WHERE col1 LIKE 0.
To concatenate strings, use ||:
db.execute(
"SELECT * FROM table1 WHERE col1 LIKE ? || '%'",
[ugc],
)
Suppose I have the following very simple query:
query = 'SELECT * FROM table1 WHERE id = %s'
And I'm calling it from a python sql wrapper, in this case psycopg:
cur.execute(query, (row_id))
The thing is that if row_id is None, I would like to get all the rows, but that query would return an empty table instead.
The easy way to approach this would be:
if row_id:
cur.execute(query, (row_id))
else:
cur.execute("SELECT * FROM table1")
Of course this is non idiomatic and gets unnecessarily complex with non-trivial queries. I guess there is a way to handle this in the SQL itself but couldn't find anything. What is the right way?
Try to use COALESCE function as below
query = 'SELECT * FROM table1 WHERE id = COALESCE(%s,id)'
SELECT * FROM table1 WHERE id = %s OR %s IS NULL
But depending how the variable is forwarded to the query it might be better to make it 0 if it is None
SELECT * FROM table1 WHERE id = %s OR %s = 0