Run checks on Items from tables in Sqlite and python

Run checks on Items from tables in Sqlite and python - python

I have two tables below:
----------
Items | QTY
----------
sugar | 14
mango | 10
apple | 50
berry | 1
----------
Items |QTY
----------
sugar |10
mango |5
apple |48
berry |1
I use the following query in python to check difference between the QTY of table one and table two.
cur = conn.cursor()
cur.execute("select s.Items, s.qty - t.qty as quantity from Stock s join Second_table t on s.Items = t.Items;")
remaining_quantity = cur.fetchall()
I'm a bit stuck on how to go about what I need to accomplish. I need to check the difference between the quantity of table one and table two, if the quantity (difference) is under 5 then for those Items I want to be able to store this in another table column with the value 1 if not then the value will be 0 for those Items. How can I go about this?
Edit:
I have attempted this like by looping through the rows and if the column value is less than 5 then insert into the new table with the value below. :
for row in remaining_quantity:
print(row[1])
if((row[1]) < 5):
cur.execute('INSERT OR IGNORE INTO check_quantity_tb VALUES (select distinct s.Items, s.qty, s.qty - t.qty as quantity, 1 from Stock s join Second_table t on s.Items = t.Items'), row)
print(row)
But I get a SQL syntax error not sure where the error could be :/

First modify your first query so you retrieve all relevant infos and don't have to issue subqueries later:
readcursor = conn.cursor()
readcursor.execute(
"select s.Items, s.qty, s.qty - t.qty as remain "
"from Stock s join Second_table t on s.Items = t.Items;"
)
Then use it to update your third table:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
if remain < 5:
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, 1)
)
conn.commit()
Note the following points:
1/ We use two distinct cursor so we can iterate over the first one while wrting with the second one. This avoids fetching all results in memory, which can be a real life saver on huge datasets
2/ when iterating on the first cursor, we unpack the rows into their individual componants. This is called "tuple unpacking" (but actually works for most sequence types):
>>> row = ("1", "2", "3")
>>> a, b, c = row
>>> a
'1'
>>> b
'2'
>>> c
'3'
3/ We let the db-api module do the proper sanitisation and escaping of the values we want to insert. This avoids headaches with escaping / quoting etc and protects your code from SQL injection attacks (not that you might have one here, but that's the correct way to write parameterized queries in Python).
NB : since you didn't not post your full table definitions nor clear specs - not even the full error message and traceback - I only translated your code snippet to something more sensible (avoiding the costly and useless subquery, which migh or not be the cause of your error). I can't garantee it will work out of the box, but at least it should put you back on tracks.
NB2 : you mentionned you had to set the last col to either 1 or 0 depending on remain value. If that's the case, you want your loop to be:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
flag = 1 if remain < 5 else 0
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, flag)
)
conn.commit()
If you instead only want to process rows where remain < 5, you can specify it directly in your first query with a where clause.

Related

How to store large scale comparison?

I'm comparing images between thousands of users who can have between 1 and 12 photos. The comparison between two photos returns a score that I need to be stored so I don't make the comparison twice.
What is the best way of storing it?
I thought about storing in a table with one photo per row/column but this can quickly get out of hand

Multi-indexed pandas if you want to work in memory, like so:
df = pd.DataFrame(index=[['Alice/foo.png'], ['Bob/bar.png']], columns=['user1', 'user2', 'score'], data=[['Alice', 'Bob', 42.0]])
df.index.names = ['photo1', 'photo2']
df
user1 user2 score
photo1 photo2
user1/foo.png user2/bar.png Alice Bob 42.0
SQLite if you want to work on disk, like so
import sqlite3
# The important part: defining the table
conn = sqlite3.Connection('photos.sqlite')
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS photos (photoid INTEGER PRIMARY KEY, photopath TEXT UNIQUE, userid INT)')
c.execute('CREATE TABLE IF NOT EXISTS photoscores (photoid1 INTEGER, photoid2 INTEGER, userid1 INT, userid2 INT, score REAL)')
c.execute('CREATE UNIQUE INDEX photopair on photoscores (photoid1, photoid2)')
conn.commit()
# Example of populating the table
sql = ('INSERT INTO photos(photopath, userid) VALUES ("foo.png", 1)',
'INSERT INTO photos(photopath, userid) VALUES ("bar.png", 2)')
for statement in sql:
c.execute(statement)
conn.commit()
sql = 'SELECT photoid FROM PHOTOS'
c.execute(sql)
values = [v[0] for v in enumerate(c.fetchall())]
# This is more complicated than it needs to be, especially since
# values will always be sorted if this code is run, but I'm just
# emphasizing the need to keep the photo ids and user ids aligned
sql = ('INSERT OR REPLACE INTO photoscores VALUES (%s, %s, %s, %s, 42.0)'
% (tuple(sorted(values)) + ((1, 2) if values == sorted(values) else (2, 1))))
c.execute(sql)
conn.commit()
import pandas as pd
pd.read_sql('SELECT * FROM photoscores', conn)
photoid1 photoid2 userid1 userid2 score
0 0 1 1 2 42.0
If you use the SQL code, I'd suggest always sorting the pair of photo IDs you compare.
The key thing is that you want something with an underlying hash map that will quickly tell you whether an existing pair of photos has already been compared.

Insert record from list if not exists in table

cHandler = myDB.cursor()
cHandler.execute('select UserId,C1,LogDate from DeviceLogs_12_2019') // data from remote sql server database
curs = connection.cursor()
curs.execute("""select * from biometric""") //data from my database table
lst = []
result= cHandler.fetchall()
for row in result:
lst.append(row)
lst2 = []
result2= curs.fetchall()
for row in result2:
lst2.append(row)
t = []
r = [elem for elem in lst if not elem in lst2]
for i in r:
print(i)
t.append(i)
for i in t:
frappe.db.sql("""Insert into biometric(UserId,C1,LogDate) select '%s','%s','%s' where not exists(select * from biometric where UserID='%s' and LogDate='%s')""",(i[0],i[1],i[2],i[0],i[2]),as_dict=1)
I am trying above code to insert data into my table if record not exists but getting error :
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '1111'',''in'',''2019-12-03 06:37:15'' where not exists(select * from biometric ' at line 1")
Is there anything I am doing wrong or any other way to achieve this?

It appears you have potentially four problems:
There is a from clause missing between select and where not exists.
When using a prepared statement you do not enclose your placeholder arguments, %s, within quotes. Your SQL should be:
Your loop:
Loop:
t = []
r = [elem for elem in lst if not elem in lst2]
for i in r:
print(i)
t.append(i)
If you are trying to only include rows from the remote site that will not be duplicates, then you should explicitly check the two fields that matter, i.e. UserId and LogDate. But what is the point since your SQL is taking care of making sure that you are excluding these duplicate rows? Also, what is the point of copying everything form r to t?
SQL:
Insert into biometric(UserId,C1,LogDate) select %s,%s,%s from DUAL where not exists(select * from biometric where UserID=%s and LogDate=%s
But here is the problem even with the above SQL:
If the not exists clause is false, then the select %s,%s,%s from DUAL ... returns no columns and the column count will not match the number of columns you are trying to insert, namely three.
If your concern is getting an error due to duplicate keys because (UserId, LogDate) is either a UNIQUE or PRIMARY KEY, then add the IGNORE keyword on the INSERT statement and then if a row with the key already exists, the insertion will be ignored. But there is no way of knowing since you have not provided this information:
for i in t:
frappe.db.sql("Insert IGNORE into biometric(UserId,C1,LogDate) values(%s,%s,%s)",(i[0],i[1],i[2]))
If you do not want multiple rows with the same (UserId, LogDate) combination, then you should define a UNIQUE KEY on these two columns and then the above SQL should be sufficient. There is also an ON DUPLICATE KEY SET ... variation of the INSERT statement where if the key exists you can do an update instead (look this up).
If you don't have a UNIQUE KEY defined on these two columns or you need to print out those rows which are being updated, then you do need to test for the presence of the existing keys. But this would be the way to do it:
cHandler = myDB.cursor()
cHandler.execute('select UserId,C1,LogDate from DeviceLogs_12_2019') // data from remote sql server database
rows = cHandler.fetchall()
curs = connection.cursor()
for row in rows:
curs.execute("select UserId from biometric where UserId=%s and LogDate=%s", (ros[0], row[2])) # row already in biometric table?
biometric_row = curs.fetchone()
if biometric_row is None: # no, it is not
print(row)
frappe.db.sql("Insert into biometric(UserId,C1,LogDate) values(%s, %s, %s)", (row[0],row[1],row[2]))

Effectively classifying a DB of event logs by a column

Situation
I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)
I have a sqlite database that looks like:
| user_id | action | timestamp |
| ------- | ------ | ---------- |
| Alice | 0 | 1551683796 |
| Alice | 23 | 1551683797 |
| James | 1 | 1551683798 |
| ....... | ...... | .......... |
where user_id is TEXT, action is an arbitary INTEGER, and timestamp is an INTEGER representing UNIX time.
The database has 200M rows, and there are 70K distinct user_ids.
Goal
I need to make a Python dictionary that looks like:
{
"Alice":[(0, 1551683796), (23, 1551683797)],
"James":[(1, 1551683798)],
...
}
that has user_ids as keys and respective event logs as values, which are lists of tuples (action, timestamp). Hopefully each list will be sorted by timestamp in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.
Effort
I have the following code to query the database. It first queries for the list of users (with user_list_cursor), and then query for all rows belonging to the user.
import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()
classified_log = {}
log_cursor = connection.cursor()
while user_id:
user_id = user_id[0] # cursor.fetchone() returns a tuple
query = (
"SELECT action, timestamp"
" FROM TABLE"
" WHERE user_id = ?"
" ORDER BY timestamp ASC"
)
parameters = (user_id,)
local_cursor.execute(query, parameters) # Here is the bottleneck
classified_log[user_id] = list()
for row in local_cursor.fetchall():
classified_log[user_id].append(row)
user_id = user_list_cursor.fetchone()
Problem
The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?
I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.
Reproducibility
If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.
But I hope there is somebody who is familiar with such a situation and knows how to write the right query.
import sqlite3
import random
connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
user_id = f"user{random.randint(0, 70000)}"
action = random.randint(0, 1_000_000)
parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()

Big thanks to #Strawberry and #Solarflare for their help given in comments.
The following solution achieved more than 70X performance increase, so I'm leaving what I did as an answer for completeness' sake.
I used indices and queried for the whole table, as they suggested.
import sqlite3
from operators import attrgetter
connection = sqlite3.connect("database.db")
# Creating index, thanks to #Solarflare
cursor = connection.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_user_id ON EVENT_LOG (user_id)")
cursor.commit()
# Reading the whole table, then make lists by user_id. Thanks to #Strawberry
cursor.execute("SELECT user_id, action, timestamp FROM EVENT_LOG ORDER BY user_id ASC")
previous_user_id = None
log_per_user = list()
classified_log = dict()
for row in cursor:
user_id, action, timestamp = row
if user_id != previous_user_id:
if previous_user_id:
log_per_user.sort(key=itemgetter(1))
classified_log[previous_user_id] = log_per_user[:]
log_per_user = list()
log_per_user.append((action, timestamp))
previous_user_id = user_id
So the points are
Indexing by user_id to make ORDER BY user_id ASC execute in acceptable time.
Reading the whole table, then classify by user_id, instead of making individual queries for each user_id.
Iterating over cursor to read row by row, instead of cursor.fetchall().

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.

Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).

It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

How to deal with repetition of a column in Sqlite? By using compression?

I know (from the answers of this question) that Sqlite by default doesn't enable compression. Is it possible to enable it, or would this require another tool? Here is the situation:
I need to add millions of rows in a Sqlite database. The table contains a description column (~500 char on average), and on average, each description is shared by, say, 40 rows, like this:
id name othercolumn description
1 azefds ... This description will be the same for probably 40 rows
2 tsdyug ... This description will be the same for probably 40 rows
...
40 wxcqds ... This description will be the same for probably 40 rows
41 azeyui ... This one is unique
42 uiuotr ... This one will be shared by 60 rows
43 poipud ... This one will be shared by 60 rows
...
101 iuotyp ... This one will be shared by 60 rows
102 blaxwx ... Same description for the next 10 rows
103 sdhfjk ... Same description for the next 10 rows
...
Question:
Would you just insert rows like this, and enable a compression algorithm of the DB? Pro: you don't have to deal with 2 tables, it's easier when querying.
or
Would you use 2 tables?
id name othercolumn descriptionid
1 azefds ... 1
2 tsdyug ... 1
...
40 wxcqds ... 1
41 azeyui ... 2
...
id description
1 This description will be the same for probably 40 rows
2 This one is unique
Con: instead of the simple select id, name, description from mytable from solution #1, we have to use a complex way to retrieve this, involving 2 tables, and probably multiple queries? Or maybe is it possible to do it without a complex query, but with a clever query with union or merge or anything like this?

Using multiple tables will not only prevent inconsistency, and take less space, but may also be faster, even if multiple/more complex queries are involved (precisely because it involves moving less data around). Which you should use depends on which of those characteristics are most important to you.
A query to retrieve the results when you have 2 tables would look something like this (which is really just a join between the two tables):
select table1.id, table1.name, table1.othercolumn, table2.description
from table1, table2
where table1.descriptionid=table2.id

Here is some illustration code in Python for ScottHunter's answer:
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute("CREATE TABLE mytable (id integer, name text, descriptionid integer)")
c.execute("CREATE TABLE descriptiontable (id integer, description text)")
c.execute('INSERT INTO mytable VALUES(1, "abcdef", 1)');
c.execute('INSERT INTO mytable VALUES(2, "ghijkl", 1)');
c.execute('INSERT INTO mytable VALUES(3, "iovxcd", 2)');
c.execute('INSERT INTO mytable VALUES(4, "zuirur", 1)');
c.execute('INSERT INTO descriptiontable VALUES(1, "Description1")');
c.execute('INSERT INTO descriptiontable VALUES(2, "Description2")');
c.execute('SELECT mytable.id, mytable.name, descriptiontable.description FROM mytable, descriptiontable WHERE mytable.descriptionid=descriptiontable.id');
print c.fetchall()
#[(1, u'abcdef', u'Description1'),
# (2, u'ghijkl', u'Description1'),
# (3, u'iovxcd', u'Description2'),
# (4, u'zuirur', u'Description1')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.